{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "import os\n",
    "if not any(path.endswith('textbook') for path in sys.path):\n",
    "    sys.path.append(os.path.abspath('../../..'))\n",
    "from textbook_utils import *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "from pathlib import Path"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# File Encoding"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Computers store data as sequences of *bits*: `0`s and `1`s. \n",
    "*Character encodings*, like ASCII, tell the computer how to translate between bits and text.\n",
    "For example, in ASCII, the bits `100 001` stand for the letter `A` and `100 010` for `B`.\n",
    "The most basic kind of plain text supports only standard ASCII characters, which\n",
    "includes the uppercase and lowercase English letters, numbers, punctuation symbols, and spaces. \n",
    " \n",
    "ASCII encoding does not include a lot of special characters or characters from other languages.\n",
    "Other, more modern character encodings have many more characters that can be represented.\n",
    "Common encodings for documents and web pages are Latin-1 (ISO-8859-1) and UTF-8. UTF-8 has over a\n",
    "million characters and is backward compatible with ASCII, meaning that it\n",
    "uses the same representation for English letters, numbers, and punctuation as\n",
    "ASCII.\n",
    "\n",
    "When we have a text file, we usually need to figure out its \n",
    "encoding. If we choose the wrong encoding to read in a file, Python \n",
    "either reads incorrect values or throws an error. The best way to find the encoding\n",
    "is by checking the data's documentation, which often explicitly says what the\n",
    "encoding is."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When we don't know the encoding, we have to make a guess. The `chardet`\n",
    "package has a function called `detect()` that infers a file's encoding.\n",
    "Since these guesses are imperfect, the function also returns a confidence\n",
    "between 0 and 1. We use this function to look at the files from our examples:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "File Name                 Encoding   Confidence\n",
      "data/inspections.csv      ascii      1.0\n",
      "data/co2_mm_mlo.txt       ascii      1.0\n",
      "data/violations.csv       ascii      1.0\n",
      "data/DAWN-Data.txt        ascii      1.0\n",
      "data/legend.csv           ascii      1.0\n",
      "data/businesses.csv       ISO-8859-1 0.73\n"
     ]
    }
   ],
   "source": [
    "import chardet\n",
    "\n",
    "line = '{:<25} {:<10} {}'.format\n",
    "\n",
    "# for each file, print its name, encoding & confidence in the encoding\n",
    "print(line('File Name', 'Encoding', 'Confidence'))\n",
    "\n",
    "for filepath in Path('data').glob('*'):\n",
    "    result = chardet.detect(filepath.read_bytes())\n",
    "    print(line(str(filepath), result['encoding'], result['confidence']))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The detection function is quite certain that all but one of the files are\n",
    "ASCII encoded. The exception is _businesses.csv_, which appears to have an ISO-8859-1\n",
    "encoding. We run into trouble if we ignore this encoding and try to read the\n",
    "business file into `pandas` without specifying the special encoding:\n",
    "\n",
    "```python\n",
    "# naively reads file without considering encoding\n",
    ">>> pd.read_csv('data/businesses.csv')\n",
    "[...stack trace omitted...]\n",
    "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in\n",
    "position 8: invalid continuation byte\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To successfully read the data, we must specify the ISO-8859-1 encoding:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "bus = pd.read_csv('data/businesses.csv', encoding='ISO-8859-1')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "tags": [
     "remove-input"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>business_id</th>\n",
       "      <th>name</th>\n",
       "      <th>address</th>\n",
       "      <th>postal_code</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>19</td>\n",
       "      <td>NRGIZE LIFESTYLE CAFE</td>\n",
       "      <td>1200 VAN NESS AVE, 3RD FLOOR</td>\n",
       "      <td>94109</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>24</td>\n",
       "      <td>OMNI S.F. HOTEL - 2ND FLOOR PANTRY</td>\n",
       "      <td>500 CALIFORNIA ST, 2ND  FLOOR</td>\n",
       "      <td>94104</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>31</td>\n",
       "      <td>NORMAN'S ICE CREAM AND FREEZES</td>\n",
       "      <td>2801 LEAVENWORTH ST</td>\n",
       "      <td>94133</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>45</td>\n",
       "      <td>CHARLIE'S DELI CAFE</td>\n",
       "      <td>3202 FOLSOM ST</td>\n",
       "      <td>94110</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   business_id                                name  \\\n",
       "0           19               NRGIZE LIFESTYLE CAFE   \n",
       "1           24  OMNI S.F. HOTEL - 2ND FLOOR PANTRY   \n",
       "2           31      NORMAN'S ICE CREAM AND FREEZES   \n",
       "3           45                 CHARLIE'S DELI CAFE   \n",
       "\n",
       "                         address postal_code  \n",
       "0   1200 VAN NESS AVE, 3RD FLOOR       94109  \n",
       "1  500 CALIFORNIA ST, 2ND  FLOOR       94104  \n",
       "2           2801 LEAVENWORTH ST        94133  \n",
       "3                3202 FOLSOM ST        94110  "
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "bus.iloc[:4, [0,1,2,5]]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "File encoding can be a bit mysterious to figure out, and unless there is metadata that explicitly gives us the encoding, guesswork comes into play. \n",
    "When an encoding is not 100% confirmed, it's a good idea to seek additional documentation. "
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another potentially important aspect of a source file is its size. \n",
    "If a file is huge, then we might not be able to read it into a dataframe.\n",
    "In the next section, we discuss how to figure out a source file's size."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}