{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "import sys\n", "import os\n", "if not any(path.endswith('textbook') for path in sys.path):\n", " sys.path.append(os.path.abspath('../../..'))\n", "from textbook_utils import *" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "from pathlib import Path" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# File Encoding" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Computers store data as sequences of *bits*: `0`s and `1`s. \n", "*Character encodings*, like ASCII, tell the computer how to translate between bits and text.\n", "For example, in ASCII, the bits `100 001` stand for the letter `A` and `100 010` for `B`.\n", "The most basic kind of plain text supports only standard ASCII characters, which\n", "includes the uppercase and lowercase English letters, numbers, punctuation symbols, and spaces. \n", " \n", "ASCII encoding does not include a lot of special characters or characters from other languages.\n", "Other, more modern character encodings have many more characters that can be represented.\n", "Common encodings for documents and web pages are Latin-1 (ISO-8859-1) and UTF-8. UTF-8 has over a\n", "million characters and is backward compatible with ASCII, meaning that it\n", "uses the same representation for English letters, numbers, and punctuation as\n", "ASCII.\n", "\n", "When we have a text file, we usually need to figure out its \n", "encoding. If we choose the wrong encoding to read in a file, Python \n", "either reads incorrect values or throws an error. The best way to find the encoding\n", "is by checking the data's documentation, which often explicitly says what the\n", "encoding is." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "When we don't know the encoding, we have to make a guess. The `chardet`\n", "package has a function called `detect()` that infers a file's encoding.\n", "Since these guesses are imperfect, the function also returns a confidence\n", "between 0 and 1. We use this function to look at the files from our examples:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "File Name Encoding Confidence\n", "data/inspections.csv ascii 1.0\n", "data/co2_mm_mlo.txt ascii 1.0\n", "data/violations.csv ascii 1.0\n", "data/DAWN-Data.txt ascii 1.0\n", "data/legend.csv ascii 1.0\n", "data/businesses.csv ISO-8859-1 0.73\n" ] } ], "source": [ "import chardet\n", "\n", "line = '{:<25} {:<10} {}'.format\n", "\n", "# for each file, print its name, encoding & confidence in the encoding\n", "print(line('File Name', 'Encoding', 'Confidence'))\n", "\n", "for filepath in Path('data').glob('*'):\n", " result = chardet.detect(filepath.read_bytes())\n", " print(line(str(filepath), result['encoding'], result['confidence']))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "The detection function is quite certain that all but one of the files are\n", "ASCII encoded. The exception is _businesses.csv_, which appears to have an ISO-8859-1\n", "encoding. We run into trouble if we ignore this encoding and try to read the\n", "business file into `pandas` without specifying the special encoding:\n", "\n", "```python\n", "# naively reads file without considering encoding\n", ">>> pd.read_csv('data/businesses.csv')\n", "[...stack trace omitted...]\n", "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in\n", "position 8: invalid continuation byte\n", "```" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "To successfully read the data, we must specify the ISO-8859-1 encoding:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "bus = pd.read_csv('data/businesses.csv', encoding='ISO-8859-1')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [ "remove-input" ] }, "outputs": [ { "data": { "text/html": [ "
\n", " | business_id | \n", "name | \n", "address | \n", "postal_code | \n", "
---|---|---|---|---|
0 | \n", "19 | \n", "NRGIZE LIFESTYLE CAFE | \n", "1200 VAN NESS AVE, 3RD FLOOR | \n", "94109 | \n", "
1 | \n", "24 | \n", "OMNI S.F. HOTEL - 2ND FLOOR PANTRY | \n", "500 CALIFORNIA ST, 2ND FLOOR | \n", "94104 | \n", "
2 | \n", "31 | \n", "NORMAN'S ICE CREAM AND FREEZES | \n", "2801 LEAVENWORTH ST | \n", "94133 | \n", "
3 | \n", "45 | \n", "CHARLIE'S DELI CAFE | \n", "3202 FOLSOM ST | \n", "94110 | \n", "