File Encoding
8.3. File Encoding#
Computers store data as sequences of bits: 0
s and 1
s.
Character encodings, like ASCII, tell the computer how to translate between bits and text.
For example, in ASCII, the bits 100 001
stand for the letter A
and 100 010
for B
.
The most basic kind of plain text supports only standard ASCII characters, which
includes the uppercase and lowercase English letters, numbers, punctuation symbols, and spaces.
ASCII encoding does not include a lot of special characters or characters from other languages. Other, more modern character encodings have many more characters that can be represented. Common encodings for documents and web pages are Latin-1 (ISO-8859-1) and UTF-8. UTF-8 has over a million characters and is backward compatible with ASCII, meaning that it uses the same representation for English letters, numbers, and punctuation as ASCII.
When we have a text file, we usually need to figure out its encoding. If we choose the wrong encoding to read in a file, Python either reads incorrect values or throws an error. The best way to find the encoding is by checking the data’s documentation, which often explicitly says what the encoding is.
When we don’t know the encoding, we have to make a guess. The chardet
package has a function called detect()
that infers a file’s encoding.
Since these guesses are imperfect, the function also returns a confidence
between 0 and 1. We use this function to look at the files from our examples:
import chardet
line = '{:<25} {:<10} {}'.format
# for each file, print its name, encoding & confidence in the encoding
print(line('File Name', 'Encoding', 'Confidence'))
for filepath in Path('data').glob('*'):
result = chardet.detect(filepath.read_bytes())
print(line(str(filepath), result['encoding'], result['confidence']))
File Name Encoding Confidence
data/inspections.csv ascii 1.0
data/co2_mm_mlo.txt ascii 1.0
data/violations.csv ascii 1.0
data/DAWN-Data.txt ascii 1.0
data/legend.csv ascii 1.0
data/businesses.csv ISO-8859-1 0.73
The detection function is quite certain that all but one of the files are
ASCII encoded. The exception is businesses.csv, which appears to have an ISO-8859-1
encoding. We run into trouble if we ignore this encoding and try to read the
business file into pandas
without specifying the special encoding:
# naively reads file without considering encoding
>>> pd.read_csv('data/businesses.csv')
[...stack trace omitted...]
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in
position 8: invalid continuation byte
To successfully read the data, we must specify the ISO-8859-1 encoding:
bus = pd.read_csv('data/businesses.csv', encoding='ISO-8859-1')
business_id | name | address | postal_code | |
---|---|---|---|---|
0 | 19 | NRGIZE LIFESTYLE CAFE | 1200 VAN NESS AVE, 3RD FLOOR | 94109 |
1 | 24 | OMNI S.F. HOTEL - 2ND FLOOR PANTRY | 500 CALIFORNIA ST, 2ND FLOOR | 94104 |
2 | 31 | NORMAN'S ICE CREAM AND FREEZES | 2801 LEAVENWORTH ST | 94133 |
3 | 45 | CHARLIE'S DELI CAFE | 3202 FOLSOM ST | 94110 |
File encoding can be a bit mysterious to figure out, and unless there is metadata that explicitly gives us the encoding, guesswork comes into play. When an encoding is not 100% confirmed, it’s a good idea to seek additional documentation.
Another potentially important aspect of a source file is its size. If a file is huge, then we might not be able to read it into a dataframe. In the next section, we discuss how to figure out a source file’s size.