8.2. File Formats

A file format describes how data are stored on a computer’s disk or other storage device. Understanding the file format helps us figure out how to read the data into Python in order to work with it as a data table. In this section, we introduce several popular formats used to store data tables. These are all plain text formats, meaning they are easy for us to read with a text editor such as Sublime, Vim, and Emacs.

Note

The file format and the structure of the data are two different things. The data’s structure is a mental representation of the data and tells us what kinds of operations we can do. For example, a table structure corresponds to data values arranged in rows and columns. But, the same table can be stored in many different types of file formats.

8.2.1. Delimited format

Delimited formats use a specific character to separate data values. Usually, these separators are either a comma (Comma-Separated-Values or CSV for short), a tab (Tab-Separated Values or TSV), white-space, or a colon. These formats are natural for storing data that have a table structure. Each line in the file represents a record, which are delimited by newlines (\n or \r\n) characters. And, within a line, the record’s information is delimited by the comma character (,) for CSV or the tab character (\t) for TSV, and so on. The first line of these files often contains the names of the table’s columns/features.

The San Francisco restaurant scores are stored in CSV-formatted files. Let’s display the first few lines of the inspections.csv file. In Python, the built-in pathlib library has a useful Path object to specify paths to files and folders that work across platforms. The data are stored in the file data/inspections.csv, so we use Path() to create the full path name. The Path object below has many useful methods, such as open() as shown below.

from pathlib import Path

# Create a Path pointing to our data file
insp_path = Path() / 'data' / 'inspections.csv'

with insp_path.open() as f:
    for _ in range(6):
        print(f.readline(), end='')
"business_id","score","date","type"
19,"94","20160513","routine"
19,"94","20171211","routine"
24,"98","20171101","routine"
24,"98","20161005","routine"
24,"96","20160311","routine"

Note

Paths are tricky when working across different operating systems (OS). For instance, a typical path in Windows might look like C:\files\data.csv, while a path in Unix or MacOS might look like ~/files/data.csv. Because of this, code that works on one OS can fail to run on other operating systems.

The pathlib Python library was created to avoid OS-specific path issues. By using it, the code shown here is more portable—it works across Windows, MacOS, and Unix.

Displaying the first few lines of a file is something we’ll do often, so we create a function as a shortcut:

def head(filepath, n=5, width=-1):
    '''Prints the width characters of first n lines of filepath'''
    with filepath.open() as f:
        for _ in range(n):
            print(f.readline(), end='') if width < 0  else print(f.readline()[:width])

Notice that the field names appear in the first line of the file; these names are comma-separated and in quotations. We see four fields: the business identifier, the restaurant’s score, the date of the inspection, and the type of inspection. Each line in the file corresponds to one inspection, and the ID, score, date and type values are separated by commas. In addition to identifying the file format, we also want to identify the format of the features. We see two things of note: the scores and dates both appear as strings. We will want to convert the scores to numbers so we can calculate summary statistics and create visualizations. And, we will convert the date into a date-time format so that we can make time-series plots. We show how to carry out these transformations in Chapter 9.

Note

People often confuse CSV and TSV files with spreadsheets. This is in part because most spreadsheet software (like Microsoft Excel) will automatically display a CSV file as a table in a workbook. Behind the scenes, Excel looks at the file format and encoding just like we’ve done in this section. However, Excel files have a different format than CSV and TSV files, and we need to use different pandas functions to read these formats into Python.

All three of the restaurant source files are CSV formatted. In contrast, the DAWN source file has a fixed-width format. We describe this kind of formatting next.

8.2.2. Fixed-width Format

The fixed-width format (FWF) does not use delimiters to separate data values. Instead, the values for a specific field appear in the exact same position in each line. The DAWN source file has this format. Each line in the file is very long. For display purposes, we only show the first few characters from the first 5 lines in the file.

dawn_path = Path() / 'data' / 'DAWN-Data.txt'
head(dawn_path, width=65)
     1 2251082    .9426354082   3 4 1 2201141 2 865 105 1102005 1
     2 2291292   5.9920106887   911 1 3201134 12077  81  82 283-8
     3 7 7 251   4.7231718669   611 2 2201143 12313   1  12  -7-8
     410 8 292   4.0801470012   6 2 1 3201122 1 234 358  99 215 2
     5 122 942   5.1777093467  10 6 1 3201134 3 865 105 1102005 1

Notice how the values appear to align from one row to the next. For example, there is a decimal point in the same position (the 19th character) in each line. Notice also that some of the values seem to be squished together, and we need to know the exact position of each piece of information in a line in order to make sense of it. SAMHSA provides a 2,000-page codebook with all of this information including some basic checks so we can confirm that we have correctly read the file. The codebook tells us that the age field appears in positions 34-35 and is coded in intervals from 1 to 11. For instance, the first two records shown above have age categories of 4 and 11; the codebook tells us that a 4 stands for the age bracket “6 to 11”, and 11 is for “65+”.

Note

A widely adopted convention is to use the filename extension, such as .csv, .tsv, and .txt, to indicate the format of the contents of the file. File names that end with .csv are expected to contain comma-separated values, .tsv tab-separated values, and .txt generally is plain text without a designated format. However, these extension names are only suggestions. Even if a file has a .csv extension, the actual contents might not be formatted properly! It’s good practice to inspect the contents of the file before loading it into a data frame. If the file is not too large, you can open and examine it with a plain text editor. Otherwise, you view a couple of lines using .readline() or shell commands.

Other plain text formats that are popular include hierarchical formats and loosely formatted text (in contrast to formats that directly support table structures). These are covered in greater detail in other chapters, but for completeness we briefly describe them here.

8.2.3. Hierarchical Formats

Hierarchical formats store data in a nested form. For instance, JavaScript Object Notation (JSON), which is commonly used for communication by web servers, includes key-value pairs and arrays that can be nested, similar to a Python dictionary. The eXtensible Markup Language (XML) and HyperText Markup Language (HTML) are other common formats for storing documents on the Internet. Like JSON, these files also have a hierarchical, key-value format. We cover both formats (JSON and XML) in more detail in Chapter 14.

Next, we briefly describe other plain text files that don’t fall into any of the previous categories but still have some structure to them that enables us to read and extract information.

8.2.4. Loosely Formatted Text

Web logs, instrument readings, and program logs typically provide data in plain text. For example, below is one line of a Web log (we’ve split it across multiple lines for readability). It contains information such as the date and time and type of request made to a Web site.

169.237.46.168 - -
[26/Jan/2004:10:47:58 -0800]"GET /stat141/Winter04 HTTP/1.1" 301 328
"http://anson.ucdavis.edu/courses"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; .NET CLR 1.1.4322)"

There are organizational patterns present, but not in a simple delimited format. This is what we mean by “loosely formatted”. We see that the date and time appear between square brackets and the type of request (GET in this case) follows the date-time information and appears in quotes. Later in Chapter 13, we use these observations about the Web log’s format and string manipulation tools to extract values of interest into a data table.

As another example, below is a single record taken from a wireless device log. The device reports the timestamp, identifier, location of the device, and the signal strengths that it picks up from other devices. This information uses a combination of formats: key=value pairs, semicolon delimited, and comma delimited values.

t=1139644637174;id=00:02:2D:21:0F:33;pos=2.0,0.0,0.0;degree=45.5;
00:14:bf:b1:97:8a=-33,2437000000,3;00:14:bf:b1:97:8a=-38,2437000000,3;

Like with the Web logs, we can use string manipulation and the patterns in the records to extract features into a table.

We have primarily introduced formats for plain text data that are widely used for storing and exchanging tables. The comma-separated-value format is the most common, but others, such as tab-separated and fixed-width, are also prevalent. And, there are many types of file formats that store data!

So far, we have used the term ‘plain text’ to broadly cover formats that can be viewed with a text editor. However, a plain text file may have different encodings, and if we don’t specify the encoding correctly, the values in the data frame might contain gibbersih. We give an overview of file encoding next.