Wrangling Files

8. Wrangling Files

Before you can work with data in Python, it helps to understand the files that store the source of the data. You want answers to a couple of basic questions like:

  • How much data do you have?

  • How is the source file formatted?

Answers to these questions can be very helpful. If your file is too large, you may need special approaches to read it in. If your file isn’t formatted the way you expect, you may not be able to properly loading it into a dataframe.

Although there are many types of structures that can represent data, in this book we primarily work with data tables, such as Pandas DataFrames and SQL relations. (But do note that Chapter 13 examines less-structured text data, and Chapter 14 introduces hierarchical formats and binary files). We focus on data tables for several reasons. Research on how to store and manipulate data tables has resulted in stable and efficient tools for working with tables. Plus, data in a tabular format are close cousins of matrices, the mathematical objects of the immensely rich field of linear algebra. And of course, data tables are quite common.

In this chapter, we introduce typical file formats and encodings for plain text and describe measures of file size, and we use Python tools to examine source files. Later in the chapter, we introduce an alternative approach for working with files, the shell interpreter. Shell commands give us a programmatic way to get information about a file outside of the Python environment, and the shell can be very useful with big data. Finally, we check the data table’s shape (the number of rows and columns) and granularity (what a row represents). These simple checks are the starting point for cleaning and analyzing our data.

We first provide brief descriptions of the datasets that we use as examples throughout this chapter.