# 8.4. File Size¶

Computers have finite resources. You have likely encountered these limits firsthand if your computer has slowed down from having too many applications open at once. We want to make sure that we do not exceed the computer’s limits while working with data, and we might choose to examine a file differently, depending on its size. If we know that our dataset is relatively small, then a text editor or a spreadsheet can be convenient to look at the data. On the other hand, for large datasets, a more programmatic exploration or even distributed computing tools may be needed.

In many situations, we analyze datasets downloaded from the Internet. These files reside on the computer’s disk storage. In order to use Python to explore and manipulate the data, we need to read the data into the computer’s memory, also known as random access memory (RAM). All Python code requires the use of RAM, no matter how short the code is. A computer’s RAM is typically much smaller than its disk storage. For example, one computer model released in 2018 had 32 times more disk storage than RAM. Unfortunately, this means that data files can often be much bigger than what is feasible to read into memory.

Both disk storage and RAM capacity are measured in terms of bytes. Roughly speaking, each character in a text file adds one byte to a file’s size. To succinctly describe the sizes of larger files, we use the prefixes as described in the following Table 8.1. For example, a file containing 52428800 characters takes up $$52428800 / 1024^2 = 50~{\textrm{mebibytes}}$$, or 50 MiB on disk.

Table 8.1 Prefixes for common filesizes.

Multiple

Notation

Number of Bytes

Kibibyte

KiB

1024

Mebibyte

MiB

1024²

Gibibyte

GiB

1024³

Tebibyte

TiB

1024⁴

Pebibyte

PiB

1024⁵

Note

Why use multiples of 1024 instead of simple multiples of 1000 for these prefixes? This is a historical result of the fact that most computers use a binary number scheme where powers of 2 are simpler to represent ($$1024 = 2^{10}$$). You will also see the typical SI prefixes used to describe size—kilobytes, megabytes, and gigabytes, for example. Unfortunately, these prefixes are used inconsistently. Sometimes a kilobyte refers to 1000 bytes; other times, a kilobyte refers to 1024 bytes. To avoid confusion, we stick to kibi-, mebi-, and gibibytes which clearly represent multiples of 1024.

It is not uncommon to have a data file happily stored on a computer that will overflow the computer’s memory if we attempt to manipulate it with a program. So, we often begin our data work by making sure the files are of manageable size. To do this, we use the built-in os library.

from pathlib import Path
import os

kib = 1024
line = '{:<25} {}'.format

print(line('File', 'Size (KiB)'))
for filepath in Path('data').glob('*'):
size = os.path.getsize(filepath)
print(line(str(filepath), np.round(size / kib)))

File                      Size (KiB)
data/inspections.csv      455.0
data/co2_mm_mlo.txt       50.0
data/violations.csv       3639.0
data/DAWN-Data.txt        273531.0
data/legend.csv           0.0


We see that the businesses.csv file takes up 645 KiB on disk, making it well within the memory capacities of most systems. Although the violations.csv file takes up 3.6 MiB of disk storage, most machines can easily read it into a pandas DataFrame too. But DAWN-Data.txt, which contains the DAWN survey data, is much larger.

The DAWN file takes up roughly 270 MiB of disk storage, and while some computers can work with this file in memory, it can slow down other systems. To make this data more manageable in Python, we can, for example, load in a subset of the columns rather than all of them.

Sometimes we are interested in the total size of a folder instead of the size of individual files. For example, we have three restaurant files, and might like to see whether we can combine all the data into a single data frame. Below, we calculate the size of the data folder, including all files in it.

mib = 1024**2

total = 0
for filepath in Path('data').glob('*'):
total += os.path.getsize(filepath) / mib

print(f'The data/ folder contains {total:.2f} MiB')

The data/ folder contains 271.80 MiB


Note

As a rule of thumb, reading in a file using pandas usually requires at least five times the available memory as the file size. For example, reading in a 1 GiB file will typically require at least 5 GiB of available memory. Memory is shared by all programs running on a computer, including the operating system, web browsers, and Jupyter notebook itself. A computer with 4 GiB total RAM might have only 1 GiB available RAM with many applications running. With 1 GiB available RAM, it is unlikely that pandas will be able to read in a 1 GiB file.

There are several strategies for working with data that are far larger than what is feasible to load into memory. We describe a few of them next.

## 8.4.1. Working with Large Datasets¶

The popular term “big data” generally refers to the scenario where the data are large enough that even top-of-the-line computers can’t read the data directly into memory. This is a common scenario in scientific domains like astronomy, where telescopes capture many large images of space that can be petabytes large ($$2^{50}$$). While not quite as big, social media giants, health care providers, and other companies can also struggle with large amounts of data.

Figuring out how to draw insights from large datasets is an important research problem that motivates the fields of database engineering and distributed computing. While we won’t cover these fields in this book, we provide a brief overview of basic approaches.

Subset The Data.

One simple approach is to work with portions of data. Rather than loading in the entire source file, we can either select a specific part of it (e.g., one day’s worth of data), or we can randomly sample the dataset. Because of its simplicity, we use this approach quite often in this book. The natural downside is that this approach loses many of the benefits of analyzing a large dataset, like being able to study rare events.

Use a Database System.

As discussed in Chapter 7, relational database management systems (RDBMS) are specifically designed to store large datasets. These systems can manipulate data that are too large to fit into memory by using SQL queries. Because of their advantages, RDBMSs are commonly used for data storage in research and industry settings. One downside is that they often require a separate server for the data that needs its own configuration. Another downside is that SQL is less flexible in what it can compute than Python, which becomes especially relevant for modeling. A useful hybrid approach is to use SQL to subset, aggregate, or sample the data into batches that are small enough to read into Python. Then, we can use Python for more sophisticated analyses.

Use a Distributed Computing System.

Another approach to handle complex computations on large datasets is to use a distributed computing system like MapReduce, Spark, or Ray. These systems work best on tasks that can be split into many smaller parts where they divide up large datasets into smaller pieces and run programs on all of the smaller datasets at once. These systems have great flexibility and can be used in a variety of scenarios. Their main downside is that they can require a lot of work to install and configure properly, because they are typically installed across many computers that need to coordinate with each other.

It can be convenient to use Python to determine a file format, encoding, and size. Another powerful tool for working with files is the shell; the shell is widely used and has a more succinct syntax than Python. In the next section, we introduce a few command-line tools available in the shell for carrying out the same tasks of finding out information about a file before reading it into a data frame.