Working With Relations Using SQL
7. Working With Relations Using SQL¶
Note
This chapter repeats the data analyses in the Working With Dataframes Using pandas chapter using
relations and SQL instead of dataframes and Python. The datasets, data
manipulations, and conclusions are nearly identical across the two chapters so
that it’s easier for the reader to see how the same data manipulations are
performed in both pandas
and SQL.
If you’ve already read the dataframe chapter, you can focus your attention on this section where we introduce the relation, and the specific SQL code examples in the sections that follow.
Data scientists work with data stored in tables. This chapter introduces relations, one of the most widely used ways to represent data tables. We’ll also introduce SQL, the standard programming language for working with relations. Here’s an example of a relation that holds information about popular dog breeds:
breed | grooming | food_cost | kids | size |
---|---|---|---|---|
Labrador Retriever | weekly | 466.0 | high | medium |
German Shepherd | weekly | 466.0 | medium | large |
Beagle | daily | 324.0 | high | small |
Golden Retriever | weekly | 466.0 | high | medium |
Yorkshire Terrier | daily | 324.0 | low | small |
Bulldog | weekly | 466.0 | medium | medium |
Boxer | weekly | 466.0 | high | medium |
In a relation, each row represents a single record—in this case, a single
dog breed. Each column represents a feature about the record—for example, the
grooming
column represents how often each dog breed needs to be groomed.
Relations have labels for columns. For instance, this relation has a column
labeled grooming
. Within a column, data have the same type. For instance, the
food_cost
column contains numbers, and the size
column contains categories.
But data types can be different within a row.
Because of these properties, relations enable all sorts of useful operations.
Note
As a data scientist, you’ll often find yourself working with people from different backgrounds who use different terms. For instance, computer scientists say that the columns of a relation represent features of the data, while statisticians call them variables instead.
Other times, people will use the same term to refer to slightly different
things. Data types in a programming sense refers to how a computer stores
data internally. For instance, the size
column has a string data type in
Python. But from a statistical point of view, the size
column stores ordered
categorical data (ordinal data). We talk more about this specific distinction
in the Exploratory Data Analysis chapter.
In this chapter, we’ll show you how to do common relation operations using SQL. First, we’ll explain the structure of SQL queries. Then, we’ll show how to use SQL to perform common data manipulation tasks, like slicing, filtering, sorting, grouping, and joining.