9.6. Example: Wrangling Restaurant Safety Violations

We wrap up this chapter with an example that demonstrates many data wrangling techniques. Recall from Chapter 8 that the San Francisco restaurant inspection data are stored in three tables: bus (for businesses/restaurants), insp (for inspections), and viol (for safety violations). The violations dataset contains detailed descriptions of violations found during an inspection. We would like to capture some of this information and connect it to the inspection score, which is in inspection-level dataset.

Our goal is to figure out the kinds of safety violations are associated with lower restaurant safety scores. This example covers several key ideas in data wrangling related to changing structure:

  • filtering to focus on a narrower segment of data

  • aggregation to modify the granularity of a table

  • joining to bring together information across tables

Additionally, an important part of this example demonstrates how we transform text data into numeric quantities for analysis.

As a first step, let’s simplify the structure by reducing the data to inspections from one year. (Recall that this dataset contains four years of inspection information.) Below we tally the number of records for each year in the inspections table.

pd.value_counts(insp['year'])
2016    5443
2017    5166
2015    3305
2018     308
Name: year, dtype: int64

Reducing the data to cover one year of inspections will simplify our analysis. Later, if we want, we can return to carry out an analysis with all four years of data.

9.6.1. Narrowing the Focus

We restrict our data wrangling to inspections that took place in 2016. Here, we can use the pipe function again in order to apply the same reshaping to both the inspections and violations data frames.

def subset_2016(df):
    return df.query('year == 2016')

vio2016 = viol.pipe(subset_2016)
ins2016 = insp.pipe(subset_2016)
ins2016.head(5)
business_id score date type timestamp year
0 19 94 20160513 routine 2016-05-13 2016
3 24 98 20161005 routine 2016-10-05 2016
4 24 96 20160311 routine 2016-03-11 2016
6 45 78 20160104 routine 2016-01-04 2016
9 45 84 20160614 routine 2016-06-14 2016

In Chapter 8 we found that business_id and timestamp together uniquely identify the inspections (with a couple of exceptions). We also see here that restaurants can receive multiple inspections in a year—business #24 had two inspections in 2016, one in March and another in October.

Next, let’s look at a few records from the violations table.

vio2016.head(5)
business_id date description timestamp year
2 19 20160513 Unapproved or unmaintained equipment or utensi... 2016-05-13 2016
3 19 20160513 Unclean or degraded floors walls or ceilings ... 2016-05-13 2016
4 19 20160513 Food safety certificate or food handler card n... 2016-05-13 2016
6 24 20161005 Unclean or degraded floors walls or ceilings ... 2016-10-05 2016
7 24 20160311 Unclean or degraded floors walls or ceilings ... 2016-03-11 2016

Notice that the first few records are all for the same restaurant. If we want to bring violation information into the inspections table, we need to address the different granularities of these tables. One approach is to aggregate the violations in some way. We discuss this next.

9.6.2. Aggregating Violations

One simple aggregation of the violations is to count them and add that count to the inspections data table. To find the number of violations at an inspection, we can group the violations by business_id and timestamp, and then find the size of each group. Essentially, this grouping changes the granularity of violations to an inspection level.

num_vios = (vio2016
            .groupby(['business_id', 'timestamp'])
            .size()
            .reset_index()
            .rename(columns={0: 'num_vio'}));
num_vios.head(3)
business_id timestamp num_vio
0 19 2016-05-13 3
1 24 2016-03-11 2
2 24 2016-10-05 1

Now we need to merge this new information with ins2016. Specifically, we want to left join ins2016 with num_vios because there could be inspections that do not have any violations and we don’t want to lose them.

def left_join_vios(ins):
    return ins.merge(num_vios, on=['business_id', 'timestamp'], how='left')

ins_and_num_vios = ins2016.pipe(left_join_vios)
ins_and_num_vios
business_id score date type timestamp year num_vio
0 19 94 20160513 routine 2016-05-13 2016 3.0
1 24 98 20161005 routine 2016-10-05 2016 1.0
2 24 96 20160311 routine 2016-03-11 2016 2.0
... ... ... ... ... ... ... ...
5440 90096 91 20161229 routine 2016-12-29 2016 2.0
5441 90268 100 20161229 routine 2016-12-29 2016 NaN
5442 90269 100 20161229 routine 2016-12-29 2016 NaN

5443 rows × 7 columns

When there are no violations at an inspection, the feature num_vio has a missing value (NaN). We can check how many values are missing:

ins_and_num_vios['num_vio'].isnull().sum()
833

About 15% of restaurant inspections in 2016 had no safety violations recorded. We can correct these missing values by setting them to 0 if the restaurant had a perfect safety score of 100. This is an example of deductive imputation since we’re using domain knowledge to fill in missing values.

def zero_vios_for_perfect_scores(df):
    df = df.copy()
    df.loc[df['score'] == 100, 'num_vio'] = 0
    return df

ins_and_num_vios = (ins2016.pipe(left_join_vios)
                    .pipe(zero_vios_for_perfect_scores))

We can count the number of inspections with missing violation counts again.

ins_and_num_vios['num_vio'].isnull().sum()
65

We have corrected a large number of missing values. With further investigation, we find that some of the businesses have inspection dates that are close but don’t quite match. We could do a fuzzy match where inspections with dates that are only one or two days apart are matched. But for now, we just leave them as NaN.

Let’s examine the relationship between the number of violations and the inspection score.

px.strip(ins_and_num_vios, x="num_vio", y="score",
              height=250, width=400)
../../_images/wrangling_restaurants_25_0.svg

As we might expect there is a negative relationship between the inspection score and the number of violations. We can also see variability in score. The variability in scores grows with the number of violations. It appears that some violations are more serious than others and have a greater impact on the score. We extract information about the kinds of violations next.

9.6.3. Extracting Information from Violation Descriptions

We saw earlier that the feature description in the violations data frame has a lot of text, including information in square brackets about when the violation was corrected. We can tally the descriptions and examine the most common violations.

display_df(vio2016['description'].value_counts().head(20), rows=20)
Unclean or degraded floors walls or ceilings                                 161
Unapproved or unmaintained equipment or utensils                              99
Moderate risk food holding temperature                                        95
Inadequate and inaccessible handwashing facilities                            93
Inadequately cleaned or sanitized food contact surfaces                       92
Improper food storage                                                         81
Wiping cloths not clean or properly stored or inadequate sanitizer            71
Food safety certificate or food handler card not available                    64
Moderate risk vermin infestation                                              58
Foods not protected from contamination                                        56
Unclean nonfood contact surfaces                                              54
Inadequate food safety knowledge or lack of certified food safety manager     52
Improper storage of equipment utensils or linens                              41
Permit license or inspection report not posted                                41
Low risk vermin infestation                                                   34
High risk food holding temperature                                            32
Inadequate warewashing facilities or equipment                                31
Improper or defective plumbing                                                28
Inadequate ventilation or lighting                                            26
Unclean or unsanitary food contact surfaces                                   23
Name: description, dtype: int64

Reading through these wordy descriptions, we see that some are related to the cleanliness of facilities, others to the food storage, and still others pertain to the cleanliness of the staff.

Since there are many types of violations, we can try to group them together into larger categories. One way to do this is to create a simple boolean flag depending on whether the text contains a special word, like “vermin” of “hand” or “high risk”.

With this approach, we create eight new features for different categories of violations. Don’t worry about the particular details of the code for now—this code uses regular expressions, covered later in Chapter 13. The important idea is that this code creates features containing True or False based on whether the violation description contains specific words.

def make_vio_categories(vio):
    def has(term):
        return vio['description'].str.contains(term)

    vio = vio.copy()
    vio['high_risk'] = has(r"high risk")
    vio['clean'] = has(r"clean|sanit")
    vio['food_surface'] = (has(r"surface") & has(r"\Wfood"))
    vio['vermin'] = has(r"vermin")
    vio['storage'] = has(r"thaw|cool|therm|storage")
    vio['permit'] = has(r"certificate|permit")
    vio['non_food_surface'] = has(r"wall|ceiling|floor|surface")
    vio['human'] = has(r"hand|glove|hair|nail")
    return vio
# The violations with the new categories
vio2016 = (viol.pipe(subset_2016)
           .pipe(make_vio_categories))

Now that we have these new features in vio2016, we again want to merge this information with the inspection information. We roll up the violations again to the restaurant/date level and track whether any of the values in the group is True.

feature_names = [
    'high_risk', 'clean', 'food_surface', 'vermin', 'storage',
    'permit', 'non_food_surface', 'human'
]

features = (vio2016
            .groupby(['business_id', 'timestamp'])
            [feature_names]
            .max())

Once again we use a left join to merge these new features into the inspection-level data frame. And, for the special case of a score of 100, we set all of the new features to False.

def left_join_features(ins):
    return ins.merge(features, on=['business_id', 'timestamp'], how='left')

def zero_features_for_perfect_scores(ins):
    ins = ins.copy()
    ins.loc[ins['score'] == 100, feature_names] = False
    return ins
ins_and_vios = (ins2016.pipe(left_join_features)
                .pipe(zero_features_for_perfect_scores))
ins_and_vios.head(3)
business_id score date type ... storage permit non_food_surface human
0 19 94 20160513 routine ... False True True True
1 24 98 20161005 routine ... False False True False
2 24 96 20160311 routine ... False False True False

3 rows × 14 columns

To see how each violation category relates to the score, we can make a collection of boxplots that compares the score distributions with and without each violation.

Now, we make our plot:

../../_images/wrangling_restaurants_41_0.svg

There are several interesting takeaways in these box plots. Restaurants that have no cleanliness-related violations have the highest scores. Having a permit violation isn’t much of a detractor. And, “high risk” violations lower the distribution of scores more than any other violation. These observations are just the beginning of a possible analysis into understanding what these scores mean.