18.2. Wrangling and Transforming#

We begin by taking a peek at the contents of our datafile. To do this, we open the file and examine the first few rows (Chapter 8):

from pathlib import Path

# Create a Path pointing to our data file
insp_path = Path('data/donkeys.csv')

with insp_path.open() as f:
    # Display first five lines of file
    for _ in range(5):
        print(f.readline(), end='')

BCS,Age,Sex,Length,Girth,Height,Weight,WeightAlt
3,<2,stallion,78,90,90,77,NA
2.5,<2,stallion,91,97,94,100,NA
1.5,<2,stallion,74,93,95,74,NA
3,<2,female,87,109,96,116,NA

Since the file is CSV formatted, we can easily read it into a dataframe:

donkeys = pd.read_csv("data/donkeys.csv")
donkeys

	BCS	Age	Sex	Length	Girth	Height	Weight	WeightAlt
0	3.0	<2	stallion	78	90	90	77	NaN
1	2.5	<2	stallion	91	97	94	100	NaN
2	1.5	<2	stallion	74	93	95	74	NaN
...	...	...	...	...	...	...	...	...
541	2.5	10-15	stallion	103	118	103	174	NaN
542	3.0	2-5	stallion	91	112	100	139	NaN
543	3.0	5-10	stallion	104	124	110	189	NaN

544 rows × 8 columns

Over 500 donkeys participated in the survey, and eight measurements were made on each donkey. According to the documentation, the granularity is a single donkey (Chapter 9). Table 18.1 provides descriptions of the eight features.

Table 18.1 Donkey study codebook#
Feature	Data type	Feature type	Description
BCS	float64	Ordinal	Body condition score: from 1 (emaciated) to 3 (healthy) to 5 (obese) in increments of 0.5.
Age	string	Ordinal	Age in years, under 2, 2–5, 5–10, 10–15, 15–20, and over 20 years
Sex	string	Nominal	Sex categories: stallion, gelding, female
Length	int64	Numeric	Body length (cm) from front leg elbow to back of pelvis
Girth	int64	Numeric	Body circumference (cm), measured just behind front legs
Height	int64	Numeric	Body height (cm) up to point where neck connects to back
Weight	int64	Numeric	Weight (kilogram)
WeightAlt	float64	Numeric	Second weight measurement taken on a subset of donkeys

Figure 18.1 is a stylized representation of a donkey as a cylinder with neck and legs appended. Height is measured from the ground to the base of the neck above the shoulders; girth is around the body, just behind the legs; and length is from the front elbow to the back of the pelvis.

../../_images/donkeyDiagram.png — Fig. 18.1 Diagram of a donkey’s girth, length, and height, characterized as measurements on a cylinder#

Our next step is to perform some quality checks on the data. In the previous section, we listed a few potential quality concerns based on scope. Next, we check the quality of the measurements and their distributions.

Let’s start by comparing the two weight measurements made on the subset of donkeys to check on the consistency of the scale. We make a histogram of the difference between these two measurements for the 31 donkeys that were weighed twice:

donkeys = donkeys.assign(difference=donkeys["WeightAlt"] - donkeys["Weight"])

px.histogram(donkeys, x="difference", nbins=15,
    labels=dict(
        difference="Differences of two weighings (kg)<br>on the same donkey"
    ),
    width=350, height=250,
)

The measurements are all within 1 kg of each other, and the majority are exactly the same (to the nearest kilogram). This gives us confidence in the accuracy of the measurements.

Next, we look for unusual values in the body condition score:

donkeys['BCS'].value_counts()

BCS
3.0    307
2.5    135
3.5     55
      ... 
1.5      5
4.5      1
1.0      1
Name: count, Length: 8, dtype: int64

From this output, we see that there’s only one emaciated (BCS = 1) and one obese (BCS = 4.5) donkey. Let’s look at the complete records for these two donkeys:

donkeys[(donkeys['BCS'] == 1.0) | (donkeys['BCS'] == 4.5)]

	BCS	Age	Sex	Length	Girth	Height	Weight	WeightAlt
291	4.5	10-15	female	107	130	106	227	NaN
445	1.0	>20	female	97	109	102	115	NaN

Since these BCS values are extreme, we want to be cautious about including these two donkeys in our analysis. Since we have only one donkey in each of these extreme categories, our model might well not extend to donkeys with a BCS of 1 or 4.5. So we remove these two records from the dataframe and note that our analysis may not extend to emaciated or obese donkeys. In general, we exercise caution in dropping records from a dataframe. Later, we may also decide to remove the five donkeys with a score of 1.5 if they appear anomalous in our analysis, but for now, we keep them in our dataframe. We need good reasons for their exclusion, document our action, and keep the number of excluded observations low so that we avoid overfitting a model because we dropped any record that disagreed with the model.

We remove these two outliers next:

def remove_bcs_outliers(donkeys):
    return donkeys[(donkeys['BCS'] >= 1.5) & (donkeys['BCS'] <= 4)] 

donkeys = (pd.read_csv('data/donkeys.csv')
           .pipe(remove_bcs_outliers))

Now, we examine the distribution of values for weight to see if there are any issues with quality:

px.histogram(donkeys, x='Weight', nbins=40, width=350, height=250,
            labels={'Weight':'Weight (kg)'})

It appears there is one very light donkey weighing less than 30 kg. Next, we check the relationship between weight and height to assess the quality of the data for analysis:

px.scatter(donkeys, x='Height', y='Weight', width=350, height=250,
          labels={'Weight':'Weight (kg)', 'Height':'Height (cm)'})

The small donkey is far from the main concentration of donkeys and would overly influence our models. For this reason, we exclude it. Again, we keep in mind that we may also want to exclude the one or two heavy donkeys if they appear to overly influence our future model fitting:

def remove_weight_outliers(donkeys):
    return donkeys[(donkeys['Weight'] >= 40)]

donkeys = (pd.read_csv('data/donkeys.csv')
           .pipe(remove_bcs_outliers)
           .pipe(remove_weight_outliers))

donkeys.shape

(541, 8)

In summary, based on our cleaning and quality checks, we removed three anomalous observations from the dataframe. Now we’re nearly ready to begin our exploratory analysis. Before we proceed, we set aside some of our data as a test set.

We talked about why it’s important to separate out a test set from the train set in Chapter 16. A best practice is to separate out a test set early in the analysis, before we explore the data in detail, because in EDA, we begin to make decisions about what kinds of models to fit and what variables to use in the model. It’s important that our test set isn’t involved in these decisions so that it imitates how our model would perform with entirely new data.

We divide our data into an 80/20 split, where we use 80% of the data to explore and build a model. Then we evaluate the model with the 20% that has been set aside. We use a simple random sample to split the dataframe into the test and train sets. To begin, we randomly shuffle the indices of the dataframe:

np.random.seed(42)
n = len(donkeys)
indices = np.arange(n)
np.random.shuffle(indices)
n_train = int(np.round((0.8 * n)))

Next, we assign the first 80% of the dataframe to the train set and the remaining 20% to the test set:

train_set = donkeys.iloc[indices[:n_train]]
test_set = donkeys.iloc[indices[n_train:]]

Now we’re ready to explore the training data and look for useful relationships and distributions that inform our modeling.

Learning Data Science

Wrangling and Transforming

18.2. Wrangling and Transforming#