Data scientists transform dataframe columns when they need to change each value in a feature in the same way. For example, if a feature contains heights of people in feet, a data scientist might want to transform the heights to centimeters. In this section, we’ll introduce apply, an operation that transforms columns of data using a user-defined function.
baby = pd.read_csv('babynames.csv') baby
2020722 rows × 4 columns
In the baby names New York Times article [Williams, 2021], Pamela
mentions that names starting with the letter “L” and “K” became popular
after 2000. On the other hand, names starting with the letter “J” peaked in
popularity in the 1970s and 1980s and have dropped off in popularity since. We
can verify these claims using the
We approach this problem using the following steps:
Namecolumn into a new column that contains the first letters of each value in
Group the dataframe by the first letter and year.
Aggregate the name counts by summing.
To complete the first step, we’ll apply a function to the
pd.Series objects contain an
.apply() method that takes in a function and
applies it to each value in the series. For instance, to find the lengths of
each name, we apply the
names = baby['Name'] names.apply(len)
0 4 1 4 2 6 .. 2020719 6 2020720 6 2020721 5 Name: Name, Length: 2020722, dtype: int64
To extract the first letter of each name, define a custom function and pass it
# The argument to the function is an individual value in the series. def first_letter(string): return string names.apply(first_letter)
0 L 1 N 2 O .. 2020719 V 2020720 V 2020721 W Name: Name, Length: 2020722, dtype: object
.apply() is similar to using a
for loop. The code above is roughly
equivalent to writing:
result =  for name in names: result.append(first_letter(name))
Now, we can assign the first letters to a new column in the dataframe:
letters = baby.assign(Firsts=names.apply(first_letter)) letters
2020722 rows × 5 columns
To create a new column in a dataframe, you might also encounter this syntax:
baby['Firsts'] = names.apply(first_letter)
This mutates the
baby table by adding a new column called
Firsts. In the
code above, we use
.assign() which doesn’t mutate the
baby table itself; it
creates a new dataframe instead. Mutating dataframes isn’t wrong but can be a
common source of bugs. Because of this, we’ll mostly use
.assign() in this
6.4.2. Example: Popularity of “L” Names¶
Now, we can use the
letters dataframe to see the popularity of first letters
letter_counts = (letters .groupby(['Firsts', 'Year']) ['Count'] .sum() .reset_index() ) letter_counts
3641 rows × 3 columns
fig = px.line(letter_counts.loc[letter_counts['Firsts'] == 'L'], x='Year', y='Count', title='Popularity of "L" names', width=350, height=250) margin(fig, t=30)
The plot shows that “L” names were popular in the 1960s, dipped in the decades after, but have indeed resurged in popularity after 2000.
What about “J” names?
fig = px.line(letter_counts.loc[letter_counts['Firsts'] == 'J'], x='Year', y='Count', title='Popularity of "J" names', width=350, height=250) margin(fig, t=30)
The NYT article says that “J” names were popular in the 1970s and 80s. The plot agrees, and also shows that they have become less popular after 2000.
6.4.3. The Price of Apply¶
The power of
.apply() is its flexibility—you can call it with any function
that takes in a single data value and outputs a single data value.
Its flexibility has a price, though. Using
.apply() can be slow, since
pandas can’t optimize arbitrary functions. For example, using
numeric calculations is much slower than using vectorized operations directly
%%timeit # Calculate the decade using vectorized operators baby['Year'] // 10 * 10
20.5 ms ± 442 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit def decade(yr): return yr // 10 * 10 # Calculate the decade using apply baby['Year'].apply(decade)
549 ms ± 35.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The version using
.apply() is 30 times slower! For numeric
operations in particular, we recommend operating on
In this section, we introduced data transformations.
To transform values in a dataframe, we commonly use the
In the next section, we’ll compare dataframes with other ways to represent and
manipulate data tables.