Python is an excellent data analysis tool since it features a rich ecosystem of data-centric packages. Pandas is one of the more popular packages that simplifies importing and analyzing data.
In this guide, we discuss how the .groupby() method works using split-apply-combine and also how to access groups and transform data.
What is the Pandas GroupBy Method?
The GroupBy method in Pandas is designed to mimic the functionality of the GROUP BY statement in SQL. The Pandas method works similar to the SQL statement, splitting the data, then aggregating it as instructed before combining the data.
Since the .groupby() splits the data first, Python users can work with the groups directly. Furthermore, the aggregations are done after the splitting, giving users full control over how it is aggregated.
Loading a Pandas DataFrame
To understand how to use the GroupBy method, begin by loading a sample dataset. You can follow along on your machine by pasting the code below that uses the .read_csv() method to load the data:
import pandas as pd df = pd.read_csv('https://raw.githubusercontent.com/datagy/data/main/sales.csv', parse_dates=['date'])
Before getting into how the GroupBy method’s objects work, let’s print the first five rows of the dataset using the .head() method:
You will notice that there is a date column indicating the transaction dates. The gender and region columns are string-type columns holding the salesperson’s data. The sales column represents the respective sales value.
How Pandas GroupBy Objects Work
Creating a GroupBy object is as simple as applying the method to the DataFrame. You can either pass a single column or a group of them. Here’s what this looks like:
print(df.groupby('region')) # Output: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fb78815a4f0>
As you can see, it returns a DataFrameGroupBy object, and since it’s an object, we can explore some of its attributes.
Attributes of GroupBy Objects
The objects come with a .ngroups attribute that stores the number of groups in the grouping. Here’s how you can count the groups in the object:
print(df.groupby('region').ngroups) # Output: 3
In the same way, you can use the .groups attribute to find more details about the groups. The attribute outputs a dictionary-like object. The object holds the groups as keys, and the value of these keys are the indices of the rows of the group.
To access the groups in the object, you can run:
But if you only want to find the group names of the object, you could return just the keys of the dictionary, like so:
print(df.groupby('region').groups.keys()) # Output: dict_keys(['North-East', 'North-West', 'South'])
Selecting a Pandas GroupBy Group
The GroupBy method allows you to select all the records in a specific group. This can be invaluable when you want an overview of the data in a group. You can do this by using the .get_group() method and passing the group’s name.
For example, running the following statement will show you the data of the “South” region in our sample dataset:
Understanding Pandas GroupBy Split-Apply-Combine
The GroupBy method utilizes a process called split-apply-combine to offer useful modifications or aggregations of the DataFrame. The process is quite self-explanatory and works as follows:
- The data is split into groups based on some criteria
- Functions are applied to each group independently
- The results are combined into a data structure
The previous sections walk you through using the .groupby() method to divide the data into groups according to the passed criteria. Therefore, you’re already familiar with the first step.
The idea behind splitting up the data is to break down the large data analysis problem into smaller, more workable parts. When the problems you need to deal with are smaller, it will be easier to perform operations on parts before putting them back together.
While the apply and combine steps are distinct and run separately, the library makes it look like the two steps happen in one go.
Aggregating Data with GroupBy
Running the following code will aggregate the data:
averages = df.groupby('region')['sales'].mean() print(averages) # Output: # region # North-East 17386.072046 # North-West 15257.732919 # South 24466.864048 # Name: sales, dtype: float64
Let’s break down this code to understand how it works:
- df.groupby(‘region’) splits the data into groups based on the region column.
- The [‘sales’] picks only the region column from the groups.
- The .mean() applies the mean method to the column in every group.
- The data is then combined into the final DataFrame.
Other Aggregations with GroupBy
Now that you understand the split-apply-combine process, here’s an overview of the various aggregation functions available:
Number of non-null records
Maximum value of the group
Arithmetic mean of values
Median of the values
Minimum value of the group
Most frequent value in the group
Standard deviation of the group
Sum of values
Variance of the group
You can freely use these methods to work with the data as you desire. For instance, if you want to calculate the standard deviation of each group, the following code would do it:
standard_deviations = df.groupby('region')['sales'].std() print(standard_deviations)
Applying Multiple Aggregations
One of the most powerful features of the Pandas library is that it allows you to apply multiple aggregations in the GroupBy method via the .agg() method. With this method, you can pass a list of callables to GroupBy. Here’s how you could apply multiple aggregations with the .agg() function:
import numpy as np aggs = df.groupby('region')['sales'].agg([np.mean, np.std, np.var]) print(aggs) # Output: # mean std var # region # North-East 17386.072046 2032.541552 4.131225e+06 # North-West 15257.732919 3621.456493 1.311495e+07 # South 24466.864048 5253.702513 2.760139e+07
The .agg() function enables you to generate summary statistics based on different groups. The function makes working with the data convenient since there is no need to use the .groupby() method three times to get the same result.
Transforming Data with GroupBy
The GroupBy method also makes it easy for users to transform the data. Put simply, transforming the data means to perform an operation specific to that group.
This could include working with missing data by attributing a value based on the group and standardizing data using a z-score.
But what makes the transformation different from aggregation and filtering? Unlike during aggregation and filtering, the output DataFrame will always have the same dimensions as the original data after transformation. This is not always true for aggregating and filtering.
Using the .transform() method returns one value for every record in the original dataset. This guarantees that the result will be the same size.
Understanding how .transform() works will be easier with an example. Let’s assume you want to calculate the percentage of a region’s total sales. You can pass the “sum” callable and return the sum for the group onto each row. You can then divide the original sales column by the sum, like so:
df['Percent Of Region Sales'] = df['sales'] / df.groupby('region')['sales'].transform('sum') print(df.head())
Interestingly, you can also transform data in GroupBy without using the .transform() method. You can apply a function that returns a single value without aggregating the data.
For instance, if you apply the .rank() method to your grouping, it will rank the values in every group:
df['ranked'] = df.groupby('region')['sales'].rank(ascending=False) print(df.sort_values(by='sales', ascending=False).head())
Running this code will return a Pandas series having the same length as the original DataFrame. You can then assign the series to a new column.
Filtering Data with GroupBy
Filtering the DataFrame is one of the most useful ways of using the .groupby() method. But it’s important to know that this approach differs from a regular filter since GroupBy allows you to apply the filtering method based on aggregations of groups’ values.
For instance, you can filter your DataFrame to get rid of the rows where a group’s average sale price is less than 20,000:
df = df.groupby('region').filter(lambda x: x['sales'].mean() < 20000) print(df.head())
Here’s a breakdown of how this code works:
- First, the code groups the data by the “region” column.
- Next, the .filter() method filters the data based on the lambda function you pass.
- Finally, the lambda function works out if the average value in the group for the “sales” column is lower than 20,000.
Filtering the data this way removes the need for you to determine the average value for every group before filtering out those values. While approaching filtering this way may seem unnecessary in this example, it can be invaluable when working with smaller groups.
Grouping a DataFrame by Multiple Columns
You can explore the other functionality of the GroupBy method by grouping your data by multiple columns. In all the examples above, we pass a string representing a single column and group the DataFrame by that column.
However, GroupBy also allows you to pass a list of strings, each representing different columns. This way, you can split the data further.
Let’s explore this functionality of GroupBy by working out the sum of all the sales grouped by “region” and “gender”:
sums = df.groupby(['region', 'gender']).sum() print(sums.head())
What’s more interesting is that all the methods covered in this post can be used in this manner. You could apply the .rank() function again to identify the top sales in every combination of region and gender, like so:
df['rank'] = df.groupby(['region', 'gender'])['sales'].rank(ascending=False) print(df.head())
Using Custom Functions with GroupBy
One of the best features of the GroupBy method is that you can apply your own functions to work with the data. You have the option of using an anonymous lambda function, but you can also define functions specific to your analysis needs.
Let’s understand how this works by defining a custom function that returns a group’s range by working out the difference between the minimum and maximum values. Here’s how you can define such a function before applying it to your .groupby() method call:
def group_range(x): return x.max() - x.min() ranges = df.groupby(['region', 'gender'])['sales'].apply(group_range) print(ranges) # Output: # region gender # North-East Female 10881 # Male 10352 # North-West Female 20410 # Male 17469 # South Female 30835 # Male 27110 # Name: sales, dtype: int64
The group_range() function defined above accepts a single parameter, which, in our case, is the series of “sales” groupings. Next, the code finds the smallest and largest values before returning the difference between the two. This way, it helps us see how the group ranges differ.
The GroupBy function is accessible and easy to use, making it one of the best data analysis methods. Some practical tips to remember when working with the method include:
- When passing multiple group keys, only the rows that have the same group key values match will each other will be added to the group.
- Setting the sort parameter to False can improve code execution times.
- Chaining the grouping steps and applying a function reduces the lines of code.
Now that you understand how to use GroupBy, it’s time to test it yourself and reap the benefits of fast and easy data analysis.