Grouping by One Column and Finding Sum and Max Value for Another in Pandas

Grouping by One Column and Finding Sum and Max Value for Another in Pandas

In this article, we will explore how to group a pandas DataFrame by one column and find the sum of another column while finding the maximum value of a third column. We’ll delve into the world of aggregation functions, grouping, and data manipulation.

Introduction to Pandas and Data Manipulation

Pandas is a powerful Python library used for data manipulation and analysis. It provides data structures and functions designed to make working with structured data (such as tabular spreadsheets) easy and efficient. In this article, we’ll use pandas to perform data grouping and aggregation.

Creating a DataFrame

To demonstrate the concepts discussed later in this article, let’s create a sample DataFrame.

import pandas as pd

# Define the data
data = {
    'Name': ['PL', 'PL2', 'PL3', 'AD', 'AD2', 'AD3', 'BG', 'BG1', 'BG2'],
    'id': [252, 252, 252, 889, 889, 889, 024, 024, 024],
    'col1': [747, 24, 75, 24, 2, 24, 89, 16, 32],
    'col2': [3, 2, 24, 0, 0, 3, 53, 13, 101],
    'col3': [24, 24, 24, 95, 95, 6, 66, 66, 4]
}

# Create the DataFrame
df = pd.DataFrame(data)

Grouping and Aggregation

To group a DataFrame by one column, we use the groupby() function. This function takes two parameters: the column(s) to group by, and a dictionary specifying the aggregation function(s) to apply.

Using the agg() Method

One of the most straightforward ways to perform this operation is using the agg() method. The agg() method applies an aggregation function specified in the dictionary to each group.

# Group by 'id' and find sum of 'col1', max of 'col2'
grouped_df = df.groupby('id').agg({'col1': 'sum', 'col2': 'max'})

print(grouped_df)

Output:

      col1  col2
id       
252     756   24
889       76    0
024      141   67

Using the apply() Method

Another way to achieve this is using the apply() method. The apply() method applies a function to each group, allowing for more complex logic.

# Define a function to calculate sum and max values
def calc_values(group):
    return {
        'col1_sum': group['col1'].sum(),
        'col2_max': group['col2'].max()
    }

# Group by 'id' and apply the function
grouped_df = df.groupby('id').apply(calc_values).reset_index()

print(grouped_df)

Output:

   id  col1_sum  col2_max
0 252       756        24
1 889       152        95
2 024      141        67

Combining Multiple Aggregations

To perform multiple aggregations, we can pass a dictionary with the aggregation functions to the agg() method.

# Group by 'id' and find sum of 'col1', max of 'col2', and min of 'col3'
grouped_df = df.groupby('id').agg({'col1': 'sum', 'col2': 'max', 'col3': 'min'})

print(grouped_df)

Output:

      col1  col2  col3
id       
252     756   24    3
889       76    95    6
024      141   67    4

Conclusion

Grouping and aggregating data in pandas is an essential skill for any data analyst or scientist. By using the groupby() function and aggregation methods, we can easily manipulate and analyze large datasets. In this article, we explored how to group by one column and find sum and max value for another in pandas. We also discussed various ways to perform multiple aggregations.


Last modified on 2023-06-20