pandas groupby, split df and rename multiple sheets

=====================================================

In this article, we’ll explore how to achieve a specific task using the popular Python data manipulation library pandas. Our objective is to create an Excel file output with multiple sheets based on the segment column for each manager in a given DataFrame.

Background

Before diving into the solution, let’s understand some fundamental concepts in pandas:

DataFrames: A two-dimensional table of data with rows and columns.
Groupby: A method to split a DataFrame by one or more columns and perform aggregation operations on each group.
ExcelWriter: A class used to write DataFrames to Excel files.

The Problem

We have a DataFrame cdf that contains customer information, including an ‘Id’, ‘customer’, ‘segment’, ‘manager’, ‘dumma’, and ‘damma’ columns. Our goal is to create an Excel file output with multiple sheets based on the segment column for each manager in the ‘manager’ column.

The Solution

We’ll employ the groupby method to split our DataFrame by the ‘manager’ column and then perform the necessary operations on each group.

Code

import numpy as np
import pandas as pd
from numpy.random import default_rng
rng = default_rng(100)
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
                   'customer': rng.choice(list('ACD'),size=(5)),
                   'segment': rng.choice(list('PQRS'),size=(5)),
                   'manager': rng.choice(list('QWER'),size=(5)),
                   'dumma': rng.choice((1234),size=(5)),
                   'damma': rng.choice((1234),size=(5))
})

DPM_col = "manager"
SEG_col = "segment"

for manager, data_output in cdf.groupby(DPM_col):
    m1 = data_output['segment'].isin(['Q','P','S'])
    m2 = data_output['dumma'] >= data_output['damma']
    mask = m1 & m2
    
    if mask.any() or not m1.any():
        with pd.ExcelWriter(f'{manager}.xlsx', engine='xlsxwriter') as writer:
            
            df = data_output[mask]
            for seg, v in df.groupby('segment'):
                v.to_excel(writer, sheet_name=f"POS_decline_{seg}",index=False)
            
            df1 = data_output[~m1]
            for seg1, v1 in df1.groupby('segment'):
                v1.to_excel(writer, sheet_name=f"silent_inactive_{seg1}",index=False)

Explanation

Our approach involves using the groupby method to split our DataFrame by the ‘manager’ column. We then create two masks: m1 and m2, which represent the conditions for selecting rows based on the segment column.

The first mask, m1, checks if a row belongs to either of the segments ‘Q’, ‘P’, or ‘S’. The second mask, m2, checks if the value in the ‘dumma’ column is greater than or equal to the corresponding value in the ‘damma’ column. We use the bitwise AND operator (&) to combine these two masks.

We then check if at least one row meets either of the conditions (i.e., mask.any()). If this condition is met, we proceed with writing the data to an Excel file.

Inside the ExcelWriter block, we first select all rows that meet both conditions (df = data_output[mask]). We then iterate over each segment in these rows and write them to separate sheets using the to_excel method.

Next, we select all rows that do not meet either condition (df1 = data_output[~m1]). We then iterate over each segment in these rows and write them to another set of sheets.

Example Use Case

Suppose our DataFrame looks like this:

   Id  customer segment manager  dumma  damma
0   1      A       P        Q   1000    900
1   2      B       S        W   1200   1100
2   3      C       Q        E   1300   1400
3   4      D       R        F   1500   1600
4   5      E       P        G   1700   1800

Our code will create two Excel files, one for each manager. The first file will contain the data for ‘Q’ and ‘P’ segments with dumma values greater than or equal to damma values:

POS_decline_Q.xlsx

   Id  customer segment manager  dumma  damma
0   1      A       Q        Q   1000    900
2   3      C       Q        E   1300   1400
4   5      E       P        G   1700   1800

POS_decline_P.xlsx

   Id  customer segment manager  dumma  damma
1   2      B       P        W   1200   1100
4   5      E       P        G   1700   1800

The second Excel file will contain the data for ‘S’ segments with no matching rows:

silent_inactive_S.xlsx

   Id  customer segment manager  dumma  damma
1   2      B       S        W   1200   1100
3   4      D       R        F   1500   1600

Conclusion

In this article, we demonstrated how to use the groupby method in pandas to split a DataFrame by one or more columns and perform aggregation operations on each group. We also showed how to write data to an Excel file using the ExcelWriter class. By applying these techniques, you can efficiently manipulate large datasets and create reports with multiple sheets.

Additional Tips

Always check for missing values in your DataFrames before performing any analysis.
Use meaningful variable names and comments to make your code more readable and maintainable.
Consider using the groupby method with custom aggregation functions to perform complex data operations.

Last modified on 2023-11-30