Data Manipulation with Pandas: Removing Unwanted Data from a DataFrame
When working with data in Python, it’s not uncommon to encounter unwanted or irrelevant information that needs to be removed or transformed. One such scenario is when dealing with date columns in a pandas DataFrame. In this article, we’ll explore how to remove unwanted data from a specific column and create a new column for the extracted day.
Understanding the Problem
The problem at hand involves removing the unwanted “Date” portion from a string and extracting only the day of the week. For instance, the input Dec 13, 2021Mon should be transformed into just Monday.
Additionally, we want to remove the last column (Volume) from the DataFrame.
Solution Overview
To achieve this, we’ll use pandas’ powerful data manipulation capabilities. The steps involved are:
- Extracting the day of the week from the “Date” string
- Removing the unwanted “Date” portion
- Creating a new column for the extracted day
- Dropping the last column (
Volume)
Step 1: Import Necessary Libraries and Define Variables
We’ll start by importing the necessary libraries, including pandas.
import requests
import pandas as pd
Next, we define our list of ISIN codes.
isins = ['LU0526609390:EUR', 'IE00BHBX0Z19:EUR']
Step 2: Retrieve Data Using pd.read_html
Now, we’ll use pd.read_html to retrieve the data from the specified URLs. The requests library is used to send GET requests to the URLs.
dfs = []
for isin in isins:
html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
dfs.extend(pd.read_html(html))
Step 3: Concatenate DataFrames
After retrieving the data, we concatenate all the DataFrames into a single DataFrame.
df = pd.concat(dfs)
print(df)
Step 4: Extract Day of Week from “Date” Column
We’ll insert a new column Day at index 0 and populate it with the day of the week extracted from the “Date” string using the apply function.
df.insert(0, 'Day', df['Date'].apply(lambda d: d[:d.find(',')]))
Step 5: Reformat Date to Desired Format
Next, we reformulate the date by taking only the last 12 characters of each “Date” string using the apply function.
df['Date'] = df['Date'].apply(lambda d: d[-12:])
Step 6: Remove Last Column (Volume)
We’ll remove the unwanted Volume column from the DataFrame by popping it using the pop method.
df.pop('Volume')
print(df)
Combining Code into a Single Function
Here’s the complete code combined into a single function:
import requests
import pandas as pd
def clean_data():
# Define list of ISIN codes
isins = ['LU0526609390:EUR', 'IE00BHBX0Z19:EUR']
# Retrieve data using pd.read_html
dfs = []
for isin in isins:
html = requests.get(f'https://markets.ft.com/data/funds/tearsheet/historical?s={isin}').content
dfs.extend(pd.read_html(html))
# Concatenate DataFrames
df = pd.concat(dfs)
# Insert new column 'Day' and populate it with day of week extracted from 'Date'
df.insert(0, 'Day', df['Date'].apply(lambda d: d[:d.find(',')]))
# Reformat date to desired format
df['Date'] = df['Date'].apply(lambda d: d[-12:])
# Remove last column ('Volume')
df.pop('Volume')
clean_data()
Conclusion
In this article, we demonstrated how to remove unwanted data from a specific column and create a new column for the extracted day using pandas. We also removed the last column (Volume). By following these steps and combining them into a single function, you can efficiently clean your DataFrame and extract relevant information.
This code will work with any list of ISIN codes provided in the isins variable, allowing it to be reused across different scenarios.
Last modified on 2024-10-30