Essential Pandas for Machine Learning: Part 2
Pandas is a powerful and versatile open-source library for data analysis in Python. It provides easy-to-use data structures like Series and DataFrames, making it an essential tool for handling and manipulating data in machine learning projects. In this blog post, we will explore some key aspects of Pandas that are crucial for anyone working in the field of machine learning.

Agenda
Let’s break down the agenda for this Pandas tutorial:
- Handling Duplicates
- Function Application – map, apply, groupby, rolling, str
- Merge, Join & Concatenate
- Pivot-tables
- Normalizing JSON
Handling Duplicates
Sometimes, ensuring that data is not duplicated can be challenging, and it becomes crucial in the data cleaning step to identify and eliminate duplicate entries.
df = pd.DataFrame({'A':[1,1,3,4,5,1], 'B':[1,1,3,7,8,1], 'C':[3,1,1,6,7,1]})
Detecting Duplicates with duplicated()
The duplicated() function helps identify duplicate rows in a DataFrame.
df.duplicated()
This will output a boolean series, marking True for duplicate rows. In the example, row 5 is identified as a duplicate.
Displaying Duplicate Rows
To display the duplicate rows, you can use boolean indexing:
df[df.duplicated()]
This will show the duplicate rows in the DataFrame.
Identifying Duplicates Based on Specific Columns
You can narrow down duplicate detection by specifying columns using the subset parameter:
df[df.duplicated(subset=['A','B'])]
This example considers columns ‘A’ and ‘B’ together, revealing rows 1 and 5 as duplicates based on these columns.
Handling duplicates is a crucial step in ensuring data accuracy and reliability. By using Pandas functions like duplicated(), you can easily identify and manage duplicate entries in your datasets.
Function Application
Using map for Transformation
The map function in Pandas is excellent for transforming one column into another. Let’s look at an example where we create a new column, ‘age_category,’ based on the ‘Age’ column.
titanic_data['age_category'] = titanic_data.Age.map(lambda age: 'Kid' if age < 18 else 'Adult')
Using apply on Series and DataFrames
The apply function is versatile and can be applied to both Series and DataFrames. Here, we calculate the sum of the ‘Age’ column and create a new column, ‘age_category,’ based on a custom function.
titanic_data.Age.apply('sum')
titanic_data.Age.apply(lambda age: 'Kid' if age < 18 else 'Adult')
Using apply on DataFrames for Multiple Columns
The apply function on DataFrames allows us to work with multiple columns simultaneously. In this example, we use a function to calculate fares differently for male and female passengers.
def fare_function(row):
if row.Sex == 'male':
return row.Fare * 2
else:
return row.Fare
titanic_data.apply(fare_function, axis=1)
Grouping Data with groupby
The groupby function is handy for splitting data into groups, applying a function to each group, and combining the results. Here, we calculate the mean age for male and female passengers.
titanic_data.groupby(['Sex']).Age.mean()
titanic_data.groupby(['Sex']).Age.agg(['mean', 'min', 'max'])
Window-based Operations with rolling
The rolling function is useful for window-based operations. In this example, we calculate the sum and minimum value for a rolling window of 5 in the ‘Age’ column.
titanic_data.Age.rolling(window=5, min_periods=1).agg(['sum', 'min'])
String Utilities with str
For columns containing strings, Pandas provides str utilities. This example filters rows containing ‘Mr’ in the ‘Name’ column.
titanic_data[titanic_data.Name.str.contains('Mr')]
Append, Merge, Join & Concatenate
Using append for Stacking DataFrames
The append function is handy for stacking DataFrames vertically.
df1.append(df2, ignore_index=True)
Merging DataFrames with merge
The merge function is used to merge DataFrames based on a specified key.
left.merge(right, on='key')
left.merge(right, on='key', how='left')
Combining DataFrames with join
The join function combines DataFrames based on index values.
left.join(right)
Pivot Tables
Extracting Information with Pivot Tables
Pivot tables are a powerful way to extract important information from data. In this example, we create a pivot table to summarize sales data.
pd.pivot_table(sales_data, index=['Manager', 'Rep'], values=['Account', 'Price'], aggfunc=[np.sum, np.mean])
Normalizing JSON Data
Handling Hierarchical JSON Data
JSON data is not always flat; it can be hierarchical. The json_normalize function is useful for normalizing such data.
json_data = [{'state': 'Florida', 'shortname': 'FL', 'info': {'governor': 'Rick Scott'}, 'counties': [{'name': 'Dade', 'population': 12345}, {'name': 'Broward', 'population': 40000}]}]
json_normalize(json_data, 'counties', ['state', ['info', 'governor']])
By mastering these advanced Pandas techniques, you’ll be better equipped to handle complex data manipulation tasks in your data analysis and machine learning projects. Stay tuned for more in-depth tutorials on Pandas and other data science topics!
Conclusion
Pandas is an indispensable tool for any data scientist or machine learning practitioner. In this tutorial, we’ve covered just a fraction of what Pandas can offer. As you delve deeper into data analysis and machine learning, mastering Pandas will undoubtedly enhance your productivity and analytical capabilities.
Stay tuned for more tutorials on advanced Pandas functionalities!
