Essential Pandas for Machine Learning: Part 2

Pandas is a powerful and versatile open-source library for data analysis in Python. It provides easy-to-use data structures like Series and DataFrames, making it an essential tool for handling and manipulating data in machine learning projects. In this blog post, we will explore some key aspects of Pandas that are crucial for anyone working in the field of machine learning.

Agenda

Let’s break down the agenda for this Pandas tutorial:

  1. Handling Duplicates
  2. Function Application – map, apply, groupby, rolling, str
  3. Merge, Join & Concatenate
  4. Pivot-tables
  5. Normalizing JSON

Handling Duplicates

Sometimes, ensuring that data is not duplicated can be challenging, and it becomes crucial in the data cleaning step to identify and eliminate duplicate entries.

df = pd.DataFrame({'A':[1,1,3,4,5,1], 'B':[1,1,3,7,8,1], 'C':[3,1,1,6,7,1]})

Detecting Duplicates with duplicated()

The duplicated() function helps identify duplicate rows in a DataFrame.

df.duplicated()

This will output a boolean series, marking True for duplicate rows. In the example, row 5 is identified as a duplicate.

Displaying Duplicate Rows

To display the duplicate rows, you can use boolean indexing:

df[df.duplicated()]

This will show the duplicate rows in the DataFrame.

Identifying Duplicates Based on Specific Columns

You can narrow down duplicate detection by specifying columns using the subset parameter:

df[df.duplicated(subset=['A','B'])]

This example considers columns ‘A’ and ‘B’ together, revealing rows 1 and 5 as duplicates based on these columns.

Handling duplicates is a crucial step in ensuring data accuracy and reliability. By using Pandas functions like duplicated(), you can easily identify and manage duplicate entries in your datasets.

Function Application

Using map for Transformation

The map function in Pandas is excellent for transforming one column into another. Let’s look at an example where we create a new column, ‘age_category,’ based on the ‘Age’ column.

titanic_data['age_category'] = titanic_data.Age.map(lambda age: 'Kid' if age < 18 else 'Adult')

Using apply on Series and DataFrames

The apply function is versatile and can be applied to both Series and DataFrames. Here, we calculate the sum of the ‘Age’ column and create a new column, ‘age_category,’ based on a custom function.

titanic_data.Age.apply('sum')
titanic_data.Age.apply(lambda age: 'Kid' if age < 18 else 'Adult')

Using apply on DataFrames for Multiple Columns

The apply function on DataFrames allows us to work with multiple columns simultaneously. In this example, we use a function to calculate fares differently for male and female passengers.

def fare_function(row):
    if row.Sex == 'male':
        return row.Fare * 2
    else:
        return row.Fare
    
titanic_data.apply(fare_function, axis=1)

Grouping Data with groupby

The groupby function is handy for splitting data into groups, applying a function to each group, and combining the results. Here, we calculate the mean age for male and female passengers.

titanic_data.groupby(['Sex']).Age.mean()
titanic_data.groupby(['Sex']).Age.agg(['mean', 'min', 'max'])

Window-based Operations with rolling

The rolling function is useful for window-based operations. In this example, we calculate the sum and minimum value for a rolling window of 5 in the ‘Age’ column.

titanic_data.Age.rolling(window=5, min_periods=1).agg(['sum', 'min'])

String Utilities with str

For columns containing strings, Pandas provides str utilities. This example filters rows containing ‘Mr’ in the ‘Name’ column.

titanic_data[titanic_data.Name.str.contains('Mr')]

Append, Merge, Join & Concatenate

Using append for Stacking DataFrames

The append function is handy for stacking DataFrames vertically.

df1.append(df2, ignore_index=True)

Merging DataFrames with merge

The merge function is used to merge DataFrames based on a specified key.

left.merge(right, on='key')
left.merge(right, on='key', how='left')

Combining DataFrames with join

The join function combines DataFrames based on index values.

left.join(right)

Pivot Tables

Extracting Information with Pivot Tables

Pivot tables are a powerful way to extract important information from data. In this example, we create a pivot table to summarize sales data.

pd.pivot_table(sales_data, index=['Manager', 'Rep'], values=['Account', 'Price'], aggfunc=[np.sum, np.mean])

Normalizing JSON Data

Handling Hierarchical JSON Data

JSON data is not always flat; it can be hierarchical. The json_normalize function is useful for normalizing such data.

json_data = [{'state': 'Florida', 'shortname': 'FL', 'info': {'governor': 'Rick Scott'}, 'counties': [{'name': 'Dade', 'population': 12345}, {'name': 'Broward', 'population': 40000}]}]
json_normalize(json_data, 'counties', ['state', ['info', 'governor']])

By mastering these advanced Pandas techniques, you’ll be better equipped to handle complex data manipulation tasks in your data analysis and machine learning projects. Stay tuned for more in-depth tutorials on Pandas and other data science topics!

Conclusion

Pandas is an indispensable tool for any data scientist or machine learning practitioner. In this tutorial, we’ve covered just a fraction of what Pandas can offer. As you delve deeper into data analysis and machine learning, mastering Pandas will undoubtedly enhance your productivity and analytical capabilities.

Stay tuned for more tutorials on advanced Pandas functionalities!

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

eight + eight =