Pre-Processing Data Like a Pro: A Deep Dive into .map(), .zip(), .enumerate(), and .groupby()

Edward Mendoza
4 min readSep 10, 2023
Credits to: Midjourney

As a Data Scientist, you are often tasked to complete a plethora of pre-processing tasks that may take forever and a day to accomplish, but with these go-to functions, cleaning and formatting data will be easier than importing Pandas and NumPy into your IDE!

Today, we’re going to learn about four Python functions that can make your data sparkle, and they are: .map(), .zip(), .enumerate(), .groupby()

`.map()`: The Data Transformer

Imagine you have a list of grades in letters, but you want to convert them into numbers. You can use .map() to do that. It’s practically the magical wand function in the Python world, where you can change one thing into another.

How to Use .map()

Here’s a Python example using the seaborn dataset “tips”. A good recommendation before diving into any dataset is to check the .head(), or the first 5 rows of the dataset, to get a good grasp of data that you’ll be working with.

import seaborn as sns

#Load the tips dataset from Seaborn
df = sns.load_dataset("tips")
df.head()

The result of this dataset should look like this:

Output of `df.head()`

As you can see, the dataset contains data of what customers pay for their meal, the characteristics of these customers, and the time they had their respective meals. Now before we input the data into a data model, we must change some of the columns to numerical datapoints in order for the model to fully understand the data we’re feeding it. We will update the sex column, where we will change characters who are male to “0”, and female to “1”. That can be accomplished through the following code snippet:

# Convert 'sex' column to numbers: Male to 0 and Female to 1
df['sex'] = df['sex'].map({'Male': 0, 'Female': 1})

`.zip()`: The Data Combiner

Let’s say you’re interested in creating a new DataFrame that contains only the total_bill and tip columns. One useful function that can achieve this is using the .zip() function, where its aim is to take multiple lists (or other iterable objects) and pairs their elements together into tuples, making it easy to loop through multiple lists at the same time.

To see zip() in action, all you would have to do is shown in this code snippet:

bill_tip_pair = list(zip(df['total_bill'], df['tip']))

#convert the pairs into a new df
df_bill_tip = df.DataFrame(bill_tip_pair, columns= ['Total_bill', 'Tip'])

The resulting dataFrame will look like this:

.enumerate(): The Counter

Sometimes you may want to keep track of the row numbers while you’re looking at a DataFrame. This can be handy if, say, you want to know the table numbers and how much each table spent.
Here’s how to do it:

# Enumerate through the first 5 rows of the DataFrame
for i, row in enumerate(df.head().itertuples()):
print(f"Table {i+1} had a total bill of ${row.total_bill}.")
#The result seen below...

As you can see here, we use .enumerate() to go through the first 5 rows of the dataFrame. We also use itertuples because it helps us look at each row one by one.

print(f"Table {i+1} had a total bill of ${row.total_bill}."): We print out the table number (starting from 1) and its total bill.

This way you can easily keep track of table numbers and their corresponding total bills, making your data exploration more organized.

.groupby(): The Data Explorer Function!

Imagine you’re the restaurant manager and you want to know which days are the most lucrative in terms of tips. You can use .groupby() to group all the tips by the day of the week, and then find the average tip for each day.

Here’s the code:

# Group the data by 'day' and calculate the average tip for each day
grouped_data = df.groupby('day')['tip'].mean().reset_index()

# Convert the result to a DataFrame for better readability
grouped_df = pd.DataFrame(grouped_data)

# Show the DataFrame
print(grouped_df)

When you run this code, you’ll see something like this:

    day       tip
0 Thur 2.771452
1 Fri 2.734737
2 Sat 2.993103
3 Sun 3.255132

Let’s break it down:

grouped_data = df.groupby('day')['tip'].mean().reset_index(): We use .groupby() to put all the data that's from the same day together. Then we calculate the average tip for each day using .mean(). Finally, .reset_index() makes it look nice and neat.

grouped_df = pd.DataFrame(grouped_data): We turn the grouped data into a DataFrame so it's easier to read.

Conclusion

Well there you have it, folks! Four handy functions that can help with the pre-processing the exploratory analysis stages in your data analysis and data science process! Whether your’re transforming, combining, counting, or organizing data, you’ve got the right tool for the job. Happy coding!

--

--