Itâs no secret that data cleaning is a large portion of the data analysis process. When using pandas, there are multiple techniques for cleaning text fields to prepare for further analysis. As data sets grow large, it is important to find efficient methods that perform in a reasonable time and are maintainable since the text cleaning process evolves overÂ time.
This article will show examples of cleaning text fields in a large data file and illustrates tips for how to efficiently clean unstructured text fields using Python andÂ pandas.
For the sake of this article, letâs say you have a brand new craft whiskey that you would like to sell. Your territory includes Iowa and there just happens to be an open data set that shows all of the liquor sales in the state. This seems like a great opportunity for you to use your analysis skills to see who the biggest accounts are in the state. Armed with that data, you can plan your sales process for each of theÂ accounts.
Excited about the opportunity, you download the data and realize itâs pretty large. The data set for this case is a 565MB CSV file with 24 columns and 2.3M rows. This is not big data by any means but it is big enough that it can make Excel crawl. Itâs also big enough that some of the pandas approaches will be relatively slow on yourÂ laptop.
For this article, Iâll be using data that includes all of 2019 sales. Due to the size, you can download it from the state site for a different timeÂ period.
Letâs get started by importing our modules and reading the data. I will also use the sidetable package to summarize the data. Itâs not required for the cleaning but I wanted to highlight how useful it can be for these data explorationÂ scenarios.
Letâs get our dataÂ :
import pandas as pd import numpy as np import sidetable df = pd.read_csv('2019_Iowa_Liquor_Sales.csv')
Hereâs what the data looksÂ like.
The first thing we might want to do is see how much each store purchases and rank them from the largest to the smallest. We have limited resources so we should focus on those places where we get the best bang for the buck. It will be easier for us to call on a couple of big corporate accounts instead of a lot of mom and popÂ stores.
sidetable is a shortcut to summarize the data in a readable format. The alternative is doing a
groupby plus additionalÂ manipulation.
df.stb.freq(['Store Name'], value='Sale (Dollars)', style=True, cum_cols=False)
One thing thatâs apparent is that the store names are unique per location in most cases. Ideally we would like to see all the sales for Hy-Vee, Costco, Samâs, etc groupedÂ together.
Looks like we need to clean theÂ data.
Cleaning attemptÂ #1
The first approach we can investigate is using
.loc plus a boolean filter with the
str accessor to search for the relevant string in the
Store Name column.
df.loc[df['Store Name'].str.contains('Hy-Vee', case=False), 'Store_Group_1'] = 'Hy-Vee'
This code will search for the string âHy-Veeâ using a case insensitive search and store the value âHy-Veeâ in a new column called
Store_Group_1 . This code will effectively convert names like âHy-Vee #3 / BDI / Des Moinesâ or âHy-Vee Food Store / Urbandaleâ into a commonÂ âHy-Veeâ.
%%timeit tells us about thisÂ performance:
1.43 s Â± 31.8 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)
Normally we donât want to optimize too early in the process but one thing we can do is use the
regex=False parameter to give aÂ speedup:
df.loc[df['Store Name'].str.contains('Hy-Vee', case=False, regex=False), 'Store_Group_1'] = 'Hy-Vee'
804 ms Â± 27.9 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)
Here are the counts in the newÂ column:
NaN 1617777 Hy-Vee 762568 Name: Store_Group_1, dtype: int64
Weâve cleaned up Hy-Vee but now there are a lot of other values we need toÂ tackle.
.loc approach contains a lot of code and can be slow. We can use this concept but look for some alternatives that are quicker to execute and easier toÂ maintain.
Cleaning attemptÂ #2
Another approach that is very performant and flexible is to use
np.select to run multiple matches and apply a specified value uponÂ match.
There are several good resources that I used to learn how to use
np.select . This article from Dataquest is a good overview. I also found this presentation from Nathan Cheever very interesting and information. I encourage you to check both of theseÂ out.
The simplest explanation for what
np.select does is that it evaluates a list of conditions and applies a corresponding list of values if the condition isÂ true.
In our case, our conditions will be different string lookups and the normalized string we want to use instead will be theÂ value.
After looking through the data, hereâs a list of conditions and values in the
store_patterns list. Each tuple in this list is a
str.contains() lookup and the corresponding text value we want to use to group like accountsÂ together.
store_patterns = [ (df['Store Name'].str.contains('Hy-Vee', case=False, regex=False), 'Hy-Vee'), (df['Store Name'].str.contains('Central City', case=False, regex=False), 'Central City'), (df['Store Name'].str.contains("Smokin' Joe's", case=False, regex=False), "Smokin' Joe's"), (df['Store Name'].str.contains('Walmart|Wal-Mart', case=False), 'Wal-Mart'), (df['Store Name'].str.contains('Fareway Stores', case=False, regex=False), 'Fareway Stores'), (df['Store Name'].str.contains("Casey's", case=False, regex=False), "Casey's General Store"), (df['Store Name'].str.contains("Sam's Club", case=False, regex=False), "Sam's Club"), (df['Store Name'].str.contains('Kum & Go', regex=False, case=False), 'Kum & Go'), (df['Store Name'].str.contains('CVS', regex=False, case=False), 'CVS Pharmacy'), (df['Store Name'].str.contains('Walgreens', regex=False, case=False), 'Walgreens'), (df['Store Name'].str.contains('Yesway', regex=False, case=False), 'Yesway Store'), (df['Store Name'].str.contains('Target Store', regex=False, case=False), 'Target'), (df['Store Name'].str.contains('Quik Trip', regex=False, case=False), 'Quik Trip'), (df['Store Name'].str.contains('Circle K', regex=False, case=False), 'Circle K'), (df['Store Name'].str.contains('Hometown Foods', regex=False, case=False), 'Hometown Foods'), (df['Store Name'].str.contains("Bucky's", case=False, regex=False), "Bucky's Express"), (df['Store Name'].str.contains('Kwik', case=False, regex=False), 'Kwik Shop') ]
One of the big challenge when working with
np.select is that it is easy to get the conditions and values mismatched. Iâve decided to combine into a tuple to more easily keep track of the dataÂ matches.
Because of this data structure, we need to break the list of tuples into two separate lists. Using
zip we can take the
store_patterns and break into
store_criteria, store_values = zip(*store_patterns) df['Store_Group_1'] = np.select(store_criteria, store_values, 'other')
This code will fill in each match with the text value. If there is no match, weâll assign it the valueÂ âotherâ.
Hereâs what it looks likeÂ now:
df.stb.freq(['Store_Group_1'], value='Sale (Dollars)', style=True, cum_cols=False)
This looks better but there is still 32.28% of our revenue in âotherâÂ accounts.
What might be nice is that if there is an account that doesnât match, we use the
Store Name instead of lumping all together in other. Hereâs how we doÂ that:
df['Store_Group_1'] = np.select(store_criteria, store_values, None) df['Store_Group_1'] = df['Store_Group_1'].combine_first(df['Store Name'])
This uses the
combine_first function to fill in all the
None values with the
Store Name . This is a handy trick to keep in mind when cleaning yourÂ data.
Letâs check ourÂ data:
df.stb.freq(['Store_Group_1'], value='Sale (Dollars)', style=True, cum_cols=False)
This looks better because we can continue to refine the groupings as needed. For instance, we may want to build a string lookup forÂ Costco.
Performance is not too bad for a large dataÂ set:
13.2 s Â± 328 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)
The benefit of this approach is that you can use
np.select for numeric analysis as well as the text examples shown here. It is veryÂ flexible.
The one challenge with this approach is that there is a lot of code. If you had a large data set to clean, thereâs a lot of data and code intermixed in thisÂ solution.
Is there another approach that might have similar performance but be a littleÂ cleaner?
Cleaning attemptÂ #3
The next solution is based on this excellent code example from Matt Harrison who developed a
generalize function that does the matching and cleaning for us. Iâve made some changes to make it consistent with this example but want to give Matt credit. I would never have thought of this solution without him doing 99% of theÂ work!
def generalize(ser, match_name, default=None, regex=False, case=False): """ Search a series for text matches. Based on code from https://www.metasnake.com/blog/pydata-assign.html ser: pandas series to search match_name: tuple containing text to search for and text to use for normalization default: If no match, use this to provide a default value, otherwise use the original text regex: Boolean to indicate if match_name contains a regular expression case: Case sensitive search Returns a pandas series with the matched value """ seen = None for match, name in match_name: mask = ser.str.contains(match, case=case, regex=regex) if seen is None: seen = mask else: seen |= mask ser = ser.where(~mask, name) if default: ser = ser.where(seen, default) else: ser = ser.where(seen, ser.values) return ser
This function can be called on a pandas series and expects a list of tuples. The first tuple item is the value to search for and the second is the value to fill in for the matchedÂ value.
Hereâs the equivalent patternÂ list:
store_patterns_2 = [('Hy-Vee', 'Hy-Vee'), ("Smokin' Joe's", "Smokin' Joe's"), ('Central City', 'Central City'), ('Costco Wholesale', 'Costco Wholesale'), ('Walmart', 'Walmart'), ('Wal-Mart', 'Walmart'), ('Fareway Stores', 'Fareway Stores'), ("Casey's", "Casey's General Store"), ("Sam's Club", "Sam's Club"), ('Kum & Go', 'Kum & Go'), ('CVS', 'CVS Pharmacy'), ('Walgreens', 'Walgreens'), ('Yesway', 'Yesway Store'), ('Target Store', 'Target'), ('Quik Trip', 'Quik Trip'), ('Circle K', 'Circle K'), ('Hometown Foods', 'Hometown Foods'), ("Bucky's", "Bucky's Express"), ('Kwik', 'Kwik Shop')]
A useful benefit of this solution is that it is much easier to maintain this list than the earlier
The other change I made with the
generalize function is that the original value will be preserved if there is no default value provided. Instead of using
combine_first , the function will take care of it all. Finally, I turned off the regex match by default for a small performanceÂ improvement.
Now that the data is all set up, calling it isÂ simple:
df['Store_Group_2'] = generalize(df['Store Name'], store_patterns_2)
How aboutÂ performance?
15.5 s Â± 409 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)
It is a little slower but I think itâs a more elegant solution and what I would use in the future if I had to do a similar textÂ cleanup.
The downside to this approach is that it is designed for string cleaning. The
np.select solution is more broadly useful since it can be applied to numeric values asÂ well.
What about dataÂ types?
In recent versions of pandas there is a dedicated
string type. I tried converting the
Store Name to a pandas string type to see if there was any performance improvement. I did not notice any changes. However, itâs possible there will be speed improvements in the future so keep that inÂ mind.
While the string type did not make a difference, the
category type showed a lot of promise on this data set. Refer to my previous article for details on the category dataÂ type.
We can convert the data to a category using
df['Store Name'] = df['Store Name'].astype('category')
Now re-run the
np.select example exactly as we didÂ earlier:
df['Store_Group_3'] = np.select(store_criteria, store_values, None) df['Store_Group_3'] = df['Store_Group_1'].combine_first(df['Store Name'])
786 ms Â± 108 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)
We went from 13s to less than 1 second by making one simple change.Â Amazing!
The reason this works is pretty straightforward. When pandas converts a column to a categorical type, pandas will only call the expensive
str.contains() function on each unique text value. Because this data set has a lot of repeated data, we get a huge performanceÂ boost.
Letâs see if this works for our
df['Store_Group_4'] = generalize(df['Store Name'], store_patterns_2)
Unfortunately we get thisÂ error:
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
That error highlights some of the challenge I have had in the past when dealing with Categorical data. When merging and joining categorical data, you can run into these types ofÂ challenges.
I tried to figure out a good way to modify
generalize() to work but could not figure it out. Bonus points to any reader that figures itÂ out.
However, there is a way we can replicate the Category approach by building a lookupÂ table.
As we learned with the Categorical approach, this data set has a lot of duplicated data. We can build a lookup table and process the resource intensive function only one time perÂ string.
To illustrate how this works on strings, letâs convert the value back to a string type instead of theÂ category:
df['Store Name'] = df['Store Name'].astype('string')
First we build a lookup DataFrame that contains all the unique values and run the
lookup_df = pd.DataFrame() lookup_df['Store Name'] = df['Store Name'].unique() lookup_df['Store_Group_5'] = generalize(lookup_df['Store Name'], store_patterns_2)
We can merge it back into a finalÂ DataFrame:
1.38 s Â± 15.1 ms per loop (mean Â± std. dev. of 7 runs, 1 loop each)
It is slower than the
np.select approach on categorical data but the performance impact might be balanced by the easier readability for maintaining the lookupÂ list.
Also, the intermediate
lookup_df could be a great output to share with an analyst that can help you clean up more of the data. That savings could be measured in hours ofÂ work!
This newsletter by Randy Au is a good discussion about the important of data cleaning and the love-hate relationship many data scientists have with this task. I agree with Randyâs premise that data cleaning isÂ analysis.
In my experience, you can learn a lot about your underlying data by taking up the kind of cleaning activities outlined in thisÂ article.
I suspect you are going to find lots of cases in your day to day analysis where you need to do text cleaning similar to what Iâve shown in thisÂ article.
Here is a quick summary of the solutions we lookedÂ at:
|13s||Can work for non-text analysis|
|Category Data and ||786ms||Categorical data can get tricky when merging and joining|
|Lookup table and ||1.3s||A lookup table can be maintained by someone else|
For some data sets, performance is not an issue so pick what clicks with yourÂ brain.
However, as the data grows in size (imagine doing this analysis for 50 states worth of data), you will need to understand how to use pandas in an efficient manner for text cleaning. My hope is that you bookmark this article and come back to it when you face a similarÂ problem.
As always, if you have some other tips that might be useful to folks, let me know in the comments. If you figure out how to make my
generalize function work with categorical data, let me knowÂ too.