18.12.2014       Выпуск 57 (14.12.2014 - 21.12.2014)       Статьи

Слияние нескольких Excel-файлов с помощью python + pandas

Преимущества и недостатки прямого метода и способа, предлагаемого автором

Читать>>




Экспериментальная функция:

Ниже вы видите текст статьи по ссылке. По нему можно быстро понять ссылка достойна прочтения или нет

Просим обратить внимание, что текст по ссылке и здесь может не совпадать.

Introduction

A common task for python and pandas is to automate the process of aggregating data from multiple files and spreadsheets.

This article will walk through the basic flow required to parse multiple Excel files, combine the data, clean it up and analyze it. The combination of python + pandas can be extremely powerful for these activities and can be a very useful alternative to the manual processes or painful VBA scripts frequently used in business settings today.

The Problem

Before, I get into the examples, here is a simple diagram showing the challenges with the common process used in businesses all over the world to consolidate data from multiple Excel files, clean it up and perform some analysis.

If you’re reading this article, I suspect you have experienced some of the problems shown above. Cutting and pasting data or writing painful VBA code will quickly get old. There has to be a better way!

Python + pandas can be a great alternative that is much more scaleable and powerful.

By using a python script, you can develop a more streamlined and repeatable solution to your data processing needs. The rest of this article will show a simple example of how this process works. I hope it will give you ideas of how to apply these tools to your unique situation.

Collecting the Data

If you are interested in following along, here are the excel files and a link to the notebook:

The first step in the process is collecting all the data into one place.

First, import pandas and numpy

Let’s take a look at the files in our input directory, using the convenient shell commands in ipython.

address-state-example.xlsx  report.xlsx                sample-address-new.xlsx
customer-status.xlsx            sales-feb-2014.xlsx    sample-address-old.xlsx
excel-comp-data.xlsx            sales-jan-2014.xlsx    sample-diff-1.xlsx
my-diff-1.xlsx                  sales-mar-2014.xlsx    sample-diff-2.xlsx
my-diff-2.xlsx                  sample-address-1.xlsx  sample-salesv3.xlsx
my-diff.xlsx                    sample-address-2.xlsx
pricing.xlsx                    sample-address-3.xlsx

There are a lot of files, but we only want to look at the sales .xlsx files.

../in/sales-feb-2014.xlsx  ../in/sales-jan-2014.xlsx  ../in/sales-mar-2014.xlsx

Use the python glob module to easily list out the files we need.

['../in/sales-jan-2014.xlsx',
 '../in/sales-mar-2014.xlsx',
 '../in/sales-feb-2014.xlsx']

This gives us what we need. Let’s import each of our files and combine them into one file. Panda’s concat and append can do this for us. I’m going to use append in this example.

The code snippet below will initialize a blank DataFrame then append all of the individual files into the all_data DataFrame.

all_data = pd.DataFrame()
for f in glob.glob("../in/sales*.xlsx"):
    df = pd.read_excel(f)
    all_data = all_data.append(df,ignore_index=True)

Now we have all the data in our all_data DataFrame. You can use describe to look at it and make sure you data looks good.

account numberquantityunit priceext price
count1742.0000001742.0000001742.0000001742.000000
mean485766.48794524.31917354.9854541349.229392
std223750.66079214.50275926.1084901094.639319
min141962.000000-1.00000010.030000-97.160000
25%257198.00000012.00000032.132500468.592500
50%527099.00000025.00000055.4650001049.700000
75%714466.00000037.00000077.6075002074.972500
max786968.00000049.00000099.8500004824.540000

A lot of this data may not make much sense for this data set but I’m most interested in the count row to make sure the number of data elements makes sense. In this case, I see all the data rows I expect.

account numbernameskuquantityunit priceext pricedate
0740150Barton LLCB1-200003986.693380.912014-01-01 07:21:51
1714466Trantow-BarrowsS2-77896-163.16-63.162014-01-01 10:00:47
2218895Kulas IncB1-699242390.702086.102014-01-01 13:24:58
3307599Kassulke, Ondricka and MetzS1-654814121.05863.052014-01-01 15:05:22
4412290Jerde-HilpertS2-34077683.21499.262014-01-01 23:26:55

It is not critical in this example but the best practice is to convert the date column to a date time object.

all_data['date'] = pd.to_datetime(all_data['date'])

Combining Data

Now that we have all of the data into one DataFrame, we can do any manipulations the DataFrame supports. In this case, the next thing we want to do is read in another file that contains the customer status by account. You can think of this as a company’s customer segmentation strategy or some other mechanism for identifying their customers.

First, we read in the data.

status = pd.read_excel("../in/customer-status.xlsx")
status
account numbernamestatus
0740150Barton LLCgold
1714466Trantow-Barrowssilver
2218895Kulas Incbronze
3307599Kassulke, Ondricka and Metzbronze
4412290Jerde-Hilpertbronze
5729833Koepp Ltdsilver
6146832Kiehn-Spinkasilver
7688981Keeling LLCsilver
8786968Frami, Hills and Schmidtsilver
9239344Stokes LLCgold
10672390Kuhn-Gusikowskisilver
11141962Herman LLCgold
12424914White-Trantowsilver
13527099Sanford and Sonsbronze
14642753Pollich LLCbronze
15257198Cronin, Oberbrunner and Spencergold

We want to merge this data with our concatenated data set of sales. Use panda’s merge function and tell it to do a left join which is similar to Excel’s vlookup function.

all_data_st = pd.merge(all_data, status, how='left')
all_data_st.head()
account numbernameskuquantityunit priceext pricedatestatus
0740150Barton LLCB1-200003986.693380.912014-01-01 07:21:51gold
1714466Trantow-BarrowsS2-77896-163.16-63.162014-01-01 10:00:47silver
2218895Kulas IncB1-699242390.702086.102014-01-01 13:24:58bronze
3307599Kassulke, Ondricka and MetzS1-654814121.05863.052014-01-01 15:05:22bronze
4412290Jerde-HilpertS2-34077683.21499.262014-01-01 23:26:55bronze

This looks pretty good but let’s look at a specific account.

all_data_st[all_data_st["account number"]==737550].head()
account numbernameskuquantityunit priceext pricedatestatus
9737550Fritsch, Russel and AndersonS2-824231481.921146.882014-01-03 19:07:37NaN
14737550Fritsch, Russel and AndersonB1-531022371.561645.882014-01-04 08:57:48NaN
26737550Fritsch, Russel and AndersonB1-536364242.061766.522014-01-08 00:02:11NaN
32737550Fritsch, Russel and AndersonS1-277222029.54590.802014-01-09 13:20:40NaN
42737550Fritsch, Russel and AndersonS1-936832271.681576.962014-01-11 23:47:36NaN

This account number was not in our status file, so we have a bunch of NaN’s. We can decide how we want to handle this situation. For this specific case, let’s label all missing accounts as bronze. Use the fillna function to easily accomplish this on the status column.

all_data_st['status'].fillna('bronze',inplace=True)
all_data_st.head()
account numbernameskuquantityunit priceext pricedatestatus
0740150Barton LLCB1-200003986.693380.912014-01-01 07:21:51gold
1714466Trantow-BarrowsS2-77896-163.16-63.162014-01-01 10:00:47silver
2218895Kulas IncB1-699242390.702086.102014-01-01 13:24:58bronze
3307599Kassulke, Ondricka and MetzS1-654814121.05863.052014-01-01 15:05:22bronze
4412290Jerde-HilpertS2-34077683.21499.262014-01-01 23:26:55bronze

Check the data just to make sure we’re all good.

all_data_st[all_data_st["account number"]==737550].head()
account numbernameskuquantityunit priceext pricedatestatus
9737550Fritsch, Russel and AndersonS2-824231481.921146.882014-01-03 19:07:37bronze
14737550Fritsch, Russel and AndersonB1-531022371.561645.882014-01-04 08:57:48bronze
26737550Fritsch, Russel and AndersonB1-536364242.061766.522014-01-08 00:02:11bronze
32737550Fritsch, Russel and AndersonS1-277222029.54590.802014-01-09 13:20:40bronze
42737550Fritsch, Russel and AndersonS1-936832271.681576.962014-01-11 23:47:36bronze

Now we have all of the data along with the status column filled in. We can do our normal data manipulations using the full suite of pandas capability.

Using Categories

One of the relatively new functions in pandas is support for categorical data. From the pandas, documentation:

Categoricals are a pandas data type, which correspond to categorical variables in statistics: a variable, which can take on only a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender, social class, blood types, country affiliations, observation time or ratings via Likert scales.

For our purposes, the status field is a good candidate for a category type.

Version Warning

You must make sure you have a recent version of pandas ( > 0.15) installed for this example to work.

'0.15.2'

First, we typecast it the column to a category using astype .

all_data_st["status"] = all_data_st["status"].astype("category")

This doesn’t immediately appear to change anything yet.

account numbernameskuquantityunit priceext pricedatestatus
0740150Barton LLCB1-200003986.693380.912014-01-01 07:21:51gold
1714466Trantow-BarrowsS2-77896-163.16-63.162014-01-01 10:00:47silver
2218895Kulas IncB1-699242390.702086.102014-01-01 13:24:58bronze
3307599Kassulke, Ondricka and MetzS1-654814121.05863.052014-01-01 15:05:22bronze
4412290Jerde-HilpertS2-34077683.21499.262014-01-01 23:26:55bronze

Buy you can see that it is a new data type.

account number             int64
name                      object
sku                       object
quantity                   int64
unit price               float64
ext price                float64
date              datetime64[ns]
status                  category
dtype: object

Categories get more interesting when you assign order to the categories. Right now, if we call sort on the column, it will sort alphabetically.

account numbernameskuquantityunit priceext pricedatestatus
1741642753Pollich LLCB1-04202895.86766.882014-02-28 23:47:32bronze
1232218895Kulas IncS1-065322942.751239.752014-09-21 11:27:55bronze
579527099Sanford and SonsS1-277224187.863602.262014-04-14 18:36:11bronze
580383080Will LLCB1-200004051.732069.202014-04-14 22:44:58bronze
581383080Will LLCS2-103421576.751151.252014-04-15 02:57:43bronze

We use set_categories to tell it the order we want to use for this category object. In this case, we use the Olympic medal ordering.

all_data_st["status"].cat.set_categories([ "gold","silver","bronze"],inplace=True)

Now, we can sort it so that gold shows on top.

account numbernameskuquantityunit priceext pricedatestatus
0740150Barton LLCB1-200003986.693380.912014-01-01 07:21:51gold
1193257198Cronin, Oberbrunner and SpencerS2-824232352.901216.702014-09-09 03:06:30gold
1194141962Herman LLCB1-864814552.782375.102014-09-09 11:49:45gold
1195257198Cronin, Oberbrunner and SpencerB1-508093051.961558.802014-09-09 21:14:31gold
1197239344Stokes LLCB1-655514315.24655.322014-09-10 11:10:02gold

Analyze Data

The final step in the process is to analyze the data. Now that it is consolidated and cleaned, we can see if there are any insights to be learned.

count       1742
unique         3
top       bronze
freq         764
Name: status, dtype: object

For instance, if you want to take a quick look at how your top tier customers are performaing compared to the bottom. Use groupby to get the average of the values.

all_data_st.groupby(["status"])["quantity","unit price","ext price"].mean()
quantityunit priceext price
status
gold24.68072352.4312051325.566867
silver23.81424155.7242411339.477539
bronze24.58900555.4707331367.757736

Of course, you can run multiple aggregation functions on the data to get really useful information

all_data_st.groupby(["status"])["quantity","unit price","ext price"].agg([np.sum,np.mean, np.std])
quantityunit priceext price
summeanstdsummeanstdsummeanstd
status
gold819424.68072314.47867017407.1652.43120526.244516440088.201325.5668671074.564373
silver1538423.81424114.51904435997.8655.72424126.053569865302.491339.4775391094.908529
bronze1878624.58900514.50651542379.6455.47073326.0621491044966.911367.7577361104.129089

So, what does this tell you? Well, the data is completely random but my first observation is that we sell more units to our bronze customers than gold. Even when you look at the total dollar value associated with bronze vs. gold, it looks odd that we sell more to bronze customers than gold.

Maybe we should look at how many bronze customers we have and see what is going on?

What I plan to do is filter out the unique accounts and see how many gold, silver and bronze customers there are.

I’m purposely stringing a lot of commands together which is not necessarily best practice but does show how powerful pandas can be. Feel free to review my previous article here and here to understand it better. Play with this command yourself to understand how the commands interact.

all_data_st.drop_duplicates(subset=["account number","name"]).ix[:,[0,1,7]].groupby(["status"])["name"].count()
status
gold      4
silver    7
bronze    9
Name: name, dtype: int64

Ok. This makes a little more sense. We see that we have 9 bronze customers and only 4 customers. That is probably why the volumes are so skewed towards our bronze customers. This result makes sense given the fact that we defaulted to bronze for many of our customers. Maybe we should reclassify some of them? Obviously this data is fake but hopefully this shows how you can use these tools to quickly analyze your own data.

Conclusion

This example only covered the aggregation of 4 simple Excel files containing random data. However the principles can be applied to much larger data sets yet you can keep the code base very manageable. Additionally, you have the full power of python at your fingertips so you can do much more than just simply manipulate the data.

I encourage you to try some of these concepts out on your scenarios and see if you can find a way to automate that painful Excel task that hangs over your head every day, week or month.

Good luck!





Разместим вашу рекламу

Пиши: mail@pythondigest.ru

Нашли опечатку?

Выделите фрагмент и отправьте нажатием Ctrl+Enter.

Система Orphus