15.07.2017       Выпуск 186 (10.07.2017 - 16.07.2017)       Статьи

Data Science: Производительность Python vs. Pandas vs. Numpy

Сравнение обработки данных на чистом Python/Numpy/Pandas

Читать>>



Экспериментальная функция:

Ниже вы видите текст статьи по ссылке. По нему можно быстро понять ссылка достойна прочтения или нет

Просим обратить внимание, что текст по ссылке и здесь может не совпадать.

Speed and time is a key factor for any Data Scientist. In business, you do not usually work with toy datasets having thousands of samples. It is more likely that your datasets will contain millions or hundreds of millions samples. Customer orders, web logs, billing events, stock prices – datasets now are huge.

I assume you do not want to spend hours or days, waiting for your data processing to complete. The biggest dataset I worked with so far contained over 30 million of records. When I run my data processing script the first time for this dataset, estimated time to complete was around 4 days! I do not have very powerful machine (Macbook Air with i5 and 4 GB of RAM), but the most I could accept was running the script over one night, not multiple days.

Thanks to some clever tricks, I was able to decrease this running time to a few hours. This post will explain the first step to achieve good data processing performance – choosing right library/framework for your dataset.

The graph below shows result of my experiment (details below), calculated as processing speed measured against processing speed of pure Python.

As you can see, Numpy performance is several times bigger than Pandas performance. I personally love Pandas for simplifying many tedious data science tasks, and I use it wherever I can. But if the expected processing time spans for more than many hours, then, with regret, I change Pandas to Numpy.

I am very aware that the actual performance may vary significantly, depending on a task and type of processing. So please, treat these result as indicative only. There is no single test that can shown “overall” comparison of performance for any set of software tools.

The Notebook

Please review the notebook below for details of the performance test.
You can also reviev the Notebook on Github.



Data Science: Performance of Pure Python vs Pandas vs Numpy

Intro

In this notebook wie will compare data processing speed of pure Python, Pandas and Numpy

In [1]:

# Python 3.6
import os
import sys
import gc
import random
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:

# clear memory
gc.collect()
# We will prepare artificial dataset that contains orders of the product.
# Each order will have random number of items ordered and random price within given range
dataset_python = []
products_ids = [1,10000] # number of products in the dataset
range_orders = [1,10] # range of orders for product from which to select one random value
range_quantity = [1,100] # range of quantity in order from which to select one random value
range_price = [0.1,30.00] # range of product prices from which to select one random value

In [3]:

def generate_dummy_dataset(products_ids,range_orders,products_quantity,products_price):
    generated_dataset = []
    for idx in range(products_ids[0],products_ids[1]):
        random_orders = random.randint(range_orders[0],range_orders[1]) 
        for odx in range(random_orders):
            quantity = random.randint(range_quantity[0],range_quantity[1])
            price = random.uniform(range_price[0],range_price[1])
            generated_dataset.append([idx,odx,quantity,price])
    return generated_dataset

In [4]:

dataset_python = generate_dummy_dataset(products_ids,range_orders,range_quantity,range_price)
print ("Dataset size",len(dataset_python),"records")
dataset_python[0:3]

Out[4]:

[[1, 0, 17, 20.38398364947676],
 [1, 1, 52, 21.906741053915106],
 [2, 0, 85, 9.61975135761023]]

In [5]:

dataset_numpy = np.array(dataset_python)
dataset_numpy[0:3,:]

Out[5]:

array([[  1.        ,   0.        ,  17.        ,  20.38398365],
       [  1.        ,   1.        ,  52.        ,  21.90674105],
       [  2.        ,   0.        ,  85.        ,   9.61975136]])

In [6]:

dataset_pandas = pd.DataFrame(data=dataset_numpy,columns=['product_id','product_order_num','quantity','price'])
dataset_pandas.head(3)

Out[6]:

product_idproduct_order_numquantityprice
01.00.017.020.383984
11.01.052.021.906741
22.00.085.09.619751

In [7]:

# function generating statistics using pure Python
def gen_stats_python(dataset_python):
    start = time.time()
    product_stats = []
    unique_products = set([x[0] for x in dataset_python])
    for product_id in unique_products:
        product_items = [x for x in dataset_python if x[0]==product_id ]
        num_orders = len(product_items)
        total_quantity = 0
        total_price = 0
        for row in product_items:
            total_quantity += row[2]
            total_price += row[3]
        avg_price = float(total_price/num_orders)
        product_stats.append([int(product_id),int(num_orders),int(total_quantity),round(avg_price,2)])
    end = time.time()
    working_time = end-start
    return product_stats,working_time

In [8]:

# function generating statistics using Pandas
def gen_stats_pandas(dataset_pandas):
    start = time.time()
    product_stats = []
    for product_id in pd.unique(dataset_pandas['product_id']):
        product_items = dataset_pandas.loc[dataset_pandas['product_id'] == product_id]
        num_orders = product_items.shape[0]
        total_quantity = product_items['quantity'].sum()
        avg_price = float(product_items['price'].mean())
        product_stats.append([int(product_id),int(num_orders),int(total_quantity),round(avg_price,2)])
    end = time.time()
    working_time = end-start
    return product_stats,working_time

In [9]:

# function generating statistics using Numpy
def gen_stats_numpy(dataset_numpy):
    start = time.time()
    product_stats = []
    for product_id in np.nditer(np.unique(dataset_numpy[:,0]),op_dtypes=['float64']):
        product_items = dataset_numpy[dataset_numpy[:,0] == product_id]
        num_orders = product_items.shape[0]
        total_quantity = product_items[:,2].sum()
        avg_price = float(product_items[:,3].mean())
        product_stats.append([int(product_id),int(num_orders),int(total_quantity),round(avg_price,2)])
    end = time.time()
    working_time = end-start
    return product_stats,working_time

In [10]:

stats_python = gen_stats_python(dataset_python)[0]
stats_pandas = gen_stats_pandas(dataset_pandas)[0]
stats_numpy = gen_stats_numpy(dataset_numpy)[0]
print (stats_python[0:3])
print (stats_pandas[0:3])
print (stats_numpy[0:3])
print ("All results equal:",stats_python == stats_pandas == stats_numpy)

[[1, 2, 69, 21.15], [2, 9, 430, 10.13], [3, 1, 40, 2.34]]
[[1, 2, 69, 21.15], [2, 9, 430, 10.13], [3, 1, 40, 2.34]]
[[1, 2, 69, 21.15], [2, 9, 430, 10.13], [3, 1, 40, 2.34]]
All results equal: True

In [11]:

# clear memory
del stats_python,stats_pandas,stats_numpy

In [13]:

gc.collect()
total_time = 0
for i in range(NUM_ITERATIONS):
    total_time += gen_stats_python(dataset_python)[1]
python_avg_time = total_time/NUM_ITERATIONS
print ("Python average time:",python_avg_time,"seconds")

In [14]:

gc.collect()
total_time = 0
for i in range(NUM_ITERATIONS):
    total_time += gen_stats_pandas(dataset_pandas)[1]
pandas_avg_time = total_time/NUM_ITERATIONS
print ("Pandas average time:",pandas_avg_time,"seconds")

In [15]:

gc.collect()
total_time = 0
for i in range(NUM_ITERATIONS):
    total_time += gen_stats_numpy(dataset_numpy)[1]
numpy_avg_time = total_time/NUM_ITERATIONS
print ("Numpy average time:",numpy_avg_time,"seconds")

In [16]:

python_vs_python = 1
pandas_vs_python = python_avg_time/pandas_avg_time
numpy_vs_python = python_avg_time/numpy_avg_time

In [17]:

plt.style.use('fivethirtyeight')
objects = ('Python', 'Pandas', 'Numpy')
y_rng = np.arange(len(objects))
performance = [python_vs_python,pandas_vs_python,numpy_vs_python]
plt.barh(y_rng, performance, align='center', alpha=0.5)
plt.yticks(y_rng, objects)
plt.xlabel('Processing speed')
plt.title('Data processing speed (Python = 1x)')
for a,b in zip(performance, y_rng):
    plt.text(a, b, "  "+str(round(a))+"x")
plt.show()

Related



Лучшая Python рассылка

Нас поддерживает


Python Software Foundation



Разместим вашу рекламу

Пиши: mail@pythondigest.ru

Нашли опечатку?

Выделите фрагмент и отправьте нажатием Ctrl+Enter.

Система Orphus