16.09.2020       Выпуск 352 (14.09.2020 - 20.09.2020)       Статьи

Чтение gzip в Python

Читать>>




Экспериментальная функция:

Ниже вы видите текст статьи по ссылке. По нему можно быстро понять ссылка достойна прочтения или нет

Просим обратить внимание, что текст по ссылке и здесь может не совпадать.

[

previous

| next ]     /

home

/

writings

/

diary

/

archive

/

2020

/

09

/

16

/faster_gzip_reading_in_python

Faster gzip reading in Python

In this essay I'll describe how I improved chemfp's gzip read performance by about 15% by replacing Python's built-in gzip module with a ctypes interface to libz. If you need faster gzip read performance, you might consider using zcat or similar tool in a subprocess - if so, look at the xopen module.

Gzip decompression overhead is enough that the 15% read speedup corresponds to a 5% overall speedup for chemfp's sdf2fps tool.

chemfp is a high-performance cheminformatics fingerprint similarity search package for Python. See its home page and documentation for details. Various licensing options are available, including the option to download a pre-compiled package that works on most Linux-based OSes so you test most of the features right now.

Measuring gzip read performance

Gzip compression is ubiquitous. The gzip module from Python's standard library makes it easy for any Python program to read and write gzip'ed files. How fast it is?

I'll use Compound_000000001_000500000.sdf.gz from the PubChem SDF distribution, which is 320 MiB in size, as my test data. (MiB = 10242 bytes, which is slightly larger than than M = Mega = 1,000,000.)

Here's a timing program, which I ran on a Debian machine:

import gzip
import time
import os

def main():
    filename = "Compound_000000001_000500000.sdf.gz"
    gz_size = os.path.getsize(filename)
    n = 0
    with gzip.open(filename) as f:
        t1 = time.time()
        while 1:
            block = f.read(1_000_000)
            if not block:
                break
            n += len(block)
        t2 = time.time()
    gz_MiBps = gz_size/(t2-t1)/1024/1024
    MiBps = n/(t2-t1)/1024/1024
    print(f"dt: {t2-t1:.2f} sec gzin: {gz_MiBps:.1f} MiB/sec out: {MiBps:.1f} MiB/sec ({n} bytes)")

if __name__ == "__main__":
    main()

(The parts in

bold

indicate code that will change when I benchmark two other gzip readers.)

Running it gives the following output :

dt: 7.60 sec gzin: 42.1 MiB/sec out: 343.6 MiB/sec (2738211097 bytes)

Not bad, but as I pointed out in my previous essay, it takes about twice as long for chemfp to read a gzip-compressed FPS file than an uncompressed one. (And that's with the faster gzio reader I'm about to discuss.)

zlib via ctypes

I looked into that overhead as part of the chemfp 3.4 release, thinking I could push more of the I/O code into C for better performance. I found that I could get about 15% better performance by using a ctypes-based Python module.

Python's gzip package, like many gzip implementations, builds on the zlib C library to handle compression and decompression. That zlib library is available on most systems as a shared library. Python's ctypes module gives Python code a way to talk directly to shared libraries like zlib. This mechanism is often called a foreign function interface.

I wrote gzio.py, to read bytes from a gzip'ed file for use in chemfp. Using a variation of the above timing program:

import gzio
import time
import os

def main():
    filename = "Compound_000000001_000500000.sdf.gz"
    gz_size = os.path.getsize(filename)
    n = 0
    with gzio.gzopen_rb(filename) as f:
        t1 = time.time()
        while 1:
            block = f.read(1_000_000)
            if not block:
                break
            n += len(block)
        t2 = time.time()
    gz_MiBps = gz_size/(t2-t1)/1024/1024
    MiBps = n/(t2-t1)/1024/1024
    print(f"dt: {t2-t1:.2f} sec gzin: {gz_MiBps:.1f} MiB/sec out: {MiBps:.1f} MiB/sec ({n} bytes)")

if __name__ == "__main__":
    main()

I was able to process about 395 MiBytes/sec rather than about 340.

dt: 6.62 sec gzin: 48.3 MiB/sec out: 394.5 MiB/sec (2738211097 bytes)

A 15% improvement is nice, so I included gzio.py as an internal module for chemfp.

zcat

The zcat program (gzcat on a Mac; use gzip -dc to be portable) decompresses a gzip-compressed file and writes the results to stdout. Presumably this is well optimized and should let us know how much performance we can expect.

% time gzip -dc Compound_000000001_000500000.sdf.gz > /dev/null

real	0m10.905s
user	0m10.824s
sys	0m0.080s

It's very peculiar that zcat is

slower

than Python. I did the above timings with the system version, gzip 1.6. I also installed gzip 1.10, but the timings were about the same. In case it helps, /etc/debian_version says it's

buster/sid

.

I installed pigz 2.4, which was significantly faster:

% time pigz -dc Compound_000000001_000500000.sdf.gz > /dev/null

real	0m4.813s
user	0m7.515s
sys	0m0.282s

(The

user

time higher than the

real

time likely because of multithreading. You'll note that the overall

user+sys

is still a couple seconds faster than gzip's

real

time.)

(g)zcat on my Mac

I primarily work on a Mac but I don't tend to do timings on it because the background system activity (like web pages and media playing) can have a big effect on the timing. That's why I used a Debian machine for the above timings. I tried the system version of gzcat ("Apple gzip 272.250.1") and installed GNU gzip 1.10; the data file isn't quite the same size, but close enough:

% gzcat --version
Apple gzip 272.250.1
% gzcat Compound_000000001_000500000.sdf.gz | wc -c
 2738222236
% time gzcat Compound_000000001_000500000.sdf.gz > /dev/null
4.179u 0.154s 0:04.67 92.5%	0+0k 0+0io 0pf+0w
% ~/local/bin/zcat --version
zcat (gzip) 1.10
Copyright (C) 2007, 2011-2018 Free Software Foundation, Inc.
This is free software.  You may redistribute copies of it under the terms of
the GNU General Public License <https://www.gnu.org/licenses/gpl.html>.
There is NO WARRANTY, to the extent permitted by law.

Written by Paul Eggert.
% time ~/local/bin/zcat Compound_000000001_000500000.sdf.gz > /dev/null
4.193u 0.157s 0:04.69 92.5%	0+0k 0+0io 0pf+0w

That's twice as fast as the Debian machine, for reasons I still don't undertand.

For what it's worth, I also tested pigz 2.4, which was a bit slower:

% time pigz -dc Compound_000000001_000500000.sdf.gz > /dev/null
4.835u 1.060s 0:05.13 114.8%	0+0k 0+0io 24pf+0w

Now back to the benchmark code using Python's gzip module and my gzio module:

% python time_gz.py
dt: 9.05 sec gzin: 35.3 MiB/sec out: 288.5 MiB/sec (2738222236 bytes)
% python time_gzio.py
dt: 7.02 sec gzin: 45.6 MiB/sec out: 372.0 MiB/sec (2738222236 bytes)

That means my gzio package is about 20% faster than Python's gzip, but not quite half the performance of Apple gzip or GNU gzip.

This is more like what I expected (which makes sense as I did most of my development on my laptop). I still don't know why GNU gzip is so much slower on that Debian machine.

xopen - use zcat as a subprocess

I am far from the first to point out that it's faster to use zcat than Python's gzip library. The xopen module, for example, can use the command-line pigz or gzip programs as a subprocess to decompress a file, then read from the program's stdout via a pipe. This approach provides a basic form of parallelization as the decompression is in a different process than the parser for the file contents.

Let's test it out with a simple variation of the benchmark code:

import xopen
import time
import os

def main():
    filename = "Compound_000000001_000500000.sdf.gz"
    gz_size = os.path.getsize(filename)
    n = 0
    with xopen.xopen(filename, "rb") as f:
        t1 = time.time()
        while 1:
            block = f.read(1_000_000)
            if not block:
                break
            n += len(block)
        t2 = time.time()
    gz_MiBps = gz_size/(t2-t1)/1024/1024
    MiBps = n/(t2-t1)/1024/1024
    print(f"dt: {t2-t1:.2f} sec gzin: {gz_MiBps:.1f} MiB/sec out: {MiBps:.1f} MiB/sec ({n} bytes)")

if __name__ == "__main__":
    main()

This gives:

dt: 4.67 sec gzin: 68.4 MiB/sec out: 558.7 MiB/sec (2738222236 bytes)

Nice!

It's not hard to roll your own using the subprocess module, but there are a few annoying details to get right. For example, what if the gzip process fails because the file isn't found, or because the file isn't in gzip format? The process will start successfully then quickly exit. So, when is that error reported?

What xopen does is to wait 10 ms then check for an exit failure.

Bigger pipe size

A exciting new feature (added within the last month!) is that the xopen package will use fcntl to set the pipe size to the system maximum when using the Linux kernel, that is, from 64KiB to 1024KiB. If I read the benchmark correctly, on the test system the overall performance doubled.

Ruben Vorderman (one of the xopen authors) then did the next step of submitting a patch to Python to Add pipesize parameter to subprocess. Add F_GETPIPE_SZ and F_SETPIPE_SZ to fcntl, for inclusion (hopefully) into Python 3.10.

Reading gzip'ed files with chemfp

By default chemfp uses my gzio wrapper to libz. It can be configured to use Python's gzip library, or to used a subprocess. It does not use xopen - I rolled my own version using subprocess - though after looking at the xopen code I'm reconsidering that decision.

For example, the chemfp program sdf2fps parses an SD file to extract fingerprints encoded in one of the SD tags. PubChem stores the CACTVS/PubChem fingerprint in the PUBCHEM_CACTVS_SUBSKEYS field, with a special encoding. The--pubchemcommand-line flag tells sdf2fps to read and decode that tag value. By default sdf2fps generates a FPS file, which I'll ignore as I'm only interested in the timings.

First, here's the time using Python's gzip module (these timings are on my Mac):

% CHEMFP_USE_SYSTEM_GZIP=1 /usr/bin/time sdf2fps --pubchem Compound_000000001_000500000.sdf.gz > /dev/null
       10.79 real        10.16 user         0.36 sys

Second, the default uses my 'gzio' ctypes wrapper, and you can see a clear performance gain:

% /usr/bin/time sdf2fps --pubchem Compound_000000001_000500000.sdf.gz > /dev/null
        8.88 real         8.33 user         0.17 sys

Lastly, I'll use Apple and GNU gzcat, which are faster still:

% CHEMFP_GZCAT=/usr/bin/gzcat /usr/bin/time sdf2fps --pubchem Compound_000000001_000500000.sdf.gz > /dev/null
        8.24 real         9.42 user         0.66 sys
% CHEMFP_GZCAT=/Users/dalke/local2/bin/zcat /usr/bin/time sdf2fps --pubchem Compound_000000001_000500000.sdf.gz > /dev/null
        8.34 real         9.48 user         0.67 sys

The

user+sys

time is larger than the

real

time because the times for both processes are included.

These timings are not that precise because of background activity on my laptop, but the ranking is generally the same. It's definitely enough to show there are gzip reading options which are faster than Python's built-in module.

On the Debian machine

I ran similar timings on the Debian machine. Pigz was the clear wall-clock faster solution at 8.1 seconds elapsed time instead of over 11 seconds. My gzio package was about 6% faster overall than Python gzip module, and had the overall lowest user time.

    # Python's built-in gzip 
% CHEMFP_USE_SYSTEM_GZIP=1 /usr/bin/time sdf2fps --pubchem Compound_000000001_000500000.sdf.gz > /dev/null
11.79user 0.08system 0:11.87elapsed 100%CPU (0avgtext+0avgdata 36676maxresident)k
0inputs+0outputs (0major+7852minor)pagefaults 0swaps

    # my gzio ctypes wrapper to libz
% /usr/bin/time sdf2fps --pubchem Compound_000000001_000500000.sdf.gz > /dev/null
10.81user 0.36system 0:11.18elapsed 99%CPU (0avgtext+0avgdata 15748maxresident)k
0inputs+0outputs (0major+307474minor)pagefaults 0swaps

   # GNU gzip 1.10  
% CHEMFP_GZCAT=/home/andrew/local/bin/zcat /usr/bin/time sdf2fps --pubchem Compound_000000001_000500000.sdf.gz > /dev/null
15.20user 1.33system 0:12.78elapsed 129%CPU (0avgtext+0avgdata 15692maxresident)k
0inputs+0outputs (0major+533095minor)pagefaults 0swaps

   # pigz 2.4
% CHEMFP_GZIP=/home/andrew/ftps/pigz-2.4/pigz /usr/bin/time sdf2fps --pubchem Compound_000000001_000500000.sdf.gz > /dev/null
11.92user 1.66system 0:08.10elapsed 167%CPU (0avgtext+0avgdata 15752maxresident)k
0inputs+0outputs (0major+533078minor)pagefaults 0swaps

fps.gz read performance on the Debian machine

I also tested chemfp's performance reading gzip-compressed FPS files to see how gzio compares to Python's gzip. First, reading raw records is about 6% faster:

     # Python's gzip module
% CHEMFP_USE_SYSTEM_GZIP=1 /usr/bin/time python3 -c 'import chemfp;print(sum(1 for _ in chemfp.open("chembl_27.fps.gz")))'
1941410
5.46user 0.07system 0:05.53elapsed 99%CPU (0avgtext+0avgdata 22588maxresident)k
0inputs+0outputs (0major+18089minor)pagefaults 0swaps

     # my gzio ctypes wrapper to libz
% /usr/bin/time python3 -c 'import chemfp;print(sum(1 for _ in chemfp.open("chembl_27.fps.gz")))'
1941410
5.12user 0.04system 0:05.16elapsed 99%CPU (0avgtext+0avgdata 13456maxresident)k
0inputs+0outputs (0major+16351minor)pagefaults 0swaps

And second, creating an arena (this test is not available under the chemfp base license) is also about 5% faster:

     # Python's gzip module
% CHEMFP_USE_SYSTEM_GZIP=1 /usr/bin/time python3 -c 'import chemfp;chemfp.load_fingerprints("chembl_27.fps.gz")'
7.20user 0.30system 0:07.51elapsed 99%CPU (0avgtext+0avgdata 1146340maxresident)k
0inputs+0outputs (0major+299456minor)pagefaults 0swaps

     # my gzio ctypes wrapper to libz
% /usr/bin/time python3 -c 'import chemfp;chemfp.load_fingerprints("chembl_27.fps.gz")'
6.84user 0.26system 0:07.10elapsed 99%CPU (0avgtext+0avgdata 1138216maxresident)k
0inputs+0outputs (0major+290749minor)pagefaults 0swaps

That is, while the actual gzip reader is about 15% faster, the rest of the code hasn't changed, so the overall increase is only about 5% faster.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2020 Andrew Dalke Scientific AB





Разместим вашу рекламу

Пиши: mail@pythondigest.ru

Нашли опечатку?

Выделите фрагмент и отправьте нажатием Ctrl+Enter.

Система Orphus