Vectorization with NumPy arrays
At this point, we could choose to call it a day; vectorizing over Pandas series achieves the overwhelming majority of optimization needs for everyday calculations. However, if speed is of highest priority, we can call in reinforcements in the form of the NumPy Python library.
The NumPy library, which describes itself as a “fundamental package for scientific computing in Python”, performs operations under the hood in optimized, pre-compiled C code. Like Pandas, NumPy operates on array objects (referred to as ndarrays); however, it leaves out a lot of overhead incurred by operations on Pandas series, such as indexing, data type checking, etc. As a result, operations on NumPy arrays can be significantly faster than operations on Pandas series.
NumPy arrays can be used in place of Pandas series when the additional functionality offered by Pandas series isn’t critical. For example, the vectorized implementation of our Haversine function doesn’t actually use indexes on the latitude or longitude series, and so not having those indexes available will not cause the function to break. By comparison, had we been doing operations like DataFrame joins, which require referring to values by index, we might want to stick to using Pandas objects.
We convert our latitude and longitude arrays from Pandas series to NumPy arrays simply by using the
values method of the series. As with vectorization on the series, passing the NumPy array directly into the function will lead Pandas to apply the function to the entire vector.
370 µs ± 18 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Running the operation on NumPy array has achieved another four-fold improvement. All in all, we’ve refined the runtime from over half a second, via looping, to a third of a millisecond, via vectorization with NumPy!