Высокоэффективная обработка данных в Phyton

High Performance Data Processing In Python

@donald_whyte

[NEXT]

About Me

![small_portrait](images/donald.jpg)

Software Engineer
@ Engineers Gate
Scalable data infrastructure
Real-time trading systems
Python/C++/Rust developer

[NEXT]

Python is a hugely popular tool for data analysis.

[NEXT]

Data analysis is now as popular as web development with Python.

*JetBrains Python Developer Survey 2017* **[1]**

note https://www.jetbrains.com/research/python-developers-survey-2017/

[NEXT]

Why?

[NEXT]

High-level and easy to use.

Wealth of tools for processing/analysing data.

General-purpose language useful outside of data analysis.

[NEXT]

Great language for research.

[NEXT]

What about production?

[NEXT]

Isolated to Research?

Large data analysis/processing used to be isolated to research.

One-off batch jobs to produce insight for research and decision making.

note Data analysis used to only be active in the realm of research. Analysts would write one-off jobs that cleaned up data and analysed it. The findings would then be included in research papers, presentations to management in firms and so on.

It was very rare that you'd run such heavy data analysis frequently in live production systems.

[NEXT]

Things Have Changed

Exponential growth of data.

Need real-time insights into this data.

Machine learning/stats models are running in live production systems.

note Source: https://insidebigdata.com/2017/02/16/the-exponential-growth-of-data/

[NEXT]

Artificial Intelligence

Projected Global Revenue

Source: [Tractica December 2017](https://www.tractica.com/newsroom/press-releases/artificial-intelligence-software-market-to-reach-89-8-billion-in-annual-worldwide-revenue-by-2025/)

note Artificial Intelligence software market projected to reach almost $90 billion by 2025.

[NEXT]

More data to process.

More numerical models being trained for live use.

Models larger and more complex.

[NEXT]

Strict time requirements.

[NEXT]

The Traditional Process

Researcher builds model in their tech of choice
Programmer takes research code and rewrites it in heavily optimised C/C++
Production code is deployed
Everything works fine

[NEXT]

Success!

[NEXT]

The Reality

[NEXT]

The Reality

Researcher builds model that works on their machine
Programmer attempts to rewrite model for production
Programmer can't replicate the researcher's results
Everything spends tons of time figuring out why

note Useful link discussing deplyoying models to prod: https://www.quora.com/How-do-you-take-a-machine-learning-model-to-production

[NEXT]

Results in...

Deployment delays
Compromises on model accuracy to release it faster

[NEXT]

A Better Process

Research and production code is identical.

note A better process is to make the research and production code identical. They can be configured differently, but the code which pre-processes the data, trains the models and executes it in prod should be the same.

[NEXT]

Problem

Want to use Python.

Enables researchers to run experiments quickly.

But Pure Python is slow.

note But we like Python because it's easy to use for research.

[NEXT]

Python vs. C Performance

Source: [The Computer Language Benchmarks Game](https://benchmarksgame-team.pages.debian.net/benchmarksgame/faster/python3-gcc.html)

[NEXT]

Speedup using C

Source: [The Computer Language Benchmarks Game](https://benchmarksgame-team.pages.debian.net/benchmarksgame/faster/python3-gcc.html)

[NEXT]

Solution

Python's ecosystem for data science.

[NEXT]

NumPy

Heart of scientific computing in Python
Stores and operates on data in C structures
Avoids slowness of Python

[NEXT]

Foundation of most scientific computing packages.

[NEXT]

Our Focus

Showing how to use NumPy to process numerical data.

Exploring how NumPy leverages vectorisation to dramatically boost performance.

[NEXT]

Outline

Analyse a large weather dataset
Process dataset in pure Python
Speed up processing using NumPy and vectorisation
Speed up processing even more using Numba

[NEXT]

Final Optimised Solution

1145 times faster than pure Python.

[NEXT SECTION]

1. The Dataset

[NEXT]

Integrated Surface Database (ISD)

note Global database of atmospheric weather data.

This map shows the spatial distribution of Integrated Surface Database stations. Data has been collected from 35,000 weather stations scattered across the globe.

Source: https://www.ncdc.noaa.gov/isd

[NEXT]

Measurements

wind speed and direction

temperature

sea level pressure

sky visibility

note Detailed list of fields:

wind speed and direction, wind gust, temperature, dew point, cloud data, sea level pressure, altimeter setting, station pressure, present weather, visibility, precipitation amounts for various time periods, snow depth, and various other elements as observed by each station.

[NEXT]

Coverage

7 continents

35,000 weather stations

1901 to now

from over 100 data sources

[NEXT]

Total Data Volume > 600GB

note ISD integrates data from over 100 original data sources, including numerous data formats that were key-entered from paper forms during the 1950s–1970s time frame

[NEXT]

Example

[NEXT]

timestamp	station_id	wind_speed_rate	...
1995-01-06 03:00:00	407060	50.0	...
1995-01-06 06:00:00	407060	70.0	...
1995-01-06 09:00:00	407060	null	...
1995-01-06 12:00:00	407060	60.0	...
1995-01-06 16:00:00	407060	20.0	...

note Wind speed rate = the rate of horizontal travel of air past a fixed point.

UNITS: meters per second SCALING FACTOR: 10 MISSING VALUE: -9999

http://www.polmontweather.co.uk/windspd.htm

[NEXT]

Tabriz Wind Speed Rate

Over Two Days

(2011-12-29 to 2011-12-31)

[NEXT]

Research Goal

Use IDS data to detect extreme weather events that happen anywhere on the planet.

[NEXT]

Initial Goal

Detecting hurricanes

[NEXT]

ISD-Lite Dataset

Let's test our approach on a smaller dataset.


Dates	1991-01-01 to 2011-12-31
Measurement	Wind Speed Rate
Stations	~6000
Rows	~400,000,000

note Total stations: 5,700 Total rows: 391,908,527

[NEXT SECTION]

2. Pure Python

[NEXT] How do we detect hurricanes?

Finding data points with unusually low/high wind_speed_rate values.

[NEXT]

Detecting Outliers

[NEXT]

Detecting Outliers

At each point i in the time series:

Take values in time series between points i - 30 and i
Calculate mean and standard deviation
Value at i is an outlier if it's:

more than 6 standard deviations away from the mean

[NEXT]

note

Split full dataset into separate station time series
For each weather station time series, detect outliers by:
calculate moving mean and stdev at each point
check if a point is > 6 stdevs away from its moving mean value
if so, mark point as outlier
generate CSV containing all outliers in each station's data

[NEXT]

The Input

HDF5 file containing three columns:

station_id
timestamp
wind_speed_rate

[NEXT]

HDF5

Hierarchical Data Format
Designed to store large amounts of binary data
No text parsing required
Efficient to load

[more information on this format here](https://www.hdfgroup.org/HDF5/)

note HDF5 is an open source file format for storing huge amounts of numerical data.

It’s typically used in research applications (meteorology, astronomy, genomics etc.) to distribute and access very large datasets without using a database.

It lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.

[NEXT]

Invariants

Rows sorted by (station_id, timestamp).

Each station's rows are grouped together.

Ordered by time.

[NEXT]

Problem: Null Values

timestamp	station_id	wind_speed_rate
1995-01-06 03:00:00	407060	50.0
1995-01-06 06:00:00	407060	70.0
1995-01-06 09:00:00	407060	null
1995-01-06 12:00:00	407060	70.0
1995-01-06 17:00:00	407060	20.0

[NEXT]

Solution: Fill Forward

timestamp	station_id	wind_speed_rate
1995-01-06 03:00:00	407060	50.0
1995-01-06 06:00:00	407060	70.0
1995-01-06 09:00:00	407060	70.0
1995-01-06 12:00:00	407060	70.0
1995-01-06 17:00:00	407060	20.0

[NEXT]

The Code

Source file on GitHub: find_outliers_purepython.py

[NEXT]

Code Breakdown


`station_ranges`	partition full dataset into per-station time series
`fill_forward`	fill in missing data with previous values
`moving_average`	computing moving average at every time point
`moving_std`	computing moving stdev at every time point
`find_outliers`	get indices of outliers using deviance from moving avg

[NEXT]

Running the Code

> python3 -m find_outliers_purepy \
    --input isdlite.hdf5 \
    --output outliers.csv \
    --measurement wind_speed_rate

Determining range of each station time series
Found time series for 5183 ranges
Removing time series that don't have enough data
Kept 4695 / 5183 station time series
Computing outliers
Computed outliers in 14499.84 seconds
Writing outliers to outliers.csv

[NEXT]

Output Outliers CSV

station_id	timestsamp	wind_speed_rate
720346	1996-04-25 11:00:00	110.0
720358	1997-01-31 09:00:00	40.0
997375	1993-01-29 15:00:00	100.0
...	...	...

[NEXT] Some detected outliers:

997299,2006-09-01 09:00:00,400.0
997299,2006-09-01 12:00:00,400.0

The affected weather station is:

> grep 997299 stations.csv
"997299","99999","CHEASAPEAKE BRIDGE","US","VA","","+36.970","-076.120","+0016.0","20050217","20161231"

[NEXT]

Success!

🎉

[NEXT]

Time Taken

4 hours.

[NEXT]

The Ultimate Goal

Use IDS data to detect extreme weather events that happen anywhere on the planet.

[NEXT]

Detecting Outliers in the Full Dataset

All 8 measurements.

All 35,000 weather stations.

From 1901 to now.

note What if we ran the same outlier detection code on the full dataset?

[NEXT]

It would take 27 days.

[NEXT]

What went wrong?

[NEXT]

Profiling the Code

Use cProfile to find out which steps were the performance bottlenecks.

> python3 -m cProfile -o profile_output \
     find_outliers_purepy.py \
     --input isdlite.hdf5 \
     --output outliers.csv \
     --measurement wind_speed_rate

[NEXT]

Visualising Performance Bottlenecks

snakeviz generates visualisations of profiling data.

> pip3 install snakeviz
> snakeviz profile_output

[NEXT]

Example

Calculating standard deviation of 100 million integers.

def _main():
    a = list(range(100000000))
    _std(a)

def _std(a):
    mean = _mean(a)
    squared_differences = _square(_differences(a, mean))
    sum_of_sq_diffs = _sum(squared_differences)
    return math.sqrt(sum_of_sq_diffs / len(a) - 1)

def _mean(a):
    return _sum(a) / len(a)

def _differences(a, mean):
    return [x - mean for x in a]

def _square(a):
    return [x * x for x in a]

def _sum(a):
    s = 0
    for x in a:
        s += x
    return s

def _divide(a, d):
    return [x / d for x in a]

[NEXT]

Snakeviz Output

[NEXT]

Finding Outliers

Execution Time Breakdown

[NEXT] Total time: 4 hours (14530 secs)

[NEXT]

Why is Python so slow?

note Source for upcoming sections: https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/

[NEXT]

Reason 1

Dynamic Typing

[NEXT] When a Python program executes, the interpreter doesn't know the type of the variables that are defined.

[NEXT] More instructions needed for any operation.

Primary reason Python is slower than C or other compiled languages for processing numerical data.

[NEXT]

Reason 2

Interpreted, not Compiled

[NEXT] Python code is interpreted at runtime.

Quick to iterate, but gives less chance to optimise.

During compilation, a smart compiler can look ahead and optimise inefficient code.

note See section 5 to learn see how compiling Python code can dramatically speed up code.

[NEXT]

Reason 3

Fragmented Memory Access

[NEXT]

[NEXT] Bad for code that steps through data in sequence.

Iterate through a single list accesses completely different regions of memory.

Not cache friendly.

[NEXT SECTION]

3. Numpy

[NEXT]

The Foundation

Fundamental package for high performance computing in Python.

Many libraries/frameworks are built on top of NumPy.

[NEXT]

Features

multi-dimensional array objects
routines for fast operations on arrays
- mathematical, logical, sorting, selecting
efficient loading/saving of numerical data to disk
- including HDF5

[NEXT] numpy.ndarray

class encapsulating n-dimensional arrays
fixed size
elements must be the same type

note At the core of the NumPy package, is the ndarray object. This encapsulates n-dimensional arrays of homogeneous data types, with many operations being performed in compiled code for performance.

[NEXT]

Examples

import numpy as np

[NEXT]

Creating an Array

>>> a = np.arange(9, dtype=np.float64)
>>> a
array([0., 1., 2., 3., 4., 5., 6., 7., 8.])
>>> a.shape
(9,)
>>> a.strides
(8,)

[NEXT]

Memory Layout

note A NumPy array in its simplest form is a Python object build around a C array. That is, it has a pointer to a contiguous data buffer of values.

data is pointer indicating the memory address of the first byte in the array.

dtype indicates the type of elements stored in the array.

shape indicates the shape of the array. That is, it defines the dimensionality of the data in the array and how many elements the array stores for each dimension.

The strides are the number of bytes that should be skipped in memory to go to the next element. If your strides are (32, 8), you need to proceed 8 bytes to get to the next column and 32 bytes to move to the next row.

flags is a set of configurable flags we don't need to cover here.

Python View

[NEXT]

Reshape

>>> b = a.reshape(3, 3)
>>> b
array([[0., 1., 2.],
       [3., 4., 5.],
       [6., 7., 8.]])

[NEXT]

Slicing One Dimension

>>> b[:, :2]
array([[0., 1.],
       [3., 4.],
       [6., 7.]])

[NEXT]

Slicing Multiple Dimensions

>>> b[:2, :2]
array([[0., 1.],
       [3., 4.]])

[NEXT] Reshaping or slicing arrays creates a view.

No copies are made.

[NEXT]

Performance Benefits

Data stored contiguously
- no memory overhead
- cache locality
No copies for common reshaping/slicing operations
Fast logical and mathematical operations
- executed in heavily optimised compiled code

[NEXT]

Benchmark

Adding 10,000,000 Numbers

[NEXT]

Pure Python

a = list(range(10000000))
b = list(range(10000000))

# 1. indexing
c = [a[i] + b[i] for i in range(len(a))]

[NEXT]

NumPy

import numpy as np

a = np.arange(10000000)
b = np.arange(10000000)

# 2. loop
c = np.zeros(len(a))
for i in range(len(a)):
    c[i] = a[i] + b[i]

# 3. built-in numpy addition operator
d = a + b

[NEXT]

Timing (seconds)

[NEXT]

Speedup Factor

[NEXT] NumPy with loops is the slowest of all choices.

Takes 4x longer than pure Python!

[NEXT]

Explicitly Looping over Numpy Array

note For every integer, we're making two __getitem__ calls, performing the addition in Python and copying each result into the output numpy array with a call to __setitem__.

This dramatically slows down the computation for two reasons:

This adds function call overhead. We invoke four Python functions for each integer. That's 40,000,000 function calls.
It performs three copies for each addition. It copies the ith element of a and b, then copies the addition into c.
The overhead and copies destroy cache locality. The copies are likely in a very different part of the address space, meaning the CPU is having to do more work to fetch data from RAM, instead of just using its local cache.

[NEXT]

Using Built-in Addition

note The full addition logic is executed in native, compiled NumPy code. There are no function call overheads and no copies.

The memory buffers storing a and b are directly accessed when adding. Since those buffers are stored contiguously in memory, we're cache friendly. The CPU has to fetch less data from RAM.

[NEXT]

Keep it in NumPy!

Don't loop through np.ndarrays.

Move the computation to the NumPy/C/native code level where possible.

[NEXT]

A Problem...

For arrays with the same size, operations are performed element-by-element.

Sometimes we want to apply smaller scalars or vectors to larger arrays.

e.g. adding one to all elements in an array

note We want to use NumPy's built-in operations, but we don't want to perform loads of copies to match up the array sizes.

[NEXT]

Adding 1 to an Array

Adding 1 to N elements would take N -1 copies!

[NEXT]

Broadcasting

Allows us to apply smaller arrays to larger arrays.

Without copying.

[NEXT]

Broadcasting Scalar to Array

[NEXT]

Broadcasting Scalar to Array

[NEXT]

Broadcasting Vector to Array

[NEXT]

Broadcasting Vector to Array

[NEXT]

Using NumPy for Outlier Detection

[NEXT]

Recap


`station_ranges`	partition full dataset into per-station time series
`fill_forward`	fill in missing data with previous values
`moving_average`	computing moving average at every time point
`moving_std`	computing moving stdev at every time point
`find_outliers`	get indices of outliers using deviance from moving avg

[NEXT]

[NEXT] station_ranges()

def station_ranges(station_ids: np.ndarray) -> np.ndarray:
    is_end_of_series = station_ids[:-1] != station_ids[1:]
    indices_where_stations_change = np.where(
      is_end_of_series == True)[0] + 1
    series_starts = np.concatenate((
        np.array([0]),
        indices_where_stations_change
    ))
    series_ends = np.concatenate((
        indices_where_stations_change,
        np.array([len(station_ids) - 1])
    ))
    return np.column_stack((series_starts, series_ends))

[NEXT]

station_ids = np.array([123, 123, 124, 245, 999, 999])

[NEXT]

is_end_of_series = station_ids[:-1] != station_ids[1:]

[NEXT]

indices_where_stations_change = (
    np.where(is_end_of_series == True)[0] + 1)

[NEXT]

series_starts = np.concatenate((
    [0],
    indices_where_stations_change
))

[NEXT]

series_ends = np.concatenate((
    indices_where_stations_change,
    [len(station_ids) - 1]
))

[NEXT]

np.column_stack((series_starts, series_ends))

[NEXT]

[NEXT] Total time: 4 hours ⟶ 1.4 hours

Speedup: 2.85x

[NEXT]

Why NumPy is So Much Faster

No extra memory overhead
Minimal copying
Cache friendly
Operations executed in optimised compiled code

[NEXT] But also...

[NEXT SECTION]

4. Vectorisation

[NEXT]

Process of converting an algorithm from operating on a single value at a time to operating on a set of values at one time.

note Source: https://software.intel.com/en-us/articles/vectorization-a-key-tool-to-improve-performance-on-modern-cpus

[NEXT] Modern CPUs provide direct support for vector operations.

A single instruction is applied to multiple data points.

[NEXT]

Adding Two Vectors

Single Instruction Single Data (SISD)

Adding N numbers takes N instructions.

[NEXT]

Adding Two Vectors

Single Instruction Multiple Data (SIMD)

Adding N numbers takes N / 4 instructions.

note Basically for you as a coder, SIMD allows to perform four operations (reading/writing/calculating) for the price of one instruction. The cost reduction is enabled by vectorization and data-parallelism. You don’t even have to handle threads and race conditions to gain this parallelism.

[NEXT]

SIMD on CPUs

Most modern CPUs support vectorisation.
CPU with a 512 bit register can hold 8 64-bit doubles.
One instruction for 8 doubles.

[NEXT]

Example in C

Adding 10,000,000 Numbers

[NEXT]

Non-Vectorised

void add(float* a, float* b, float* out, int len) {
    for (int i = 0; i < len; ++i) {
        out[i] = a[i] + b[i];
    }
}

[NEXT]

Vectorised

void add_vectorised(float* a, float* b, float* out, int len) {
    int i = 0;
    for (; i < len - 4; i += 4) {
        out[i] = a[i] + b[i];
        out[i + 1] = a[i + 1] + b[i + 1];
        out[i + 2] = a[i + 2] + b[i + 2];
        out[i + 3] = a[i + 3] + b[i + 3];
    }
    for (; i < len; ++i) {
        out[i] = a[i] + b[i];
    }
}

[NEXT] Disable optimisations to prevent compiler auto-vectorising.

clang -O0 vectorised_timings.c

[NEXT]

Speedup: 1.4x

[NEXT]

Vectorised Definitions

Context
Native Code	Apply single operations to multiple data items at once using special CPU registers.
Python	Keeping as much computation in `numpy`/native code as much as possible.

Both involve making algorithms use array/vector/matrix based computation (not iterative).

note Vectorization describes the absence of any explicit looping, indexing, etc., in the code - these things are taking place, of course, just “behind the scenes” in optimized, pre-compiled C code. Vectorized code has many advantages, among which are:

vectorized code is more concise and easier to read
fewer lines of code generally means fewer bugs
the code more closely resembles standard mathematical notation (making it easier, typically, to correctly code mathematical constructs)
vectorization results in more “Pythonic” code. Without vectorization, our code would be littered with inefficient and difficult to read for loops.

[NEXT]

Vectorisation for Outlier Detection

[NEXT] Unvectorised fill_forward()

def fill_forward(arr: np.ndarray):
    prev_val = arr[0]
    for i in range(1, len(arr)):
        if np.isnan(arr[i]):
            arr[i] = prev_val
        else:
            prev_val = arr[i]

[NEXT] Vectorised fill_forward()

def fill_forward(arr: np.ndarray) -> np.ndarray:
    mask = ~np.isnan(arr)
    indices = np.arange(len(arr))
    indices_to_use = np.where(mask, indices, 0)
    np.maximum.accumulate(
        indices_to_use,
        out=indices_to_use)
    return arr[indices_to_use]

[NEXT]

# wind_speed_rate measurements for a single weather station.
arr = np.array([
    20, 5, 3, 8, np.nan, np.nan, 6, np.nan, 25, 5
])

[NEXT]

mask = ~np.isnan(wind_speed_rates)

[NEXT]

indices = np.arange(len(arr))

[NEXT]

indices_to_use = np.where(mask, indices, 0)

[NEXT]

np.maximum.accumulate(indices_to_use, out=indices_to_use)
return wind_speed_rates[indices_to_use]

[NEXT] Unvectorised moving_average()

def moving_average(arr: np.ndarray,
                    n: int) -> np.ndarray:
    avg = np.zeros(len(arr) - n + 1)
    for i in range(len(avg)):
        avg[i] = arr[i:i+n].sum() / n
    return avg

note Glance over this and the next slide. Just state that this one has a for loop. We can vectorised and eliminate

[NEXT] Vectorised moving_average()

def moving_average(arr: np.ndarray,
                    n: int) -> np.ndarray:
    ret = np.cumsum(arr, dtype=float)
    ret[n:] = ret[n:] - ret[:-n]
    return ret[n - 1:] / n

[NEXT] Total time: 1.4 hours ⟶ 48 mins

Speedup: 2.85x ⟶ 5x

[NEXT SECTION]

5. Numba

note see https://numba.pydata.org/ for examples

[NEXT]

Not all algorithms are vectorisable.

note Are these non-vectorisable Python functions doomed to be slow?

[NEXT]

Solution

Compile non-vectorisable Python code to native machine instructions.

[NEXT]

Numba

Annotate Python functions with decorators.

Compiles them to optimised machine code at runtime.

Just-in-time (JIT) compilation.

LLVM for compiling to machine instructions.

note Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.

[NEXT] numba.jit

Decorator that tells Numba to compile a function to native instructions.

[NEXT]

Example

Summing an Array of Numbers

def sum_array(arr):
    result = 0
    for i in range(len(arr)):
        result += arr[i]
    return result

[NEXT]

Sprinkle Some Numba Magic

from numba import jit

@jit(nopython=True)
def sum_array(arr):
    result = 0
    for i in range(len(arr)):
        result += arr[i]
    return result

[NEXT]

Timing (seconds)

[NEXT]

Speedup Factor

[NEXT]

Type Deduction

Numba automatically deduces the types of JIT-compiled functions.

Uses types of arguments in function's first invocation.

[NEXT]

Explicitly Set Types

from numba import int64, jit

@jit(int64(int64[:]), nopython=True)
def sum_array(arr):
    result = 0
    for i in range(len(arr)):
        result += arr[i]
    return result

[NEXT]

Drawbacks

Numba type inference sometimes fails
You might need to specify types manually
- arguably makes code more verbose / harder to read
Restricted language features using nopython=True
- variable types are fixed
- cannot use arbitrary classes

note Numba FAQ lists many of the drawbacks: https://numba.pydata.org/numba-doc/dev/user/faq.html

[NEXT]

Using Numba for Outlier Detection

Added @jit(nopython=True) to all functions.

Explicitly specified types.

[NEXT] Total time: 48 mins ⟶ 2.46 mins

Speedup: 5x ⟶ 98x

[NEXT] Total time: 48 mins ⟶ 2.46 mins

Speedup: 5x ⟶ 98x

[NEXT]

To Summarise

Use vectorised NumPy code where possible.

Fall back to Numba if code cannot be vectorised.

[NEXT SECTION]

6. Parallel Computation

[NEXT]

Recap

Moved data to contiguous buffers
Run most computation in compiled/optimised machine instructions
Vectorised computation to take advantage of CPU's SIMD feature

Current speedup: 98x

[NEXT]

Embarrassingly Parallel

Full dataset is partitioned into different station time series.

Outliers in each station time series are calculated independently.

Split stations into N groups.

Process each group on a different CPU core.

[NEXT]

Library for building lightweight data pipelines.

[NEXT] joblib.Parallel

Parallelises Python loops.

Handles spawning of new Python processes and storing intermittent results for you.

[NEXT]

Before: Sequential For Loop

all_outliers = [
    compute_outliers(wind_speeds[start:end])
    for start, end in station_index_ranges
]

[NEXT]

After: Process Each Station Time Series in Parallel

from multiprocessing import cpu_count
from joblib import delayed, Parallel

processor = Parallel(n_jobs=cpu_count())
all_outliers = processor(
    delayed(compute_outliers)(wind_speeds[start:end])
    for start, end in station_index_ranges)

[NEXT] Total time: 2.46 mins ⟶ 1.37 mins

Speedup: 98x ⟶ 177x

[NEXT]

Problem: Excessive Copies

Worker processes receive copies of input data.

Means all station time series are copied.

Adds significant overhead to parallelisation.

note By default the workers of the joblib pool are real Python processes forked using the multiprocessing module of the Python standard library. The arguments passed as input to the Parallel call are serialized and reallocated in the memory of each worker process.

[NEXT]

Solution: Memmap

Map in-process memory to data stored on disk.

[NEXT] np.memmap

Write the measurement data to an memmap'd file.

# Open handle to temporary memmap file.
data = np.memmap(
    'wind_speeds',
    dtype=np.float64,
    shape=(len(input_file['wind_speed_rate']),),
    mode='w+')

# Load all wind_speed_rates into memmap file.
data[:] = input_file['wind_speed_rate'][:]

# Flushes contents to disk.
del data

[NEXT]

note mmap is great if you have multiple processes accessing data in a read only fashion from the same file, which is common in the kind of server systems I write. mmap allows all those processes to share the same physical memory pages, saving a lot of memory.

[NEXT] Total time: 1.37 mins ⟶ 0.83 mins

Speedup: 177x ⟶ 291x

[NEXT SECTION]

7. The Final Timings

[NEXT]

Current Timings

[NEXT]

Current Speedups

[NEXT]

Main Bottleneck

Computing moving standard deviation.

[NEXT]

Vectorised Rolling STD

Speedup: 10x

[NEXT]

Final Speedup

1145 times faster.

[NEXT]

The Ultimate Goal

Use IDS data to detect extreme weather events that happen anywhere on the planet.

[NEXT]

Detecting Outliers in the Full Dataset

All 8 measurements.

All 35,000 weather stations.

From 1901 to now.

note What if we ran the same outlier detection code on the full dataset?

[NEXT]

27 days ⟶ 38 minutes

[NEXT] On a single Macbook pro.

[NEXT SECTION]

Fin

[NEXT] Python is great for research.

Out of the box Python is slow.

[NEXT] Increasing demands for faster/real-time data processing.

Processing large volumes of data or training complex machine learning models.

Standard Python in prod isn't viable for many use cases.

[NEXT] Could research in Python then convert the code to a faster language.

Can cause more problems than it solves.

[NEXT] Use Python for research and production.

Possible by using Python's large ecosystem of scientific computing packages.

[NEXT] Keep computation in native code as much possible.

Vectorise using NumPy where possible.

Use Numba to optimise unvectorisable code.

[NEXT] Identify opportunities to parallelise.

Parallelise using joblib to abstract parallisation details from code.

[NEXT] joblib abstracts the worker backend.

Workers can be CPU cores or a machine cluster.

Run on single machine with multiple CPU cores first.

Run on cluster of machines only when necessary or if you already have the infrastructure.

Either way, code is almost identical.

note This abstracts the worker backend. Workers can be CPU cores or machines. Either way, the code remains the same.

[NEXT] numpy/numba/joblib alone can yield 1000x speedup.

[NEXT]

Don't throw the problem to dev ops.

[NEXT] If RAM or disk is your bottleneck, parallelise using a cluster.

Otherwise, you can get very far with vectorisation and sprinkling @numba.jit magic.

[NEXT]

Спасибо!

[NEXT]

Links

these slides:
- http://donsoft.io/high-performance-data-processing-in-python
example code from this talk:
- https://github.com/DonaldWhyte/high-performance-data-processing-in-python/tree/master/code

[NEXT]

Get In Touch

[don@donsoft.io](mailto:don@donsoft.io)
[@donald_whyte](http://twitter.com/donald_whyte)
https://github.com/DonaldWhyte

![small_portrait](images/donald.jpg)

[NEXT SECTION]

Appendix

[NEXT]

References

[1] https://www.jetbrains.com/research/python-developers-survey-2017/

[NEXT]

Timing Specifications

All performance timings in these slides were produced by running the code on a machine with the following specs:


OS	macOS Sierra v10.12.6
Processor:	2.3 GHz Intel Core i5
Memory:	8 GB 2133 MHz LPDDR3

[NEXT]

Image Credits

Freepik
Icon Fonts
Appzgear

Files

contents.md

Latest commit

History

contents.md

File metadata and controls

Высокоэффективная обработка данных в Phyton

High Performance Data Processing In Python

About Me

Why?

Isolated to Research?

Things Have Changed

Artificial Intelligence

Projected Global Revenue

The Traditional Process

Success!

The Reality

The Reality

Results in...

A Better Process

Problem

Python vs. C Performance

Speedup using C

Solution

NumPy

Our Focus

Outline

Final Optimised Solution

1. The Dataset

Integrated Surface Database (ISD)

Measurements

Coverage

Example

Tabriz Wind Speed Rate

Over Two Days

Research Goal

Initial Goal

Detecting hurricanes

ISD-Lite Dataset

2. Pure Python

Detecting Outliers

Detecting Outliers

The Input

HDF5

Invariants

Problem: Null Values

Solution: Fill Forward

The Code

Code Breakdown

Running the Code

Output Outliers CSV

Success!

Time Taken

The Ultimate Goal

Detecting Outliers in the Full Dataset

Profiling the Code

Visualising Performance Bottlenecks

Example

Snakeviz Output

Finding Outliers

Execution Time Breakdown

Reason 1

Dynamic Typing

Reason 2

Interpreted, not Compiled

Reason 3

Fragmented Memory Access

3. Numpy

The Foundation

Features

Examples

Creating an Array

Memory Layout

Python View

Reshape

Slicing One Dimension

Slicing Multiple Dimensions

Performance Benefits

Benchmark

Adding 10,000,000 Numbers