Skip to content

Latest commit

 

History

History
1729 lines (1277 loc) · 42.3 KB

File metadata and controls

1729 lines (1277 loc) · 42.3 KB

Высокоэффективная обработка данных в Phyton

High Performance Data Processing In Python

@donald_whyte

pycon_russia

[NEXT]

About Me

![small_portrait](images/donald.jpg)
  • Software Engineer
  • @ Engineers Gate
  • Scalable data infrastructure
  • Real-time trading systems
  • Python/C++/Rust developer

[NEXT]

Python is a hugely popular tool for data analysis.

[NEXT]

Data analysis is now as popular as web development with Python.

*JetBrains Python Developer Survey 2017* **[1]**

note https://www.jetbrains.com/research/python-developers-survey-2017/

[NEXT]

Why?

[NEXT]

High-level and easy to use.

Wealth of tools for processing/analysing data.

General-purpose language useful outside of data analysis.

[NEXT]

Great language for research.

[NEXT]

What about production?

[NEXT]

Isolated to Research?

research_use_case research_use_case research_use_case

Large data analysis/processing used to be isolated to research.

One-off batch jobs to produce insight for research and decision making.

note Data analysis used to only be active in the realm of research. Analysts would write one-off jobs that cleaned up data and analysed it. The findings would then be included in research papers, presentations to management in firms and so on.

It was very rare that you'd run such heavy data analysis frequently in live production systems.

[NEXT]

Things Have Changed

Exponential growth of data.

Need real-time insights into this data.

Machine learning/stats models are running in live production systems.

note Source: https://insidebigdata.com/2017/02/16/the-exponential-growth-of-data/

[NEXT]

Artificial Intelligence

Projected Global Revenue

note Artificial Intelligence software market projected to reach almost $90 billion by 2025.

[NEXT]

More data to process.

More numerical models being trained for live use.

Models larger and more complex.

[NEXT]

Strict time requirements.

[NEXT]

The Traditional Process

  1. Researcher builds model in their tech of choice
  2. Programmer takes research code and rewrites it in heavily optimised C/C++
  3. Production code is deployed
  4. Everything works fine

[NEXT]

Success!

[NEXT]

The Reality

bad_construction bad_construction

[NEXT]

The Reality

  1. Researcher builds model that works on their machine
  2. Programmer attempts to rewrite model for production
  3. Programmer can't replicate the researcher's results
  4. Everything spends tons of time figuring out why

note Useful link discussing deplyoying models to prod: https://www.quora.com/How-do-you-take-a-machine-learning-model-to-production

[NEXT]

Results in...

anger

  • Deployment delays
  • Compromises on model accuracy to release it faster

[NEXT]

A Better Process

Research and production code is identical.

note A better process is to make the research and production code identical. They can be configured differently, but the code which pre-processes the data, trains the models and executes it in prod should be the same.

[NEXT]

Problem

Want to use Python.

Enables researchers to run experiments quickly.

But Pure Python is slow.

note But we like Python because it's easy to use for research.

[NEXT]

Python vs. C Performance

[NEXT]

Speedup using C

[NEXT]

Solution

Python's ecosystem for data science.

[NEXT]

[NEXT]

[NEXT]

NumPy

  • Heart of scientific computing in Python
  • Stores and operates on data in C structures
  • Avoids slowness of Python
numpy_coloured

[NEXT]

Foundation of most scientific computing packages.

scipy pandas sklearn matplotlib

[NEXT]

Our Focus

Showing how to use NumPy to process numerical data.

Exploring how NumPy leverages vectorisation to dramatically boost performance.

[NEXT]

Outline

  1. Analyse a large weather dataset
  2. Process dataset in pure Python
  3. Speed up processing using NumPy and vectorisation
  4. Speed up processing even more using Numba

[NEXT]

Final Optimised Solution

1145 times faster than pure Python.

[NEXT SECTION]

1. The Dataset

dataset

[NEXT]

Integrated Surface Database (ISD)

isd

note Global database of atmospheric weather data.

This map shows the spatial distribution of Integrated Surface Database stations. Data has been collected from 35,000 weather stations scattered across the globe.

Source: https://www.ncdc.noaa.gov/isd

[NEXT]

Measurements

wind speed and direction

temperature

sea level pressure

sky visibility

weather_measurements

note Detailed list of fields:

wind speed and direction, wind gust, temperature, dew point, cloud data, sea level pressure, altimeter setting, station pressure, present weather, visibility, precipitation amounts for various time periods, snow depth, and various other elements as observed by each station.

[NEXT]

Coverage

globe

7 continents

35,000 weather stations

1901 to now

from over 100 data sources

[NEXT]

Total Data Volume > 600GB

note ISD integrates data from over 100 original data sources, including numerous data formats that were key-entered from paper forms during the 1950s–1970s time frame

[NEXT]

Example

tabriz_airport

[NEXT]

timestamp station_id wind_speed_rate ...
1995-01-06 03:00:00 407060 50.0 ...
1995-01-06 06:00:00 407060 70.0 ...
1995-01-06 09:00:00 407060 null ...
1995-01-06 12:00:00 407060 60.0 ...
1995-01-06 16:00:00 407060 20.0 ...

note Wind speed rate = the rate of horizontal travel of air past a fixed point.

UNITS: meters per second SCALING FACTOR: 10 MISSING VALUE: -9999

http://www.polmontweather.co.uk/windspd.htm

[NEXT]

Tabriz Wind Speed Rate

Over Two Days

tabriz_wind_speed_rate

(2011-12-29 to 2011-12-31)

[NEXT]

Research Goal

Use IDS data to detect extreme weather events that happen anywhere on the planet.

[NEXT]

Initial Goal

Detecting hurricanes

hurricanes

[NEXT]

ISD-Lite Dataset

Let's test our approach on a smaller dataset.

Dates 1991-01-01 to 2011-12-31
Measurement Wind Speed Rate
Stations ~6000
Rows ~400,000,000

note Total stations: 5,700 Total rows: 391,908,527

[NEXT SECTION]

2. Pure Python

python

[NEXT] How do we detect hurricanes?

Finding data points with unusually low/high wind_speed_rate values.

[NEXT]

Detecting Outliers

outlier_detection

[NEXT]

Detecting Outliers

At each point i in the time series:

  1. Take values in time series between points i - 30 and i
  2. Calculate mean and standard deviation
  3. Value at i is an outlier if it's:
  • more than 6 standard deviations away from the mean

[NEXT] moving_mean_and_std

note

  1. Split full dataset into separate station time series
  2. For each weather station time series, detect outliers by:
  3. calculate moving mean and stdev at each point
  4. check if a point is > 6 stdevs away from its moving mean value
  5. if so, mark point as outlier
  6. generate CSV containing all outliers in each station's data

[NEXT]

The Input

HDF5 file containing three columns:

  • station_id
  • timestamp
  • wind_speed_rate

[NEXT]

HDF5

  • Hierarchical Data Format
  • Designed to store large amounts of binary data
  • No text parsing required
  • Efficient to load
[more information on this format here](https://www.hdfgroup.org/HDF5/)

note HDF5 is an open source file format for storing huge amounts of numerical data.

It’s typically used in research applications (meteorology, astronomy, genomics etc.) to distribute and access very large datasets without using a database.

It lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.

[NEXT]

Invariants

Rows sorted by (station_id, timestamp).

Each station's rows are grouped together.

Ordered by time.

[NEXT]

Problem: Null Values

timestamp station_id wind_speed_rate
1995-01-06 03:00:0040706050.0
1995-01-06 06:00:0040706070.0
1995-01-06 09:00:00407060null
1995-01-06 12:00:0040706070.0
1995-01-06 17:00:0040706020.0

[NEXT]

Solution: Fill Forward

timestamp station_id wind_speed_rate
1995-01-06 03:00:0040706050.0
1995-01-06 06:00:0040706070.0
1995-01-06 09:00:0040706070.0
1995-01-06 12:00:0040706070.0
1995-01-06 17:00:0040706020.0

[NEXT]

The Code

Source file on GitHub: find_outliers_purepython.py

[NEXT] complete_process

[NEXT]

Code Breakdown

station_ranges partition full dataset into per-station time series
fill_forward fill in missing data with previous values
moving_average computing moving average at every time point
moving_std computing moving stdev at every time point
find_outliers get indices of outliers using deviance from moving avg

[NEXT]

Running the Code

> python3 -m find_outliers_purepy \
    --input isdlite.hdf5 \
    --output outliers.csv \
    --measurement wind_speed_rate
Determining range of each station time series
Found time series for 5183 ranges
Removing time series that don't have enough data
Kept 4695 / 5183 station time series
Computing outliers
Computed outliers in 14499.84 seconds
Writing outliers to outliers.csv

[NEXT]

Output Outliers CSV

station_id timestsamp wind_speed_rate
720346 1996-04-25 11:00:00 110.0
720358 1997-01-31 09:00:00 40.0
997375 1993-01-29 15:00:00 100.0
... ... ...

[NEXT] Some detected outliers:

997299,2006-09-01 09:00:00,400.0
997299,2006-09-01 12:00:00,400.0

The affected weather station is:

> grep 997299 stations.csv
"997299","99999","CHEASAPEAKE BRIDGE","US","VA","","+36.970","-076.120","+0016.0","20050217","20161231"

[NEXT] detected_hurricane

[NEXT] detected_hurricane

[NEXT] detected_hurricane

[NEXT]

Success!

🎉

[NEXT]

Time Taken

4 hours.

[NEXT]

The Ultimate Goal

Use IDS data to detect extreme weather events that happen anywhere on the planet.

[NEXT]

Detecting Outliers in the Full Dataset

All 8 measurements.

All 35,000 weather stations.

From 1901 to now.

note What if we ran the same outlier detection code on the full dataset?

[NEXT]

It would take 27 days.

waiting_skeleton

[NEXT]

What went wrong?

[NEXT]

Profiling the Code

Use cProfile to find out which steps were the performance bottlenecks.

> python3 -m cProfile -o profile_output \
     find_outliers_purepy.py \
     --input isdlite.hdf5 \
     --output outliers.csv \
     --measurement wind_speed_rate

[NEXT]

Visualising Performance Bottlenecks

snakeviz generates visualisations of profiling data.

> pip3 install snakeviz
> snakeviz profile_output

[NEXT]

Example

Calculating standard deviation of 100 million integers.

def _main():
    a = list(range(100000000))
    _std(a)

def _std(a):
    mean = _mean(a)
    squared_differences = _square(_differences(a, mean))
    sum_of_sq_diffs = _sum(squared_differences)
    return math.sqrt(sum_of_sq_diffs / len(a) - 1)

def _mean(a):
    return _sum(a) / len(a)

def _differences(a, mean):
    return [x - mean for x in a]

def _square(a):
    return [x * x for x in a]

def _sum(a):
    s = 0
    for x in a:
        s += x
    return s

def _divide(a, d):
    return [x / d for x in a]

[NEXT]

Snakeviz Output

snakeviz

[NEXT]

Finding Outliers

Execution Time Breakdown

[NEXT] Total time: 4 hours (14530 secs)

[NEXT] Total time: 4 hours (14530 secs)

[NEXT]

Why is Python so slow?

note Source for upcoming sections: https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/

[NEXT]

Reason 1

Dynamic Typing

[NEXT] When a Python program executes, the interpreter doesn't know the type of the variables that are defined.

python_slow_1

[NEXT] More instructions needed for any operation.

Primary reason Python is slower than C or other compiled languages for processing numerical data.

[NEXT]

Reason 2

Interpreted, not Compiled

[NEXT] Python code is interpreted at runtime.

Quick to iterate, but gives less chance to optimise.

During compilation, a smart compiler can look ahead and optimise inefficient code.

note See section 5 to learn see how compiling Python code can dramatically speed up code.

[NEXT]

Reason 3

Fragmented Memory Access

[NEXT] python_slow_2

[NEXT] Bad for code that steps through data in sequence.

Iterate through a single list accesses completely different regions of memory.

Not cache friendly.

[NEXT SECTION]

3. Numpy

numpy

[NEXT]

The Foundation

Fundamental package for high performance computing in Python.

Many libraries/frameworks are built on top of NumPy.

[NEXT]

Features

  • multi-dimensional array objects
  • routines for fast operations on arrays
    • mathematical, logical, sorting, selecting
  • efficient loading/saving of numerical data to disk
    • including HDF5

[NEXT] numpy.ndarray

  • class encapsulating n-dimensional arrays
  • fixed size
  • elements must be the same type

note At the core of the NumPy package, is the ndarray object. This encapsulates n-dimensional arrays of homogeneous data types, with many operations being performed in compiled code for performance.

[NEXT]

Examples

import numpy as np

[NEXT]

Creating an Array

>>> a = np.arange(9, dtype=np.float64)
>>> a
array([0., 1., 2., 3., 4., 5., 6., 7., 8.])
>>> a.shape
(9,)
>>> a.strides
(8,)

[NEXT]

Memory Layout

ndarray

note A NumPy array in its simplest form is a Python object build around a C array. That is, it has a pointer to a contiguous data buffer of values.

data is pointer indicating the memory address of the first byte in the array.

dtype indicates the type of elements stored in the array.

shape indicates the shape of the array. That is, it defines the dimensionality of the data in the array and how many elements the array stores for each dimension.

The strides are the number of bytes that should be skipped in memory to go to the next element. If your strides are (32, 8), you need to proceed 8 bytes to get to the next column and 32 bytes to move to the next row.

flags is a set of configurable flags we don't need to cover here.

Python View

python_view

[NEXT]

Reshape

>>> b = a.reshape(3, 3)
>>> b
array([[0., 1., 2.],
       [3., 4., 5.],
       [6., 7., 8.]])

ndarray

[NEXT]

Slicing One Dimension

>>> b[:, :2]
array([[0., 1.],
       [3., 4.],
       [6., 7.]])

ndarray

[NEXT]

Slicing Multiple Dimensions

>>> b[:2, :2]
array([[0., 1.],
       [3., 4.]])

ndarray

[NEXT] Reshaping or slicing arrays creates a view.

No copies are made.

[NEXT]

Performance Benefits

  • Data stored contiguously
    • no memory overhead
    • cache locality
  • No copies for common reshaping/slicing operations
  • Fast logical and mathematical operations
    • executed in heavily optimised compiled code

[NEXT]

Benchmark

Adding 10,000,000 Numbers

[NEXT]

Pure Python

a = list(range(10000000))
b = list(range(10000000))

# 1. indexing
c = [a[i] + b[i] for i in range(len(a))]

[NEXT]

NumPy

import numpy as np

a = np.arange(10000000)
b = np.arange(10000000)

# 2. loop
c = np.zeros(len(a))
for i in range(len(a)):
    c[i] = a[i] + b[i]

# 3. built-in numpy addition operator
d = a + b

[NEXT]

Timing (seconds)

[NEXT]

Speedup Factor

[NEXT] NumPy with loops is the slowest of all choices.

Takes 4x longer than pure Python!

[NEXT]

Explicitly Looping over Numpy Array

numpy_add_loop

note For every integer, we're making two __getitem__ calls, performing the addition in Python and copying each result into the output numpy array with a call to __setitem__.

This dramatically slows down the computation for two reasons:

  1. This adds function call overhead. We invoke four Python functions for each integer. That's 40,000,000 function calls.
  2. It performs three copies for each addition. It copies the ith element of a and b, then copies the addition into c.
  3. The overhead and copies destroy cache locality. The copies are likely in a very different part of the address space, meaning the CPU is having to do more work to fetch data from RAM, instead of just using its local cache.

[NEXT]

Using Built-in Addition

numpy_add_native

note The full addition logic is executed in native, compiled NumPy code. There are no function call overheads and no copies.

The memory buffers storing a and b are directly accessed when adding. Since those buffers are stored contiguously in memory, we're cache friendly. The CPU has to fetch less data from RAM.

[NEXT]

Keep it in NumPy!

Don't loop through np.ndarrays.

Move the computation to the NumPy/C/native code level where possible.

[NEXT]

A Problem...

For arrays with the same size, operations are performed element-by-element.

Sometimes we want to apply smaller scalars or vectors to larger arrays.

e.g. adding one to all elements in an array

note We want to use NumPy's built-in operations, but we don't want to perform loads of copies to match up the array sizes.

[NEXT]

Adding 1 to an Array

adding_one_to_array

Adding 1 to N elements would take N -1 copies!

[NEXT]

Broadcasting

Allows us to apply smaller arrays to larger arrays.

Without copying.

[NEXT]

Broadcasting Scalar to Array

broadcasting

[NEXT]

Broadcasting Scalar to Array

broadcasting

[NEXT]

Broadcasting Vector to Array

broadcasting

[NEXT]

Broadcasting Vector to Array

broadcasting

[NEXT]

Using NumPy for Outlier Detection

[NEXT]

Recap

station_ranges partition full dataset into per-station time series
fill_forward fill in missing data with previous values
moving_average computing moving average at every time point
moving_std computing moving stdev at every time point
find_outliers get indices of outliers using deviance from moving avg

[NEXT] complete_process

[NEXT] station_ranges()

def station_ranges(station_ids: np.ndarray) -> np.ndarray:
    is_end_of_series = station_ids[:-1] != station_ids[1:]
    indices_where_stations_change = np.where(
      is_end_of_series == True)[0] + 1
    series_starts = np.concatenate((
        np.array([0]),
        indices_where_stations_change
    ))
    series_ends = np.concatenate((
        indices_where_stations_change,
        np.array([len(station_ids) - 1])
    ))
    return np.column_stack((series_starts, series_ends))

[NEXT]

station_ids = np.array([123, 123, 124, 245, 999, 999])

station_ranges0

[NEXT]

is_end_of_series = station_ids[:-1] != station_ids[1:]

station_ranges1

[NEXT]

indices_where_stations_change = (
    np.where(is_end_of_series == True)[0] + 1)

station_ranges2

[NEXT]

series_starts = np.concatenate((
    [0],
    indices_where_stations_change
))

station_ranges3

[NEXT]

series_ends = np.concatenate((
    indices_where_stations_change,
    [len(station_ids) - 1]
))

station_ranges4

[NEXT]

np.column_stack((series_starts, series_ends))

station_ranges5

[NEXT] station_ranges6

[NEXT] Total time: 4 hours ⟶ 1.4 hours

Speedup: 2.85x

[NEXT]

Why NumPy is So Much Faster

  • No extra memory overhead
  • Minimal copying
  • Cache friendly
  • Operations executed in optimised compiled code

[NEXT] But also...

[NEXT SECTION]

4. Vectorisation

vectorisation

[NEXT]

Process of converting an algorithm from operating on a single value at a time to operating on a set of values at one time.

note Source: https://software.intel.com/en-us/articles/vectorization-a-key-tool-to-improve-performance-on-modern-cpus

[NEXT] Modern CPUs provide direct support for vector operations.

A single instruction is applied to multiple data points.

[NEXT]

Adding Two Vectors

Single Instruction Single Data (SISD)

sisd

Adding N numbers takes N instructions.

[NEXT]

Adding Two Vectors

Single Instruction Multiple Data (SIMD)

simd

Adding N numbers takes N / 4 instructions.

note Basically for you as a coder, SIMD allows to perform four operations (reading/writing/calculating) for the price of one instruction. The cost reduction is enabled by vectorization and data-parallelism. You don’t even have to handle threads and race conditions to gain this parallelism.

[NEXT]

SIMD on CPUs

  • Most modern CPUs support vectorisation.
  • CPU with a 512 bit register can hold 8 64-bit doubles.
  • One instruction for 8 doubles.
cpu

[NEXT]

Example in C

Adding 10,000,000 Numbers

[NEXT]

Non-Vectorised

void add(float* a, float* b, float* out, int len) {
    for (int i = 0; i < len; ++i) {
        out[i] = a[i] + b[i];
    }
}

[NEXT]

Vectorised

void add_vectorised(float* a, float* b, float* out, int len) {
    int i = 0;
    for (; i < len - 4; i += 4) {
        out[i] = a[i] + b[i];
        out[i + 1] = a[i + 1] + b[i + 1];
        out[i + 2] = a[i + 2] + b[i + 2];
        out[i + 3] = a[i + 3] + b[i + 3];
    }
    for (; i < len; ++i) {
        out[i] = a[i] + b[i];
    }
}

[NEXT] Disable optimisations to prevent compiler auto-vectorising.

clang -O0 vectorised_timings.c

[NEXT]

Speedup: 1.4x

[NEXT]

Vectorised Definitions

Context
Native Code Apply single operations to multiple data items at once using special CPU registers.
Python Keeping as much computation in numpy/native code as much as possible.

Both involve making algorithms use array/vector/matrix based computation (not iterative).

note Vectorization describes the absence of any explicit looping, indexing, etc., in the code - these things are taking place, of course, just “behind the scenes” in optimized, pre-compiled C code. Vectorized code has many advantages, among which are:

  • vectorized code is more concise and easier to read
  • fewer lines of code generally means fewer bugs
  • the code more closely resembles standard mathematical notation (making it easier, typically, to correctly code mathematical constructs)
  • vectorization results in more “Pythonic” code. Without vectorization, our code would be littered with inefficient and difficult to read for loops.

[NEXT]

Vectorisation for Outlier Detection

[NEXT] Unvectorised fill_forward()

def fill_forward(arr: np.ndarray):
    prev_val = arr[0]
    for i in range(1, len(arr)):
        if np.isnan(arr[i]):
            arr[i] = prev_val
        else:
            prev_val = arr[i]

[NEXT] Vectorised fill_forward()

def fill_forward(arr: np.ndarray) -> np.ndarray:
    mask = ~np.isnan(arr)
    indices = np.arange(len(arr))
    indices_to_use = np.where(mask, indices, 0)
    np.maximum.accumulate(
        indices_to_use,
        out=indices_to_use)
    return arr[indices_to_use]

[NEXT]

# wind_speed_rate measurements for a single weather station.
arr = np.array([
    20, 5, 3, 8, np.nan, np.nan, 6, np.nan, 25, 5
])

vectorised_fill_forward0

[NEXT]

mask = ~np.isnan(wind_speed_rates)

vectorised_fill_forward1

[NEXT]

indices = np.arange(len(arr))

vectorised_fill_forward2

[NEXT]

indices_to_use = np.where(mask, indices, 0)

vectorised_fill_forward3

[NEXT]

np.maximum.accumulate(indices_to_use, out=indices_to_use)
return wind_speed_rates[indices_to_use]

vectorised_fill_forward4

[NEXT] Unvectorised moving_average()

def moving_average(arr: np.ndarray,
                    n: int) -> np.ndarray:
    avg = np.zeros(len(arr) - n + 1)
    for i in range(len(avg)):
        avg[i] = arr[i:i+n].sum() / n
    return avg

note Glance over this and the next slide. Just state that this one has a for loop. We can vectorised and eliminate

[NEXT] Vectorised moving_average()

def moving_average(arr: np.ndarray,
                    n: int) -> np.ndarray:
    ret = np.cumsum(arr, dtype=float)
    ret[n:] = ret[n:] - ret[:-n]
    return ret[n - 1:] / n

[NEXT] Total time: 1.4 hours ⟶ 48 mins

Speedup: 2.85x ⟶ 5x

[NEXT SECTION]

5. Numba

numba

note see https://numba.pydata.org/ for examples

[NEXT]

Not all algorithms are vectorisable.

note Are these non-vectorisable Python functions doomed to be slow?

[NEXT]

Solution

Compile non-vectorisable Python code to native machine instructions.

[NEXT]

Numba

Annotate Python functions with decorators.

Compiles them to optimised machine code at runtime.

Just-in-time (JIT) compilation.

LLVM for compiling to machine instructions.

note Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.

[NEXT] numba.jit

Decorator that tells Numba to compile a function to native instructions.

[NEXT]

Example

Summing an Array of Numbers

def sum_array(arr):
    result = 0
    for i in range(len(arr)):
        result += arr[i]
    return result

[NEXT]

Sprinkle Some Numba Magic

from numba import jit

@jit(nopython=True)
def sum_array(arr):
    result = 0
    for i in range(len(arr)):
        result += arr[i]
    return result

[NEXT]

Timing (seconds)

[NEXT]

Speedup Factor

[NEXT]

Type Deduction

Numba automatically deduces the types of JIT-compiled functions.

Uses types of arguments in function's first invocation.

[NEXT]

Explicitly Set Types

from numba import int64, jit

@jit(int64(int64[:]), nopython=True)
def sum_array(arr):
    result = 0
    for i in range(len(arr)):
        result += arr[i]
    return result

[NEXT]

Drawbacks

  • Numba type inference sometimes fails
  • You might need to specify types manually
    • arguably makes code more verbose / harder to read
  • Restricted language features using nopython=True
    • variable types are fixed
    • cannot use arbitrary classes

note Numba FAQ lists many of the drawbacks: https://numba.pydata.org/numba-doc/dev/user/faq.html

[NEXT]

Using Numba for Outlier Detection

Added @jit(nopython=True) to all functions.

Explicitly specified types.

[NEXT] Total time: 48 mins ⟶ 2.46 mins

Speedup: 5x ⟶ 98x

[NEXT] Total time: 48 mins ⟶ 2.46 mins

Speedup: 5x ⟶ 98x

[NEXT]

To Summarise

Use vectorised NumPy code where possible.

Fall back to Numba if code cannot be vectorised.

[NEXT SECTION]

6. Parallel Computation

parallel_computation

[NEXT]

Recap

  1. Moved data to contiguous buffers
  2. Run most computation in compiled/optimised machine instructions
  3. Vectorised computation to take advantage of CPU's SIMD feature

Current speedup: 98x

[NEXT]

Embarrassingly Parallel

Full dataset is partitioned into different station time series.

Outliers in each station time series are calculated independently.

Split stations into N groups.

Process each group on a different CPU core.

[NEXT] joblib

Library for building lightweight data pipelines.

[NEXT] joblib.Parallel

Parallelises Python loops.

Handles spawning of new Python processes and storing intermittent results for you.

[NEXT]

Before: Sequential For Loop

all_outliers = [
    compute_outliers(wind_speeds[start:end])
    for start, end in station_index_ranges
]

[NEXT]

After: Process Each Station Time Series in Parallel

from multiprocessing import cpu_count
from joblib import delayed, Parallel

processor = Parallel(n_jobs=cpu_count())
all_outliers = processor(
    delayed(compute_outliers)(wind_speeds[start:end])
    for start, end in station_index_ranges)

[NEXT] Total time: 2.46 mins ⟶ 1.37 mins

Speedup: 98x ⟶ 177x

[NEXT]

Problem: Excessive Copies

Worker processes receive copies of input data.

Means all station time series are copied.

Adds significant overhead to parallelisation.

note By default the workers of the joblib pool are real Python processes forked using the multiprocessing module of the Python standard library. The arguments passed as input to the Parallel call are serialized and reallocated in the memory of each worker process.

[NEXT]

Solution: Memmap

Map in-process memory to data stored on disk.

memmap

[NEXT] np.memmap

Write the measurement data to an memmap'd file.

# Open handle to temporary memmap file.
data = np.memmap(
    'wind_speeds',
    dtype=np.float64,
    shape=(len(input_file['wind_speed_rate']),),
    mode='w+')

# Load all wind_speed_rates into memmap file.
data[:] = input_file['wind_speed_rate'][:]

# Flushes contents to disk.
del data

[NEXT] joblib_memmap

note mmap is great if you have multiple processes accessing data in a read only fashion from the same file, which is common in the kind of server systems I write. mmap allows all those processes to share the same physical memory pages, saving a lot of memory.

[NEXT] Total time: 1.37 mins ⟶ 0.83 mins

Speedup: 177x ⟶ 291x

[NEXT SECTION]

7. The Final Timings

final_timings

[NEXT]

Current Timings

[NEXT]

Current Speedups

[NEXT]

Main Bottleneck

Computing moving standard deviation.

[NEXT]

Vectorised Rolling STD

Speedup: 10x

[NEXT]

Final Speedup

1145 times faster.

[NEXT]

The Ultimate Goal

Use IDS data to detect extreme weather events that happen anywhere on the planet.

[NEXT]

Detecting Outliers in the Full Dataset

All 8 measurements.

All 35,000 weather stations.

From 1901 to now.

note What if we ran the same outlier detection code on the full dataset?

[NEXT]

27 days38 minutes

[NEXT] On a single Macbook pro.

[NEXT SECTION]

Fin

fin

[NEXT] Python is great for research.

Out of the box Python is slow.

[NEXT] Increasing demands for faster/real-time data processing.

Processing large volumes of data or training complex machine learning models.

Standard Python in prod isn't viable for many use cases.

[NEXT] Could research in Python then convert the code to a faster language.

Can cause more problems than it solves.

[NEXT] Use Python for research and production.

Possible by using Python's large ecosystem of scientific computing packages.

[NEXT] Keep computation in native code as much possible.

Vectorise using NumPy where possible.

Use Numba to optimise unvectorisable code.

[NEXT] Identify opportunities to parallelise.

Parallelise using joblib to abstract parallisation details from code.

[NEXT] joblib abstracts the worker backend.

Workers can be CPU cores or a machine cluster.

Run on single machine with multiple CPU cores first.

Run on cluster of machines only when necessary or if you already have the infrastructure.

Either way, code is almost identical.

note This abstracts the worker backend. Workers can be CPU cores or machines. Either way, the code remains the same.

[NEXT] numpy/numba/joblib alone can yield 1000x speedup.

[NEXT]

Don't throw the problem to dev ops.

[NEXT] If RAM or disk is your bottleneck, parallelise using a cluster.

Otherwise, you can get very far with vectorisation and sprinkling @numba.jit magic.

[NEXT]

Спасибо!

[NEXT]

Links

[NEXT]

Get In Touch

![small_portrait](images/donald.jpg)

[NEXT SECTION]

Appendix

[NEXT]

References

[1] https://www.jetbrains.com/research/python-developers-survey-2017/

[NEXT]

Timing Specifications

All performance timings in these slides were produced by running the code on a machine with the following specs:

OS macOS Sierra v10.12.6
Processor: 2.3 GHz Intel Core i5
Memory: 8 GB 2133 MHz LPDDR3

[NEXT]

Image Credits