[NEXT]
- Software Engineer
- @ Engineers Gate
- Scalable data infrastructure
- Real-time trading systems
- Python/C++/Rust developer
[NEXT]
Python is a hugely popular tool for data analysis.
[NEXT]
Data analysis is now as popular as web development with Python.
note https://www.jetbrains.com/research/python-developers-survey-2017/
[NEXT]
[NEXT]
High-level and easy to use.
Wealth of tools for processing/analysing data.
General-purpose language useful outside of data analysis.
[NEXT]
Great language for research.
[NEXT]
What about production?
[NEXT]
Large data analysis/processing used to be isolated to research.
One-off batch jobs to produce insight for research and decision making.
note Data analysis used to only be active in the realm of research. Analysts would write one-off jobs that cleaned up data and analysed it. The findings would then be included in research papers, presentations to management in firms and so on.
It was very rare that you'd run such heavy data analysis frequently in live production systems.
[NEXT]
Exponential growth of data.
Need real-time insights into this data.
Machine learning/stats models are running in live production systems.
note Source: https://insidebigdata.com/2017/02/16/the-exponential-growth-of-data/
[NEXT]
Source: [Tractica December 2017](https://www.tractica.com/newsroom/press-releases/artificial-intelligence-software-market-to-reach-89-8-billion-in-annual-worldwide-revenue-by-2025/)
note Artificial Intelligence software market projected to reach almost $90 billion by 2025.
[NEXT]
More data to process.
More numerical models being trained for live use.
Models larger and more complex.
[NEXT]
Strict time requirements.
[NEXT]
- Researcher builds model in their tech of choice
- Programmer takes research code and rewrites it in heavily optimised C/C++
- Production code is deployed
- Everything works fine
[NEXT]
[NEXT]
[NEXT]
- Researcher builds model that works on their machine
- Programmer attempts to rewrite model for production
- Programmer can't replicate the researcher's results
- Everything spends tons of time figuring out why
note Useful link discussing deplyoying models to prod: https://www.quora.com/How-do-you-take-a-machine-learning-model-to-production
[NEXT]
- Deployment delays
- Compromises on model accuracy to release it faster
[NEXT]
Research and production code is identical.
note A better process is to make the research and production code identical. They can be configured differently, but the code which pre-processes the data, trains the models and executes it in prod should be the same.
[NEXT]
Want to use Python.
Enables researchers to run experiments quickly.
But Pure Python is slow.
note But we like Python because it's easy to use for research.
[NEXT]
Source: [The Computer Language Benchmarks Game](https://benchmarksgame-team.pages.debian.net/benchmarksgame/faster/python3-gcc.html)
[NEXT]
Source: [The Computer Language Benchmarks Game](https://benchmarksgame-team.pages.debian.net/benchmarksgame/faster/python3-gcc.html)
[NEXT]
Python's ecosystem for data science.
[NEXT]
[NEXT]
[NEXT]
- Heart of scientific computing in Python
- Stores and operates on data in C structures
- Avoids slowness of Python
[NEXT]
Foundation of most scientific computing packages.
[NEXT]
Showing how to use NumPy to process numerical data.
Exploring how NumPy leverages vectorisation to dramatically boost performance.
[NEXT]
- Analyse a large weather dataset
- Process dataset in pure Python
- Speed up processing using NumPy and vectorisation
- Speed up processing even more using Numba
[NEXT]
1145 times faster than pure Python.
[NEXT SECTION]
[NEXT]
note Global database of atmospheric weather data.
This map shows the spatial distribution of Integrated Surface Database stations. Data has been collected from 35,000 weather stations scattered across the globe.
Source: https://www.ncdc.noaa.gov/isd
[NEXT]
wind speed and direction
temperature
sea level pressure
sky visibility
note Detailed list of fields:
wind speed and direction, wind gust, temperature, dew point, cloud data, sea level pressure, altimeter setting, station pressure, present weather, visibility, precipitation amounts for various time periods, snow depth, and various other elements as observed by each station.
[NEXT]
7 continents
35,000 weather stations
1901 to now
from over 100 data sources
[NEXT]
Total Data Volume > 600GB
note ISD integrates data from over 100 original data sources, including numerous data formats that were key-entered from paper forms during the 1950s–1970s time frame
[NEXT]
[NEXT]
timestamp | station_id | wind_speed_rate | ... |
---|---|---|---|
1995-01-06 03:00:00 | 407060 | 50.0 | ... |
1995-01-06 06:00:00 | 407060 | 70.0 | ... |
1995-01-06 09:00:00 | 407060 | null | ... |
1995-01-06 12:00:00 | 407060 | 60.0 | ... |
1995-01-06 16:00:00 | 407060 | 20.0 | ... |
note Wind speed rate = the rate of horizontal travel of air past a fixed point.
UNITS: meters per second SCALING FACTOR: 10 MISSING VALUE: -9999
http://www.polmontweather.co.uk/windspd.htm
[NEXT]
(2011-12-29 to 2011-12-31)
[NEXT]
Use IDS data to detect extreme weather events that happen anywhere on the planet.
[NEXT]
[NEXT]
Let's test our approach on a smaller dataset.
Dates | 1991-01-01 to 2011-12-31 |
Measurement | Wind Speed Rate |
Stations | ~6000 |
Rows | ~400,000,000 |
note Total stations: 5,700 Total rows: 391,908,527
[NEXT SECTION]
[NEXT] How do we detect hurricanes?
Finding data points with unusually low/high wind_speed_rate
values.
[NEXT]
[NEXT]
At each point i
in the time series:
- Take values in time series between points
i - 30
andi
- Calculate mean and standard deviation
- Value at
i
is an outlier if it's:
- more than 6 standard deviations away from the mean
note
- Split full dataset into separate station time series
- For each weather station time series, detect outliers by:
- calculate moving mean and stdev at each point
- check if a point is > 6 stdevs away from its moving mean value
- if so, mark point as outlier
- generate CSV containing all outliers in each station's data
[NEXT]
HDF5 file containing three columns:
station_id
timestamp
wind_speed_rate
[NEXT]
- Hierarchical Data Format
- Designed to store large amounts of binary data
- No text parsing required
- Efficient to load
note HDF5 is an open source file format for storing huge amounts of numerical data.
It’s typically used in research applications (meteorology, astronomy, genomics etc.) to distribute and access very large datasets without using a database.
It lets you store huge amounts of numerical data, and easily manipulate that data from NumPy. For example, you can slice into multi-terabyte datasets stored on disk, as if they were real NumPy arrays. Thousands of datasets can be stored in a single file, categorized and tagged however you want.
[NEXT]
Rows sorted by (station_id, timestamp)
.
Each station's rows are grouped together.
Ordered by time.
[NEXT]
timestamp | station_id | wind_speed_rate |
---|---|---|
1995-01-06 03:00:00 | 407060 | 50.0 |
1995-01-06 06:00:00 | 407060 | 70.0 |
1995-01-06 09:00:00 | 407060 | null |
1995-01-06 12:00:00 | 407060 | 70.0 |
1995-01-06 17:00:00 | 407060 | 20.0 |
[NEXT]
timestamp | station_id | wind_speed_rate |
---|---|---|
1995-01-06 03:00:00 | 407060 | 50.0 |
1995-01-06 06:00:00 | 407060 | 70.0 |
1995-01-06 09:00:00 | 407060 | 70.0 |
1995-01-06 12:00:00 | 407060 | 70.0 |
1995-01-06 17:00:00 | 407060 | 20.0 |
[NEXT]
Source file on GitHub: find_outliers_purepython.py
[NEXT]
station_ranges |
partition full dataset into per-station time series |
fill_forward |
fill in missing data with previous values |
moving_average |
computing moving average at every time point |
moving_std |
computing moving stdev at every time point |
find_outliers |
get indices of outliers using deviance from moving avg |
[NEXT]
> python3 -m find_outliers_purepy \
--input isdlite.hdf5 \
--output outliers.csv \
--measurement wind_speed_rate
Determining range of each station time series
Found time series for 5183 ranges
Removing time series that don't have enough data
Kept 4695 / 5183 station time series
Computing outliers
Computed outliers in 14499.84 seconds
Writing outliers to outliers.csv
[NEXT]
station_id | timestsamp | wind_speed_rate |
---|---|---|
720346 | 1996-04-25 11:00:00 | 110.0 |
720358 | 1997-01-31 09:00:00 | 40.0 |
997375 | 1993-01-29 15:00:00 | 100.0 |
... | ... | ... |
[NEXT] Some detected outliers:
997299,2006-09-01 09:00:00,400.0
997299,2006-09-01 12:00:00,400.0
The affected weather station is:
> grep 997299 stations.csv
"997299","99999","CHEASAPEAKE BRIDGE","US","VA","","+36.970","-076.120","+0016.0","20050217","20161231"
[NEXT]
🎉
[NEXT]
4 hours.
[NEXT]
Use IDS data to detect extreme weather events that happen anywhere on the planet.
[NEXT]
All 8 measurements.
All 35,000 weather stations.
From 1901 to now.
note What if we ran the same outlier detection code on the full dataset?
[NEXT]
It would take 27 days.
[NEXT]
What went wrong?
[NEXT]
Use cProfile
to find out which steps were the performance bottlenecks.
> python3 -m cProfile -o profile_output \
find_outliers_purepy.py \
--input isdlite.hdf5 \
--output outliers.csv \
--measurement wind_speed_rate
[NEXT]
snakeviz
generates visualisations of profiling data.
> pip3 install snakeviz
> snakeviz profile_output
[NEXT]
Calculating standard deviation of 100 million integers.
def _main():
a = list(range(100000000))
_std(a)
def _std(a):
mean = _mean(a)
squared_differences = _square(_differences(a, mean))
sum_of_sq_diffs = _sum(squared_differences)
return math.sqrt(sum_of_sq_diffs / len(a) - 1)
def _mean(a):
return _sum(a) / len(a)
def _differences(a, mean):
return [x - mean for x in a]
def _square(a):
return [x * x for x in a]
def _sum(a):
s = 0
for x in a:
s += x
return s
def _divide(a, d):
return [x / d for x in a]
[NEXT]
[NEXT]
[NEXT] Total time: 4 hours (14530 secs)
[NEXT] Total time: 4 hours (14530 secs)
[NEXT]
Why is Python so slow?
note Source for upcoming sections: https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/
[NEXT]
[NEXT] When a Python program executes, the interpreter doesn't know the type of the variables that are defined.
[NEXT] More instructions needed for any operation.
Primary reason Python is slower than C or other compiled languages for processing numerical data.
[NEXT]
[NEXT] Python code is interpreted at runtime.
Quick to iterate, but gives less chance to optimise.
During compilation, a smart compiler can look ahead and optimise inefficient code.
note See section 5 to learn see how compiling Python code can dramatically speed up code.
[NEXT]
[NEXT] Bad for code that steps through data in sequence.
Iterate through a single list accesses completely different regions of memory.
Not cache friendly.
[NEXT SECTION]
[NEXT]
Fundamental package for high performance computing in Python.
Many libraries/frameworks are built on top of NumPy.
[NEXT]
- multi-dimensional array objects
- routines for fast operations on arrays
- mathematical, logical, sorting, selecting
- efficient loading/saving of numerical data to disk
- including HDF5
[NEXT]
numpy.ndarray
- class encapsulating n-dimensional arrays
- fixed size
- elements must be the same type
note At the core of the NumPy package, is the ndarray object. This encapsulates n-dimensional arrays of homogeneous data types, with many operations being performed in compiled code for performance.
[NEXT]
import numpy as np
[NEXT]
>>> a = np.arange(9, dtype=np.float64)
>>> a
array([0., 1., 2., 3., 4., 5., 6., 7., 8.])
>>> a.shape
(9,)
>>> a.strides
(8,)
[NEXT]
note A NumPy array in its simplest form is a Python object build around a C array. That is, it has a pointer to a contiguous data buffer of values.
data
is pointer indicating the memory address of the first byte in the array.
dtype
indicates the type of elements stored in the array.
shape
indicates the shape of the array. That is, it defines the dimensionality
of the data in the array and how many elements the array stores for each dimension.
The strides
are the number of bytes that should be skipped in memory to go to the next element. If your strides are (32, 8), you need to proceed 8 bytes to get to the next column and 32 bytes to move to the next row.
flags
is a set of configurable flags we don't need to cover here.
[NEXT]
>>> b = a.reshape(3, 3)
>>> b
array([[0., 1., 2.],
[3., 4., 5.],
[6., 7., 8.]])
[NEXT]
>>> b[:, :2]
array([[0., 1.],
[3., 4.],
[6., 7.]])
[NEXT]
>>> b[:2, :2]
array([[0., 1.],
[3., 4.]])
[NEXT] Reshaping or slicing arrays creates a view.
No copies are made.
[NEXT]
- Data stored contiguously
- no memory overhead
- cache locality
- No copies for common reshaping/slicing operations
- Fast logical and mathematical operations
- executed in heavily optimised compiled code
[NEXT]
[NEXT]
a = list(range(10000000))
b = list(range(10000000))
# 1. indexing
c = [a[i] + b[i] for i in range(len(a))]
[NEXT]
import numpy as np
a = np.arange(10000000)
b = np.arange(10000000)
# 2. loop
c = np.zeros(len(a))
for i in range(len(a)):
c[i] = a[i] + b[i]
# 3. built-in numpy addition operator
d = a + b
[NEXT]
[NEXT]
[NEXT] NumPy with loops is the slowest of all choices.
Takes 4x longer than pure Python!
[NEXT]
note
For every integer, we're making two __getitem__
calls, performing the
addition in Python and copying each result into the output numpy array with
a call to __setitem__
.
This dramatically slows down the computation for two reasons:
- This adds function call overhead. We invoke four Python functions for each integer. That's 40,000,000 function calls.
- It performs three copies for each addition. It copies the
i
th element ofa
andb
, then copies the addition intoc
. - The overhead and copies destroy cache locality. The copies are likely in a very different part of the address space, meaning the CPU is having to do more work to fetch data from RAM, instead of just using its local cache.
[NEXT]
note The full addition logic is executed in native, compiled NumPy code. There are no function call overheads and no copies.
The memory buffers storing a
and b
are directly accessed when adding.
Since those buffers are stored contiguously in memory, we're cache friendly.
The CPU has to fetch less data from RAM.
[NEXT]
Don't loop through np.ndarray
s.
Move the computation to the NumPy/C/native code level where possible.
[NEXT]
For arrays with the same size, operations are performed element-by-element.
Sometimes we want to apply smaller scalars or vectors to larger arrays.
e.g. adding one to all elements in an array
note We want to use NumPy's built-in operations, but we don't want to perform loads of copies to match up the array sizes.
[NEXT]
Adding 1 to N elements would take N -1 copies!
[NEXT]
Allows us to apply smaller arrays to larger arrays.
Without copying.
[NEXT]
[NEXT]
[NEXT]
[NEXT]
[NEXT]
[NEXT]
station_ranges |
partition full dataset into per-station time series |
fill_forward |
fill in missing data with previous values |
moving_average |
computing moving average at every time point |
moving_std |
computing moving stdev at every time point |
find_outliers |
get indices of outliers using deviance from moving avg |
[NEXT]
station_ranges()
def station_ranges(station_ids: np.ndarray) -> np.ndarray:
is_end_of_series = station_ids[:-1] != station_ids[1:]
indices_where_stations_change = np.where(
is_end_of_series == True)[0] + 1
series_starts = np.concatenate((
np.array([0]),
indices_where_stations_change
))
series_ends = np.concatenate((
indices_where_stations_change,
np.array([len(station_ids) - 1])
))
return np.column_stack((series_starts, series_ends))
[NEXT]
station_ids = np.array([123, 123, 124, 245, 999, 999])
[NEXT]
is_end_of_series = station_ids[:-1] != station_ids[1:]
[NEXT]
indices_where_stations_change = (
np.where(is_end_of_series == True)[0] + 1)
[NEXT]
series_starts = np.concatenate((
[0],
indices_where_stations_change
))
[NEXT]
series_ends = np.concatenate((
indices_where_stations_change,
[len(station_ids) - 1]
))
[NEXT]
np.column_stack((series_starts, series_ends))
[NEXT] Total time: 4 hours ⟶ 1.4 hours
Speedup: 2.85x
[NEXT]
- No extra memory overhead
- Minimal copying
- Cache friendly
- Operations executed in optimised compiled code
[NEXT] But also...
[NEXT SECTION]
[NEXT]
Process of converting an algorithm from operating on a single value at a time to operating on a set of values at one time.
note Source: https://software.intel.com/en-us/articles/vectorization-a-key-tool-to-improve-performance-on-modern-cpus
[NEXT] Modern CPUs provide direct support for vector operations.
A single instruction is applied to multiple data points.
[NEXT]
Adding N numbers takes N instructions.
[NEXT]
Adding N numbers takes N / 4 instructions.
note Basically for you as a coder, SIMD allows to perform four operations (reading/writing/calculating) for the price of one instruction. The cost reduction is enabled by vectorization and data-parallelism. You don’t even have to handle threads and race conditions to gain this parallelism.
[NEXT]
- Most modern CPUs support vectorisation.
- CPU with a 512 bit register can hold 8 64-bit doubles.
- One instruction for 8 doubles.
[NEXT]
[NEXT]
void add(float* a, float* b, float* out, int len) {
for (int i = 0; i < len; ++i) {
out[i] = a[i] + b[i];
}
}
[NEXT]
void add_vectorised(float* a, float* b, float* out, int len) {
int i = 0;
for (; i < len - 4; i += 4) {
out[i] = a[i] + b[i];
out[i + 1] = a[i + 1] + b[i + 1];
out[i + 2] = a[i + 2] + b[i + 2];
out[i + 3] = a[i + 3] + b[i + 3];
}
for (; i < len; ++i) {
out[i] = a[i] + b[i];
}
}
[NEXT] Disable optimisations to prevent compiler auto-vectorising.
clang -O0 vectorised_timings.c
[NEXT]
Speedup: 1.4x
[NEXT]
Context | |
---|---|
Native Code | Apply single operations to multiple data items at once using special CPU registers. |
Python | Keeping as much computation in numpy /native code as much as possible. |
Both involve making algorithms use array/vector/matrix based computation (not iterative).
note Vectorization describes the absence of any explicit looping, indexing, etc., in the code - these things are taking place, of course, just “behind the scenes” in optimized, pre-compiled C code. Vectorized code has many advantages, among which are:
- vectorized code is more concise and easier to read
- fewer lines of code generally means fewer bugs
- the code more closely resembles standard mathematical notation (making it easier, typically, to correctly code mathematical constructs)
- vectorization results in more “Pythonic” code. Without vectorization, our code would be littered with inefficient and difficult to read for loops.
[NEXT]
[NEXT]
Unvectorised fill_forward()
def fill_forward(arr: np.ndarray):
prev_val = arr[0]
for i in range(1, len(arr)):
if np.isnan(arr[i]):
arr[i] = prev_val
else:
prev_val = arr[i]
[NEXT]
Vectorised fill_forward()
def fill_forward(arr: np.ndarray) -> np.ndarray:
mask = ~np.isnan(arr)
indices = np.arange(len(arr))
indices_to_use = np.where(mask, indices, 0)
np.maximum.accumulate(
indices_to_use,
out=indices_to_use)
return arr[indices_to_use]
[NEXT]
# wind_speed_rate measurements for a single weather station.
arr = np.array([
20, 5, 3, 8, np.nan, np.nan, 6, np.nan, 25, 5
])
[NEXT]
mask = ~np.isnan(wind_speed_rates)
[NEXT]
indices = np.arange(len(arr))
[NEXT]
indices_to_use = np.where(mask, indices, 0)
[NEXT]
np.maximum.accumulate(indices_to_use, out=indices_to_use)
return wind_speed_rates[indices_to_use]
[NEXT]
Unvectorised moving_average()
def moving_average(arr: np.ndarray,
n: int) -> np.ndarray:
avg = np.zeros(len(arr) - n + 1)
for i in range(len(avg)):
avg[i] = arr[i:i+n].sum() / n
return avg
note Glance over this and the next slide. Just state that this one has a for loop. We can vectorised and eliminate
[NEXT]
Vectorised moving_average()
def moving_average(arr: np.ndarray,
n: int) -> np.ndarray:
ret = np.cumsum(arr, dtype=float)
ret[n:] = ret[n:] - ret[:-n]
return ret[n - 1:] / n
[NEXT] Total time: 1.4 hours ⟶ 48 mins
Speedup: 2.85x ⟶ 5x
[NEXT SECTION]
note see https://numba.pydata.org/ for examples
[NEXT]
Not all algorithms are vectorisable.
note Are these non-vectorisable Python functions doomed to be slow?
[NEXT]
Compile non-vectorisable Python code to native machine instructions.
[NEXT]
Annotate Python functions with decorators.
Compiles them to optimised machine code at runtime.
Just-in-time (JIT) compilation.
LLVM for compiling to machine instructions.
note Numba gives you the power to speed up your applications with high performance functions written directly in Python. With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.
[NEXT]
numba.jit
Decorator that tells Numba to compile a function to native instructions.
[NEXT]
def sum_array(arr):
result = 0
for i in range(len(arr)):
result += arr[i]
return result
[NEXT]
from numba import jit
@jit(nopython=True)
def sum_array(arr):
result = 0
for i in range(len(arr)):
result += arr[i]
return result
[NEXT]
[NEXT]
[NEXT]
Numba automatically deduces the types of JIT-compiled functions.
Uses types of arguments in function's first invocation.
[NEXT]
from numba import int64, jit
@jit(int64(int64[:]), nopython=True)
def sum_array(arr):
result = 0
for i in range(len(arr)):
result += arr[i]
return result
[NEXT]
- Numba type inference sometimes fails
- You might need to specify types manually
- arguably makes code more verbose / harder to read
- Restricted language features using
nopython=True
- variable types are fixed
- cannot use arbitrary classes
note Numba FAQ lists many of the drawbacks: https://numba.pydata.org/numba-doc/dev/user/faq.html
[NEXT]
Added @jit(nopython=True)
to all functions.
Explicitly specified types.
[NEXT] Total time: 48 mins ⟶ 2.46 mins
Speedup: 5x ⟶ 98x
[NEXT] Total time: 48 mins ⟶ 2.46 mins
Speedup: 5x ⟶ 98x
[NEXT]
Use vectorised NumPy code where possible.
Fall back to Numba if code cannot be vectorised.
[NEXT SECTION]
[NEXT]
- Moved data to contiguous buffers
- Run most computation in compiled/optimised machine instructions
- Vectorised computation to take advantage of CPU's SIMD feature
Current speedup: 98x
[NEXT]
Full dataset is partitioned into different station time series.
Outliers in each station time series are calculated independently.
Split stations into N groups.
Process each group on a different CPU core.
Library for building lightweight data pipelines.
[NEXT]
joblib.Parallel
Parallelises Python loops.
Handles spawning of new Python processes and storing intermittent results for you.
[NEXT]
all_outliers = [
compute_outliers(wind_speeds[start:end])
for start, end in station_index_ranges
]
[NEXT]
from multiprocessing import cpu_count
from joblib import delayed, Parallel
processor = Parallel(n_jobs=cpu_count())
all_outliers = processor(
delayed(compute_outliers)(wind_speeds[start:end])
for start, end in station_index_ranges)
[NEXT] Total time: 2.46 mins ⟶ 1.37 mins
Speedup: 98x ⟶ 177x
[NEXT]
Worker processes receive copies of input data.
Means all station time series are copied.
Adds significant overhead to parallelisation.
note
By default the workers of the joblib pool are real Python processes forked
using the multiprocessing
module of the Python standard library. The
arguments passed as input to the Parallel
call are serialized and
reallocated in the memory of each worker process.
[NEXT]
Map in-process memory to data stored on disk.
[NEXT]
np.memmap
Write the measurement data to an memmap'd file.
# Open handle to temporary memmap file.
data = np.memmap(
'wind_speeds',
dtype=np.float64,
shape=(len(input_file['wind_speed_rate']),),
mode='w+')
# Load all wind_speed_rates into memmap file.
data[:] = input_file['wind_speed_rate'][:]
# Flushes contents to disk.
del data
note mmap is great if you have multiple processes accessing data in a read only fashion from the same file, which is common in the kind of server systems I write. mmap allows all those processes to share the same physical memory pages, saving a lot of memory.
[NEXT] Total time: 1.37 mins ⟶ 0.83 mins
Speedup: 177x ⟶ 291x
[NEXT SECTION]
[NEXT]
[NEXT]
[NEXT]
Computing moving standard deviation.
[NEXT]
Speedup: 10x
[NEXT]
1145 times faster.
[NEXT]
Use IDS data to detect extreme weather events that happen anywhere on the planet.
[NEXT]
All 8 measurements.
All 35,000 weather stations.
From 1901 to now.
note What if we ran the same outlier detection code on the full dataset?
[NEXT]
27 days ⟶ 38 minutes
[NEXT] On a single Macbook pro.
[NEXT SECTION]
[NEXT] Python is great for research.
Out of the box Python is slow.
[NEXT] Increasing demands for faster/real-time data processing.
Processing large volumes of data or training complex machine learning models.
Standard Python in prod isn't viable for many use cases.
[NEXT] Could research in Python then convert the code to a faster language.
Can cause more problems than it solves.
[NEXT] Use Python for research and production.
Possible by using Python's large ecosystem of scientific computing packages.
[NEXT] Keep computation in native code as much possible.
Vectorise using NumPy where possible.
Use Numba to optimise unvectorisable code.
[NEXT] Identify opportunities to parallelise.
Parallelise using joblib
to abstract parallisation details from code.
[NEXT]
joblib
abstracts the worker backend.
Workers can be CPU cores or a machine cluster.
Run on single machine with multiple CPU cores first.
Run on cluster of machines only when necessary or if you already have the infrastructure.
Either way, code is almost identical.
note This abstracts the worker backend. Workers can be CPU cores or machines. Either way, the code remains the same.
[NEXT]
numpy
/numba
/joblib
alone can yield 1000x speedup.
[NEXT]
Don't throw the problem to dev ops.
[NEXT] If RAM or disk is your bottleneck, parallelise using a cluster.
Otherwise, you can get very far with vectorisation and sprinkling
@numba.jit
magic.
[NEXT]
Спасибо!
[NEXT]
- these slides:
- example code from this talk:
[NEXT]
[[email protected]](mailto:[email protected])
[@donald_whyte](http://twitter.com/donald_whyte)
https://github.com/DonaldWhyte
[NEXT SECTION]
[NEXT]
[1] https://www.jetbrains.com/research/python-developers-survey-2017/
[NEXT]
All performance timings in these slides were produced by running the code on a machine with the following specs:
OS | macOS Sierra v10.12.6 |
Processor: | 2.3 GHz Intel Core i5 |
Memory: | 8 GB 2133 MHz LPDDR3 |
[NEXT]