Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable parallelism #93

Merged
merged 5 commits into from
Jun 27, 2022
Merged

Enable parallelism #93

merged 5 commits into from
Jun 27, 2022

Conversation

avalentino
Copy link
Contributor

This PR enable the numba jit parallel option by default.
The option can be disabled by setting an environment variable.

@codecov-commenter
Copy link

codecov-commenter commented Jun 18, 2022

Codecov Report

Merging #93 (0603dea) into main (ee82e25) will decrease coverage by 1.35%.
The diff coverage is 77.77%.

@@            Coverage Diff             @@
##             main      #93      +/-   ##
==========================================
- Coverage   82.60%   81.25%   -1.36%     
==========================================
  Files           4        4              
  Lines          92       96       +4     
==========================================
+ Hits           76       78       +2     
- Misses         16       18       +2     
Flag Coverage Δ
unittests 81.25% <77.77%> (-1.36%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
resampy/core.py 80.76% <77.77%> (-2.57%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ee82e25...0603dea. Read the comment docs.

@avalentino avalentino force-pushed the feature/parallel branch 2 times, most recently from f16c60c to 8eb418e Compare June 18, 2022 09:27
@bmcfee
Copy link
Owner

bmcfee commented Jun 20, 2022

Thanks for getting this one going!

After thinking this one over a bit, I'm not sure that an environment variable is the best way to go here. (If for no other reason, it's kind of a pain to set up unit tests for.)

What do you think about making this more configurable via parameters to the user-facing API? E.g.

resampy.resample(..., parallel=True)  # or parallel=False

we could have two jit-wrapped versions of the core function on hand by replacing

@numba.jit(nopython=True, nogil=True)
def resample_f(x, y, t_out, interp_win, interp_delta, num_table, scale=1.0):

by something like

def _resample_f(x, y, t_out, interp_win, interp_delta, num_table, scale=1.0):
    ...

resample_f_p = numba.jit(nopython=True, nogil=True, parallel=True)(_resample_f)
resample_f_s = numba.jit(nopython=True, nogil=True, parallel=False)(_resample_f)

and then have the user-facing API select between the two. It's a little hacky, but should not present significant overhead and would certainly be easier to test and benchmark.

As an aside, we might want to consider adding cached compilation (to both) so that we don't have to pay the startup cost each time.

@bmcfee bmcfee added this to the 0.3.0 milestone Jun 20, 2022
@avalentino
Copy link
Contributor Author

Dear @bmcfee, I like a lot your solution.
I will try to implement it ASAP.
Just to filan details:

  • Can you please confirm that the default option for resample should be parallel=True?
  • do you want to run all tests for both the sequential and parallel version or just a dedicated test for the sequential version is enough?

@bmcfee
Copy link
Owner

bmcfee commented Jun 20, 2022

  • Can you please confirm that the default option for resample should be parallel=True?

I'd like to benchmark it on typical loads before making a decision on this.

  • do you want to run all tests for both the sequential and parallel version or just a dedicated test for the sequential version is enough?

Nah, I think a simple numerical equivalence check for one run would be sufficient. Note that we might not get exact floating point equivalence here because fp math is not really associative or commutative, but it shouldn't drift too far from the serial implementation.

tests/test_quality.py Outdated Show resolved Hide resolved
@bmcfee
Copy link
Owner

bmcfee commented Jun 23, 2022

Thanks @avalentino ! This is looking good - any sense of how it benchmarks on typical loads?

@bmcfee
Copy link
Owner

bmcfee commented Jun 23, 2022

Quick follow-up, I tried benchmarking this on some typical loads (10s, 60s, 600s signals at 44100→32000, float32) and didn't see any consistent benefit. I also tried this on both my laptop (8 cores) and a 24-core server, with no clear difference in behavior.

A couple of thoughts:

  1. It may be necessary to enable prange for the inner loops as well as the outer loop
  2. We might have to do a bit of digging into the compiler optimization to see if it's missing something, or if it's optimizing at all.

@avalentino
Copy link
Contributor Author

@bmcfee I made some benchmarking when I initially proposed to activate the parallelism.
In this moment I do not remember details but the speedup on my quac-core laptop laptop was around 3.1, which is not bad.
ASAP (at the moment I cannot access my laptop) I will try to recover my benchmark script, re-run it on my fresh Ubuntu installation and share it.

@avalentino
Copy link
Contributor Author

Regarding the possible use of prange in inner loops at a first glance I would say that it is not a good idea, but nothing is better than a direct test

@avalentino
Copy link
Contributor Author

avalentino commented Jun 25, 2022

OK @bmcfee, it seems that the cache parameter has a behaviour that is unexpected for me.
I should read better the documentation.
Meanwhile I have just removed the the option form the call to numba.jit and now things seems to behave like expected.

CPU:  Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz (4 physical cores)
CPU count: 8
Input size 220500
Output size: 882000
best of 5: 1780.217 [ms]
best of 5: 471.998 [ms]
speedup: 3.77165920690257

I have used the following script to benchmark the resample function:

#!/usr/bin/env python3

import os
import timeit
import numpy as np
import resampy

# CPU info
# https://stackoverflow.com/questions/4842448/getting-processor-information-in-python
def get_processor_name():
    import platform
    import subprocess
    if platform.system() == "Windows":
        return platform.processor()
    elif platform.system() == "Darwin":
        import os
        os.environ['PATH'] = os.environ['PATH'] + os.pathsep + '/usr/sbin'
        command ="sysctl -n machdep.cpu.brand_string"
        return subprocess.check_output(command).strip()
    elif platform.system() == "Linux":
        import re
        mode_name = None
        cpu_cores = None
        with open('/proc/cpuinfo') as fd:
            cpuinfo = fd.read()
            for line in cpuinfo.split("\n"):
                if "model name" in line:
                    model_name = re.sub( ".*model name.*:", "", line, 1)
                elif "cpu cores" in line:
                    cpu_cores = re.sub( ".*cpu cores.*:", "", line, 1).strip()
                if None not in (cpu_cores, mode_name):
                    break
        return f'{model_name} ({cpu_cores} physical cores)'

    return ""

print('CPU:', get_processor_name())
print('CPU count:', os.cpu_count())


# setup
sr_orig = 22050.0
f0 = 440
T = 10
ovs = 4
t = np.arange(T * sr_orig) / sr_orig
x = np.sin(2 * np.pi * f0 * t)

sr_new = sr_orig * ovs
print('Input size', len(x))
print('Output size:', len(x) * ovs)


# compile
resampy.resample(x, sr_orig, sr_new, parallel=False)
resampy.resample(x, sr_orig, sr_new, parallel=True)

# timeit
debug = False
repeat = 5
results = {}
for label, parallel in zip(('sequential', 'parallel'), (False, True)):
    timer = timeit.Timer(
        f'resampy.resample(x, sr_orig, sr_new, parallel={parallel})',
        globals=globals().copy())
    
    number, _ = timer.autorange()
    number = max(3, number)
    # print('number:', number)

    times = timer.repeat(repeat, number)
    if debug:
        print('sequential', times)
    results[label] = min(times)/number * 1e3
    print(f'best of {repeat}: {results[label]:.3f} [ms]')

print('speedup:', results['sequential']/results['parallel'])

@bmcfee
Copy link
Owner

bmcfee commented Jun 25, 2022

Meanwhile I have just removed the the option form the call to numba.jit and now things seems to behave like expected.

Nice - confirming that I also see some good speedups on my laptop with a basic test load (2 runs each to ensure eliminating compilation effects):

In [4]: sr = 44100

In [5]: sr_new = 32000

In [6]: x = np.random.randn(120 * sr).astype(np.float32)

In [7]: %timeit resampy.resample(x, sr, sr_new, parallel=False)
3.14 s ± 16.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: %timeit resampy.resample(x, sr, sr_new, parallel=False)
3.41 s ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %timeit resampy.resample(x, sr, sr_new, parallel=True)
723 ms ± 31.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [10]: %timeit resampy.resample(x, sr, sr_new, parallel=True)
791 ms ± 107 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

and on the 24-core server:

In [9]: %timeit resampy.resample(x, sr, sr_new, parallel=True)
298 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [10]: %timeit resampy.resample(x, sr, sr_new, parallel=False)
5 s ± 3.06 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I'm curious about how the cache= parameter is affecting this though, especially given this bit of the documentation

The following features are explicitly verified to work with caching:
ufuncs and gufuncs for the cpu and parallel target
parallel accelerator features (i.e. parallel=True)

We should probably dig into this and get a better understanding of what the trouble is.

Otherwise, I'm happy for parallel=True to be the default going forward.

@avalentino
Copy link
Contributor Author

It seems that the numba caching get somewhat confused if the one applies two timed the jit to the same python function.
I made a simple test consisting in defining two identical python functions with different names (_resample_f_p, and _resample_f_s).
When I apply the jit decorator with parallel=True/False and cache=True all seems to work.
Probably it is better to leave out the caching for the moment.
Maybe we could try to open an issue for numba regarding this strange caching behaviour.

@bmcfee
Copy link
Owner

bmcfee commented Jun 25, 2022

It seems that the numba caching get somewhat confused if the one applies two timed the jit to the same python function.

Ah! That makes sense, and is silly. Maybe we could just do something like

resample_f_s = jit(..., parallel=False)(_resample_f)
resample_f_p = jit(..., parallel=True)(copy(_resample_f))

?

But otherwise agreed that caching is not necessary for this PR.

@avalentino
Copy link
Contributor Author

The solution using copy.copy seems not to work for me.
I have also tried to modify the __name__ of the copied function.

@bmcfee
Copy link
Owner

bmcfee commented Jun 25, 2022

The solution using copy.copy seems not to work for me.

Alright, let's punt it then. Probably this is an issue to raise with the numba folks - caching should depend on the jit parameters as well as the function body, and it seems like a bug if that's not the case.

@avalentino
Copy link
Contributor Author

There is indeed an open numba issue: numba/numba#6845

Copy link
Owner

@bmcfee bmcfee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Is there anything else to do here, or shall we merge and release?

@bmcfee bmcfee mentioned this pull request Jun 27, 2022
@avalentino
Copy link
Contributor Author

For me it is complete.
Please go ahead.

@bmcfee
Copy link
Owner

bmcfee commented Jun 27, 2022

:shipit: thanks!

@bmcfee bmcfee merged commit ff7f7a7 into bmcfee:main Jun 27, 2022
@avalentino avalentino deleted the feature/parallel branch June 27, 2022 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants