Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add BLAKE3 to hashlib #83479

Closed
larryhastings opened this issue Jan 11, 2020 · 62 comments
Closed

add BLAKE3 to hashlib #83479

larryhastings opened this issue Jan 11, 2020 · 62 comments
Labels
3.11 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@larryhastings
Copy link
Contributor

BPO 39298
Nosy @malemburg, @gpshead, @larryhastings, @tiran, @mgorny, @jstasiak, @oconnor663, @corona10, @tirkarthi, @kmaork
PRs
  • bpo-39298: Add BLAKE3 bindings to hashlib. #31686
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2022-03-23.16:53:03.784>
    created_at = <Date 2020-01-11.04:27:40.080>
    labels = ['type-feature', 'library', '3.11']
    title = 'add BLAKE3 to hashlib'
    updated_at = <Date 2022-03-24.20:30:46.143>
    user = 'https://github.com/larryhastings'

    bugs.python.org fields:

    activity = <Date 2022-03-24.20:30:46.143>
    actor = 'oconnor663'
    assignee = 'none'
    closed = True
    closed_date = <Date 2022-03-23.16:53:03.784>
    closer = 'larry'
    components = ['Library (Lib)']
    creation = <Date 2020-01-11.04:27:40.080>
    creator = 'larry'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 39298
    keywords = ['patch']
    message_count = 62.0
    messages = ['359777', '359794', '359796', '359936', '359941', '360152', '360215', '360535', '360838', '360840', '361918', '361925', '363397', '391355', '391356', '391360', '391418', '401070', '401093', '410277', '410293', '410363', '410364', '410386', '410451', '410452', '410453', '410454', '410455', '413413', '413414', '413490', '413563', '414532', '414533', '414543', '414544', '414565', '414569', '414571', '415761', '415806', '415823', '415842', '415846', '415847', '415863', '415885', '415888', '415889', '415894', '415895', '415897', '415898', '415900', '415902', '415920', '415948', '415951', '415952', '415960', '415973']
    nosy_count = 11.0
    nosy_names = ['lemburg', 'gregory.p.smith', 'larry', 'christian.heimes', 'mgorny', "Zooko.Wilcox-O'Hearn", 'jstasiak', 'oconnor663', 'corona10', 'xtreak', 'kmaork']
    pr_nums = ['31686']
    priority = 'normal'
    resolution = 'rejected'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue39298'
    versions = ['Python 3.11']

    @larryhastings
    Copy link
    Contributor Author

    From 3/4 of the team that brought you BLAKE2, now comes... BLAKE3!

    https://github.com/BLAKE3-team/BLAKE3

    BLAKE3 is a brand new hashing function. It's fast, it's paralellizeable, and unlike BLAKE2 there's only one variant.

    I've experimented with it a little. On my laptop (2018 Intel i7 64-bit), the portable implementation is kind of middle-of-the-pack, but with AVX2 enabled it's second only to the "Haswell" build of KangarooTwelve. On a 32-bit ARMv7 machine the results are more impressive--the portable implementation is neck-and-neck with MD4, and with NEON enabled it's definitely the fastest hash function I tested. These tests are all single-threaded and eliminate I/O overhead.

    The above Github repo has a reference implementation in C which includes Intel and ARM SIMD drivers. Unsurprisingly, the interface looks roughly the same as the BLAKE2 interface(s), so if you took the existing BLAKE2 module and s/blake2b/blake3/ you'd be nearly done. Not quite as close as blake2b and blake2s though ;-)

    @larryhastings larryhastings added 3.9 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Jan 11, 2020
    @tiran
    Copy link
    Member

    tiran commented Jan 11, 2020

    I've been playing with the new algorithm, too. Pretty impressive!

    Let's give the reference implementation a while to stabilize. The code has comments like: "This is only for benchmarking. The guy who wrote this file hasn't touched C since college. Please don't use this code in production."

    @larryhastings
    Copy link
    Contributor Author

    For what it's worth, I spent some time producing clean benchmarks. All these were run on the same laptop, and all pre-load the same file (406668786 bytes) and run one update() on the whole thing to minimize overhead. K12 and BLAKE3 are using a hand-written C driver, and compiled with both gcc and clang; all the rest of the algorithms are from hashlib.new, python3 configured with --enable-optimizations and compiled with gcc. K12 and BLAKE3 support several SIMD extensions; this laptop only has AVX2 (no AVX512). All these numbers are the best of 3. All tests were run in a single thread.

    -----------------+----------+----------+----+-----------------------
    hash algorithm|elapsed s |mb/sec |size|hash
    -----------------+----------+----------+----+-----------------------
    K12-Haswell 0.176949 2298224495 64 24693954fa0dfb059f99...
    K12-Haswell-clang 0.181968 2234841926 64 24693954fa0dfb059f99...
    BLAKE3-AVX2-clang 0.250482 1623547723 64 30149a073eab69f76583...
    BLAKE3-AVX2 0.256845 1583326242 64 30149a073eab69f76583...
    md4 0.37684668 1079135924 32 d8a66422a4f0ae430317...
    sha1 0.46739069 870083193 40 a7488d7045591450ded9...
    K12-clang 0.498058 816509323 64 24693954fa0dfb059f99...
    BLAKE3 0.561470 724292378 64 30149a073eab69f76583...
    K12 0.569490 714093306 64 24693954fa0dfb059f99...
    BLAKE3-clang 0.573743 708800001 64 30149a073eab69f76583...
    blake2b 0.58276098 697831191 128 809ca44337af39792f8f...
    md5 0.59936016 678504863 32 306d7de4d1622384b976...
    sha384 0.64208886 633352818 96 b107ce5d086e9757efa7...
    sha512_224 0.66094102 615287556 56 90931762b9e553bd07f3...
    sha512_256 0.66465768 611846969 64 27b03aacdfbde1c2628e...
    sha512 0.6776549 600111921 128 f0af29e2019a6094365b...
    blake2s 0.86828375 468359318 64 02bee0661cd88aa2be15...
    sha256 0.97720436 416155312 64 48b5243cfcd90d84cd3f...
    sha224 1.0255457 396538907 56 10fb56b87724d59761c6...
    shake_128 1.0895037 373260576 32 2ec12727ac9d59c2e842...
    md5-sha1 1.1171806 364013470 72 306d7de4d1622384b976...
    sha3_224 1.2059123 337229156 56 93eaf083ca3a9b348e14...
    shake_256 1.3039152 311882857 64 b92538fd701791db8c1b...
    sha3_256 1.3417314 303092540 64 69354bf585f21c567f1e...
    ripemd160 1.4846368 273918025 40 30f2fe48fec404990264...
    sha3_384 1.7710776 229616579 96 61af0469534633003d3b...
    sm3 1.8384831 221198006 64 1075d29c75b06cb0af3e...
    sha3_512 2.4839673 163717444 128 c7c250e79844d8dc856e...

    If I can't have BLAKE3, I'm definitely switching to BLAKE2 ;-)

    @oconnor663
    Copy link
    Mannequin

    oconnor663 mannequin commented Jan 13, 2020

    I'm in the middle of adding some Rust bindings to the C implementation in github.com/BLAKE3-team/BLAKE3, so that cargo test and cargo bench can cover both. Once that's done, I'll follow up with benchmark numbers from my laptop (Kaby Lake i5-8250U, also AVX2 with no AVX-512). For benchmark numbers with AVX-512 support, see the Performance section of the BLAKE3 paper (https://github.com/BLAKE3-team/BLAKE3-specs/blob/master/blake3.pdf). Larry, what processor did you run your benchmarks on?

    Also, is there anything currently in CPython that does dispatch based on runtime CPU feature detection? Is this something that BLAKE3 should do for itself, or is there existing machinery that we'd want to integrate with?

    @larryhastings
    Copy link
    Contributor Author

    According to my order details it is a "8th Generation Intel Core i7-8650U".

    @oconnor663
    Copy link
    Mannequin

    oconnor663 mannequin commented Jan 16, 2020

    Ok, I've added Rust bindings to the BLAKE3 C implementation, so that I can benchmark it in a vaguely consistent way. My laptop is an i5-8250U, which should be very similar to yours. (Both are "Kaby Lake Refresh".) My end result do look similar to yours with TurboBoost on, but pretty different with TurboBoost off:

    with TurboBoost on
    ------------------
    K12 GCC | 2159 MB/s
    BLAKE3 Rust | 1787 MB/s
    BLAKE3 C Clang | 1588 MB/s
    BLAKE3 C GCC | 1453 MB/s

    with TurboBoost off
    -------------------
    BLAKE3 Rust | 1288 MB/s
    K12 GCC | 1060 MB/s
    BLAKE3 C Clang | 1094 MB/s
    BLAKE3 C GCC | 943 MB/s

    The difference seems to be that with TurboBoost on, the BLAKE3 benchmarks have my CPU sitting around 2.4 GHz, while for the K12 benchmarks it's more like 2.9 GHz. With TurboBoost off, both benchmarks run at 1.6 GHz, and BLAKE3 does better. I'm not sure what causes that frequency difference. Perhaps some high-power instruction that the BLAKE3 implementation is emitting?

    To reproduce these numbers you can clone these two repos (the latter is where I happen to have a K12 benchmark):

    https://github.com/BLAKE3-team/BLAKE3
    https://github.com/oconnor663/blake2_simd

    Then in both cases checkout the "bench_406668786" branch, where I've put some benchmarks with the same input length you used.

    For Rust BLAKE3, at the root of the BLAKE3 repo, run: cargo +nightly bench 406668786

    For C BLAKE3, the command is the same, but run it in the "./c/blake3_c_rust_bindings" directory. The build defaults to GCC, and you can "export CC=clang" to switch it.

    For my K12 benchmark, at the root of the blake2_simd repo, run: cargo +nightly bench --features=kangarootwelve 406668786

    @oconnor663
    Copy link
    Mannequin

    oconnor663 mannequin commented Jan 17, 2020

    I plan to bring the C code up to speed with the Rust code this week. As part of that, I'll probably remove comments like the one above :) Otherwise, is there anything else we can do on our end to help with this?

    @oconnor663
    Copy link
    Mannequin

    oconnor663 mannequin commented Jan 23, 2020

    Version 0.1.3 of the official BLAKE3 repo includes some significant performance improvements:

    • The x86 implementations include explicit prefetch instructions, which helps with long inputs. (commit b8c33e1)
    • The C implementation now uses the same parallel parent hashing strategy that the Rust implementation uses. (commit 163f522)

    When I repeat the benchmarks above with TurboBoost on, here's what I see now:

    BLAKE3 Rust 2578 MB/s
    BLAKE3 C (clang -O3) 2502 MB/s
    BLAKE3 C (gcc -O2) 2223 MB/s
    K12 C (gcc -O2) 2175 MB/s

    Larry, if you have time to repeat your benchmarks with the latest C code, I'd be curious to see if you get similar results.

    @larryhastings
    Copy link
    Contributor Author

    I gave it a go. And yup, I see a definite improvement: it jumped from 1,583,326,242 bytes/sec to 2,376,741,703 bytes/sec on my Intel laptop using AVX2. A 50% improvement!

    I also *think* I'm seeing a 10% improvement in ARM using NEON. On my DE10-Nano board, BLAKE3 portable gets about 50mb/sec, and now BLAKE3 using NEON gets about 55mb/sec. (Roughly.) I might have goofed up on the old benchmarks though, or just not written down the final correct numbers.

    I observed no statistically significant performance change in the no-SIMD builds on Intel and ARM.

    p.s. in my previous comment with that table of benchmarks I said "mb/sec". I meant "bytes/sec". Oops!

    @larryhastings
    Copy link
    Contributor Author

    I just tried it with clang, and uff-da! 2,737,446,868 bytes/sec!

    p.s. I compiled with -O3 for both gcc and clang

    @oconnor663
    Copy link
    Mannequin

    oconnor663 mannequin commented Feb 12, 2020

    Version 0.2.0 of the BLAKE3 repo includes optimized assembly implementations. These are behind the "c" Cargo feature for the blake3 Rust crate, but included by default for the internal bindings crate. So the easiest way to rerun our favorite benchmark is:

    git clone https://github.com/BLAKE3-team/BLAKE3
    cd BLAKE3
    git fetch
    # I rebased this branch on top of version 0.2.0 today.
    git checkout origin/bench_406668786
    cd c/blake3_c_rust_bindings
    # Nightly is currently broken for unrelated reasons, so
    # we use stable with this internal bootstrapping flag.
    RUSTC_BOOTSTRAP=1 cargo bench 406668786

    Running the above on my machine, I get 2888 MB/s, up another 12% from the 0.1.3 numbers. As a bonus, we don't need to worry about the difference between GCC and Clang.

    These new assembly files are essentially drop-in replacements for the instruction-set-specific C files we had before, which are also still supported. The updated C README has more details: https://github.com/BLAKE3-team/BLAKE3/blob/master/c/README.md

    @larryhastings
    Copy link
    Contributor Author

    Personally I'm enjoying these BLAKE3 status updates, and I wouldn't mind at all being kept up-to-date during BLAKE3's development via messages on this issue. But, given the tenor of the conversation so far, I'm guessing Python is gonna hold off until BLAKE3 reaches 1.0.

    @oconnor663
    Copy link
    Mannequin

    oconnor663 mannequin commented Mar 4, 2020

    I've just published some Python bindings for the Rust implementation on PyPI: https://pypi.org/project/blake3

    I'm guessing Python is gonna hold off until BLAKE3 reaches 1.0.

    That's very fair. The spec and test vectors are set in stone at this point, but the implementations are new, and I don't see any reason to rush things out. (Especially since early adopters can now use the library above.) That said, there aren't really any expected implementation changes that would be a natural moment for the implementations to tag 1.0. I'll probably end up tagging 1.0 as soon as a caller appears who needs it to be tagged to meet their own stability requirements.

    @oconnor663
    Copy link
    Mannequin

    oconnor663 mannequin commented Apr 19, 2021

    An update a year later: I have a proof-of-concept branch that adds BLAKE3 support to hashlib: https://github.com/oconnor663/cpython/tree/blake3. That branch is API compatible with the current master branch of https://github.com/oconnor663/blake3-py. Both that module and the upstream BLAKE3 repo are ready to be tagged 1.0, just waiting to see whether any integrations like this one end up requesting changes.

    Would anyone be interested in moving ahead with this? One of the open questions would be whether CPython would vendor the BLAKE3 optimized assembly files, or whether we'd prefer to stick to C intrinsics.

    @larryhastings
    Copy link
    Contributor Author

    I note that Python already ships with some #ifdefs around SSE and the like. So, yes, we already do this sort of thing, although I think this usually uses compiler intrinsics rather than actual assembly. A quick grep shows zero .s files and only one .asm file (./Modules/_decimal/libmpdec/vcdiv64.asm) in the Python tree. Therefore it wouldn't be completely novel for Python but it's unusual.

    I assume there's a completely generic platform-agnostic C implementation, for build environments where the assembly won't work, yes?

    Disclaimer: I've been corresponding with Jack sporadically over the past year regarding the BLAKE3 Python API. I also think BLAKE3 is super duper cool neat-o, and I have uses for it. So I'd love to see it in Python 3.10.

    One note, just to draw attention to it: the "blake3-py" module, also published by Jack, is written using the Rust implementation, which I understand is even more performant. Obviously there's no chance Python would ship that implementation. But by maintaining exact API compatibility between "blake3-py" and the "blake3" added to hashlib, this means code can use the fast one when it's available, and the built-in one when it isn't, a la CStringIO:

    try:
        from blake3 import blake3
    except ImportError:
        from hashlib import blake3
    

    @larryhastings larryhastings added 3.10 only security fixes and removed 3.9 only security fixes labels Apr 19, 2021
    @tiran
    Copy link
    Member

    tiran commented Apr 19, 2021

    3.10 feature freeze is in two weeks (May 3). I don't feel comfortable to add so much new C code shortly before beta 1. If I understandly correctly the code is new and hasn't been published on PyPI yet. I also don't have much time to properly review the code. OpenSSL 3.0.0 and PEP-644 is keeping me busy.

    I would prefer to postpone the inclusion of blake3. Could you please publish the C version on PyPI first and let people test it?

    Apropos OpenSSL, do you have plans to submit the algorithm to OpenSSL for inclusion in 3.1.0?

    @oconnor663
    Copy link
    Mannequin

    oconnor663 mannequin commented Apr 20, 2021

    Hey Christian, yes these are new bindings, and also incomplete. See comments in oconnor663@dc6f616, but in short only x86-64 Unix is in working order. If 3.10 doesn't seem realistic, I'm happy to go the PyPI route. That said, this is my first time using the Python C API. (My code in that branch is going to make that pretty obvious.) Could you recommend any existing packages that I might be able to use as a model?

    For OpenSSL, I'm very interested in the abstract but less familiar with their project and their schedules. Who might be a good person to get in touch with?

    I assume there's a completely generic platform-agnostic C implementation, for build environments where the assembly won't work, yes?

    Yes, that's the vendored file blake3_portable.c. One TODO for my branch here is convincing the Python build system not to try to compile the x86-64-specific stuff on other platforms. The vendored file blake3_dispatch.c abstracts over all the different implementations and takes care of #ifdef'ing platform-specific function calls. (It also does runtime CPU feature detection on x86.)

    written using the Rust implementation, which I understand is even more performant

    A few details here: The upstream Rust and C implementations have been matched in single threaded performance for a while now. They share the same assembly files, and the rest is a direct port. The big difference is that Rust also includes multithreading support, using the Rayon work-stealing runtime. The blake3-py module based on the Rust crate exposes this with a simple boolean flag, though we've been thinking about ways to give the caller more control over the number of threads used.

    @mgorny
    Copy link
    Mannequin

    mgorny mannequin commented Sep 5, 2021

    Jack, are you still working on this? I was considering allocating the time to write the bindings for the C library myself but I've stumbled upon this bug and I suppose there's no point in duplicating work. I'd love to see it on pypi, so we could play with it a bit.

    @tiran
    Copy link
    Member

    tiran commented Mar 5, 2022

    I didn't consult the steering council in 2016, because I lost the keys to the time machine. The very first SC election was in 2019. :)

    @larryhastings
    Copy link
    Contributor Author

    Right, and I did say "(or BDFL)". Apparently you didn't bother to consult with the BDFL in advance, or at least not in the usual public venues--I haven't found a record of such a conversation on the bpo issue, nor in python-dev.

    BTW you simultaneously proposed adding SHA3/SHAKE. The total kloc for all this work was over 26k; you didn't mention any discomfort with the size of these patches at the time in public correspondance.

    In fact, quite the opposite. On 2016/05/28 you said:

    I also don't get your obsession with lines of code. The gzip and expat
    are far bigger than the KeccakCodePackage.

    https://mail.python.org/archives/list/[email protected]/message/3YHVN2I74UQC36AVY5BGRJJUE4PMU6GX/

    @larryhastings
    Copy link
    Contributor Author

    Jack: I've updated the PR, improving compatibility with the "blake3" package on PyPI. I took your notes, and also looked at the C module you wrote.

    The resulting commit is here:

    37ce72b

    Specifically:

    • derive_key_context is now a string, which I internally encode into UTF-8.

    • I added the AUTO member to the module, set to -1.

    • max_threads may not be zero; it can be >= 1 or AUTO.

    • I added the reset() method.

    Some additional thoughts, both on what I did and on what you did:

    • In your new() method, your error string says "keys must be 32 bytes". I went with "key must be exactly 32 bytes"; the name of the parameter is "key", and I think "exactly" makes the message clearer.

    • In my new() method, I complain if the derive_key_context is zero-length. In your opinion, is that a good idea, or is that overly fussy?

    • In your copy() method, you hard-code Blake3Type. It's considered good form to use type(self) here, in case the user is calling copy on a user-created subclass of Blake3Type.

    • In your copy() method, if the original has a lock created, you create a lock in the copy too. I'm not sure why you bother; I leave the lock member uninitialized, and let the existing logic create the lock in the copy on demand.

    • In the Blake3_methods array, you list the "update" method twice. I suspect this is totally harmless, but it's unnecessary.

    @gpshead
    Copy link
    Member

    gpshead commented Mar 22, 2022

    hashlib creator and other maintainer here: I do not think it was a good idea for us to add blake2 to hashlib the way we did. So blake3 should not be presumed as a given, at least not done in the same manner.

    Background:

    While OpenSSL gained _some_ blake2 support in 2016, around when we were adding it to hashlib (not a coincidence), we made a mistake: We offered an overly complex API. OpenSSL's own support for blake2 is a subset, not sufficient to be used as a replacement for the API we exposed so we are stuck with our vendored copy with no easy way off. openssl/openssl#980

    OpenSSL is not going to gain native blake3 support. openssl/openssl#11613

    Given that I don't want to see us gain new vendored copies of significant but non-critical third party hash code in our tree (Modules/_blake3/impl/ in PR 31686) for anything but a known fixed term need (ex: the sha2 libtomcrypt code is gone from our tree as was clearly going to happen from the start), the only way I think we should include blake3 support is if there is already a plan for that code to leave our tree in the future with a high probability of success.

    A configure.ac check for an installed blake3 library to build and link against would be appropriate.

    Along with updating relevant CI systems and Windows and macOS release build systems to have that available.

    That'd significantly shrink the PR to reasonable size.

    This means blake3 support should be considered optional as anyone doing their own CPython build may not have it. This is primarily a documentation issue: list it as such and provide one official documented API to detect its availability. Our binary releases will include it as will most OS distro packagers. It also means implementation details, performance and platform tuning are **not our problem** but those of the OS distro or external library provider.

    Regarding setup.py, what Christian says is true, that is on its way out. Do not add new non-trivial statements to it as that just creates more work for those working to untangle the mess. Getting rid of the /impl/ build in favor of an autoconf detected library gets rid of that mess.

    I'll file a separate issue to track moving blake2 in the same direction so we can lose it's /impl/.

    @gpshead
    Copy link
    Member

    gpshead commented Mar 22, 2022

    correction: our md5/sha1/sha2/sha3 code is not gone yet, but they are simple C implementations used as a fallback when the provider of optimal versions are unavailable (openssl for those). That keeps the copies of code in our tree simple and most people use the optimal library version.

    @larryhastings
    Copy link
    Contributor Author

    Given that I don't want to see us gain new vendored copies of
    significant but non-critical third party hash code in our tree
    (Modules/_blake3/impl/ in PR 31686) for anything but a known
    fixed term need (ex: the sha2 libtomcrypt code is gone from
    our tree as was clearly going to happen from the start),
    the only way I think we should include blake3 support is if
    there is already a plan for that code to leave our tree in
    the future with a high probability of success.

    You've said what you want, but not why. It sounds like you are against merging the BLAKE3 PR containing its own impl. Why?

    @gpshead
    Copy link
    Member

    gpshead commented Mar 23, 2022

    Because I don't think blake3 or blake2 _(though we've shipped it already so there's a challenge in making changes https://bugs.python.org/issue47095)_ are important enough to be _guaranteed_ present in all builds (our release binaries would include them). Depending on an external library for those to exist makes sense.

    I do not want CPython to get into the business of maintaining a complicated build process in-tree for third party architecture specific optimized code for non-core functionality purposes. That is best handled outside of the project & on CI and binary release hosts.

    I'm okay with blake3 in hashlib if we can avoid gaining another /impl/ tree that is a copy of large third party code and our own build system for it.

    Q: What benefits does having blake3 builtin vs getting it from PyPI bring?

    Q: Should we instead provide a way for third party provided hashes to be registered in hashlib similar to codecs.register()?

    I view the NIST standard hashes as important enough to attempt to guarantee as present (all the SHAs and MD5) as built-in. Others should really demonstrate practical application popularity to gain included battery status rather than just using PyPI.

    @oconnor663
    Copy link
    Mannequin

    oconnor663 mannequin commented Mar 23, 2022

    maintaining a complicated build process in-tree

    For what it's worth, if you have any sort of "in a perfect world" vision for what the upstream BLAKE3 project could do to make it trivially easy for you to integrate, I'd be very interested in getting that done. Making integration easy would benefit all callers. We have some issues on the backburner about committing CMake build files, but I assume those would be useless for us here. Is there anything that would be more useful? If we provided autotools build files, could you call into them?

    Fundamentally, BLAKE3 wants to build some code on x86 and some other code on ARM, and also some code on Unix and some other code on Windows. Currently we just ask the caller to do that for us, for lack of a common standard. (And if we're building intrinsics rather than assembly, we also need the compiler flags that enable our intrinsics.) But maybe we could handle more of that upstream, using the preprocessor? If the build instructions said "compile this one giant file on all platforms and don't worry about what it does", would that be better? Or would that be gross? Is a header-only library the gold standard? Or too C++-ish? Has anyone ever done a really good job of this?

    @malemburg
    Copy link
    Member

    On 23.03.2022 02:12, Gregory P. Smith wrote:

    I view the NIST standard hashes as important enough to attempt to guarantee as present (all the SHAs and MD5) as built-in. Others should really demonstrate practical application popularity to gain included battery status rather than just using PyPI.

    +1 on this. I also think the topic deserves a wider discussion.

    IMO, Python's stdlib should only provide a basic set of hash algorithms
    and not try to add every single new algorithm out there.

    PyPI is a much better way to add support for new hash algorithms,
    can move much faster than the stdlib, provide specialized builds for
    added performance and also add exotic features, which are not always
    needed.

    Here's the list of Python 3.10 algos on a typical Linux system:

    >>> hashlib.algorithms_available
    {'sha512_256', 'mdc2', 'md5-sha1', 'md4', 'ripemd160', 'shake_128', 'sha3_384',
    'blake2s', 'sha3_512', 'sha3_256', 'sha256', 'sha1', 'sm3', 'sha512_224',
    'whirlpool', 'sha384', 'shake_256', 'sha224', 'sha512', 'sha3_224', 'md5',
    'blake2b'}

    This already is more than enough. Since we're using OpenSSL in Python
    anyway, exposing some of the often used algos from OpenSSL is fine,
    since it doesn't add much extra bloat. The above list already goes
    way beyond this, IMO.

    The longer the list gets, the more confusion it causes among users,
    since Python's stdlib doesn't provide any guidance on
    basic questions such as "Which hash algo should I use for my
    application".

    Most applications today will only need these basic hash algos:

    {'ripemd160', 'sha3_512', 'sha3_256', 'sha256', 'sha1', 'sha512', 'md5'}

    @larryhastings
    Copy link
    Contributor Author

    Ok, I give up.

    @malemburg
    Copy link
    Member

    On 23.03.2022 17:53, Larry Hastings wrote:

    Ok, I give up.

    Sorry to spoil the fun, but there's no need to throw
    everything in the bin ;-)

    A lean and fast blake3 C package would still be a great thing
    to have on PyPI, e.g. provide support for platforms, which
    Jack's blake3 Rust package doesn't cover, e.g.

    Raspis:
    https://www.piwheels.org/project/blake3/

    Android (e.g. via termux):
    https://wiki.termux.com/wiki/Main_Page
    https://wiki.termux.com/wiki/Python

    etc.

    @larryhastings
    Copy link
    Contributor Author

    The Rust version is already quite "lean". And it can be much faster than the C version, because it supports internal multithreading. Even without multithreading I bet it's at least a hair faster.

    Also, Jack has independently written a Python package based around the C version:

    https://github.com/oconnor663/blake3-py/tree/master/c_impl

    so my making one would be redundant.

    I have no interest in building standalone BLAKE3 PyPI packages for Raspberry Pi or Android. My goal was for BLAKE3 to be one of the "included batteries" in Python--which would have meant it would, eventually, be available on the Raspberry Pi and Android builds that way.

    @malemburg
    Copy link
    Member

    With "lean" I meant: doesn't use much code and is easy to compile
    and install.

    I built a wheel from Jack's experimental package and it comes out to
    just under 100kB on Linux x64, compared to around the 1.1MB the
    Rust wheel needs:

    Archive: blake3_experimental_c-0.0.1-cp310-cp310-linux_x86_64.whl
    Length Date Time Name
    --------- ---------- ----- ----
    348528 2022-03-23 18:38 blake3.cpython-310-x86_64-linux-gnu.so
    3183 2022-03-23 18:38 blake3_experimental_c-0.0.1.dist-info/METADATA
    105 2022-03-23 18:38 blake3_experimental_c-0.0.1.dist-info/WHEEL
    7 2022-03-23 18:38 blake3_experimental_c-0.0.1.dist-info/top_level.txt
    451 2022-03-23 18:38 blake3_experimental_c-0.0.1.dist-info/RECORD
    --------- -------
    352274 5 files

    Archive: blake3-0.3.1-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.whl
    Length Date Time Name
    --------- ---------- ----- ----
    3800 2022-01-13 01:26 blake3-0.3.1.dist-info/METADATA
    133 2022-01-13 01:26 blake3-0.3.1.dist-info/WHEEL
    48 2022-01-13 01:26 blake3/init.py
    4195392 2022-01-13 01:26 blake3/blake3.cpython-310-x86_64-linux-gnu.so
    382 2022-01-13 01:26 blake3-0.3.1.dist-info/RECORD
    --------- -------
    4199755 5 files

    I don't know why there is such a significant difference in size. Perhaps
    the Rust version includes multiple variants for different CPU
    optimizations ?!

    @larryhastings
    Copy link
    Contributor Author

    I can't answer why the Rust one is so much larger--that's a question for Jack. But the blake3-py you built might (should?) have support for SIMD extensions. See the setup.py for how that works; it appears to at least try to use the SIMD extensions on x86 POSIX (32- and 64-bit), x86_64 Windows, and 64-bit ARM POSIX.

    If you were really curious, you could run some quick benchmarks, then hack your local setup.py to not attempt adding support for those (see "portable code only" in setup.py) and do a build, and run your benchmarks again. If BLAKE3 got a lot slower, yup, you (initially) built it with SIMD extension support.

    @gpshead
    Copy link
    Member

    gpshead commented Mar 23, 2022

    To anyone else who comes along with motivation:

    I'm fine with blake3 being in hashlib, but I don't want us to guarantee it by carrying the implementation of the algorithm in the CPython codebase itself unless it gains wide industry standard-like adoption status.

    We should feel free to link to both the Rust blake3 and C blake3-py packages from the hashlib docs regardless.

    @gpshead
    Copy link
    Member

    gpshead commented Mar 23, 2022

    Performance wise... The SHA series have hardware acceleration on modern CPUs and SoCs. External libraries such as OpenSSL are in a position to provide implementations that make use of that. Same with the Linux Kernel CryptoAPI (https://bugs.python.org/issue47102).

    Hardware accelerated SHAs are likely faster than blake3 single core. And certainly more efficient in terms of watt-secs/byte.

    @malemburg
    Copy link
    Member

    Here's a wheel which only includes the portable code (I disabled
    all the special cases as you suggested).

    Archive: dist/blake3_experimental_c-0.0.1-cp310-cp310-linux_x86_64.whl
    Length Date Time Name
    --------- ---------- ----- ----
    297680 2022-03-23 19:26 blake3.cpython-310-x86_64-linux-gnu.so
    3183 2022-03-23 19:26 blake3_experimental_c-0.0.1.dist-info/METADATA
    105 2022-03-23 19:26 blake3_experimental_c-0.0.1.dist-info/WHEEL
    7 2022-03-23 19:26 blake3_experimental_c-0.0.1.dist-info/top_level.txt
    451 2022-03-23 19:26 blake3_experimental_c-0.0.1.dist-info/RECORD
    --------- -------
    301426 5 files

    I didn't run any benchmarks, but it's clear that the SIMD code was
    used in my initial build and this adds some 50kB to the .so file.
    This is on a older Linux x64 box with Intel i7-4770k CPU.

    Could be that the Rust version adds several such SIMD variants and
    then branches based on the platform running the code.

    In any case, the C extension is indeed very easy to build and
    install with a standard compiler setup.

    @gpshead
    Copy link
    Member

    gpshead commented Mar 23, 2022

    Rust based anything comes with a baseline level of Rust code overhead. https://stackoverflow.com/questions/29008127/why-are-rust-executables-so-huge

    That seems expected.

    @larryhastings
    Copy link
    Contributor Author

    Performance wise... The SHA series have hardware acceleration on
    modern CPUs and SoCs. External libraries such as OpenSSL are in a
    position to provide implementations that make use of that. Same with
    the Linux Kernel CryptoAPI (https://bugs.python.org/issue47102).

    Hardware accelerated SHAs are likely faster than blake3 single core.
    And certainly more efficient in terms of watt-secs/byte.

    I don't know if OpenSSL currently uses the Intel SHA1 extensions.
    A quick google suggests they added support in 2017. And:

    • I'm using a recent CPU that AFAICT supports those extensions.
      (AMD 5950X)
    • My Python build with BLAKE3 support is using the OpenSSL implementation
      of SHA1 (_hashlib.openssl_sha1), which I believe is using the OpenSSL
      provided by the OS. (I haven't built my own OpenSSL or anything.)
    • I'm using a recent operating system release (Pop!_OS 21.10), which
      currently has OpenSSL version 1.1.1l-1ubuntu1.1 installed.
    • My Python build with BLAKE3 doesn't support multithreaded hashing.
    • In that Python build, BLAKE3 is roughly twice as fast as SHA1 for
      non-trivial workloads.

    @oconnor663
    Copy link
    Mannequin

    oconnor663 mannequin commented Mar 24, 2022

    Hardware accelerated SHAs are likely faster than blake3 single core.

    Surprisingly, they're not. Here's a quick measurement on my recent ThinkPad laptop (64 KiB of input, single-threaded, TurboBoost left on), which supports both AVX-512 and the SHA extensions:

    OpenSSL SHA-256: 1816 MB/s
    OpenSSL SHA-1: 2103 MB/s
    BLAKE3 SSE2: 2109 MB/s
    BLAKE3 SSE4.1: 2474 MB/s
    BLAKE3 AVX2: 4898 MB/s
    BLAKE3 AVX-512: 8754 MB/s

    The main reason SHA-1 and SHA-256 don't do better is that they're fundamentally serial algorithms. Hardware acceleration can speed up a single instance of their compression functions, but there's just no way for it to run more than one instance per message at a time. In contrast, AES-CTR can easily parallelize its blocks, and hardware accelerated AES does beat BLAKE3.

    And certainly more efficient in terms of watt-secs/byte.

    I don't have any experience measuring power myself, so take this with a grain of salt: I think the difference in throughput shown above is large enough that, even accounting for the famously high power draw of AVX-512, BLAKE3 comes out ahead in terms of energy/byte. Probably not on ARM though.

    @tiran
    Copy link
    Member

    tiran commented Mar 24, 2022

    sha1 should be considered broken anyway and sha256 does not perform well on 64bit systems. Truncated sha512 (sha512-256) typically performs 40% faster than sha256 on X86_64. It should get you close to the performance of BLAKE3 SSE4.1 on your system.

    @oconnor663
    Copy link
    Mannequin

    oconnor663 mannequin commented Mar 24, 2022

    Truncated sha512 (sha512-256) typically performs 40% faster than sha256 on X86_64.

    Without hardware acceleration, yes. But because SHA-NI includes only SHA-1 and SHA-256, and not SHA-512, it's no longer a level playing field. OpenSSL's SHA-512 and SHA-512/256 both get about 797 MB/s on my machine.

    @gpshead
    Copy link
    Member

    gpshead commented Mar 24, 2022

    You missed the key "And certainly more efficient in terms of watt-secs/byte" part.

    @oconnor663
    Copy link
    Mannequin

    oconnor663 mannequin commented Mar 24, 2022

    I did reply to that point above with some baseless speculation, but now I can back up my baseless speculation with unscientific data :)

    https://gist.github.com/oconnor663/aed7016c9dbe5507510fc50faceaaa07

    According to whatever powerstat -R measures on my laptop, running hardware-accelerated SHA-256 in a loop for a minute or so takes 26.86 Watts on average. Doing the same with AVX-512 BLAKE3 takes 29.53 Watts, 10% more. Factoring in the 4.69x difference in throughput reported by those loops, the overall energy/byte for BLAKE3 is 4.27x lower than SHA-256. This is my first time running a power benchmark, so if this sounds implausible hopefully someone can catch my mistakes.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants