-
-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build and process .conda
artifacts
#1586
Comments
cc @conda-forge/core (as Anaconda.org & CDN can now handle |
edit: all of these items are done notes from core call:
|
use
in the condarc |
@conda-forge/core I went with compression level 16. LMK if you have any issues with that. |
How long does it take to run 16 (or 19)? How much more compression does one see between the two? Understand we may not have benchmarks, but any info that can help guide us would be useful. Should we allow this to be overridable? For example if compression takes too long on a feedstock close to a CI time limit and we want to dial it down. Edit: Just realized PR ( #1852 ) shows this being configurable. So think that answer the last question. |
I don't have any of this info. I think we ship this PR and then figure out as things run how it is working. |
Yeah being able to configure it is more important I think. From past experience with compressors the last little bit tends to take a lot longer for minimal gain. So was just trying to get a sense of how "flat" the curve was getting to aid in decision making. |
Here is a benchmark for numpy using the following script on my (old intel) mac #!/usr/bin/env bash
in_pkg=$1
out_pkg=${in_pkg/.tar.bz2/.conda}
bak_pkg=${in_pkg}.bak
cp ${in_pkg} ${bak_pkg}
for level in 1 4 10 16 17 18 19 20 21; do
cp ${bak_pkg} ${in_pkg}
rm -f ${out_pkg}
rm -rf ${out_pkg/.conda//}
start=`python -c "import time; print(time.time())"`
cph transmute --zstd-compression-level=${level} ${in_pkg} .conda
end=`python -c "import time; print(time.time())"`
ttime=$( echo "$end - $start" | bc -l )
start=`python -c "import time; print(time.time())"`
cph x ${out_pkg}
end=`python -c "import time; print(time.time())"`
runtime=$( echo "$end - $start" | bc -l )
size=$(ls -lah ${out_pkg} | cut -w -f 5)
echo "${level} ${size} ${runtime} ${ttime}"
done
cp ${bak_pkg} ${in_pkg}
rm -f ${bak_pkg}
rm -f ${out_pkg}
rm -rf ${out_pkg/.conda//} results (columns are zstd level, size, extraction time, transmute time) $ ./bench.sh numpy-1.23.4-py39h9e3402d_1.tar.bz2
1 8.2M 2.182568 7.1936909
4 7.2M 1.803828 7.8562648
10 6.4M 1.9773452 8.359201
16 5.9M 1.975351 16.997171
17 5.8M 3.171298 20.3572858
18 5.7M 2.3847492 23.3421962
19 5.7M 2.237947 36.101651
20 5.2M 3.756540 35.1239249
21 5.2M 3.2139912 40.8598119 Things flatten for this size around 10-16. This package is ~32M uncompressed. |
Here is the start of a benchmark for a much bigger file (compressed around 450 MB) $ ./bench.sh stackvana-afw-0.2022.46-py310hff52083_0.tar.bz2
1 464M 17.158145 161.082254
4 419M 14.839792 157.401964
10 375M 16.084401 199.193263
16 338M 13.825499 711.3774772 |
I think 16 will be fine for now. We can lower it as needed for big packages and we only take a small hit on small ones. |
We really could use an adaptive option in conda build. |
Thanks Matt! 🙏 Agreed 16 seems like plenty. Also notably better than their
|
On a different note (kind of related to adaptive), we may in the future want to leverage Zstandard's dictionary to pretrain on the content of many packages. We could then package this constructed dictionary and use it to improve overall compression and cutdown compression/decompression time. One question here is how compressible are packages in aggregate. There may be some things (like text) that compress really well and other things (like dynamic libraries) that may do less well. A somewhat related question is whether it is worth creating per file format dictionaries (though this would be a modification to the format). Given other packagers, filesystems, etc. have gone down the path of using Zstandard already, we may be able to glean results from their efforts. |
Great to see this happening - I'm excited about the performance improvements this might bring! ❤️ I've been building my own With the old With the Is there an easy way to browse the contents of a |
We have made the same observation ( conda/conda-package-handling#5 ) 🙂 |
All of this information is in https://github.com/regro/libcfgraph and I have a local clone which I regularly use with Longer term I was already thingking this would be a great feature for https://prefix.dev/ if @wolfv is interested. |
The idea is that this is useful for debugging broken builds - i.e. the build fails because of missing files in the package so the new package version never gets published outside of CI or my local desktop. As a dev I want to know what the internal file/folder structure of the newly built (broken) package was so I can compare with my expectations. I don't know much about prefix.dev but doesn't that just report on dependencies between published packages? |
Related to this have filed an issue to create a CEP spelling out the |
@dhirschfeld - you can use |
zstd has built-in benchmarking |
What does this mean? |
If you run |
" copy-pasta" of some comments of mine from internal chat:
|
Thankfully the window size is also limited by the uncompressed archive size. |
Synching after conda-forge/arrow-cpp-feedstock#875, which does quite a lot of things, see this [summary](conda-forge/arrow-cpp-feedstock#875 (review)). I'm not keeping the commit history here, but it might be instructive to check the commits there to see why certain changes came about. It also fixes the CI that was broken by a3ef64b (undoing the changes of #14102 in `tasks.yml`). Finally, it adapts to conda making a long-planned [switch](conda-forge/conda-forge.github.io#1586) w.r.t. to the format / extension of the artefacts it produces. I'm very likely going to need some help (or at least pointers) for the R-stuff. CC @ xhochy (for context, I never got a response to conda-forge/r-arrow-feedstock#55, but I'll open a PR to build against libarrow 10). Once this is done, I can open issues to tackle the tests that shouldn't be failing, resp. the segfaults on PPC resp. in conjunction with `sparse`. * Closes: #14828 Authored-by: H. Vetinari <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
Even though I have |
This issue is to track to-do items for building and handling
.conda
artifacts.to dos:
.tar.bz2
is assumed to be at the tail end of packagesanaconda.org
( Conda servers don't seem to cache gzip compressed indexes - limiting download speed to 2-3MB/s (compressed) conda/infrastructure#637 (comment) )The text was updated successfully, but these errors were encountered: