-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve optimal parser performance on small data #2771
Conversation
This is less appropriate for this mode : benchmark is about accuracy, it's important to read the exact values.
As a library, the default shouldn't be to write anything on console. `cover` and `fastcover` have a `g_displayLevel` variable to control this behavior. It's now set to 0 (no display) by default. Setting notification to a higher level should be an explicit operation by a console application.
small general compression ratio improvement for btopt+ strategies/
used to be necessary to counter-balance the fixed-weight frequency update which has been recently changed for an adaptive rate (targeting stable starting frequency stats).
better for larger blocks, very small inefficiency on small block.
better for large files, and sources with relatively "stable" entropy, like silesia.tar. slightly worse for files with rapidly changing entropy, like Calgary.tar/. Updated small files tests in fuzzer
notably within kernel space
programs/benchzstd.c
Outdated
DISPLAYLEVEL(2, "\r%70s\r", ""); /* blank line */ | ||
DISPLAYLEVEL(2, "%2s-%-17.17s : %.*f%s -> \r", marks[markNb], displayName, hr_isize.precision, hr_isize.value, hr_isize.suffix); | ||
assert(srcSize < UINT_MAX); | ||
DISPLAYLEVEL(2, "%2s-%-17.17s :%10u -> \r", marks[markNb], displayName, (unsigned)srcSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the existing code, you can see full sizes with -vv
. Does that seem insufficient? In general I like that this is human-readable formatted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I noticed this capability.
It wasn't enough unfortunately, as it was breaking scripts depending on previous format.
Moreover, and more importantly, I simply believe that this is not the right place for the condensed quantity format. Benchmarking is about accuracy. I typically always need the exact amount, in order to clearly appreciate changes, even when they are small.
The new "human readable" format is more suitable for general information status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I will say more often than not I have found myself benchmarking something and then having to re-do it once i realize I forgot -vv
. At least for me, I'd always prefer to have it on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
by employing parallel compilation of object files.
The
|
It seems so, probably a change to the |
optPtr->litSum = ZSTD_scaleStats(optPtr->litFreq, MaxLit, 12); | ||
optPtr->litLengthSum = ZSTD_scaleStats(optPtr->litLengthFreq, MaxLL, 11); | ||
optPtr->matchLengthSum = ZSTD_scaleStats(optPtr->matchLengthFreq, MaxML, 11); | ||
optPtr->offCodeSum = ZSTD_scaleStats(optPtr->offCodeFreq, MaxOff, 11); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the logTarget
maybe be based on the block size?
I would expect smaller blocks to want to have a smaller history.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The update policy could be refined even further. But note that the topic of "block size" has many sub-cases, that probably deserve separate policies.
For example, is the "small block" also the "only block" ?
(I presume that's what you meant.)
In which case, the first pass of btultra2
would produce stats which are already scaled to the size of the block. If there are only a few sequences, then the stats will only contain a few elements, way below the logTarget
threshold. In which case, they won't be up-scaled (stats are only down-scaled when they need to). So there is a form of built-in adaptation for small inputs in this process.
But if the "small block" is the N-th block in a long stream, and is also expected to be followed by other blocks of variable size, the situation is different. I would guess that a "stream level" logTarget
would feel more appropriate.
This can certainly be analyzed even more.
I believe the update policy is one of these places where some non-negligible gains can still be produced.
And my expectation is that a better update policy should be dynamic, reacting to the "probability" that historic statistics may or may not match the statistics of the following block.
As can be guessed though, this is a fairly complex topic, which would require a dedicated (and time-consuming) study.
I wonder if |
It's a good question. It's probably worth an investigation. But note that it's not clear if it's obviously better. Extending the topic, I've been considering some form of "light" initial evaluation for As can be guessed, the most important point here is that these investigations cost time. And time is the scarcest resource there is. This PR doesn't "terminate" this topic, it mainly brings it to light. I'm opened to any complementary PR that would improve these initial results. |
I guess that's because A trivial replacement of edit : indeed, edit 2 : get it, that's because the script is downloading and comparing several versions of |
make it able to process text output sent into either stdout or stderr
All remaining issues fixed |
{ unsigned const baseOFCfreqs[MaxOff+1] = { | ||
6, 2, 1, 1, 2, 3, 4, 4, | ||
4, 3, 2, 1, 1, 1, 1, 1, | ||
1, 1, 1, 1, 1, 1, 1, 1, | ||
1, 1, 1, 1, 1, 1, 1, 1 | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the change that makes the majority of the difference
This investigation was driven by #2765 .
In essence, this user sample contains a file which, depending on early choices made by the optimal parser, end up compressing or not the data, resulting in large relative differences (+100%).
This is a case of self-reinforcing local optimal. In order to find the shortest path (i.e. which leads to better compression), the algorithm needs to evaluate the "cost" of its choices. For this sample to work, the algorithm must consider small matches as better than literals. If it does, its weight evaluation will favor future choices of the same kind. If not, it will keep considering them too expensive.
One of the known issues of the optimal parser is that it doesn't know which stats to start from. Over time, the process tends to self-regulate, but in the early stages, there is not enough statistics collected yet.
The initially selected approach was to start from a blank pages, with all choices essentially equivalent.
This is what this PR modifies.
Initial statistics are modified this way :
silesia.tar
cut into small blocks. It's not clear why, I suspect it merely corrects another side effect which tends to favor more literals than it should. This change, all by itself, fix Compression ratio example #2765.silesia.tar
by a few dozens of KB. That might be because, due to the total cost including supplementary bits, the algorithm might already be biased enough ( in the general case ) in favor of small offsets and repeat code. Still, since the impact was consistently positive, this change was kept. It also fixes, all by itself, Compression ratio example #2765, proving there is more than one way to fix that one.btultra2
, which compress a small block twice, the first time merely to collect statistics for the second time. The issue is that the general update algorithm was presumed to be invoked once every 128 KB block. On small inputs, it's invoked twice in a much shorter timeframe, having no time to collect a "full block" of statistics. The issue is that the rescaling factor is static, and will squash down statistics the same way, no matter if they are numerous or not. In this new method, the rescaling is a bit more dynamic, trying to maintain a certain baseline budget between consecutive blocks, and adapting the squashing depending on actual runtime amounts. One issue is that there is no "one size fits all". Files with homogeneous entropy prefer slow adaptation, hence larger inter-block correlations, while files with heterogeneous sections prefer fast adaptation. I presume a more optimal solution should adapt the update rate depending on how likely statistics are to be representative of the following section. It's a topic for a full paper. For the time being, the simple heuristics selected in this PR will do (See benchmarks below).So what are the benefits ? Let's do some benchmark :
`silesia.tar`, cut into 2 KB blocks :
This round is consistently positive. It's even better when small matches of length 3 can be selected (levels 12+). The benefit is less pronounced on reaching
ultra2
levels. The generous interpretation is thatultra2
was already designed to compensate for the "early statistics" problem, so it benefits less from this new round of improvement. Still, it benefits a little, which is almost surprising, and points at a self-reinforcing effect.OK, but maybe 2 KB is a too favorable scenario, and the new statistics were designed for this one use case ? Let's try some bigger 16 KB blocks then.
`silesia.tar`, cut into 16 KB blocks :
Well, with bigger blocks, the impact seems a little less pronounced, but it's still there, proving initial stats remain useful after 2 KB.
OK, fine for small blocks. What about large blocks ? Full 128 KB ?
`silesia.tar`, cut into 128 KB blocks :
Note that, for these sizes,
zstd_btopt
only starts at level 13.Anyway, the benefit is now clearly reduced. But reassuringly, it's still globally positive, so it's still worth it.
Does this change impact full-length files, and if yes how ?
`silesia.tar` whole file :
Well, barely. This time, size differences are barely significant. And note that they are not always positive. But wether they are or not, it doesn't matter, as the compression ratio is essentially equivalent.
OK, but let's try some different file now. Maybe
calgary.tar
, which is a collection of small files of different types appended together, resulting in rapidly changing statistics. How does it behave with this new update policy ?`calgary.tar` whole file :
Well, this time, it's rather negative. But thankfully, by very little.
This is a case where statistics update would benefit from being more actively changed, since 2 consecutive internal files can be very different (text, image, db, etc.). It could be fixed with a faster update rate, but alas, other use cases (like
silesia.tar
) would then suffer. This is a situation where selecting a "good enough" middle ground heuristic is all we can do before moving on to some more complex algorithm.Anyway, an important point : it's not always a win. This change is more targeted at small blocks (~2 KB), for which it's generally a win. For larger files, it's less clear, but the impact is also less pronounced.