-
-
Notifications
You must be signed in to change notification settings - Fork 851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JpegDecoder: post-process baseline spectral data per MCU-row #1597
Comments
I can definitely get behind this. As I recall looking at other libraries they were working on a per MCU process but I think per MCU row is fine. It's a shame this optimization is only limited to sequential jpegs but is definitely worth the effort. |
I hope to be able to have a look at this within ~3 weeks, unless someone else wants to take it earlier. |
@br3aker is this something you'd be interested in? You've been doing amazing work in the encoder! |
@JimBobSquarePants I can definately get behind this after PR with that deBruijn table I was talking about at memory allocator PR. |
Did a little study to understand decoder architecture. Long story short: code explosion due to generic
Converting spectral data to YCbCr and then to Rgba is a piece of cake as we know for sure that supported jpeg contains spectral data which would yield YCbCr colorspace values which should be converted to Rgba for future colorspace conversion - no generics needed. Rgba -> TPixel is done via
There's no need to convert full mcu row as PostProcessor converts them piece by piece so it's unlikely to bring any performance benefits. Moreover, 4:2:0 needs to process two rows at the same time, so 2 full rows of While machine code size can bloat at runtime for each pixel type decoded from jpeg, I don't think that majory of users would use more than 1-3 pixel types they want images to decode to. @JimBobSquarePants @antonfirsov I might overlooked something but I'm almost confident that this is the only way, can you elaborate on the final decision? |
@br3aker I think we can solve this with a double-dispatch trick by implementing The recommendation to go MCU row by MCU-row is mostly to avoid the overhead of virtual calls. Note that virtual methods on |
Yep, was a bit delusional about the power of the JIT :D. Thanks for pointing that out. I though as TPixel is a struct, jit would compile IL to an exclusive implementation for exact TPixel type. Completely forgot that Won't be able to work for a couple of days but will definitely work on this, thanks for the double dispatch advice! |
|
@JimBobSquarePants I meant one block at a time but for a bulk of mcus at the same time so it would eliminate virtual call overhead:
vs.
The only problem here is memory allocation for mcu stride, especially for 420 subsampling as it proccesses more mcus per decoding unit (4 luma + 1 chroma) per actual deconding to spectral data. Maybe it's better to process in more granular bulks depending on some allocator size like it was 2MB last discussion but it doesn't matter that much atm so we can discuss it later when at least mvp implementation is ready. Encoder actually has the same problem, it calls |
This is not how it works in our decoder. We have #1121 has a very detailed description of the pipeline, and refactor plans. I prefer doing things in smaller steps, and I think #1597 can be fixed without major changes on the pipeline. I strongly recommend the per-stride approach, because it's gonna be a less intrusive change, and the virtual methods will process data in larger bulks. @JimBobSquarePants I assume libjpeg-turbo doesn't have many virtual calls in their pipeline, it's less modular, doing things in big, super-optimized, hardcoded steps. We can switch to this kind of approach in a future integer-only pipeline, but I'm quite sure that with our current architecture it's gonna be suboptimal.
I don't think it's a problem. We need to look at the big picture, the memory allocated by a trio of MCU rows (or even row-groups) is a small friction of what the whole In short:
|
My example pseudo code was purely to illustrate why suggested per-block conversion is bad. I've read through all the current decoding pipeline and actually have a question: what's the point of Don't get me wrong, I understand that Imo, it'll be more clear to do everything in
If I understood you correctly, your approach is the same but it quits stream parsing and fetches mcu strides during post-process step so there's no difference in theory. In practice all of that spectral data is scattered among And with my proposed solution everything is done within And one last thing - jpeg encoder is done the way I've proposed solution for the decoder. It would be a mirror copy to already existing internal API. |
The purpose is separation of concerns & a bit historical. In #274 we replaced the entire Huffman coding logic without touching the (already SIMD-optimized) color conversion by porting part of I'm fine with an architectural change here, but with We don't care about .NET Framework performance anymore, so if you have the motivation, and the time to replace the dinosaur stuff with something better and faster, I'm not gonna be in the way. Although it might worth to consider doing things in more granular steps, and at least keep the float conversion logic in the first refactor PR. I leave it up to you :) |
Thanks for your time explaning how current architecture was put together, interesting stuff! As entire So, transition could be something like this:
While step 1 might seem pointless - it would prepare internal decoder API for step 2 without changing actual spectral decoding logic. |
Question. Would we ever consider adding support for arithmetic encoding (rare and currently unsupported) |
Not sure if anyone would ever use arithmetic encoding, it's a lot slower and does not provide much storage optimization. For example, image specific huffman tree (optimize_coding flag in libjpeg) would produce a smaller image for about 5%: Note that decoder already supports any huffman tree, the only thing left is to implement it in the encoder which is not that complicated and won't change anything architecture-wise (would provide public flag to the public API though). And it will affect only encoder performance as you need to traverse entire image before actually encoding it. Decoder should not suffer from performance penalty, I would even dare to say that it should be a bit faster to decode optimized image than the default one. I can't provide such numbers for arithmetic coding but I've heard it to be somewhat ~10%-12% in the best case scenario compared to huffman. |
Makes sense to me thanks! That’s a good link you’ve shared. I keep meaning to look up the tables used in mozjpeg as I’m sure I heard that they deviated from the standard. |
I heard that too, they even had some experiments on quantization tables like this, for example: mozilla/mozjpeg#76 Feeling like a kid in the candy store to be honest, want to work on everything at the same time 😄 |
@br3aker regarding #1597 (comment), I want to see a more detailed plan on the way you are going to change the With a full reimplementation, we are talking about several dozens of working hours IMO. For example #1462 is a tiny step contributing to #1121, which alone took one evening + one full day. I really want to avoid blocking #1597 on an entire reimplementation of the conversion chain, and keep these two things separate. (Not talking about the PostProcessor, HuffmannScanDecoder, PixelConverter or whatever classes that put the pieces together, but rather the theoretical pipeline and the individual pieces in the way described in #1121's opening comment.) |
@antonfirsov sorry for the late responce, had some really busy days. I won't alter conversion pipeline with this issue fix. Let's define a pseudo workflow: Now
This is the workflow for any type of jpeg, either progressive or baseline, interleaved or not. Proposal
Basically, nothing changed. Right now there are 2 steps:
Proposed approach does it in 1 single step for baseline:
For progressive nothing would change:
To be honest, I thought it to be a lot easier to do but it's a lot of work. Not abandonning this just saying it would take some more time :P P.S. P.P.S. |
@br3aker thanks for clarifying! Looks like I misunderstood you because of the line
I totally understand this, there was 35 ℃ a up until yesterday here, was frustrated and brainless all the time :)
Take whatever time is needed, the issue is officially in the "Future" milestone for a reason :) Your plan looks good to me now conceptually. However when it comes to code style, I still think that it's worth to implement Such a trick can also prevent turning abstract class SpectralToImageConverter
{
// Add all the marker paramters which are needed for decoding in this type instead of HuffmannScanDecoder
public abstract void DecodeFromSpectral(spectralRow, rowIndex);
}
class SpectralToImageConverter<TPixel> : SpectralToImageConverter
{
private Image<TPixel> image;
public override void DecodeFromSpectral(spectralRow, rowIndex){
// You can use TPixel now
image[rowIndex] = DecodeImpl(spectralRow);
}
}
class HuffmannScanDecoder
{
// Use the non-generic base type:
private SpectralToImageConverter spectralConverter;
public DecodeBaseline()
{
spectralRow = new[RowSizeInBlocks]
foreach mcuRow in Image:
DecodeToSpectral(from: mcuRow, dest: spectralRow);
spectralConverter.DecodeFromSpectral(spectralRow, i);
}
}
class JpegDecoderCore
{
private SpectralToImageConverter spectralConverter;
public void Decode<TPixel>()
{
Image<TPixel> image = ...;
// ...
this.spectralConverter = new SpectralToImageConverter<TPixel>(image, ...);
ParseStream();
}
private void ProcessStartOfScan()
{
HuffmannScanDecoder scanDecoder = HuffmannScanDecoder(this.spectralConverter, ...);
scanDecoder.DecodeBaseline();
}
} What do you think? |
@antonfirsov actually I had almost the same plan but for |
@antonfirsov encountered a little dilemma: For baseline jpeg files proposed conversion is straightforward as it's known when spectral stride for each component is ready. public Image<TPixel> Decode<TPixel>(BufferedReadStream stream, CancellationToken cancellationToken)
where TPixel : unmanaged, IPixel<TPixel>
{
// this is still WIP but final variant would look somewhat like this
var specificConverter = new SpectralToImageConverter<TPixel>(this.Configuration);
this.spectralConverter = specificConverter;
this.ParseStream(stream, cancellationToken: cancellationToken);
this.InitExifProfile();
this.InitIccProfile();
this.InitIptcProfile();
this.InitDerivedMetadataProperties();
// this looks out of place to be honest
if (/* This jpeg is progressive */)
{
specificConverter.ConvertFullScan();
}
return new Image<TPixel>(this.Configuration, this.Metadata, new[] { specificConverter.ImageFrame });
} Another solution is to comit spectral data to the converter before returning from While both solutions look a bit 'ugly' it's the most performant way of checking P.S. |
@br3aker I like the plan with the pseudo |
@antonfirsov I will hide if-check in property getter for visual clarity then. Thanks for the responce! |
A little update on this: I redid a lot of code and screwed something up and I couldn't find out why in a couple of hours + got a new idea which should be a little more understandable. First of all, resulting image whould be constructed from Second, I've decided to implement this as an enumerable collection of spectral strides: // There would actually be some wrapping class for strides
// so it would store all components' spectral strides in a single object
foreach(Buffer2D<Block8x8> spectralStride in scanDecoder)
{
// spectral -> vector4
Buffer2D<Vector4> colorBuffer = ConvertFromSpectalToVector4(spectralStride);
// vector4 -> TPixel
PixelBuffer[i] = ConvertFromVector4<TPixel>(colorBuffer);
} For every decoding mode it won't change anything (except there's won't be extra virtual call for each stride conversion) but for baseline dct it would allow to deffer stream parsing stride by stride. |
If you have a working PR, that would be a good trigger to push things into a decision, otherwise it's just endless
I wonder how does this work with |
We definitely need to cater for that since that's how progressive jpegs work. I would open a draft PR where we can discuss the actual implementation. |
That's not that hard to determine actually. Multiple SOS markers can exist only in:
In other words: if (!this.Frame.IsProgressive && this.Frame.ComponentCount == scanComponentCount)
{
// this SOS must be the only one, any extra is an error and can be checked after spectral decoding
this.scanDecoder.Baseline = true;
// we can return true to signal that we are ready for spectral conversion
return true;
}
// decodes current partial scan to the pre-allocated spectral buffer
this.scanDecoder.DecodeScan();
// for more consistent behaviour we can actually evaluate if multi-sos jpeg is done
// via spectralEnd == 63 for progressive jpegs
// via processedScans == this.Frame.ComponentCount for non-inteleaved baseline jpegs
return lastScanCondition; There's a problem if given jpeg has anything after SOS except EOI and we can actually check even that - we can call ParseStream() one more time. |
Nevermind, current architecture & code is not in a good shape for my plan, I'll try to work on it later. Priority right now:
Sorry for this rapid change of ideas & messages, yesterday's discard of an almost working code knocked me hard. Will try to work out an implementation in a couple of days. |
No need to apologize and I feel you mate. At some point I need to replay months of optimization code I wrote for a zlib stream implementation because at some point I broke it but have no idea when. 😞 Really looking forward to seeing what you come up with! |
@antonfirsov @JimBobSquarePants sorry for bothering but I have a little problem PR is actually ready with almost all tests passing. The only problem is these tests for baseline jpegs: These tests check if spectral data is equal to libjpeg spectral data of the given image. This approach is impossible as spectral data is discarded deep inside scan decoding process. Progressive and multi-scan baselines can be tested simply because they use the same technique as before the PR. Question: Do we even need to test spectral data? We can compare final colors which would be invalid if spectral is invalid. Yes, it's a couple layers 'higher' but right now it's impossible to test. Only if P.S. |
The good thing about verifying spectral data is that the spectral intermediate result is exact, while for color conversion small deviations are allowed. It's important to be able to catch cases when a small difference is a result of a Huffman-decoding bug and not a floating point inaccuracy. We faced and fixed such issues while finalizing #274, and while I don't expect a refactor of that volume anytime soon, the tests can be still handy for Huffman decoding optimizations, so I prefer to keep them in long term. On the other hand, this should not block progress, so my recommendation is to temporarily disable them, and re-enable when the Enumerable refactor is done. @JimBobSquarePants agreed? |
@antonfirsov skipping tests now seems like an easy plan but you know... I will slightly alter current PR architecture to enable testing without major changes so it won't rely on 'somewhat possible enumerable implementation in the future'. |
Not sure I follow what you mean here? Do you mean that this is not an issue now? I'm happy to temporarily disable baseline tests for now simply to see a difference. |
It is a problem because I wanted to change as little code as possible for this PR so you guys won't spend too much time reviewing. These tests fix would result in a bigger change than necessary for PR to work. I will disable them and push a draft PR then. |
Currently, Huffmann decoding (done by
HuffmanScanDecoder
) is strictly separated from postprocessing/color conversion (done byJpegImagePostProcessor
) for simplicity. This means thatJpegComponent.SpectralBlocks
are allocated upfront for the whole image.I did a second round of memory profiling using
SimpleGcMemoryAllocator
to get rid of pooling for more auditable results. This shows thatSpectralBlocks
are responsible for the majority of our memory allocation:This can be eliminated with some non-trivial, but still limited refactoring:
JpegDecoderCore
andHuffmanScanDecoder
needs a mode whereJpegComponent.SpectralBlocks
is interpreted as a sliding window of blocks instead of full set of decoded spectral blocksHuffmanScanDecoder
can then push the rows of the sliding window, directly calling an instance ofJpegComponentPostprocessor
in the end of it's MCU-row decoding loopThe text was updated successfully, but these errors were encountered: