You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current implementation buffers records, and writes them based on a flush
A more appropriate approach is IMHO to stream the records to file
In this model the RecordGrouper would not collect the records in the buffer, but indicate the logical stream of where the records should be written, and the caller ( the task) manages how the records are handled - added to a "logical stream"
This opens up several possibilities and improvements
files can be written when the logical stream has sufficient buffered recordsl, rather than waiting for a flush
files could be written in parts when there is sufficient data to write, but the file isn't full
memory pressure is reduced when records are written (maybe only after the file is closed)
the IO load is more distributed, (not maybe several hundred files being written when a flush occurs)
other possibilities
files can be written asynchronously, (ideally in virtual threads when we are building on a JVM that supports that) which makes the system more responsive (reduction in latency)
we can make a better back pressure model, that is more graceful
back pressure can be graceful, rather than a "stop the world action"
We can use requestFlush() when files have been written to update the metrics, rather than relying on timeouts and flushing everything
Issues
We would need to ensure that records are eventually written, by having some timeout, depending on the preCommitaction
some of the documented behaviour WRT to batching would change
file overwrite needs to be handled specially (that isn't in the PR below, but is simple) - enough.g. they you have a filename based on key rather than unique files such as partition and offset
A simple implementation of this is in #319, but this also has some other changes
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
The current implementation buffers records, and writes them based on a flush
A more appropriate approach is IMHO to stream the records to file
In this model the
RecordGrouper
would not collect the records in the buffer, but indicate the logical stream of where the records should be written, and the caller ( the task) manages how the records are handled - added to a "logical stream"This opens up several possibilities and improvements
other possibilities
We can use
requestFlush()
when files have been written to update the metrics, rather than relying on timeouts and flushing everythingIssues
preCommit
actionA simple implementation of this is in #319, but this also has some other changes
Beta Was this translation helpful? Give feedback.
All reactions