-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use SHA for BLOB update instead of modification time #3697
Conversation
@Override | ||
public long getVersion() throws IOException { | ||
try (FileInputStream fis = new FileInputStream(path)) { | ||
byte[] bytes = DigestUtils.sha1(fis); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think about using something such as sha256 or sha512 to avoid (unlikely) collisions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the suggestion, I've update it to Sha256
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a fealing how often getVersion(...)
is called? Creating a SHA hash is rather expensive compared to the modification date (just hink, if we need to do caching after first call or a like) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is called often, perhaps we can use something such as MurmurHash that is used elsewhere in the code for the sharding of tuples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is being ran by the AsyncLocalizer every interval defined by supervisor.localizer.update.blob.interval.secs but this won't have impact on the worker but on the supervisor. We wouldn't need to cache it but nevertheless we can add cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is being ran continuously we can opt for a Murmur hash that will prioritise fast hashing(suggested by @reiabreu) and this way we would opt for not using caching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, after a brief discussion, we've decided to follow with Checksum instead of Murmur since Checksum computation is faster. Commit with the changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to use checksum.
I've approved the changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't merge it yet as I'm still doing some tests where some flakiness has surged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything tested, we can proceed with merge
@rzo1 do you want to re-examine the PR? |
lgtm. Thanks for the PR. |
What is the purpose of the change
Issue[4077]
When deploying Nimbus or changing the leadership within a high availability Nimbus cluster, we've verified that the Topologies workers are killed due to different modification times.
By using the modTime as the version, we have found that, while using the LocalFsBlobStoreFile, every time the the Nimbus leader goes down the following occurs:
In this PR, we've introduced a new feature to use the SHA version of the file instead of the modification time. With this feature, when a nimbus loses the leadership, the workers will continue running because the version of the BLOB will continue the same as the BLOB is the same and also it's correspondent SHA.
How was the change tested
Unit Tests
Test locally with specific jar on my local Storm, forced nimbus to change leadership and the workers on the topologies continued to work properly.