-
Notifications
You must be signed in to change notification settings - Fork 151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recover corrupted GZIP files, setup of Java-Ruby mixed environment. #249
Recover corrupted GZIP files, setup of Java-Ruby mixed environment. #249
Conversation
…ments to public class/methods.
…ments to public class/methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass through
} catch (IOException exception) { | ||
// raise an exception if expected exception is not end of ZLIB related. | ||
if (CORRUPTED_FILE_ERROR.equals(exception.getMessage()) == false) { | ||
logger.error("Error occurred while compressing the file, error={}", exception.getMessage()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the javadoc, is the intent to only log here? This will continue the upload to S3 in the event of an unrecoverable zip file.
Question here is what do we want to do in the event of a corrupt/unrecoverable gzip file? Upload to S3 anyway or keep locally(maybe with a .damaged extension?), or discard?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, so far the only case when GZIP might be corrupted is no tail
case where stream is left opened when LS is crashed. In case if other cases happen, currently, LS will leave file as it is, users will be able to find it under temporary path. No any delete/upload action will be proceeded
. And also, most importantly, we don't damage/crash LS and let it continue.
Let me know your thought if we can improve the situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless, I'm missing something, I don't think LS will leave the file as it is. With this change, the error will be logged, but processing will continue as normal - uploading the corrupt file to S3, and deleting the file from local storage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File path changes when corrupted file is recovered (file_path = self.recover(file_path)
) and new temp file will be returned (with create_from_existing_file
). If LS does not recover the file its size is still zero (in temporary file) and it does not remote the file (source)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the file size is 0, the temporary file and the original "corrupt" file will be deleted - the process of deleting a "temporary file" also deletes the folder it is a part of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh it is because of temp folder... yeah agree that temp folder will be removed. Thanks for letting know, reproducing such case is not easy. How do you think if we replace the size check logic with following one
// s3.rb:404L
if temp_file.size > 0
@logger.debug? && @logger.debug("Recovering from crash and uploading", :path => temp_file.path)
@crash_uploader.upload_async(temp_file,
:on_complete => method(:clean_temporary_file),
:upload_options => upload_options)
end
if @encoding != "gzip" && temp_file.size != 0
clean_temporary_file(temp_file)
end
lib/logstash/outputs/s3.rb
Outdated
.select { |file| ::File.file?(file) } | ||
.each do |file| | ||
temp_file = TemporaryFile.create_from_existing_file(file, temp_folder_path) | ||
.select { |file_path| ::File.file?(file_path) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need to figure out a way between the recovery code and this glob to ignore "recovery in progress" - currently, in the event of a failure/crash during "recovery in progress", this glob
definition will pick up the original file to be recovered, plus any files in mid-recovery - ie ls.s3<>part0.txt.gz
, ls.s3<>.part0-recovered.txt.gz
, and ls.s3<>.part0-recovered.txt.gz
. This can lead to a few different outcomes - duplicate data being uploaded to S3, potentially multiple -recovered.txt
and -recovered.txt.gz
extensions, and, in the event one of these files is empty, clean_temporary_file(temp_file)
will be called, which deletes the entire sub directory that all of these files are stored in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is Dir.glob().select{}
async for this case?
The possible cases would be
- multiple LS instances are accessing to same S3 temp path but as we discussed offline the recommendation would be providing a separate path for each instance.
- unrecoverable zip files which always stay
So, I have added a logic where figures out the under recovery process files (under_recovery_files
) and skips the file if it is in this list.
*/ | ||
public static void recover(String sourcePath, String targetPath) { | ||
try { | ||
GzipUtil.decompress(Paths.get(sourcePath), Paths.get(targetPath)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the recovery returns 0
bytes, then we should probably skip the rest of the recovery process, to avoid uploading an empty zip file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will covered by s.rb:L405~L413
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible for an "empty" gzip file to be created which includes the header but no content, which I don't think will be covered by the above case, and will create an empty recovery file and upload a gzip
version.
(I could make this happen with input { generator { count => 250 }}
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Technically, when receiving input gzip file will be created, that means header will be written initial gzip creation. I am not sure its possibility in our scenario but manually confirmed/tested possible empty zip cases and updated the PR.
… when recovery fails.
lib/logstash/outputs/s3.rb
Outdated
:upload_options => upload_options) | ||
end | ||
# do not remove if Logstash tries to recover but fails | ||
if @encoding != GZIP_ENCODING && temp_file.size != 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this logic is correct for uploading non-gzip based recovery - I think you may end up cleaning the temporary here, which may be before upload_async
is complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be worthwhile to see some tests here - what happens with:
- standard plain text recovery
- standard gzip recovery
- empty plain text recovery
- empty gzip recovery
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
Wooh, no idea why I removed
else
case, thanks for the notice that would be a crash! -
Test cases
It seems default (none
or plain txt
) encoding cases are covered in the integration test, restore_from_crash_spec.rb
and I have added the cases where Logstash restores the file, uploads and removes locally.
- Run integration test, as we synced offline, make sure to set the AWS ENV params including
AWS_SESSION_TOKEN
:
bundle exec rspec logstash-plugins/logstash-output-s3/spec/integration/restore_from_crash_spec.rb:69 -t integration:true
- Result logs:
Finished in 1 minute 22.39 seconds (files took 9.97 seconds to load)
3 examples, 0 failures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Nice job getting this through!
Hello, I saw that the fix code #249 has been merged into the logstash-plugins/logstash-integration-aws repository. Could you let me know when this fix will be merged into the logstash-plugins/logstash-output-s3 repository, or if it has already been fixed in a particular version of logstash? |
Issue description
When using
GZIP
encoding option with output to AWS S3 plugin, there are cases where Logstash may be crashed. When Logstash crashed GZIP stream is left opened and no tail in the file exist. Logstash uploads corrupted file to S3 at restart but customers who download S3 file and use, they figured out the file is corrupted.This PR aims to recover the corrupted file at restart time and upload healthy GZIP file to S3.
FYI: Requested the case where/when/how Logstash crashed, I will investigate once I get response
Acceptance Criteria
Logstash should always upload healthy GZIP/plain text files to AWS S3.
Solution explanation
Look at the note section for details: option 4 is recommended and implemented.
Testing
Unit testing
GzipUtilTest
unit test class covers to recover, compress and decompress success and failure cases accordingly.E2E testing
kill -9 PID
where PID can be fetched withps -la | grep logstash
aws s3 cp s3://logstash-mashhur-test/path-to-file.txt.gz /local/path/to/download/
gunzip downloaded-file.txt.gz
and open the file.Additional Notes
Discussed solutions:
[Gzip recovery] Pull out the GZIP header and decompress healthy blocks till we find a tail in Plain Ruby.
GzipReader
,GzipWriter
) do not allow to access GZIP blocks/headers byte-by-byte. They always validate the file on any read actions. - NOT FEASIBLE.Logstash uses plain text and when uploading to S3 it zips the file.
output -> s3 -> size_file
users' intention is ZIP file size not ongoing plain text. Since Logstash looses accuracy, this option would be a low priority.External tool
gz-recover
will work as expected and this is an OSS. However, for the long term (eg. if security vulnerabilities found) this option would cost high since we may need to contribute or because of poor maintenance. - low priority.GZIPInputStream
interface does same whenever stream starts. That means we don't necessarily look up for header or tail operations. Instead, capturingUnexpected end of ZLIB input stream
exception satisfies our condition.input-beats
,input-http
) already took initiative to move to such environment. So, this change also includes Java-Ruby mixin env setups.