Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gzip files with 'dX %' in it are not detected by mime type #14

Open
thijsvandergugten opened this issue Jun 7, 2021 · 1 comment
Open
Labels
bug Something isn't working

Comments

@thijsvandergugten
Copy link

thijsvandergugten commented Jun 7, 2021

Logstash information:

  1. Logstash version: 7.11.1
  2. Logstash installation source: docker (docker.elastic.co/logstash/logstash-oss:7.11.1)
  3. How is Logstash being run: docker
  4. How was the Logstash Plugin installed: with the line RUN logstash-plugin install logstash-input-google_cloud_storage in the Dockerfile

JVM (e.g. java -version): OpenJDK 64-Bit Server VM 11.0.8+10

OS version: Ubuntu 18.04 LTS

Description of the problem including expected versus actual behavior:

If a file contains the magic string 'dX %', it is not processed, because it is detected as audio/vnd.dts.hd instead of application/gzip. In https://github.com/logstash-plugins/logstash-input-google_cloud_storage/blob/master/lib/logstash/inputs/cloud_storage/file_reader.rb#L26, the snippet

def self.gzip?(filename)
  magic = MimeMagic.by_magic(::File.open(filename))
  magic ? magic.subtype == "gzip" : false
end

uses code from https://github.com/mimemagicrb/mimemagic/blob/master/lib/mimemagic.rb#L84. As far as I can see, whenever a file contains the magic string 'dX %', it is recognized as audio/vnd.dts.hd which is not equal to a gzip-type.

Steps to reproduce:

  1. Try to parse a gzip-document which contains the string 'dX %' (the magic string for the filetype audio/vnd.dts.hd)
  2. Observe the logging below.

Logs:

[2020-10-02T16:16:53,500][ERROR][logstash.javapipeline    ][main][db1dc633a0e5eeb4e59aa152d277f51da22d98b38e631f12070531c57eaeabe8] A plugin had an unrecoverable error. Will restart this plugin.
  Pipeline_id:main
  Plugin: <LogStash::Inputs::GoogleCloudStorage bucket_id=>"...", json_key_file=>"...", codec=><LogStash::Codecs::JSONLines id=>"json_lines_b7600074-64a5-4ec4-b5b6-ab34acb20332", enable_metric=>true, charset=>"UTF-8", delimiter=>"\n">, interval=>300, id=>"db1dc633a0e5eeb4e59aa152d277f51da22d98b38e631f12070531c57eaeabe8", delete=>true, file_matches=>".*log.gz", enable_metric=>true, file_exclude=>"^$", metadata_key=>"x-goog-meta-ls-gcs-input", unpack_gzip=>true, temp_directory=>"/tmp/ls-in-gcs">
  Error: invalid byte sequence in UTF-8
  Exception: ArgumentError
  Stack: org/jruby/RubyString.java:4225:in `split'
org/logstash/common/BufferedTokenizerExt.java:78:in `extract'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-codec-json_lines-3.0.6/lib/logstash/codecs/json_lines.rb:40:in `decode'
/usr/share/logstash/logstash-core/lib/logstash/codecs/delegator.rb:62:in `block in decode'
org/logstash/instrument/metrics/AbstractSimpleMetricExt.java:65:in `time'
org/logstash/instrument/metrics/AbstractNamespacedMetricExt.java:64:in `time'
/usr/share/logstash/logstash-core/lib/logstash/codecs/delegator.rb:61:in `decode'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/google_cloud_storage.rb:111:in `extract_event'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/google_cloud_storage.rb:97:in `block in download_and_process'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/cloud_storage/file_reader.rb:33:in `block in read_plain_lines'
org/jruby/RubyIO.java:3329:in `each'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/cloud_storage/file_reader.rb:32:in `read_plain_lines'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/cloud_storage/file_reader.rb:20:in `read_lines'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/google_cloud_storage.rb:96:in `block in download_and_process'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/cloud_storage/blob_adapter.rb:72:in `with_downloaded'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/google_cloud_storage.rb:93:in `download_and_process'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/google_cloud_storage.rb:70:in `block in list_download_process'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/google_cloud_storage.rb:86:in `block in list_processable_blobs'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/cloud_storage/client.rb:23:in `block in list_blobs'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/cloud_storage/client.rb:22:in `list_blobs'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/google_cloud_storage.rb:85:in `list_processable_blobs'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/google_cloud_storage.rb:68:in `list_download_process'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/google_cloud_storage.rb:61:in `block in run'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/stud-0.0.23/lib/stud/interval.rb:20:in `interval'
/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-google_cloud_storage-0.11.1-java/lib/logstash/inputs/google_cloud_storage.rb:60:in `run'
/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:346:in `inputworker'
/usr/share/logstash/logstash-core/lib/logstash/java_pipeline.rb:337:in `block in start_input'
@thijsvandergugten thijsvandergugten added the bug Something isn't working label Jun 7, 2021
@daxxog
Copy link

daxxog commented Feb 4, 2022

FWIW I was able to produce a dirty Dockerfile patch which appears to resolve this issue.

RUN cat /usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/mimemagic-0.4.3/lib/mimemagic/tables.rb | \
    sed 's/common_types = \[/common_types = \["application\/gzip",/g' | \
    tee /usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/mimemagic-0.4.3/lib/mimemagic/tables.patched.rb \

    && mv /usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/mimemagic-0.4.3/lib/mimemagic/tables.patched.rb \
    /usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/mimemagic-0.4.3/lib/mimemagic/tables.rb \
;

Related to mimemagicrb/mimemagic#36, I think it has to do with "priority" of mime magic checking. In the context of the usage in this logstash plugin it's either gzip or it's not, so my patch just puts gzip at the top of the "common types" list. An issue probably should be opened in mimemagicrb/mimemagic regarding this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants