Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 object/bucket XML parsing issues 2.0.34 #764

Closed
DanielRedOak opened this issue Apr 6, 2015 · 8 comments
Closed

S3 object/bucket XML parsing issues 2.0.34 #764

DanielRedOak opened this issue Apr 6, 2015 · 8 comments

Comments

@DanielRedOak
Copy link
Contributor

Seeing some bad parsing with the changes from 2.0.34 coming from 2.0.33 when dealing with listing larger buckets. Reproduced with jruby 1.7.16.1 (1.9.3p392),which is by for Logstash where I am ultimately consuming this code.

require "aws-sdk-resources"
s3 = Aws::S3::Resource.new(:region => "us-east-1")
bucket = 'somebucket'
prefix = 'AWSLogs/accountnumber/CloudTrail/'
s3bucket = s3.bucket(bucket)
s3bucket.objects(:prefix => prefix).each do |log|
  puts log.key
end

Error:

RuntimeError: unhandled timestamp format `2015-04-'
         parse at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/xml/parser/frame.rb:277
         parse at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/xml/parser/frame.rb:275
      set_text at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/xml/parser/frame.rb:267
          text at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/xml/parser/stack.rb:31
    characters at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/xml/parser/engines/nokogiri.rb:29
    parse_with at nokogiri/XmlSaxParserContext.java:252
  parse_memory at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/nokogiri-1.6.6.2-java/lib/nokogiri/xml/sax/parser.rb:112
         parse at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/nokogiri-1.6.6.2-java/lib/nokogiri/xml/sax/parser.rb:84
         parse at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/xml/parser/engines/nokogiri.rb:13
         parse at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/xml/parser.rb:28
    parse_body at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/xml/rest_handler.rb:15
         parse at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/rest_body_handler.rb:91
  extract_data at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/rest_body_handler.rb:40
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/rest_body_handler.rb:10
          call at org/jruby/RubyProc.java:271
            on at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/response.rb:43
    on_success at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/http/response.rb:137
          call at org/jruby/RubyProc.java:271
      listener at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/http/response.rb:164
          call at org/jruby/RubyProc.java:271
       on_done at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/http/response.rb:128
    on_success at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/http/response.rb:135
            on at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/response.rb:42
    on_success at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/response.rb:51
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/rest_body_handler.rb:9
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/plugins/user_agent.rb:12
   manage_keys at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/plugins/s3_url_encoded_keys.rb:39
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/plugins/s3_url_encoded_keys.rb:31
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/plugins/restful_bindings.rb:13
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/plugins/endpoint.rb:35
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/plugins/param_validation.rb:22
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/plugins/raise_response_errors.rb:14
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/plugins/s3_sse_cpk.rb:18
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/plugins/param_conversion.rb:22
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/aws-sdk-core/plugins/response_paging.rb:10
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/plugins/response_target.rb:18
  send_request at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/request.rb:70
  list_objects at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-core-2.0.34/lib/seahorse/client/base.rb:216
          call at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-resources-2.0.34/lib/aws-sdk-resources/request.rb:24
   all_batches at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-resources-2.0.34/lib/aws-sdk-resources/operations.rb:127
          each at org/jruby/RubyEnumerator.java:274
          each at /Users/DanielRedOak/.rbenv/versions/jruby-1.7.16.1/lib/ruby/gems/shared/gems/aws-sdk-resources-2.0.34/lib/aws-sdk-resources/collection.rb:18
        (root) at s3_test.rb:6

Some debugging I've added to (aws-sdk-core-2.0.34/lib/aws-sdk-core/xml/parser/stack.rb) to show what it's receiving.

def text(value)
  puts value.inspect
  @frame.set_text(value)
end

Output:

...
"AWSLogs/someaccount/CloudTrail/eu-west-1/2015/04/04/someaccount_CloudTrail_eu-west-1_20150404T1450Z_YPnaBuc2bGxhPjHQ.json.gz"
"2015-04-04T14:52:35.000Z"
"\""
"26a091c72b039fc7b28b90414ce01c21"
"\""
"484"
"STANDARD"
"AWSLogs/someaccount/CloudTrail/eu-west-1/2015/04/04/someaccount_CloudTrail_eu-west-1_20150404T1505Z_b1NDxxGvlx5qnaHM.json.gz"
"2015-0"
...(Exception here)

Added some debugging in aws-sdk-core-2.0.34/lib/aws-sdk-core/xml/parser.rb to validate the XML. Everything looks correct there so its somewhere down the line within the parsing.

def parse(xml, target = nil)
  xml = '<xml/>' if xml.nil? or xml.empty?
  puts "XML: " + xml + "ENDXML"
  stack = Stack.new(@shape, target)
  @engine.new(stack).parse(xml.to_s)
  stack.result
end

The error only shows up when trying to parse the timefields since there is validation on that field, but I see other inconsistencies where it seems to be chopping the key field into multiples as well causing issues.

Not sure whats going on at this point, it only seems to happen when listing a large bucket (which worked up until 2.0.34) which seems to make the parsing fail.

@trevorrowe
Copy link
Member

@DanielRedOak I'm really interested in determining the cause and providing a fix. Would you be able to provide a complete XML document for the list object response that causes the failure? I would like to reproduce this locally, and am currently unable to. I'd like to get this patched asap.

@DanielRedOak
Copy link
Contributor Author

Probably, I'll need to sanitize things first. I'll see about doing that shortly

@trevorrowe
Copy link
Member

I was able to produce a failure myself. No need to send the output. No clue what the cause is yet, but I'll update as I find out more.

@DanielRedOak
Copy link
Contributor Author

OK, was just about to ask for an email. Let me know if you need anything else!

@trevorrowe
Copy link
Member

If found the root cause of the issue. It appears that Nokogiri will send potentially multiple text events to the sax handler. For example, given the following XML element:

<LastModified>2012-12-06T18:33:01.000Z</LastModified>

It will typically trigger a single text event with the value 2012-12-06T18:33:01.000Z as a string. For whatever reason, larger XML documents will cause nokogiri to start emitting two text events, e.g. 2012-12- and 06T18:33:01.000Z, splitting the XML text element up into multiple parts.

The fix is to have the stack frame collect all of the yeilded text values and then have them parsed when the result is accessed, joining the parts first.

This explains why other string values might be truncated (as only the last one would be returned). This appears to only affect the Nokogiri backend, but I'm going to test the others. I need to add a text case that triggers this an then post a fix.

I hope to have this patched this afternoon.

Thank you for reporting the issue!

@trevorrowe
Copy link
Member

The fix was pretty straight-forward. In the sax-parser, when an element has a text value and that element was a known scalar value (e.g. string, timestamp, integer, boolean, etc) it was being type-cast in response to the text callback.

Now the text values are collected into a list and they are instead joined when the #result method is called on the stack frame. This allows xml parsers to emit the text in multiple parts and have them joined at the end when the full string is available.

I added the ability to inject a parsing engine and the tests now exercise a dummy parser engine that splits the sample text into individual characters emitting each as their own text element. This initially broke all of the tests until the related change as made, causing the tests to go green again.

I'm happy with the change and am inclined to ship this later today. Please feel free to share any feedback you have. Also, if you have a change to checkout the master branch to test the fix, that would be great.

@trevorrowe
Copy link
Member

This is available now in version v2.0.36.

@DanielRedOak
Copy link
Contributor Author

Just tested again with a very large data set and it seemed to work correctly now. Thanks for knocking this one out quickly and with a detailed explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants