Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APPSEC-10967] Compress and encode schema information #3177

Merged
merged 6 commits into from
Oct 4, 2023

Conversation

GustavoCaso
Copy link
Member

@GustavoCaso GustavoCaso commented Oct 3, 2023

What does this PR do?

The RFC for API Security Schema extraction recommends compressing and encoding schema extraction information before appending it to the root span to reduce the number of bytes sent over the wire. The RFC also defines a hard limit of the maximum size for individual schema values.

We add the necessary code to compress and encode the schema information on this PR.

Motivation:

Additional Notes:

The last commit f9af157 make use of Zlib::GzipWriter instead of Zlib.gzip because Ruby 2.2 and 2.3 do not have that method.

Also, the RBS repo does not have the Zlib library up to date, so I had to add those in vendor/rbs/zlib manually. I wanted to upgrade rbs version, but unfortunately, the RBS version that updated the Zlib signatures dropped support for ruby 2.x https://github.com/ruby/rbs/releases/tag/v3.2.0.pre.1, so I had to manually port it

We decided to updated the RBS version instead 😄

How to test the change?

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.

Unsure? Have a question? Request a review!

@GustavoCaso GustavoCaso requested review from a team as code owners October 3, 2023 09:49
@github-actions github-actions bot added the appsec Application Security monitoring product label Oct 3, 2023
@codecov-commenter
Copy link

codecov-commenter commented Oct 3, 2023

Codecov Report

Merging #3177 (733d010) into master (62e6c52) will increase coverage by 0.06%.
Report is 17 commits behind head on master.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3177      +/-   ##
==========================================
+ Coverage   98.14%   98.20%   +0.06%     
==========================================
  Files        1247     1247              
  Lines       71713    71567     -146     
  Branches     3340     3338       -2     
==========================================
- Hits        70381    70283      -98     
+ Misses       1332     1284      -48     

see 71 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

end
next
end
tags[key] = data
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest adding an extra check to see if data.size is larger than the size of JSON.dump(value), as base64 encoding typically results in a string of size roughly 4 * n / 3, so if the compression isn't effective it might be best to just provide the uncompressed / unencoded json string.

Alternatively, you could try and experimentally identify a minimum size below which compression + base64 encoding always results in a larger string and only compress above that threshold.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, consider the following schema: [{hello : 8}]

Compressing and encoding with echo -n '[{"hello" : 8}]' | gzip -f | base64 results in: H4sIAAAAAAAAA4uuVspIzcnJV1KwUrCojQUAqI9PNw8AAAA=

The schema in this example is 15 characters while the compressed and base64-encoded payload is 49, so you need to check which one is larger before adding it to the trace. For example (perhaps not valid Ruby but you get the idea):

          waf_result.derivatives.each do |key, value|
            json = JSON.dump(value)
            data = Base64.encode64(gzip())

            if data.size >= MAX_ENCODED_SCHEMA_SIZE and json.size >= MAX_ENCODED_SCHEMA_SIZE
              Datadog.logger.debug do
                "Schema key: #{key} exceed max size value. We do not include it as part of the span tags"
              end
              next
            end

            if json.size > data.size
                tags[key] = data
            else 
                tags[key] = json
            end

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and just for fun you can double check that everything is correct as follows:

$ echo -n '[{"hello" : 8}]' | gzip -f | base64 | base64 --decode | gzip --decompress
[{"hello" : 8}]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, you could try and experimentally identify a minimum size below which compression + base64 encoding always results in a larger string and only compress above that threshold.

You could test different payloads and see what is the minimum size which results in a smaller payload after compression and encoding, once you find this magic value, you could have something along the lines of:

waf_result.derivatives.each do |key, value|
  parsed_value = json_parse(value)
  parsed_value_size = parsed_value.size

  if parsed_value_size >= MIN_SCHEMA_SIZE_FOR_COMPRESSION
    compressed_data = compressed_and_base64_encoded(parsed_value)
    compressed_data_size = compressed_data.size

    if compressed_data_size >= MAX_ENCODED_SCHEMA_SIZE && parsed_value_size >= MAX_ENCODED_SCHEMA_SIZE
      Datadog.logger.debug do
        "Schema key: #{key} exceed max size value. We do not include it as part of the span tags"
      end
      next
    end

    derivative_value = parsed_value_size > compressed_data_size ? compressed_data : parsed_value

    tags[key] = derivative_value
  else
    tags[key] = parsed_value
  end
end


if data.size >= MAX_ENCODED_SCHEMA_SIZE
Datadog.logger.debug do
"Schema key: #{key} exceed max size value. We do not include it as part of the span tags"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor wording suggestion

Suggested change
"Schema key: #{key} exceed max size value. We do not include it as part of the span tags"
"Schema key: #{key} exceeds the max size value. It will not be included as part of the span tags"

end

tags
end
end

def self.gzip(value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a decent chance that at some point, somewhere, file IO is going to fail. Should we be explicitly handling exceptions in this method?

Comment on lines +84 to +86
request.headers.each do |header, value|
tags["http.request.headers.#{header}"] = value if ALLOWED_REQUEST_HEADERS.include?(header.downcase)
end
Copy link
Member Author

@GustavoCaso GustavoCaso Oct 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduce the number of times we iterate over request.headers

Comment on lines +94 to +96
response.headers.each do |header, value|
tags["http.response.headers.#{header}"] = value if ALLOWED_RESPONSE_HEADERS.include?(header.downcase)
end
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reduce the number of times we iterate over response.headers

Comment on lines +127 to +128
appsec_events = json_parse({ triggers: waf_events })
entry_tags['_dd.appsec.json'] = appsec_events if appsec_events
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move from record_via_span into build_service_entry_tags

tags
def gzip(value)
sio = StringIO.new
gz = Zlib::GzipWriter.new(sio, Zlib::DEFAULT_COMPRESSION, Zlib::DEFAULT_STRATEGY)
Copy link
Member

@marcotc marcotc Oct 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could I trouble to run very simple benchmarks in your machine with a sample value payload and different combinations of compression and strategy arguments? (measure performance and compressed size, so we can make a good trade-off decision.)

@anmarchenko
Copy link
Member

@GustavoCaso did you run bundle exec rake appraisal:generate after making this change? it gives me a weird error message Could not find gem 'rbs (~> 3.2.0)' with platform 'aarch64-linux' in rubygems repository https://rubygems.org/ or installed locally.

@lloeki
Copy link
Member

lloeki commented Oct 4, 2023

@anmarchenko That's odd, rbs is pure Ruby and has no binary gem.

Copy link
Member

@lloeki lloeki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved following internal discussion over pending changes.

@GustavoCaso
Copy link
Member Author

@marcotc, with regards to the benchmarks, I created a simple script that would allow us to decide what is the best decision

# frozen_string_literal: true

require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'
  gem 'benchmark-ips'
  gem 'pry'
end

require 'benchmark/ips'
require 'zlib'
require 'json'
require 'stringio'

# Each compression algorithm has its own array of compressed files

words = File.readlines('/usr/share/dict/words')
schema_values = [0, 1, 2, 4, 8, 16]

value = [{}]
100000.times do
  value[0][words.sample.chomp] = [schema_values.sample]
end

value_json = value.to_json

uncompressed_memory_usage = value_json.bytesize

def gzip(value, compression: Zlib::DEFAULT_COMPRESSION, strategy: Zlib::DEFAULT_STRATEGY)
  sio = StringIO.new
  gz = Zlib::GzipWriter.new(sio, compression, strategy)
  gz.write(value)
  gz.close
  sio.string
end

gzip_normal = []
gzip_normal_filtered = []
gzip_normal_fixed = []
gzip_normal_huffman = []
gzip_best_speed = []
gzip_best_speed_filtered = []
gzip_best_speed_fixed = []
gzip_best_speed_huffman = []
gzip_best_compression = []
gzip_best_compression_filtered = []
gzip_best_compression_fixed = []
gzip_best_compression_huffman = []

# Compress time benchmark
puts 'Time to compress the files'
Benchmark.ips do |x|
  x.report('GZip (normal mode)') do
    gzip_normal << gzip(value_json)
  end

  x.report('GZip (normal mode) (filtered)') do
    gzip_normal_filtered << gzip(value_json, strategy: Zlib::FILTERED)
  end

  x.report('GZip (normal mode) (fixed)') do
    gzip_normal_fixed << gzip(value_json, strategy: Zlib::FIXED)
  end

  x.report('GZip (normal mode) (huffman)') do
    gzip_normal_huffman << gzip(value_json, strategy: Zlib::HUFFMAN_ONLY)
  end

  x.report('GZip (speed mode)') do
    gzip_best_speed << gzip(value_json, compression: Zlib::BEST_SPEED)
  end

  x.report('GZip (speed mode) (filtered)') do
    gzip_best_speed_filtered << gzip(value_json, compression: Zlib::BEST_SPEED, strategy: Zlib::FILTERED)
  end

  x.report('GZip (speed mode) (fixed)') do
    gzip_best_speed_fixed << gzip(value_json, compression: Zlib::BEST_SPEED, strategy: Zlib::FIXED)
  end

  x.report('GZip (speed mode) (huffman)') do
    gzip_best_speed_huffman << gzip(value_json, compression: Zlib::BEST_SPEED, strategy: Zlib::HUFFMAN_ONLY)
  end

  x.report('GZip (compression mode)') do
    gzip_best_compression << gzip(value_json, compression: Zlib::BEST_COMPRESSION)
  end

  x.report('GZip (compression mode) (filtered)') do
    gzip_best_compression_filtered << gzip(value_json, compression: Zlib::BEST_COMPRESSION, strategy: Zlib::FILTERED)
  end

  x.report('GZip (compression mode) (fixed)') do
    gzip_best_compression_fixed << gzip(value_json, compression: Zlib::BEST_COMPRESSION, strategy: Zlib::FIXED)
  end

  x.report('GZip (compression mode) (huffman)') do
    gzip_best_compression_huffman << gzip(value_json, compression: Zlib::BEST_COMPRESSION, strategy: Zlib::HUFFMAN_ONLY)
  end

  x.compare!
end
puts

def average_size(array)
  array.map(&:bytesize).sum / array.size
end

# Gets the memory size (in bytes) of each compressed array
gzip_normal_compressed_average_size = average_size(gzip_normal)
gzip_normal_filtered_average_size = average_size(gzip_normal_filtered)
gzip_normal_fixed_average_size = average_size(gzip_normal_fixed)
gzip_normal_huffman_average_size = average_size(gzip_normal_huffman)

gzip_best_speed_compressed_average_size = average_size(gzip_best_speed)
gzip_best_speed_filtered_average_size = average_size(gzip_best_speed_filtered)
gzip_best_speed_fixed_average_size = average_size(gzip_best_speed_fixed)
gzip_best_speed_huffman_average_size = average_size(gzip_best_speed_huffman)

gzip_best_compression_compressed_average_size = average_size(gzip_best_compression)
gzip_best_compression_filtered_average_size = average_size(gzip_best_compression_filtered)
gzip_best_compression_fixed_average_size = average_size(gzip_best_compression_fixed)
gzip_best_compression_huffman_average_size = average_size(gzip_best_compression_huffman)

averages = [
  { name: 'GZip (normal mode)', value: gzip_normal_compressed_average_size },
  { name: 'GZip (normal mode) (filtered)', value: gzip_normal_filtered_average_size },
  { name: 'GZip (normal mode) (fixed)', value: gzip_normal_fixed_average_size },
  { name: 'GZip (normal mode) (huffman)', value: gzip_normal_huffman_average_size },

  { name: 'GZip (speed mode)', value: gzip_best_speed_compressed_average_size },
  { name: 'GZip (speed mode) (filtered)', value: gzip_best_speed_filtered_average_size },
  { name: 'GZip (speed mode) (fixed)', value: gzip_best_speed_fixed_average_size },
  { name: 'GZip (speed mode) (huffman)', value: gzip_best_speed_huffman_average_size },

  { name: 'GZip (best compression mode) ', value: gzip_best_compression_compressed_average_size },
  { name: 'GZip (best compression mode) (filtered)', value: gzip_best_compression_filtered_average_size },
  { name: 'GZip (best compression mode) (fixed)', value: gzip_best_compression_fixed_average_size },
  { name: 'GZip (best compression mode) (huffman)', value: gzip_best_compression_huffman_average_size },
].sort_by { |a| a[:value] }

# Prints the memory used after compression on each algorithm
puts 'Memory used by the compression mode'
puts "Original file size  #{uncompressed_memory_usage} bytes"

averages.each do |av|
  puts "#{av[:name]}  #{av[:value]} bytes (#{(100.0 * av[:value] / uncompressed_memory_usage).round(2)} %)"
end

puts

The results are:

Comparison:
GZip (speed mode) (huffman):      105.3 i/s
GZip (compression mode) (huffman):      103.3 i/s - same-ish: difference falls within error
GZip (normal mode) (huffman):      103.1 i/s - same-ish: difference falls within error
GZip (speed mode) (filtered):       85.0 i/s - 1.24x  (± 0.00) slower
GZip (speed mode):       84.5 i/s - 1.25x  (± 0.00) slower
GZip (speed mode) (fixed):       84.4 i/s - 1.25x  (± 0.00) slower
GZip (normal mode):       19.9 i/s - 5.30x  (± 0.00) slower
GZip (normal mode) (fixed):       19.9 i/s - 5.30x  (± 0.00) slower
GZip (normal mode) (filtered):       17.9 i/s - 5.88x  (± 0.00) slower
GZip (compression mode) (filtered):        5.2 i/s - 20.31x  (± 0.00) slower
GZip (compression mode):        4.8 i/s - 21.74x  (± 0.00) slower
GZip (compression mode) (fixed):        4.8 i/s - 21.98x  (± 0.00) slower


Memory used by the compression mode
Original file size  1365446 bytes
GZip (compression mode) (filtered)  498006 bytes (36.47 %)
GZip (compression mode)   499090 bytes (36.55 %)
GZip (normal mode)  509671 bytes (37.33 %)
GZip (normal mode) (filtered)  514454 bytes (37.68 %)
GZip (speed mode)  607377 bytes (44.48 %)
GZip (speed mode) (filtered)  607377 bytes (44.48 %)
GZip (compression mode) (fixed)  618894 bytes (45.33 %)
GZip (normal mode) (fixed)  628963 bytes (46.06 %)
GZip (speed mode) (fixed)  777802 bytes (56.96 %)
GZip (speed mode) (huffman)  808165 bytes (59.19 %)
GZip (normal mode) (huffman)  808165 bytes (59.19 %)
GZip (compression mode) (huffman)  808165 bytes (59.19 %)

Based on those values, we could use the ZLib::BEST_SPEED and the default strategy. Since it would yield a good mix of speed and byte size reduction

@GustavoCaso
Copy link
Member Author

Answering @Anilm3 #3177 (comment)

I tested what would be the value in which is not worth it to compressed the payload using a simple script:

# frozen_string_literal: true

require 'zlib'
require 'json'
require 'stringio'
require 'base64'

def gzip(value, compression = Zlib::BEST_SPEED, strategy = Zlib::DEFAULT_STRATEGY)
  sio = StringIO.new
  gz = Zlib::GzipWriter.new(sio, compression, strategy)
  gz.write(value)
  gz.close
  sio.string
end

def json(value)
  JSON.dump(value)
end

def base64(value)
  Base64.encode64(value)
end

testing = true

value = [{}]
words = File.readlines('/usr/share/dict/words')
schema_values = [0, 1, 2, 4, 8, 16]
itertaion = 0
while testing
  json_value = json(value)
  compressed_and_encoded_value = base64(gzip(value))

  if json_value.size >= compressed_and_encoded_value.size
    puts "json value size is #{json_value.size}"
    puts "compressed and encoded size is #{compressed_and_encoded_value.size}"
    puts "JSON #{json_value}"
    puts "Compressed value #{compressed_and_encoded_value}"
    testing = false
    next
  end

  itertaion += 1
  puts "Iteration #{itertaion}"

  key = words.sample.chomp
  value[0][key] = [schema_values.sample]
end

puts 'DONE'

The results yielded that payloads below 260 bytes are not worth to be compressed

Copy link
Member

@marcotc marcotc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much, @GustavoCaso!

@GustavoCaso GustavoCaso merged commit de7400c into master Oct 4, 2023
176 checks passed
@GustavoCaso GustavoCaso deleted the asm-compress-schema-information branch October 4, 2023 20:24
@github-actions github-actions bot added this to the 1.15.0 milestone Oct 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
appsec Application Security monitoring product
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants