[APPSEC-10967] Compress and encode schema information #3177

GustavoCaso · 2023-10-03T09:49:33Z

What does this PR do?

The RFC for API Security Schema extraction recommends compressing and encoding schema extraction information before appending it to the root span to reduce the number of bytes sent over the wire. The RFC also defines a hard limit of the maximum size for individual schema values.

We add the necessary code to compress and encode the schema information on this PR.

Motivation:

Additional Notes:

The last commit f9af157 make use of Zlib::GzipWriter instead of Zlib.gzip because Ruby 2.2 and 2.3 do not have that method.

Also, the RBS repo does not have the Zlib library up to date, so I had to add those in vendor/rbs/zlib manually. I wanted to upgrade rbs version, but unfortunately, the RBS version that updated the Zlib signatures dropped support for ruby 2.x https://github.com/ruby/rbs/releases/tag/v3.2.0.pre.1, so I had to manually port it

We decided to updated the RBS version instead 😄

How to test the change?

For Datadog employees:

If this PR touches code that signs or publishes builds or packages, or handles
credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
This PR doesn't touch any of that.

Unsure? Have a question? Request a review!

codecov-commenter · 2023-10-03T11:28:27Z

Codecov Report

Merging #3177 (733d010) into master (62e6c52) will increase coverage by 0.06%.
Report is 17 commits behind head on master.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #3177      +/-   ##
==========================================
+ Coverage   98.14%   98.20%   +0.06%     
==========================================
  Files        1247     1247              
  Lines       71713    71567     -146     
  Branches     3340     3338       -2     
==========================================
- Hits        70381    70283      -98     
+ Misses       1332     1284      -48

see 71 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Anilm3 · 2023-10-03T13:04:17Z

lib/datadog/appsec/event.rb

+              end
+              next
+            end
+            tags[key] = data


I would suggest adding an extra check to see if data.size is larger than the size of JSON.dump(value), as base64 encoding typically results in a string of size roughly 4 * n / 3, so if the compression isn't effective it might be best to just provide the uncompressed / unencoded json string.

Alternatively, you could try and experimentally identify a minimum size below which compression + base64 encoding always results in a larger string and only compress above that threshold.

For example, consider the following schema: [{hello : 8}]

Compressing and encoding with echo -n '[{"hello" : 8}]' | gzip -f | base64 results in: H4sIAAAAAAAAA4uuVspIzcnJV1KwUrCojQUAqI9PNw8AAAA=

The schema in this example is 15 characters while the compressed and base64-encoded payload is 49, so you need to check which one is larger before adding it to the trace. For example (perhaps not valid Ruby but you get the idea):

waf_result.derivatives.each do |key, value| json = JSON.dump(value) data = Base64.encode64(gzip()) if data.size >= MAX_ENCODED_SCHEMA_SIZE and json.size >= MAX_ENCODED_SCHEMA_SIZE Datadog.logger.debug do "Schema key: #{key} exceed max size value. We do not include it as part of the span tags" end next end if json.size > data.size tags[key] = data else tags[key] = json end

Oh and just for fun you can double check that everything is correct as follows:

$ echo -n '[{"hello" : 8}]' | gzip -f | base64 | base64 --decode | gzip --decompress [{"hello" : 8}]

Alternatively, you could try and experimentally identify a minimum size below which compression + base64 encoding always results in a larger string and only compress above that threshold.

You could test different payloads and see what is the minimum size which results in a smaller payload after compression and encoding, once you find this magic value, you could have something along the lines of:

waf_result.derivatives.each do |key, value| parsed_value = json_parse(value) parsed_value_size = parsed_value.size if parsed_value_size >= MIN_SCHEMA_SIZE_FOR_COMPRESSION compressed_data = compressed_and_base64_encoded(parsed_value) compressed_data_size = compressed_data.size if compressed_data_size >= MAX_ENCODED_SCHEMA_SIZE && parsed_value_size >= MAX_ENCODED_SCHEMA_SIZE Datadog.logger.debug do "Schema key: #{key} exceed max size value. We do not include it as part of the span tags" end next end derivative_value = parsed_value_size > compressed_data_size ? compressed_data : parsed_value tags[key] = derivative_value else tags[key] = parsed_value end end

ekump · 2023-10-03T14:42:45Z

lib/datadog/appsec/event.rb

+
+            if data.size >= MAX_ENCODED_SCHEMA_SIZE
+              Datadog.logger.debug do
+                "Schema key: #{key} exceed max size value. We do not include it as part of the span tags"


Just a minor wording suggestion

Suggested change

"Schema key: #{key} exceed max size value. We do not include it as part of the span tags"

"Schema key: #{key} exceeds the max size value. It will not be included as part of the span tags"

ekump · 2023-10-03T14:44:59Z

lib/datadog/appsec/event.rb

          end

          tags
        end
      end
+
+      def self.gzip(value)


I think there is a decent chance that at some point, somewhere, file IO is going to fail. Should we be explicitly handling exceptions in this method?

… the smaller version between the JSON string version and the compressed version for schema key values

GustavoCaso · 2023-10-03T16:00:36Z

lib/datadog/appsec/event.rb

+              request.headers.each do |header, value|
+                tags["http.request.headers.#{header}"] = value if ALLOWED_REQUEST_HEADERS.include?(header.downcase)
+              end


Reduce the number of times we iterate over request.headers

GustavoCaso · 2023-10-03T16:01:00Z

lib/datadog/appsec/event.rb

+              response.headers.each do |header, value|
+                tags["http.response.headers.#{header}"] = value if ALLOWED_RESPONSE_HEADERS.include?(header.downcase)
+              end


Reduce the number of times we iterate over response.headers

GustavoCaso · 2023-10-03T16:04:05Z

lib/datadog/appsec/event.rb

+          appsec_events = json_parse({ triggers: waf_events })
+          entry_tags['_dd.appsec.json'] = appsec_events if appsec_events


Move from record_via_span into build_service_entry_tags

marcotc · 2023-10-03T22:44:42Z

lib/datadog/appsec/event.rb

-          tags
+        def gzip(value)
+          sio = StringIO.new
+          gz = Zlib::GzipWriter.new(sio, Zlib::DEFAULT_COMPRESSION, Zlib::DEFAULT_STRATEGY)


Could I trouble to run very simple benchmarks in your machine with a sample value payload and different combinations of compression and strategy arguments? (measure performance and compressed size, so we can make a good trade-off decision.)

anmarchenko · 2023-10-04T08:44:18Z

@GustavoCaso did you run bundle exec rake appraisal:generate after making this change? it gives me a weird error message Could not find gem 'rbs (~> 3.2.0)' with platform 'aarch64-linux' in rubygems repository https://rubygems.org/ or installed locally.

lloeki · 2023-10-04T10:57:55Z

@anmarchenko That's odd, rbs is pure Ruby and has no binary gem.

lloeki

Approved following internal discussion over pending changes.

GustavoCaso · 2023-10-04T16:05:40Z

@marcotc, with regards to the benchmarks, I created a simple script that would allow us to decide what is the best decision

# frozen_string_literal: true

require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'
  gem 'benchmark-ips'
  gem 'pry'
end

require 'benchmark/ips'
require 'zlib'
require 'json'
require 'stringio'

# Each compression algorithm has its own array of compressed files

words = File.readlines('/usr/share/dict/words')
schema_values = [0, 1, 2, 4, 8, 16]

value = [{}]
100000.times do
  value[0][words.sample.chomp] = [schema_values.sample]
end

value_json = value.to_json

uncompressed_memory_usage = value_json.bytesize

def gzip(value, compression: Zlib::DEFAULT_COMPRESSION, strategy: Zlib::DEFAULT_STRATEGY)
  sio = StringIO.new
  gz = Zlib::GzipWriter.new(sio, compression, strategy)
  gz.write(value)
  gz.close
  sio.string
end

gzip_normal = []
gzip_normal_filtered = []
gzip_normal_fixed = []
gzip_normal_huffman = []
gzip_best_speed = []
gzip_best_speed_filtered = []
gzip_best_speed_fixed = []
gzip_best_speed_huffman = []
gzip_best_compression = []
gzip_best_compression_filtered = []
gzip_best_compression_fixed = []
gzip_best_compression_huffman = []

# Compress time benchmark
puts 'Time to compress the files'
Benchmark.ips do |x|
  x.report('GZip (normal mode)') do
    gzip_normal << gzip(value_json)
  end

  x.report('GZip (normal mode) (filtered)') do
    gzip_normal_filtered << gzip(value_json, strategy: Zlib::FILTERED)
  end

  x.report('GZip (normal mode) (fixed)') do
    gzip_normal_fixed << gzip(value_json, strategy: Zlib::FIXED)
  end

  x.report('GZip (normal mode) (huffman)') do
    gzip_normal_huffman << gzip(value_json, strategy: Zlib::HUFFMAN_ONLY)
  end

  x.report('GZip (speed mode)') do
    gzip_best_speed << gzip(value_json, compression: Zlib::BEST_SPEED)
  end

  x.report('GZip (speed mode) (filtered)') do
    gzip_best_speed_filtered << gzip(value_json, compression: Zlib::BEST_SPEED, strategy: Zlib::FILTERED)
  end

  x.report('GZip (speed mode) (fixed)') do
    gzip_best_speed_fixed << gzip(value_json, compression: Zlib::BEST_SPEED, strategy: Zlib::FIXED)
  end

  x.report('GZip (speed mode) (huffman)') do
    gzip_best_speed_huffman << gzip(value_json, compression: Zlib::BEST_SPEED, strategy: Zlib::HUFFMAN_ONLY)
  end

  x.report('GZip (compression mode)') do
    gzip_best_compression << gzip(value_json, compression: Zlib::BEST_COMPRESSION)
  end

  x.report('GZip (compression mode) (filtered)') do
    gzip_best_compression_filtered << gzip(value_json, compression: Zlib::BEST_COMPRESSION, strategy: Zlib::FILTERED)
  end

  x.report('GZip (compression mode) (fixed)') do
    gzip_best_compression_fixed << gzip(value_json, compression: Zlib::BEST_COMPRESSION, strategy: Zlib::FIXED)
  end

  x.report('GZip (compression mode) (huffman)') do
    gzip_best_compression_huffman << gzip(value_json, compression: Zlib::BEST_COMPRESSION, strategy: Zlib::HUFFMAN_ONLY)
  end

  x.compare!
end
puts

def average_size(array)
  array.map(&:bytesize).sum / array.size
end

# Gets the memory size (in bytes) of each compressed array
gzip_normal_compressed_average_size = average_size(gzip_normal)
gzip_normal_filtered_average_size = average_size(gzip_normal_filtered)
gzip_normal_fixed_average_size = average_size(gzip_normal_fixed)
gzip_normal_huffman_average_size = average_size(gzip_normal_huffman)

gzip_best_speed_compressed_average_size = average_size(gzip_best_speed)
gzip_best_speed_filtered_average_size = average_size(gzip_best_speed_filtered)
gzip_best_speed_fixed_average_size = average_size(gzip_best_speed_fixed)
gzip_best_speed_huffman_average_size = average_size(gzip_best_speed_huffman)

gzip_best_compression_compressed_average_size = average_size(gzip_best_compression)
gzip_best_compression_filtered_average_size = average_size(gzip_best_compression_filtered)
gzip_best_compression_fixed_average_size = average_size(gzip_best_compression_fixed)
gzip_best_compression_huffman_average_size = average_size(gzip_best_compression_huffman)

averages = [
  { name: 'GZip (normal mode)', value: gzip_normal_compressed_average_size },
  { name: 'GZip (normal mode) (filtered)', value: gzip_normal_filtered_average_size },
  { name: 'GZip (normal mode) (fixed)', value: gzip_normal_fixed_average_size },
  { name: 'GZip (normal mode) (huffman)', value: gzip_normal_huffman_average_size },

  { name: 'GZip (speed mode)', value: gzip_best_speed_compressed_average_size },
  { name: 'GZip (speed mode) (filtered)', value: gzip_best_speed_filtered_average_size },
  { name: 'GZip (speed mode) (fixed)', value: gzip_best_speed_fixed_average_size },
  { name: 'GZip (speed mode) (huffman)', value: gzip_best_speed_huffman_average_size },

  { name: 'GZip (best compression mode) ', value: gzip_best_compression_compressed_average_size },
  { name: 'GZip (best compression mode) (filtered)', value: gzip_best_compression_filtered_average_size },
  { name: 'GZip (best compression mode) (fixed)', value: gzip_best_compression_fixed_average_size },
  { name: 'GZip (best compression mode) (huffman)', value: gzip_best_compression_huffman_average_size },
].sort_by { |a| a[:value] }

# Prints the memory used after compression on each algorithm
puts 'Memory used by the compression mode'
puts "Original file size  #{uncompressed_memory_usage} bytes"

averages.each do |av|
  puts "#{av[:name]}  #{av[:value]} bytes (#{(100.0 * av[:value] / uncompressed_memory_usage).round(2)} %)"
end

puts

The results are:

Comparison:
GZip (speed mode) (huffman):      105.3 i/s
GZip (compression mode) (huffman):      103.3 i/s - same-ish: difference falls within error
GZip (normal mode) (huffman):      103.1 i/s - same-ish: difference falls within error
GZip (speed mode) (filtered):       85.0 i/s - 1.24x  (± 0.00) slower
GZip (speed mode):       84.5 i/s - 1.25x  (± 0.00) slower
GZip (speed mode) (fixed):       84.4 i/s - 1.25x  (± 0.00) slower
GZip (normal mode):       19.9 i/s - 5.30x  (± 0.00) slower
GZip (normal mode) (fixed):       19.9 i/s - 5.30x  (± 0.00) slower
GZip (normal mode) (filtered):       17.9 i/s - 5.88x  (± 0.00) slower
GZip (compression mode) (filtered):        5.2 i/s - 20.31x  (± 0.00) slower
GZip (compression mode):        4.8 i/s - 21.74x  (± 0.00) slower
GZip (compression mode) (fixed):        4.8 i/s - 21.98x  (± 0.00) slower


Memory used by the compression mode
Original file size  1365446 bytes
GZip (compression mode) (filtered)  498006 bytes (36.47 %)
GZip (compression mode)   499090 bytes (36.55 %)
GZip (normal mode)  509671 bytes (37.33 %)
GZip (normal mode) (filtered)  514454 bytes (37.68 %)
GZip (speed mode)  607377 bytes (44.48 %)
GZip (speed mode) (filtered)  607377 bytes (44.48 %)
GZip (compression mode) (fixed)  618894 bytes (45.33 %)
GZip (normal mode) (fixed)  628963 bytes (46.06 %)
GZip (speed mode) (fixed)  777802 bytes (56.96 %)
GZip (speed mode) (huffman)  808165 bytes (59.19 %)
GZip (normal mode) (huffman)  808165 bytes (59.19 %)
GZip (compression mode) (huffman)  808165 bytes (59.19 %)

Based on those values, we could use the ZLib::BEST_SPEED and the default strategy. Since it would yield a good mix of speed and byte size reduction

GustavoCaso · 2023-10-04T16:08:52Z

Answering @Anilm3 #3177 (comment)

I tested what would be the value in which is not worth it to compressed the payload using a simple script:

# frozen_string_literal: true

require 'zlib'
require 'json'
require 'stringio'
require 'base64'

def gzip(value, compression = Zlib::BEST_SPEED, strategy = Zlib::DEFAULT_STRATEGY)
  sio = StringIO.new
  gz = Zlib::GzipWriter.new(sio, compression, strategy)
  gz.write(value)
  gz.close
  sio.string
end

def json(value)
  JSON.dump(value)
end

def base64(value)
  Base64.encode64(value)
end

testing = true

value = [{}]
words = File.readlines('/usr/share/dict/words')
schema_values = [0, 1, 2, 4, 8, 16]
itertaion = 0
while testing
  json_value = json(value)
  compressed_and_encoded_value = base64(gzip(value))

  if json_value.size >= compressed_and_encoded_value.size
    puts "json value size is #{json_value.size}"
    puts "compressed and encoded size is #{compressed_and_encoded_value.size}"
    puts "JSON #{json_value}"
    puts "Compressed value #{compressed_and_encoded_value}"
    testing = false
    next
  end

  itertaion += 1
  puts "Iteration #{itertaion}"

  key = words.sample.chomp
  value[0][key] = [schema_values.sample]
end

puts 'DONE'

The results yielded that payloads below 260 bytes are not worth to be compressed

…string exceeds certain byte size

marcotc

Thank you so much, @GustavoCaso!

Compress and encode schema information

8c52213

GustavoCaso requested review from a team as code owners October 3, 2023 09:49

github-actions bot added the appsec Application Security monitoring product label Oct 3, 2023

Add support for ruby 2.2 and 2.3

f9af157

Anilm3 reviewed Oct 3, 2023

View reviewed changes

ekump reviewed Oct 3, 2023

View reviewed changes

GustavoCaso added 3 commits October 3, 2023 16:59

refactor AppSec::Event to improve error handling and make sure to use…

b6b77a4

… the smaller version between the JSON string version and the compressed version for schema key values

update RBS version

a686d5b

improve the log message when the schema payload exceeds the maximun size

733d010

GustavoCaso commented Oct 3, 2023

View reviewed changes

marcotc reviewed Oct 3, 2023

View reviewed changes

lloeki approved these changes Oct 4, 2023

View reviewed changes

update the Zlib algorith and add only compress the value if the JSON …

5b10018

…string exceeds certain byte size

marcotc approved these changes Oct 4, 2023

View reviewed changes

GustavoCaso merged commit de7400c into master Oct 4, 2023
176 checks passed

GustavoCaso deleted the asm-compress-schema-information branch October 4, 2023 20:24

github-actions bot added this to the 1.15.0 milestone Oct 4, 2023

anmarchenko mentioned this pull request Oct 5, 2023

Update RBS to 3.2 DataDog/datadog-ci-rb#47

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[APPSEC-10967] Compress and encode schema information #3177

[APPSEC-10967] Compress and encode schema information #3177

GustavoCaso commented Oct 3, 2023 •

edited

Loading

codecov-commenter commented Oct 3, 2023 •

edited

Loading

Anilm3 Oct 3, 2023

Anilm3 Oct 3, 2023

Anilm3 Oct 3, 2023

Anilm3 Oct 3, 2023

ekump Oct 3, 2023

ekump Oct 3, 2023

GustavoCaso Oct 3, 2023 •

edited

Loading

GustavoCaso Oct 3, 2023

GustavoCaso Oct 3, 2023

marcotc Oct 3, 2023 •

edited

Loading

anmarchenko commented Oct 4, 2023

lloeki commented Oct 4, 2023

lloeki left a comment

GustavoCaso commented Oct 4, 2023

GustavoCaso commented Oct 4, 2023

marcotc left a comment

	"Schema key: #{key} exceed max size value. We do not include it as part of the span tags"
	"Schema key: #{key} exceeds the max size value. It will not be included as part of the span tags"

		appsec_events = json_parse({ triggers: waf_events })
		entry_tags['_dd.appsec.json'] = appsec_events if appsec_events

[APPSEC-10967] Compress and encode schema information #3177

[APPSEC-10967] Compress and encode schema information #3177

Conversation

GustavoCaso commented Oct 3, 2023 • edited Loading

codecov-commenter commented Oct 3, 2023 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GustavoCaso Oct 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcotc Oct 3, 2023 • edited Loading

Choose a reason for hiding this comment

anmarchenko commented Oct 4, 2023

lloeki commented Oct 4, 2023

lloeki left a comment

Choose a reason for hiding this comment

GustavoCaso commented Oct 4, 2023

GustavoCaso commented Oct 4, 2023

marcotc left a comment

Choose a reason for hiding this comment

GustavoCaso commented Oct 3, 2023 •

edited

Loading

codecov-commenter commented Oct 3, 2023 •

edited

Loading

GustavoCaso Oct 3, 2023 •

edited

Loading

marcotc Oct 3, 2023 •

edited

Loading