-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[APPSEC-10967] Compress and encode schema information #3177
Conversation
Codecov Report
@@ Coverage Diff @@
## master #3177 +/- ##
==========================================
+ Coverage 98.14% 98.20% +0.06%
==========================================
Files 1247 1247
Lines 71713 71567 -146
Branches 3340 3338 -2
==========================================
- Hits 70381 70283 -98
+ Misses 1332 1284 -48 see 71 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
lib/datadog/appsec/event.rb
Outdated
end | ||
next | ||
end | ||
tags[key] = data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest adding an extra check to see if data.size
is larger than the size of JSON.dump(value)
, as base64 encoding typically results in a string of size roughly 4 * n / 3
, so if the compression isn't effective it might be best to just provide the uncompressed / unencoded json string.
Alternatively, you could try and experimentally identify a minimum size below which compression + base64 encoding always results in a larger string and only compress above that threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, consider the following schema: [{hello : 8}]
Compressing and encoding with echo -n '[{"hello" : 8}]' | gzip -f | base64
results in: H4sIAAAAAAAAA4uuVspIzcnJV1KwUrCojQUAqI9PNw8AAAA=
The schema in this example is 15 characters while the compressed and base64-encoded payload is 49, so you need to check which one is larger before adding it to the trace. For example (perhaps not valid Ruby but you get the idea):
waf_result.derivatives.each do |key, value|
json = JSON.dump(value)
data = Base64.encode64(gzip())
if data.size >= MAX_ENCODED_SCHEMA_SIZE and json.size >= MAX_ENCODED_SCHEMA_SIZE
Datadog.logger.debug do
"Schema key: #{key} exceed max size value. We do not include it as part of the span tags"
end
next
end
if json.size > data.size
tags[key] = data
else
tags[key] = json
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh and just for fun you can double check that everything is correct as follows:
$ echo -n '[{"hello" : 8}]' | gzip -f | base64 | base64 --decode | gzip --decompress
[{"hello" : 8}]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, you could try and experimentally identify a minimum size below which compression + base64 encoding always results in a larger string and only compress above that threshold.
You could test different payloads and see what is the minimum size which results in a smaller payload after compression and encoding, once you find this magic value, you could have something along the lines of:
waf_result.derivatives.each do |key, value|
parsed_value = json_parse(value)
parsed_value_size = parsed_value.size
if parsed_value_size >= MIN_SCHEMA_SIZE_FOR_COMPRESSION
compressed_data = compressed_and_base64_encoded(parsed_value)
compressed_data_size = compressed_data.size
if compressed_data_size >= MAX_ENCODED_SCHEMA_SIZE && parsed_value_size >= MAX_ENCODED_SCHEMA_SIZE
Datadog.logger.debug do
"Schema key: #{key} exceed max size value. We do not include it as part of the span tags"
end
next
end
derivative_value = parsed_value_size > compressed_data_size ? compressed_data : parsed_value
tags[key] = derivative_value
else
tags[key] = parsed_value
end
end
lib/datadog/appsec/event.rb
Outdated
|
||
if data.size >= MAX_ENCODED_SCHEMA_SIZE | ||
Datadog.logger.debug do | ||
"Schema key: #{key} exceed max size value. We do not include it as part of the span tags" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor wording suggestion
"Schema key: #{key} exceed max size value. We do not include it as part of the span tags" | |
"Schema key: #{key} exceeds the max size value. It will not be included as part of the span tags" |
lib/datadog/appsec/event.rb
Outdated
end | ||
|
||
tags | ||
end | ||
end | ||
|
||
def self.gzip(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a decent chance that at some point, somewhere, file IO is going to fail. Should we be explicitly handling exceptions in this method?
… the smaller version between the JSON string version and the compressed version for schema key values
request.headers.each do |header, value| | ||
tags["http.request.headers.#{header}"] = value if ALLOWED_REQUEST_HEADERS.include?(header.downcase) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reduce the number of times we iterate over request.headers
response.headers.each do |header, value| | ||
tags["http.response.headers.#{header}"] = value if ALLOWED_RESPONSE_HEADERS.include?(header.downcase) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reduce the number of times we iterate over response.headers
appsec_events = json_parse({ triggers: waf_events }) | ||
entry_tags['_dd.appsec.json'] = appsec_events if appsec_events |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move from record_via_span
into build_service_entry_tags
lib/datadog/appsec/event.rb
Outdated
tags | ||
def gzip(value) | ||
sio = StringIO.new | ||
gz = Zlib::GzipWriter.new(sio, Zlib::DEFAULT_COMPRESSION, Zlib::DEFAULT_STRATEGY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could I trouble to run very simple benchmarks in your machine with a sample value
payload and different combinations of compression and strategy arguments? (measure performance and compressed size, so we can make a good trade-off decision.)
@GustavoCaso did you run |
@anmarchenko That's odd, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved following internal discussion over pending changes.
@marcotc, with regards to the benchmarks, I created a simple script that would allow us to decide what is the best decision # frozen_string_literal: true
require 'bundler/inline'
gemfile do
source 'https://rubygems.org'
gem 'benchmark-ips'
gem 'pry'
end
require 'benchmark/ips'
require 'zlib'
require 'json'
require 'stringio'
# Each compression algorithm has its own array of compressed files
words = File.readlines('/usr/share/dict/words')
schema_values = [0, 1, 2, 4, 8, 16]
value = [{}]
100000.times do
value[0][words.sample.chomp] = [schema_values.sample]
end
value_json = value.to_json
uncompressed_memory_usage = value_json.bytesize
def gzip(value, compression: Zlib::DEFAULT_COMPRESSION, strategy: Zlib::DEFAULT_STRATEGY)
sio = StringIO.new
gz = Zlib::GzipWriter.new(sio, compression, strategy)
gz.write(value)
gz.close
sio.string
end
gzip_normal = []
gzip_normal_filtered = []
gzip_normal_fixed = []
gzip_normal_huffman = []
gzip_best_speed = []
gzip_best_speed_filtered = []
gzip_best_speed_fixed = []
gzip_best_speed_huffman = []
gzip_best_compression = []
gzip_best_compression_filtered = []
gzip_best_compression_fixed = []
gzip_best_compression_huffman = []
# Compress time benchmark
puts 'Time to compress the files'
Benchmark.ips do |x|
x.report('GZip (normal mode)') do
gzip_normal << gzip(value_json)
end
x.report('GZip (normal mode) (filtered)') do
gzip_normal_filtered << gzip(value_json, strategy: Zlib::FILTERED)
end
x.report('GZip (normal mode) (fixed)') do
gzip_normal_fixed << gzip(value_json, strategy: Zlib::FIXED)
end
x.report('GZip (normal mode) (huffman)') do
gzip_normal_huffman << gzip(value_json, strategy: Zlib::HUFFMAN_ONLY)
end
x.report('GZip (speed mode)') do
gzip_best_speed << gzip(value_json, compression: Zlib::BEST_SPEED)
end
x.report('GZip (speed mode) (filtered)') do
gzip_best_speed_filtered << gzip(value_json, compression: Zlib::BEST_SPEED, strategy: Zlib::FILTERED)
end
x.report('GZip (speed mode) (fixed)') do
gzip_best_speed_fixed << gzip(value_json, compression: Zlib::BEST_SPEED, strategy: Zlib::FIXED)
end
x.report('GZip (speed mode) (huffman)') do
gzip_best_speed_huffman << gzip(value_json, compression: Zlib::BEST_SPEED, strategy: Zlib::HUFFMAN_ONLY)
end
x.report('GZip (compression mode)') do
gzip_best_compression << gzip(value_json, compression: Zlib::BEST_COMPRESSION)
end
x.report('GZip (compression mode) (filtered)') do
gzip_best_compression_filtered << gzip(value_json, compression: Zlib::BEST_COMPRESSION, strategy: Zlib::FILTERED)
end
x.report('GZip (compression mode) (fixed)') do
gzip_best_compression_fixed << gzip(value_json, compression: Zlib::BEST_COMPRESSION, strategy: Zlib::FIXED)
end
x.report('GZip (compression mode) (huffman)') do
gzip_best_compression_huffman << gzip(value_json, compression: Zlib::BEST_COMPRESSION, strategy: Zlib::HUFFMAN_ONLY)
end
x.compare!
end
puts
def average_size(array)
array.map(&:bytesize).sum / array.size
end
# Gets the memory size (in bytes) of each compressed array
gzip_normal_compressed_average_size = average_size(gzip_normal)
gzip_normal_filtered_average_size = average_size(gzip_normal_filtered)
gzip_normal_fixed_average_size = average_size(gzip_normal_fixed)
gzip_normal_huffman_average_size = average_size(gzip_normal_huffman)
gzip_best_speed_compressed_average_size = average_size(gzip_best_speed)
gzip_best_speed_filtered_average_size = average_size(gzip_best_speed_filtered)
gzip_best_speed_fixed_average_size = average_size(gzip_best_speed_fixed)
gzip_best_speed_huffman_average_size = average_size(gzip_best_speed_huffman)
gzip_best_compression_compressed_average_size = average_size(gzip_best_compression)
gzip_best_compression_filtered_average_size = average_size(gzip_best_compression_filtered)
gzip_best_compression_fixed_average_size = average_size(gzip_best_compression_fixed)
gzip_best_compression_huffman_average_size = average_size(gzip_best_compression_huffman)
averages = [
{ name: 'GZip (normal mode)', value: gzip_normal_compressed_average_size },
{ name: 'GZip (normal mode) (filtered)', value: gzip_normal_filtered_average_size },
{ name: 'GZip (normal mode) (fixed)', value: gzip_normal_fixed_average_size },
{ name: 'GZip (normal mode) (huffman)', value: gzip_normal_huffman_average_size },
{ name: 'GZip (speed mode)', value: gzip_best_speed_compressed_average_size },
{ name: 'GZip (speed mode) (filtered)', value: gzip_best_speed_filtered_average_size },
{ name: 'GZip (speed mode) (fixed)', value: gzip_best_speed_fixed_average_size },
{ name: 'GZip (speed mode) (huffman)', value: gzip_best_speed_huffman_average_size },
{ name: 'GZip (best compression mode) ', value: gzip_best_compression_compressed_average_size },
{ name: 'GZip (best compression mode) (filtered)', value: gzip_best_compression_filtered_average_size },
{ name: 'GZip (best compression mode) (fixed)', value: gzip_best_compression_fixed_average_size },
{ name: 'GZip (best compression mode) (huffman)', value: gzip_best_compression_huffman_average_size },
].sort_by { |a| a[:value] }
# Prints the memory used after compression on each algorithm
puts 'Memory used by the compression mode'
puts "Original file size #{uncompressed_memory_usage} bytes"
averages.each do |av|
puts "#{av[:name]} #{av[:value]} bytes (#{(100.0 * av[:value] / uncompressed_memory_usage).round(2)} %)"
end
puts The results are:
Based on those values, we could use the |
Answering @Anilm3 #3177 (comment) I tested what would be the value in which is not worth it to compressed the payload using a simple script: # frozen_string_literal: true
require 'zlib'
require 'json'
require 'stringio'
require 'base64'
def gzip(value, compression = Zlib::BEST_SPEED, strategy = Zlib::DEFAULT_STRATEGY)
sio = StringIO.new
gz = Zlib::GzipWriter.new(sio, compression, strategy)
gz.write(value)
gz.close
sio.string
end
def json(value)
JSON.dump(value)
end
def base64(value)
Base64.encode64(value)
end
testing = true
value = [{}]
words = File.readlines('/usr/share/dict/words')
schema_values = [0, 1, 2, 4, 8, 16]
itertaion = 0
while testing
json_value = json(value)
compressed_and_encoded_value = base64(gzip(value))
if json_value.size >= compressed_and_encoded_value.size
puts "json value size is #{json_value.size}"
puts "compressed and encoded size is #{compressed_and_encoded_value.size}"
puts "JSON #{json_value}"
puts "Compressed value #{compressed_and_encoded_value}"
testing = false
next
end
itertaion += 1
puts "Iteration #{itertaion}"
key = words.sample.chomp
value[0][key] = [schema_values.sample]
end
puts 'DONE' The results yielded that payloads below 260 bytes are not worth to be compressed |
…string exceeds certain byte size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much, @GustavoCaso!
What does this PR do?
The RFC for API Security Schema extraction recommends compressing and encoding schema extraction information before appending it to the root span to reduce the number of bytes sent over the wire. The RFC also defines a hard limit of the maximum size for individual schema values.
We add the necessary code to compress and encode the schema information on this PR.
Motivation:
Additional Notes:
The last commit f9af157 make use of
Zlib::GzipWriter
instead ofZlib.gzip
because Ruby 2.2 and 2.3 do not have that method.Also, the RBS repo does not have the Zlib library up to date, so I had to add those invendor/rbs/zlib manually
. I wanted to upgrade rbs version, but unfortunately, the RBS version that updated the Zlib signatures dropped support for ruby 2.x https://github.com/ruby/rbs/releases/tag/v3.2.0.pre.1, so I had to manually port itWe decided to updated the RBS version instead 😄
How to test the change?
For Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance
.Unsure? Have a question? Request a review!