Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ScanTranscode - Convert New/Uncommon Image Formats #324

Merged
merged 11 commits into from
Feb 15, 2023

Conversation

ryanohoro
Copy link
Collaborator

@ryanohoro ryanohoro commented Feb 14, 2023

Describe the change

HEIC/AVIF/HEIF images are appearing more frequently as platforms like mobile devices use these new encoding schemes to increase compression performance. Support for these codecs is missing in common tools and modules, like the tesseract dependency in Strelka for OCR.

The transcode process is relatively fast, e.g. 0.023 sec for a ~100KB image.

Adds a scanner that can convert e.g. HEIC/AVIF/HEIF files into other formats. Default conversion is to jpeg with quality 90 (preserves OCR quality without inflating file sizes).

Adds tests for three new fixtures transcoded into each of the six available formats, one test for broken images.

Adds negative matches in ScanJpeg, ScanEofPng, ScanEofBmp, ScanNf, ScanLsb for ScanTranscode.

Improves exception handling in emit_file().

Describe testing procedures

============================= test session starts ==============================
platform linux -- Python 3.10.6, pytest-7.2.0, pluggy-1.0.0
rootdir: /strelka
plugins: mock-3.10.0, unordered-0.5.2
collected 121 items

tests/test_scan_tar.py .
tests/test_scan_tlsh.py .
tests/test_scan_transcode.py ...................
tests/test_scan_upx.py .
tests/test_scan_url.py ..

====================== 122 passed, 28 warnings in 45.33s =======================

Sample output

Sample transcode:

./strelka-oneshot -l - -f src/python/strelka/tests/fixtures/test_qr.avif
{"file":{"depth":0,"flavors":{"mime":["image/avif"]},"name":"src/python/strelka/tests/fixtures/test_qr.avif","scanners":["ScanEntropy","ScanExiftool","ScanFooter","ScanHash","ScanHeader","ScanTlsh","ScanTranscode","ScanYara"],"size":49499,"tree":{"node":"d03763ca-52d3-4375-b1df-efeb62bfa1a1","root":"d03763ca-52d3-4375-b1df-efeb62bfa1a1"}},"request":{"attributes":{"filename":"src/python/strelka/tests/fixtures/test_qr.avif"},"client":"go-oneshot","id":"d03763ca-52d3-4375-b1df-efeb62bfa1a1","source":"ubuntu","time":1676416991},"scan":{"entropy":{"elapsed":0.000057,"entropy":7.995014927024274},"exiftool":{"elapsed":0.149846,"keys":[{"key":"ImageWidth","value":999},{"key":"ImageHeight","value":609}]},"footer":{"backslash":"\\x8c\\x98\\xd3\\xf0\\x0f\\x96\\r\\xce\\x14\\xcf\\xa0\\xda~P\\xdaOx\\\\A\\xa7br\\xd0A\\x04;S\\x87\\xe3 I\\xcdI\\x8c[b\\x80\\x85\\xcc\\x1f\\xdf]k\\xae\\x86\\xc5s\\x8dm\\xc0","elapsed":0.000036,"footer":"����\u000f�\r�\u0014Ϡ�~P�Ox\\A�br�A\u0004;S�� I�I�[b���\u001f�]k���s�m�"},"hash":{"elapsed":0.003897,"md5":"2609004bb9c371a869fa324b0fabdf5c","sha1":"9271dffbecf652db7803d0a930091c462428b18a","sha256":"f188590f54b3cbf60af59af4a797af5b6441a6de6b4bacb191417c2c8c71b1e9","ssdeep":"1536:DzNgGMG0z6S54BaP3phIkuboHbtwOeSwY:SB6aP5h6boHbtzr","tlsh":"T11D23027E3252DC22D96B857E5BD31316451210D901AE43AF3A7CF792432E227F9C793A"},"header":{"backslash":"\\x00\\x00\\x00\\x1cftypavif\\x00\\x00\\x00\\x00avifmif1miaf\\x00\\x00\\x00\\xeameta\\x00\\x00\\x00\\x00\\x00\\x00\\x00!hdlr\\x00\\x00","elapsed":0.000037,"header":"\u0000\u0000\u0000\u001cftypavif\u0000\u0000\u0000\u0000avifmif1miaf\u0000\u0000\u0000�meta\u0000\u0000\u0000\u0000\u0000\u0000\u0000!hdlr\u0000\u0000"},"tlsh":{"elapsed":0.001291},"transcode":{"elapsed":0.02379,"flags":["transcoded"]},"yara":{"elapsed":0.001789,"matches":["test"]}}}
{"file":{"depth":1,"flavors":{"mime":["image/jpeg"],"yara":["jpeg_file"]},"name":"src/python/strelka/tests/fixtures/test_qr.avif","scanners":["ScanEntropy","ScanExiftool","ScanFooter","ScanHash","ScanHeader","ScanJpeg","ScanOcr","ScanQr","ScanTlsh","ScanYara"],"size":152620,"source":"ScanTranscode","tree":{"node":"8320996b-8446-4a10-b4c2-4e6b5317f58d","parent":"d03763ca-52d3-4375-b1df-efeb62bfa1a1","root":"d03763ca-52d3-4375-b1df-efeb62bfa1a1"}},"request":{"attributes":{"filename":"src/python/strelka/tests/fixtures/test_qr.avif"},"client":"go-oneshot","id":"d03763ca-52d3-4375-b1df-efeb62bfa1a1","source":"ubuntu","time":1676416991},"scan":{"entropy":{"elapsed":0.000138,"entropy":7.904183260122295},"exiftool":{"elapsed":0.134133,"keys":[{"key":"ImageWidth","value":999},{"key":"ImageHeight","value":609}]},"footer":{"backslash":"(\\xa2\\x80\\n(\\xa2\\x80\\n(\\xa2\\x80\\n(\\xa2\\x80\\n(\\xa2\\x80\\n(\\xa2\\x80\\n(\\xa2\\x80\\n(\\xa2\\x80\\n(\\xa2\\x80\\n(\\xa2\\x80\\n(\\xa2\\x80\\n(\\xa2\\x80?\\xff\\xd9","elapsed":0.000037,"footer":"(��\n(��\n(��\n(��\n(��\n(��\n(��\n(��\n(��\n(��\n(��\n(��?��"},"hash":{"elapsed":0.002865,"md5":"e22c8e84ca56433a9daca303d67d26f8","sha1":"93f040ddcf843092fd1583ff30c95045df3bd939","sha256":"4873a11562246dd8627e7688980934a76f0f4dd8223a58eee60dfab37ee926ce","ssdeep":"3072:2vqrjJqvKOpCGdZfF48LgIqo0VLEwMXs0OCPlj9AVBz1vv:2viOphFdLghosLFCPlSBxX","tlsh":"T19FE3E0138E658F93A5ADD3BCBE934E319F8C161CF5A232EA40660D8637A52264C4F51E"},"header":{"backslash":"\\xff\\xd8\\xff\\xe0\\x00\\x10JFIF\\x00\\x01\\x01\\x00\\x00\\x01\\x00\\x01\\x00\\x00\\xff\\xdb\\x00C\\x00\\x03\\x02\\x02\\x03\\x02\\x02\\x03\\x03\\x03\\x03\\x04\\x03\\x03\\x04\\x05\\x08\\x05\\x05\\x04\\x04\\x05\\n\\x07\\x07\\x06","elapsed":0.00002,"header":"����\u0000\u0010JFIF\u0000\u0001\u0001\u0000\u0000\u0001\u0000\u0001\u0000\u0000��\u0000C\u0000\u0003\u0002\u0002\u0003\u0002\u0002\u0003\u0003\u0003\u0003\u0004\u0003\u0003\u0004\u0005\u0008\u0005\u0005\u0004\u0004\u0005\n\u0007\u0007\u0006"},"jpeg":{"elapsed":0.007548,"flags":["no_trailer"]},"ocr":{"elapsed":0.900662,"text":["Lorem","Ipsum","Lorem","ipsum","dolor","sit","amet,","consectetur","adipiscing","elit.","Cras","lobortis","sem","dui.","Morbi","at","magna","quis","ligula","faucibus","consectetur","feugiat","at","purus.","Sed","nec","lorem","nibh.","Nam","vel","libero","odio.","Vivamus","tempus","non","enim","egestas","pretium.","Vestibulum","turpis","arcu,","maximus","nec","libero","quis,","imperdiet","suscipit","purus.","Vestibulum","blandit","quis","lacus","non","sollicitudin.","Nullam","non","convallis","dui,","et","aliquet","risus.","Sed","accumsan","ullamcorper","vehicula.","Proin","non","urna","facilisis,","condimentum","eros","quis,","suscipit","purus.","Morbi","euismod","imperdiet","neque","fermentum","dictum.","Integer","aliquam,","erat","sit","amet","fringilla","tempus,","mauris","ligula","blandit","sapien,","et","varius","sem","mauris","eu","diam.","Sed","fringilla","neque","est,","in","laoreet","felis","tristique","in.","Donec","luctus","velit","a","posuere","posuere.","Suspendisse","sodales","pellentesque","quam."]},"qr":{"data":"https://www.example.com/","elapsed":0.039702,"type":"url"},"tlsh":{"elapsed":0.001294},"yara":{"elapsed":0.000269,"matches":["test"]}}}

Same image without transcoding:

./strelka-oneshot -l - -f src/python/strelka/tests/fixtures/test_qr.png
{"file":{"depth":0,"flavors":{"mime":["image/png"],"yara":["png_file"]},"name":"src/python/strelka/tests/fixtures/test_qr.png","scanners":["ScanEntropy","ScanExiftool","ScanFooter","ScanHash","ScanHeader","ScanLsb","ScanNf","ScanOcr","ScanPngEof","ScanQr","ScanTlsh","ScanYara"],"size":212521,"tree":{"node":"37d3fc95-af28-493a-8fcf-4de66445ebef","root":"37d3fc95-af28-493a-8fcf-4de66445ebef"}},"request":{"attributes":{"filename":"src/python/strelka/tests/fixtures/test_qr.png"},"client":"go-oneshot","id":"37d3fc95-af28-493a-8fcf-4de66445ebef","source":"ubuntu","time":1676417515},"scan":{"entropy":{"elapsed":0.000172,"entropy":7.992020010977977},"exiftool":{"elapsed":0.11114,"keys":[{"key":"ImageWidth","value":999},{"key":"ImageHeight","value":609}]},"footer":{"backslash":"\\x80\\xae\\x1d\\x02\\x02\\x02\\x02\\x02\\x02\\x02\\x02\\x02\\x82\\xd5\\x01];\\x04\\x04\\x04\\x04\\x04\\x04\\x04\\x04\\x04\\x04\\xab\\xe3\\xffsqE\\x8c\u003c\\x9d\\x8e\\xb7\\x00\\x00\\x00\\x00IEND\\xaeB`\\x82","elapsed":0.000041,"footer":"��\u001d\u0002\u0002\u0002\u0002\u0002\u0002\u0002\u0002\u0002��\u0001];\u0004\u0004\u0004\u0004\u0004\u0004\u0004\u0004\u0004\u0004���sqE�\u003c���\u0000\u0000\u0000\u0000IEND�B`�"},"hash":{"elapsed":0.004249,"md5":"a1c114a65b72bae0579b215130480f30","sha1":"f331db34ca64d1c99962ddcdb40219d3814818d0","sha256":"cf611050fff9dde3d80a701ada553c0cc55005b2e95ca1edfe2874afd210adf8","ssdeep":"3072:5fUCGz2YqZiZnKcw1VQyAhxeH8XmA/2u9c7CGVryDcyBrGaUL4YJdEKAj:5fZ62pbcteREOCtbm8IEXj","tlsh":"T11F24125C1A10D5F2BB19237870CD135292BC86188BFE8CEADB63784723D0C6EC2D9B45"},"header":{"backslash":"\\x89PNG\\r\\n\\x1a\\n\\x00\\x00\\x00\\rIHDR\\x00\\x00\\x03\\xe7\\x00\\x00\\x02a\\x08\\x02\\x00\\x00\\x00\\xac\\x02V\\x9a\\x00\\x00\\xff\\xffIDATx\\xda\\xec\\xbdw\\x9c\\x14U\\xf6","elapsed":0.000067,"header":"�PNG\r\n\u001a\n\u0000\u0000\u0000\rIHDR\u0000\u0000\u0003�\u0000\u0000\u0002a\u0008\u0002\u0000\u0000\u0000�\u0002V�\u0000\u0000��IDATx��w�\u0014U�"},"lsb":{"elapsed":0.010761,"lsb":false},"nf":{"elapsed":0.008657,"noise_floor":true,"percentage":0.0684806316990225,"threshold":0.25},"ocr":{"elapsed":0.509432,"text":["Lorem","Ipsum","Lorem","ipsum","dolor","sit","amet,","consectetur","adipiscing","elit.","Cras","lobortis","sem","dui.","Morbi","at","magna","quis","ligula","faucibus","consectetur","feugiat","at","purus.","Sed","nec","lorem","nibh.","Nam","vel","libero","odio.","Vivamus","tempus","non","enim","egestas","pretium.","Vestibulum","turpis","arcu,","maximus","nec","libero","quis,","imperdiet","suscipit","purus.","Vestibulum","blandit","quis","lacus","non","sollicitudin.","Nullam","non","convallis","dui,","et","aliquet","risus.","Sed","accumsan","ullamcorper","vehicula.","Proin","non","uma","facilisis,","condimentum","eros","quis,","suscipit","purus.","Morbi","euismod","imperdiet","neque","fermentum","dictum.","Integer","aliquam,","erat","sit","amet","fringilla","tempus,","mauris","ligula","blandit","sapien,","et","varius","sem","mauris","eu","diam.","Sed","fringilla","neque","est,","in","laoreet","felis","tristique","in.","Donec","luctus","velit","a","posuere","posuere.","Suspendisse","sodales","pellentesque","quam."]},"png_eof":{"elapsed":0.000043,"flags":["no_trailer"]},"qr":{"data":"https://www.example.com/","elapsed":0.041885,"type":"url"},"tlsh":{"elapsed":0.003085},"yara":{"elapsed":0.000657,"matches":["test"]}}}

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of and tested my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings

Comment on lines 26 to 37
def scan(self, data, file, options, expire_at):
output_format = options.get("output_format", "jpeg")

def convert(im):
with io.BytesIO() as f:
im.save(f, format=f"{output_format}", quality=90)
return f.getvalue()

# Send extracted file back to Strelka
self.emit_file(convert(Image.open(io.BytesIO(data))), name=file.name)

self.flags.append("transcoded")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we discussed global exception handling - curious how you want to approach adding exceptions here. My first thought would be something like...

    def scan(self, data, file, options, expire_at):
        output_format = options.get("output_format", "jpeg")

        def convert(im):
            with io.BytesIO() as f:
                try:
                    im.save(f, format=f"{output_format}", quality=90)
                    return f.getvalue()
                except ValueError:
                  self.flags.append(f"{self.__class__.__name__} Exception:  Invalid format or quality.")
                except OSError:
                   self.flags.append(f"{self.__class__.__name__} Exception:  Unsupported format or invalid image file.")
                except AttributeError:
                   self.flags.append(f"{self.__class__.__name__} Exception:  Data is not a bytes-like object.")
                except Exception as e:
                   self.flags.append(f"{self.__class__.__name__} Exception: {str(e)[:50]}")

        # Send extracted file back to Strelka
        try:
            self.emit_file(convert(Image.open(io.BytesIO(data))), name=file.name)
        except Exception as e:
            self.flags.append(f"{self.__class__.__name__} Exception: Failed to emit file")
            return

        self.flags.append("transcoded")

Too much? Too specific to scanner?

Copy link
Collaborator Author

@ryanohoro ryanohoro Feb 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I catch UnidentifiedImageError now, which is what's thrown when a broken image is loaded in to Image.open. Fair to do if we expect a lot of exceptions from badly formatted or truncated image files. I'd like to add in other specific exceptions as they come up while running Strelka.

Added a test for a broken image.

I'm not keen on a broader catch. If emit_file() itself fails, I think that should be handled inside of emit_file(). I added code to add a flag when emit_file fails, and log the exception. If emit_file fails, it's likely due to a coordinator connectivity problem, and we shouldn't suppress those exceptions.

@phutelmyer phutelmyer merged commit f5a85c4 into target:master Feb 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants