Updating ScanPDF to store Xref objects in a list #343

morriscode · 2023-03-01T23:08:21Z

Describe the change
Prior to this change, the output of PDF Scanner would show only the count of the xref objects within the file. This change adds each xref object to a list, which is then displayed as part of the output.

Describe testing procedures
Updated test_scan_pdf.py and confirmed a passing test.

tests/test_scan_pcap.py ..
tests/test_scan_pdf.py .
tests/test_scan_pe.py .
tests/test_scan_pgp.py ....

Sample output

            "pages": 1,
            "producer": "Microsoft® Word 2016",
            "repaired": false,
            "words": 418,
            "xref_object": ["<</Type/Catalog/Pages 2 0 R/Lang(en-US)/StructTreeRoot 15 0 R/MarkInfo<</Marked true>>>>", "<</Type/Pages/Count 1/Kids[3 0 R]>>", "<</Type/Page/Parent 2 0 R/Resources<</ExtGState<</GS5 5 0 R/GS8 8 0 R>>/Font<</F1 6 0 R/F2 10 0 R/F3 12 0 R>>/XObject<</Image9 9 0 R>>/ProcSet[/PDF/Text/ImageB/ImageC/ImageI]>>/MediaBox[0 0 612 792]/Contents 4 0 R/Group<</Type/Group/S/Transparency/CS/DeviceRGB>>/Tabs/S/StructParents 0>>", "<</Filter/FlateDecode/Length 4050>>", "<</Type/ExtGState/BM/Normal/ca 1>>", "<</Type/Font/Subtype/TrueType/Name/F1/BaseFont/TimesNewRomanPSMT/Encoding/WinAnsiEncoding/FontDescriptor 7 0 R/FirstChar 32/LastChar 117/Widths 36 0 R>>", "<</Type/FontDescriptor/FontName/TimesNewRomanPSMT/Flags 32/ItalicAngle 0/Ascent 891/Descent -216/CapHeight 693/AvgWidth 401/MaxWidth 2614/FontWeight 400/XHeight 250/Leading 42/StemV 40/FontBBox[-568 -216 2046 693]>>", "<</Type/ExtGState/BM/Normal/CA 1>>", "<</Type/XObject/Subtype/Image/Width 340/Height 245/ColorSpace/DeviceRGB/BitsPerComponent 8/Filter/DCTDecode/Interpolate true/Length 21001>>", "<</Type/Font/Subtype/TrueType/Name/F2/BaseFont/ABCDEE+Calibri/Encoding/WinAnsiEncoding/FontDescriptor 11 0 R/FirstChar 32/LastChar 32/Widths 37 0 R>>", "<</Type/FontDescriptor/FontName/ABCDEE+Calibri/Flags 32/ItalicAngle 0/Ascent 750/Descent -250/CapHeight 750/AvgWidth 521/MaxWidth 1743/FontWeight 400/XHeight 250/StemV 52/FontBBox[-503 -250 1240 750]/FontFile2 38 0 R>>", "<</Type/Font/Subtype/TrueType/Name/F3/BaseFont/ArialMT/Encoding/WinAnsiEncoding/FontDescriptor 13 0 R/FirstChar 32/LastChar 120/Widths 39 0 R>>", "<</Type/FontDescriptor/FontName/ArialMT/Flags 32/ItalicAngle 0/Ascent 905/Descent -210/CapHeight 728/AvgWidth 441/MaxWidth 2665/FontWeight 400/XHeight 250/Leading 33/StemV 44/FontBBox[-665 -210 2000 728]>>", "<</Author(Ryan.OHoro)/Creator<FEFF004D006900630072006F0073006F0066007400AE00200057006F0072006400200032003000310036>/CreationDate(D:20221216134852-06'00')/ModDate(D:20221216134852-06'00')/Producer<FEFF004D006900630072006F0073006F0066007400AE00200057006F0072006400200032003000310036>>>", "<</Type/StructTreeRoot/RoleMap 16 0 R/ParentTree 17 0 R/K[18 0 R]/ParentTreeNextKey 1>>", "<</Footnote/Note/Endnote/Note/Textbox/Sect/Header/Sect/Footer/Sect/InlineShape/Sect/Annotation/Sect/Artifact/Sect/Workbook/Document/Worksheet/Part/Macrosheet/Part/Chartsheet/Part/Dialogsheet/Part/Slide/Part/Chart/Sect/Diagram/Figure>>", "<</Nums[0 21 0 R]>>", "<</P 15 0 R/S/Part/Type/StructElem/K[19 0 R 25 0 R 28 0 R 29 0 R 30 0 R 31 0 R 32 0 R 33 0 R 34 0 R 35 0 R]>>", "<</P 18 0 R/S/H1/Type/StructElem/K[20 0 R 23 0 R 24 0 R]/Pg 3 0 R>>", "<</P 19 0 R/S/Span/Type/StructElem/Pg 3 0 R/K 0>>", "[20 0 R 23 0 R 24 0 R 27 0 R 26 0 R 28 0 R 28 0 R 28 0 R 28 0 R 28 0 R 28 0 R 28 0 R 28 0 R 28 0 R 28 0 R 28 0 R 28 0 R 29 0 R 29 0 R 29 0 R 29 0 R 29 0 R 29 0 R 29 0 R 29 0 R 29 0 R 29 0 R 30 0 R 30 0 R 30 0 R 30 0 R 30 0 R 30 0 R 30 0 R 31 0 R 31 0 R 31 0 R 31 0 R 31 0 R 31 0 R 31 0 R 31 0 R 31 0 R 31 0 R 31 0 R 32 0 R 32 0 R 32 0 R 32 0 R 32 0 R 32 0 R 32 0 R 32 0 R 32 0 R 32 0 R 33 0 R 33 0 R 34 0 R 34 0 R 35 0 R]", "<</Type/ObjStm/N 20/First 142/Filter/FlateDecode/Length 601>>", "<</P 19 0 R/S/Span/Type/StructElem/ActualText(Lorem Ipsum)/K[1]/Pg 3 0 R>>", "<</P 19 0 R/S/Span/Type/StructElem/ActualText( )/K[2]/Pg 3 0 R>>", "<</P 18 0 R/S/P/Type/StructElem/K[26 0 R 27 0 R]/Pg 3 0 R>>", "<</P 25 0 R/S/Span/Type/StructElem/Pg 3 0 R/K 4>>", "<</P 25 0 R/S/InlineShape/Alt()/Type/StructElem/K[3]/Pg 3 0 R>>", "<</P 18 0 R/S/P/Type/StructElem/K[5 6 7 8 9 10 11 12 13 14 15 16]/Pg 3 0 R>>", "<</P 18 0 R/S/P/Type/StructElem/K[17 18 19 20 21 22 23 24 25 26]/Pg 3 0 R>>", "<</P 18 0 R/S/P/Type/StructElem/K[27 28 29 30 31 32 33]/Pg 3 0 R>>", "<</P 18 0 R/S/P/Type/StructElem/K[34 35 36 37 38 39 40 41 42 43 44]/Pg 3 0 R>>", "<</P 18 0 R/S/P/Type/StructElem/K[45 46 47 48 49 50 51 52 53 54]/Pg 3 0 R>>", "<</P 18 0 R/S/P/Type/StructElem/K[55 56]/Pg 3 0 R>>", "<</P 18 0 R/S/P/Type/StructElem/K[57 58]/Pg 3 0 R>>", "<</P 18 0 R/S/P/Type/StructElem/K[59]/Pg 3 0 R>>", "[250 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 333 0 0 611 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 444 0 0 0 0 0 0 0 778 0 500 500 0 333 389 0 500]", "[226]", "<</Filter/FlateDecode/Length 175850/Length1 537988>>", "[278 0 0 0 0 0 0 0 0 0 0 0 278 0 278 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 667 0 722 722 0 0 0 0 278 0 0 556 833 722 0 667 778 0 667 0 0 667 0 0 0 0 0 0 0 0 0 0 556 556 500 556 556 278 556 556 222 222 0 222 833 556 556 556 556 333 500 278 556 500 0 500]", "<</Type/XRef/Size 40/W[1 4 2]/Root 1 0 R/Info 14 0 R/ID[<996084F03FED2848AB7A00AD5BCAA8E6><996084F03FED2848AB7A00AD5BCAA8E6>]/Filter/FlateDecode/Length 132>>"],
            "xrefs": 40
        },

Checklist

My code follows the style guidelines of this project
I have performed a self-review of and tested my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings

phutelmyer · 2023-03-02T12:48:06Z

Thank you @morriscode! Reviewing today :)

phutelmyer · 2023-03-02T13:01:37Z

Everything looks good @morriscode. I went ahead and reformatted the two files.

I'm curious if you ever saw events with excessive Xref arrays (>1k or >10k or so). We have limiters on some arrays in some of our scanners (ScanJavascript tokens/keywords) because we've seen arrays of those be in the tens of thousands.

Not sure if that applies here, but curious what your thoughts are.

morriscode · 2023-03-02T16:15:23Z

Thanks @phutelmyer!

Setting a limit on the array size would be a good idea! It's entirely possible that we could encounter files that generate thousands. I submitted a variety during my testing, the largest I encountered was a 10.8MB PDF that spawned 504 xref objects.

During testing I submitted several larger pdfs, 90MB+ however processing seems to have timed out prior to hitting scan_pdf.

Can likely be added to more than just xref

phutelmyer · 2023-03-03T13:11:31Z

@morriscode - I've added that limiter functionality to this scanner and updated the associated tests. If you don't mind giving it a quick review, I'll merge it in if you're good with it 👍

Once again, appreciate the PR.

morriscode · 2023-03-03T15:42:49Z

@phutelmyer Thank you this looks great!

I verified it completed build checks. I also submitted a large PDF with an xref count of 597.
Output shows xref_objects is capped at 249 as expected and the xrefs count still shows the full 597. I'm good to merge!! Thank you for the help!

morriscode and others added 3 commits March 1, 2023 17:54

Updating ScanPDF to store Xref objects in a list

79e2217

Removing errant whitespace

a5182c4

Reformatting with Black

4815b03

phutelmyer self-requested a review March 2, 2023 15:30

phutelmyer added the enhancement New feature or request label Mar 2, 2023

phutelmyer added 4 commits March 3, 2023 07:57

Adding max objects to PDF scanner

b61878a

Can likely be added to more than just xref

Adding max objects to XREF objects

dbff55b

Updating PDF test with XREF limiter

287ca49

Docstring updates

917f26c

phutelmyer merged commit befb6b1 into target:master Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating ScanPDF to store Xref objects in a list #343

Updating ScanPDF to store Xref objects in a list #343

morriscode commented Mar 1, 2023 •

edited

Loading

phutelmyer commented Mar 2, 2023

phutelmyer commented Mar 2, 2023

morriscode commented Mar 2, 2023

phutelmyer commented Mar 3, 2023

morriscode commented Mar 3, 2023

Updating ScanPDF to store Xref objects in a list #343

Updating ScanPDF to store Xref objects in a list #343

Conversation

morriscode commented Mar 1, 2023 • edited Loading

phutelmyer commented Mar 2, 2023

phutelmyer commented Mar 2, 2023

morriscode commented Mar 2, 2023

phutelmyer commented Mar 3, 2023

morriscode commented Mar 3, 2023

morriscode commented Mar 1, 2023 •

edited

Loading