-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updating ScanPDF to store Xref objects in a list #343
Conversation
Thank you @morriscode! Reviewing today :) |
Everything looks good @morriscode. I went ahead and reformatted the two files. I'm curious if you ever saw events with excessive Xref arrays (>1k or >10k or so). We have limiters on some arrays in some of our scanners (ScanJavascript tokens/keywords) because we've seen arrays of those be in the tens of thousands. Not sure if that applies here, but curious what your thoughts are. |
Thanks @phutelmyer! Setting a limit on the array size would be a good idea! It's entirely possible that we could encounter files that generate thousands. I submitted a variety during my testing, the largest I encountered was a 10.8MB PDF that spawned 504 xref objects. During testing I submitted several larger pdfs, 90MB+ however processing seems to have timed out prior to hitting scan_pdf. |
Can likely be added to more than just xref
@morriscode - I've added that limiter functionality to this scanner and updated the associated tests. If you don't mind giving it a quick review, I'll merge it in if you're good with it 👍 Once again, appreciate the PR. |
@phutelmyer Thank you this looks great! I verified it completed build checks. I also submitted a large PDF with an xref count of 597. |
Describe the change
Prior to this change, the output of PDF Scanner would show only the count of the xref objects within the file. This change adds each xref object to a list, which is then displayed as part of the output.
Describe testing procedures
Updated
test_scan_pdf.py
and confirmed a passing test.Sample output
Checklist