Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Type of Contribution
What does this implement/fix? Explain your changes.
PST encoding bugfix
Includes a bugfix where PST encoding didn't use the first priority encoding, which could cause encoding errors in PDF, HTML, and WARC derivatives.
Improve PST HTML body extraction
PST files often contain messages that do not have an HTML body that still renders like it does in Outlook. Outlook and other clients instead use the RTF body. Mailbagit, which previously ignored RTF bodies, now extracts HTML from them when an HTML body is not present. This is then used for both PDF and WARC derivatives too. Previously this was only done for MSG sources.
WARC URI improvement
Previously, WARC derivatives made a custom URI for the important WARC-Target-URI header, using
http://mailbag
, such as:This wasn't great as they were likely to create conflicts outside of a mailbag and this didn't denote a real location as the WARC-Target-URI is supposed to have.
A better approach would be to use the Message-ID header, as specified by RFC2392. The reason we didn't originally, was that this was thought to be unreliable, as we had cases where the Message-ID headers were stripped. Yet, just ignoring the field wasn't a great approach, so this change uses Message-ID for WARC-Target-URI when it is present, and only falls back to
http://mailbag
if it doesn't get a Message-ID that seems valid.This approach uses the Message-ID header, but strips the leading and trailing brackets (
<>
) that typically wrap it. To make it a valid URI according to RFC3986 it prepends themailto:
URI scheme.Thus, the Message-ID header
<MN2PR04MB579157FAB038D851277E3908F9129@MN2PR04MB5791.namprd04.prod.outlook.com>
becomes the WARC-Target-URImailto:MN2PR04MB579157FAB038D851277E3908F9129@MN2PR04MB5791.namprd04.prod.outlook.com
Link to issue?
n/a
Pull Request Checklist
Please check if your PR fulfills the following requirements:
How has this been tested?
Operating System: Win10
Python Version: 3.9.16
Licensing