Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve PST parsing and WARC URIs #235

Merged
merged 8 commits into from
Jun 22, 2023
Merged

Improve PST parsing and WARC URIs #235

merged 8 commits into from
Jun 22, 2023

Conversation

gwiedeman
Copy link
Collaborator

@gwiedeman gwiedeman commented Jun 22, 2023

Type of Contribution

  • Bugfix (non-breaking change which fixes an issue)
  • New component
  • Refactoring (no functional changes)
  • Documentation-only

What does this implement/fix? Explain your changes.

PST encoding bugfix

Includes a bugfix where PST encoding didn't use the first priority encoding, which could cause encoding errors in PDF, HTML, and WARC derivatives.

Improve PST HTML body extraction

PST files often contain messages that do not have an HTML body that still renders like it does in Outlook. Outlook and other clients instead use the RTF body. Mailbagit, which previously ignored RTF bodies, now extracts HTML from them when an HTML body is not present. This is then used for both PDF and WARC derivatives too. Previously this was only done for MSG sources.

WARC URI improvement

Previously, WARC derivatives made a custom URI for the important WARC-Target-URI header, using http://mailbag, such as:

http://mailbag/39/body.html
http://mailbag/39/headers.json
http://mailbag/39/attachmentFilename.pdf

This wasn't great as they were likely to create conflicts outside of a mailbag and this didn't denote a real location as the WARC-Target-URI is supposed to have.

A better approach would be to use the Message-ID header, as specified by RFC2392. The reason we didn't originally, was that this was thought to be unreliable, as we had cases where the Message-ID headers were stripped. Yet, just ignoring the field wasn't a great approach, so this change uses Message-ID for WARC-Target-URI when it is present, and only falls back to http://mailbag if it doesn't get a Message-ID that seems valid.

This approach uses the Message-ID header, but strips the leading and trailing brackets (<>) that typically wrap it. To make it a valid URI according to RFC3986 it prepends the mailto: URI scheme.

Thus, the Message-ID header <MN2PR04MB579157FAB038D851277E3908F9129@MN2PR04MB5791.namprd04.prod.outlook.com> becomes the WARC-Target-URI mailto:MN2PR04MB579157FAB038D851277E3908F9129@MN2PR04MB5791.namprd04.prod.outlook.com

Link to issue?

n/a

  • Issue closed
  • Remain open

Pull Request Checklist

Please check if your PR fulfills the following requirements:

  • Make sure you are requesting to the develop branch. Don't PR to main!
  • This contribution has sufficient documentation
  • Tests for the changes have been added
  • All tests pass

How has this been tested?

Operating System: Win10
Python Version: 3.9.16

Licensing

  • I agree that the Mailbag Project and the University at Albany, SUNY can release this code under the MIT license.

@gwiedeman gwiedeman changed the title Use Message-ID for WARC URIs Improve PST parsing and WARC URIs Jun 22, 2023
@gwiedeman gwiedeman merged commit cc6803d into main Jun 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant