Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse folder structure from Gmail MBOX exports #183

Open
3 of 9 tasks
gwiedeman opened this issue May 23, 2022 · 0 comments
Open
3 of 9 tasks

Parse folder structure from Gmail MBOX exports #183

gwiedeman opened this issue May 23, 2022 · 0 comments
Labels
Input Parsing input data, such as MBOX, IMAP, PST, EML, etc.

Comments

@gwiedeman
Copy link
Collaborator

The problem the component solves

mailbagit should preserve the original arrangement of email exports, particularly the accounts folder structure, and use that arrangement when writing derivatives. mailbagit currently supports this by parsing the internal folder structure of PST files and reading the X-Folder header when present for MBOX and EML sources. This information is stored in the model using the Message-Path field (which is then converted to Derivatives-Path, as documented in the Input-Output Examples doc.

While X-Folder is often used to preserve email folder information, Gmail instead seems to us a X-Gmail-Labels header, which mailbagit does not currently parse. Unlike X-Folder (which could be Inbox/Listservs, X-Gmail-Labels appears to be a comma separated list that includes both the folder of the email as well as labels like "Important" or "Unread," which we need to exclude Common examples are:

  • X-Gmail-Labels: Unread,Inbox
  • X-Gmail-Labels: Unread,Important,Inbox
  • X-Gmail-Labels: Important,Inbox
  • X-Gmail-Labels: Inbox
  • X-Gmail-Labels: Inbox,Category Promotions,Unread
  • X-Gmail-Labels: Inbox,Important,Category Updates,Unread
  • X-Gmail-Labels: Spam,Category Personal,Unread
  • X-Gmail-Labels: Inbox,Opened,Category Updates
  • X-Gmail-Labels: Inbox,Category Social,Unread
  • X-Gmail-Labels: Inbox,Category Forums,Unread
  • X-Gmail-Labels: Inbox,Important,Opened,Category Personal
  • X-Gmail-Labels: Archived,Sent,Opened
  • X-Gmail-Labels: Sent,Notes
  • X-Gmail-Labels: Sent

The Spam label appears to be a folder, not a label:

  • X-Gmail-Labels: Unread,Spam

While the Category labels are arguably arrangement, I don't think we can treat them as such reasonably. Thus, I expect to ignore labels starting with "Category."

It looks like the X-GM-LABELS header is also used for this, so we should try to read both.

It does not appear that the folder is consistently the first or last item in the list or anything reliable like that to help with parsing.

I can't find clear documentation on these headers to be confident that this is consistent practice. Thus, I'm thinking we need to make this user-overridable, probably along the lines of the plugins. Perhaps a user could put a list of labels to exclude in with the plugin directory. This way, if there's another label we're not excluding a user could put that in a Gmail-Labels.txt file and mailbagit would use those labels instead of the default excluded labels. Just a line separated list could be fine across platforms just by relying in Python's .readlines() or similar. That way the line breaks should be whatever is native to the OS.

Relevant part of mailbag spec?

N/A

Type of component

  • Core
  • Input
  • Attachments
  • Derivatives conversion
  • Reporting/Exporting
  • GUI
  • Distribution

Expected contribution

  • Pull Request
  • Comment with proposed solution

Major challenges or things to keep in mind

Inconsistent arrangement structure is the worst.

@gwiedeman gwiedeman added the Input Parsing input data, such as MBOX, IMAP, PST, EML, etc. label May 23, 2022
@gwiedeman gwiedeman added this to the MVP milestone May 23, 2022
@gwiedeman gwiedeman removed this from the MVP milestone Jul 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Input Parsing input data, such as MBOX, IMAP, PST, EML, etc.
Projects
Development

No branches or pull requests

1 participant