Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#9 Implemented RTF to HTML conversion according to RTF spec #15

Merged
merged 1 commit into from
Oct 13, 2019

Conversation

fadeyev
Copy link

@fadeyev fadeyev commented Oct 12, 2019

As discsussed in #9 , current implementation still wasn't perfect, so I reimplemented RTF to HTML converter to parse RTF according to RTF spec.

Had to change test data of two existing tests:

  1. Chinese email input html - it was broken, having extra line-breaks where not needed. I checked carefully all differences: current result seems to be correct one. All Chinise characters match exactly.
  2. Removed logic of wrapping with tags - not quite understand the meaning of that - if rtf doesn't print tags - then they shouldn't be in the output.

Tested this implementation on multiple very complex emails - all emails exactly match HTML source from Outlook (extracted by clicking View Source on an email)

@fadeyev
Copy link
Author

fadeyev commented Oct 12, 2019

Please let me know if any changes needed

@bbottema bbottema merged commit f3943b8 into bbottema:master Oct 13, 2019
@bbottema
Copy link
Owner

That's awesome!

In fact, I immediately moved out all RTF conversion related code to a new project and library: rtf-to-html.

I already released everything (rtf-to-html:1.0.0 and outlook-message-parser:1.4.0).

Again, thanks so much for your contribution.

@bbottema
Copy link
Owner

I just noticed kschroeer/rtf-html-java, have you taken a look at that implementation? Any idea if it produces different output?

@fadeyev
Copy link
Author

fadeyev commented Oct 13, 2019

I just noticed kschroeer/rtf-html-java, have you taken a look at that implementation?

Haven't seen it. I can to have a look later - most probably next weekend.

@fadeyev
Copy link
Author

fadeyev commented Oct 20, 2019

Hi Benny,

I had a look at kschroeer/rtf-html-java. It also does RTF structured parsing, however its purpose is a bit different. It convert generic RTF document to HTML, while code I've written is aimed specifically at RTF files, which were created from HTML, i.e. RTF files having {\*\htmltagN } tags or as they are called in RTF spec htmltag destinations.
Actually I looked into it a bit further and discovered that there are at least three types of emails (you can switch between them in Outlook on the FORMAT TEXT tab in the Format section when creating a new message):

  1. HTML (default and the most common one)
    If an email has this format, then, when parsed from msg file, RTF will have all those htmltag destinations - and this can be parsed perfectly using the class I wrote. Parser for such RTF files is simpler that the one for generic RTF, since it doesn't need to handle RTF formatting control words like \pard\plain \f0\b because htmltag already has all necessary formatting, for example {\*\htmltag36 <span style="font-size:13.5pt;font-family:&quot;Arial&quot;,sans-serif">}
  2. Plan Text
    Interestingly enough, such msg file still has RTF, however it's very simple, so can be parsed by the same class which is used for HTML format.
  3. Rich Text
    RTF extracted from such msg file is very different from RTF extracted from HTML file. It won't have htmltags and any reference to any HTML things at all, so it's not so easy to convert it to HTML. Essentially you need to handle all RTF formatting like \pard\plain \f0\b and convert it to HTML tags (like <div>, <span>, etc.) and style attributes (like font-size, font-family, etc.). This is much harder to do correctly and moreover there is no standard or reference for such conversion at all. There is no notion of HTML source for such emails, its body will contain only generic RTF message (if you try to view email source from Outlook, View Source option will be grayed out) and it's up to an email client how to render that: seems like most clients do support such format though. The question is whether parsing library should convert such emails to HTML at all? As I see there are two options on what to do with such emails:
    a. Don't do RTF to HTML conversion at all: produce only RTF body as it is and HTML body will be null. Then the question is how it will work with simple-java-mail library - should simple-java-mail support RTF body in addition to plan text and HTML? Probably not.
    b. Implement RTF to HTML conversion. Yes, there is no standard way to do that, it's hard to do correctly and requires careful testing, however we can do our best. Here we can enhance code I've written or merge it with kschroeer/rtf-html-java, however the latter is pretty basic and needs revising/polishing: I tried to parse couple of RTFs with that library - it failed with errors and it doesn't seem to handle escaped sequences.

@bbottema
Copy link
Owner

Thank you for looking into it.

The purpose of Simple Java Mail is to make life easier, not to be academically correct per se. Yes, its primary function to send emails that are handled properly by all clients requires it to properly structure emails according to some RFC spec. However, for its secondary functions such as providing HTML from an RTF source when there is no native HTML body, it suffices to provide a rudimentary approach until a better one comes along.

I would love a merge of your approach and the generic approach by kschroeer/rtf-html-java.

@fadeyev
Copy link
Author

fadeyev commented Oct 20, 2019

Ok, I raised an enhancement request #16 (not sure how to add enhancement label, probably only you can do it). Probably I'll implement it some time later or someone else can implement it.
From what I see it's low-priority anyway, since majority of msg files are produced from HTML.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants