Add integrity check for LaTeX special characters #8712

ThiloteE · 2022-04-22T14:31:49Z

This issue was inspired by inspecting and working on issues #8673, plurimath/unicode2latex#19, #8650, #8682 and #3644

Problem:

Special characters do special things in LaTeX (or when being compiled by Biber).
- See Page 15 (Table 1) for escapeable "special" characters in LaTeX
- See Page 15 (Table 2) for Predefined LATEX 2𝜀 Text-mode Commands
- See Page 16 (Table 3) for "Commands Defined to Work in Both Math and Text Mode"
Special characters are hard to convert from LaTeX to Unicode and vice versa, because context is relevant! (e.g. being in Text Mode or Math Mode or another special LaTeX environment). Using special characters may necessitate user action.

Desired solution:

Users should be able to easily find existing "special characters" in their database.
Create integrity check(s) for having the following symbols in your entries:

From Table 1: LaTeX escapable "special" characters:
- %, $, _, # and &
- Not necessary: Detect \%, \$, \_, \# and \&, because this should be handled at text paste.
From Table 2: Predefined LATEX 2𝜀 Text-mode Commands:
- See Table 2. I am too lazy to copy / paste them all here.
- Comment: I tried pasting a few of these symbols into JabRef. NOT ALL could be entered into the JabRef title field. Feel free to experiment and implement an integrity check for characters that can easily be used with JabRef. Would be nice to create a list of Symbols from Table 2 that JabRef can't easily deal with.
Table 3: Commands Defined to Work in Both Math and Text Mode:
- ... - PS. not yet sure about this one. Maybe the ... should be added to the Unicode2LaTeX converter instead. Please Exclude this character for now. Work on the others first!
- All the other characters in table 3 can be converted from Unicode to LaTeX and from LaTeX to Unicode (except _ and \_), so no need to implement an integrity check for them! :-)

Additional info:

I believe an integrity check for # is already implemented. Maybe this integrity check can be used as a template.

The text was updated successfully, but these errors were encountered:

ThiloteE · 2022-04-22T14:42:08Z

The workflow for users is as follows:

Decide if Database should be exported to LaTeX/OpenOffice/LibreOffice/MicrosoftOffice (Each of them may require entries to be formatted in a specific way (e.g. LaTeX, Unicode, HTML etc.) to be rendered correctly).
Use integrity check
Decide if characters found by integrity check are correct or not.
If not, change characters
- Either manually
- Or via "cleanup actions"

systemoperator · 2023-09-19T10:12:30Z

@Siedlerchr @koppor With this new feature, I receive now tons of LaTeX Parsing Errors, when using "Check integrity", which irritated me for a while and made me think:

Field: File
Message: LaTeX Parsing Error: Command \: cannot be used in PARAG...

This results because of using the JabRef Browser Extension, which adds linked files as follows:

file = {Full Text PDF:https\://www.jair.org/index.php/jair/article/download/11675/26513:application/pdf;Snapshot:https\://www.jair.org/index.php/jair/article/view/11675:text/html;:Jauhiainen2019 - Automatic Language Identification in Texts_ a Survey.pdf:PDF},

koppor · 2023-09-19T10:35:32Z

@systemoperator Thank you for letting us know! CC @Zylence

koppor · 2023-09-19T12:15:14Z

@systemoperator A new version with a fix be available in 30min at #10401. There should then be a comment appearing linking to the binaries.

systemoperator · 2023-09-19T12:33:59Z

@systemoperator A new version with a fix be available in 30min at #10401. There should then be a comment appearing linking to the binaries.

Awesome! I'm gonna check this out.

Zylence · 2023-09-19T19:17:32Z

This results because of using the JabRef Browser Extension

I am terribly sorry, I did not check for this usecase. ( Honestly I only tested manual input. 😐)

koppor · 2023-09-21T08:28:13Z

No worries at all - and thank @systemoperator for the quick report!

Background: This is one consequence of our "merge early" development strategy. Instead of asking users to try out PRs, we "just" merge and fix ASAP if something goes wrong. We get more velocity into JabRef - and I think, this good for the tool as whole. -- Otherwise, we had many PRs piling up, because some details were missing. Then, they got abandoned and never got revived. OK, most of the non-merged PRs were 20% solutions. These currently pile up at https://github.com/koppor/jabref/pulls

systemoperator · 2023-09-21T22:41:37Z

@koppor The mentioned error has gone now. :)

I've noticed other things for imported entries, using the JabRef Browser Extension:

For DOI fields:
LaTeX Parsing Error: Math superscript (^) and subscript (_) characters are not allowed in text mode
Examples: 10.1007/0-387-34805-0_39, 10.1007/3-540-49264-X_24

For Comment, Comment-username, Abstract, Title and Journaltitle fields:
LaTeX Parsing Error: Math superscript (^) and subscript (_) characters are not allowed in text mode
LaTeX Parsing Error: Undefined command \textgreater
LaTeX Parsing Error: Undefined command \textbackslash
LaTeX Parsing Error: Undefined command \textbar
LaTeX Parsing Error: Undefined command \textless

koppor · 2023-09-21T22:48:28Z

TBH, I don't use the integrity check, too much to fix 🙈. Thus, I am relying on feedback here.

Currently, we just exclude "verbatim" fields. I think, all identifier fields should also be classified as verbatim. On the other hand, DOI etc should be checked for HTML characters (shouldn't they)?

Regarding comment and abstract. The should have proper LaTeX. Maybe, we need to allow the commands you mentioned?

koppor · 2023-09-21T22:49:04Z

Can you give an example text raising "LaTeX Parsing Error: Math superscript (^) and subscript (_) characters are not allowed in text mode"?

koppor · 2023-09-21T22:50:10Z

We could maybe check only the fields which are going to be typeset using LaTeX in the common bibliography styles - to avoid noice? Comment, abstract, ... would not be checked, but title, author, ...

systemoperator · 2023-09-21T22:56:18Z

Can you give an example text raising "LaTeX Parsing Error: Math superscript (^) and subscript (_) characters are not allowed in text mode"?

In my case, I pasted some URLs into the Comments or Comments-username field, which contained the underscores.
e.g. https://www.bhu.ac.in/research_pub/jsr/Volumes/JSR_65_01_2021

Zylence · 2023-09-21T22:58:12Z

Can you give an example text raising "LaTeX Parsing Error: Math superscript (^) and subscript (_) characters are not allowed in text mode"?

In my case, I pasted some URLs into the Comments or Comments-username field, which contained the underscores. e.g. https://www.bhu.ac.in/research_pub/jsr/Volumes/JSR_65_01_2021

In that case it would be expected to fail, as _ and ^ are only allowed in latex math environments.

systemoperator · 2023-09-21T23:02:09Z

TBH, I don't use the integrity check, too much to fix 🙈. Thus, I am relying on feedback here.

Sure. It's nothing, which breaks something, it's just informative.

Currently, we just exclude "verbatim" fields. I think, all identifier fields should also be classified as verbatim. On the other hand, DOI etc should be checked for HTML characters (shouldn't they)?

Or only allowing URL characters. (maybe excluding &, ? and #?)

Regarding comment and abstract. The should have proper LaTeX. Maybe, we need to allow the commands you mentioned?

I guess, there would be more commands to consider, which can show up.

systemoperator · 2023-09-21T23:04:25Z

We could maybe check only the fields which are going to be typeset using LaTeX in the common bibliography styles - to avoid noice? Comment, abstract, ... would not be checked, but title, author, ...

Concerning the title, some authors use extraordinary characters there. 🤔

Zylence · 2023-09-21T23:12:45Z

LaTeX Parsing Error: Undefined command \textgreater
LaTeX Parsing Error: Undefined command \textbackslash
LaTeX Parsing Error: Undefined command \textbar
LaTeX Parsing Error: Undefined command \textless

I guess, there would be more commands to consider, which can show up.

The implementation of the underlying snuggletex is not complete. To add new commands, we'd either need to make PRs to snuggletex or to inject new commands as shown here: koppor#646 (comment) (takeaways section)

The latter is probably faster.

systemoperator · 2023-09-21T23:23:14Z

I have pasted URLs into the URL field, where also some parameters exist, which are concatenated with the &.
Resulting in: Found 2 unescaped '&'.

The same applies to the File field, where the JabRef Browser Extension sends the Links, which can also contain "&" characters.
Resulting in: Found 2 unescaped '&'.

Examples:

Zylence · 2023-09-21T23:30:20Z

We could maybe check only the fields which are going to be typeset using LaTeX in the common bibliography styles - to avoid noice? Comment, abstract, ... would not be checked, but title, author, ...

Ideas ranked from most to least favorite:

As you said, exclude fields (at least those that are filled in automatically by for example the browser extension)
Separate the latex integrity check (but that would be bad style)
Enforce url encoding as in anything other than a url you should simply be able to escape the problematic characters. (but that's not userfriendly and may cause other similar errors)

Can not really come up with a better idea then the one proposed first. May need a little more time to think.

Zylence · 2023-09-21T23:45:27Z

Found 2 unescaped '&'

Okay, but that's the "fault" of the AmpersandChecker which was introduced earlier.

~~As the problem seems to be predominantly induced by urls, url-encoding as solution now ranks higher in my opinion. We could intercept paste and additionally warn the user.~~

~~So for example:~~
~~> https://en.wikipedia.org/w/index.php?title=Information_extraction&oldid=948091115~~

~~becomes:~~
~~https%3A%2F%2Fen.wikipedia.org%2Fw%2Findex.php%3Ftitle%3DInformation%5fextract%0Aion%26oldid%3D948091115~~

~~which is manageable.~~

[Edit]: Just realized that won't make much sense for that case, since the ampersands are not part of the queryparameter.

Siedlerchr · 2023-09-22T04:43:44Z

Please don't mess with url fields. Rather exclude them. As far as I know url is also a verbatim field Julian Kirsch ***@***.***> schrieb am Fr., 22. Sept. 2023, 01:45:

…

Found 2 unescaped '&' Okay, but that's the "fault" of the AmpersandChecker which was introduced earlier <#9758>. As the problem seems to be predominantly induced by urls, url-encoding now ranks higher in my opinion. We could intercept *paste* and additionally warn the user. So for example: https://en.wikipedia.org/w/index.php?title=Information_extraction&oldid=948091115 becomes: https%3A%2F%2Fen.wikipedia.org %2Fw%2Findex.php%3Ftitle%3DInformation%5fextract%0Aion%26oldid%3D948091115 which is manageable. — Reply to this email directly, view it on GitHub <#8712 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACOFZHM65TJEOFXUUIARODX3TGSFANCNFSM5UCQBSGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

koppor · 2023-09-22T09:33:46Z

Please don't mess with url fields. Rather exclude them. As far as I know
url is also a verbatim field

Then the ampersand checker needs to be updated to exclude verbatim fields. #quickwin

koppor · 2023-09-27T07:24:18Z

The implementation of the underlying snuggletex is not complete. To add new commands, we'd either need to make PRs to snuggletex or to inject new commands as shown here: koppor#646 (comment) (takeaways section)

As usual, I would do both ^^ -- PR to https://github.com/davemckain/snuggletex (reference is more for me)

However, I am not 100% sure about the error "Undefined command". We can never be package-complete. On the one hand, I assume the sources come from publishers, which "should" use correct commands. On the other hand, we should add the most common commands to avoid "typos" (\textsupsrcipt vs. \textsuperscript (correct)).

systemoperator · 2023-10-05T14:04:15Z

@koppor I've checked the latest build.

I would recommend, allowing underscores in DOI fields, since they are not infrequent:

In the comment-{username} field, I would recommend allowing any character.

Furthermore, in my opinion, it would make sense to adapt the messages in the integrity check so that "LaTeX Parsing Error" is only printed, if there is REALLY some LaTeX parsing error. (Initially, I thought, something is broken in my file, but actually, everything was fine.) I think, a message like e.g. "LaTeX Parsing Warning" or something different (like "LaTeX Inconsistency", or even "LaTeX Improvement") would be more appropriate, so that a user does not panic, when reading it. Basically any message in the integrity check suggests improvements in terms of inconsistencies, without anything being broken. (Probably, if something is really broken, like a wrongly inserted "}" or "{", in the bib-file, I guess, even Jabref cannot properly parse it anymore.

koppor · 2023-10-09T09:19:48Z

@systemoperator I fully agree. Then, my PR #10436 does not work. I need to re-check. Sorry for that many iterations on this. I will also work on the text. I think, LaTeX Warning could be good.

On the one hand, we could classify the errors at https://github.com/davemckain/snuggletex/blob/development_1_2_x/snuggletex-core/src/main/resources/uk/ac/ed/ph/snuggletex/core-error-messages.properties between warning and error.

On the other hand, always "Warning" feels currently right.

The issue with the missing } needs to be checked. JabRef currently drops too much entries if too much } are contained. Our parser needs to improved with that regard.

Siedlerchr · 2023-10-18T19:11:08Z

@systemoperator I was testing this again today and cannot reproduce your issue. For me, no error related to DOI is raised
e..g DOI: 10.1007/978-3-642-01190-0_17

koppor · 2023-10-20T07:59:11Z

@Siedlerchr After three iterations, we managed. 😅🙈🎉🎉

systemoperator · 2023-10-21T12:03:32Z

I've checked it with my reference library, with more than 500 entries. Everything is printed properly now. 🎉

ThiloteE added integrity-checker unicode unicode related issues labels Apr 22, 2022

ThiloteE mentioned this issue Apr 22, 2022

LaTeX to Unicode formatter should not replace \% with % #8490

Closed

2 tasks

JasonGross mentioned this issue Apr 23, 2022

Add cleanup action for "LaTeX to LaTeX aware Unicode" #8715

Open

ThiloteE mentioned this issue Apr 30, 2022

Connect Integrity Check with Cleanup Actions #8736

Open

Siedlerchr added the good first issue An issue intended for project-newcomers. Varies in difficulty. label Apr 16, 2023

This was referenced Apr 24, 2023

New quality check and cleanup for & koppor/jabref#585

Closed

Quality check escaped ampersand #9758

Merged

koppor removed the good first issue An issue intended for project-newcomers. Varies in difficulty. label Apr 24, 2023

Zylence mentioned this issue Jun 13, 2023

Snuggletex Integration koppor/jabref#646

Closed

6 tasks

koppor mentioned this issue Sep 12, 2023

Adds LaTeX integrity check based on SnuggleTeX #10376

Merged

6 tasks

koppor mentioned this issue Sep 19, 2023

Fix LaTeX evaluation for verbatim fields #10401

Merged

6 tasks

koppor mentioned this issue Sep 26, 2023

Exclude check for ampersand (&) at verbatim fields #10419

Merged

6 tasks

koppor mentioned this issue Oct 1, 2023

Accept LaTeX errors in comment field #10436

Merged

6 tasks

koppor mentioned this issue Oct 9, 2023

Fix handling of comments and doi at LaTeX check #10470

Merged

6 tasks

koppor added the needs-refinement label Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add integrity check for LaTeX special characters #8712

Add integrity check for LaTeX special characters #8712

ThiloteE commented Apr 22, 2022 •

edited by koppor

Loading

ThiloteE commented Apr 22, 2022 •

edited

Loading

systemoperator commented Sep 19, 2023

koppor commented Sep 19, 2023

koppor commented Sep 19, 2023

systemoperator commented Sep 19, 2023

Zylence commented Sep 19, 2023

koppor commented Sep 21, 2023

systemoperator commented Sep 21, 2023 •

edited

Loading

koppor commented Sep 21, 2023

koppor commented Sep 21, 2023

koppor commented Sep 21, 2023

systemoperator commented Sep 21, 2023

Zylence commented Sep 21, 2023

systemoperator commented Sep 21, 2023

systemoperator commented Sep 21, 2023

Zylence commented Sep 21, 2023

systemoperator commented Sep 21, 2023

Zylence commented Sep 21, 2023

Zylence commented Sep 21, 2023 •

edited

Loading

Siedlerchr commented Sep 22, 2023 via email

koppor commented Sep 22, 2023

koppor commented Sep 27, 2023

systemoperator commented Oct 5, 2023

koppor commented Oct 9, 2023

Siedlerchr commented Oct 18, 2023

koppor commented Oct 20, 2023

systemoperator commented Oct 21, 2023

Add integrity check for LaTeX special characters #8712

Add integrity check for LaTeX special characters #8712

Comments

ThiloteE commented Apr 22, 2022 • edited by koppor Loading

ThiloteE commented Apr 22, 2022 • edited Loading

systemoperator commented Sep 19, 2023

koppor commented Sep 19, 2023

koppor commented Sep 19, 2023

systemoperator commented Sep 19, 2023

Zylence commented Sep 19, 2023

koppor commented Sep 21, 2023

systemoperator commented Sep 21, 2023 • edited Loading

koppor commented Sep 21, 2023

koppor commented Sep 21, 2023

koppor commented Sep 21, 2023

systemoperator commented Sep 21, 2023

Zylence commented Sep 21, 2023

systemoperator commented Sep 21, 2023

systemoperator commented Sep 21, 2023

Zylence commented Sep 21, 2023

systemoperator commented Sep 21, 2023

Zylence commented Sep 21, 2023

Zylence commented Sep 21, 2023 • edited Loading

Siedlerchr commented Sep 22, 2023 via email

koppor commented Sep 22, 2023

koppor commented Sep 27, 2023

systemoperator commented Oct 5, 2023

koppor commented Oct 9, 2023

Siedlerchr commented Oct 18, 2023

koppor commented Oct 20, 2023

systemoperator commented Oct 21, 2023

ThiloteE commented Apr 22, 2022 •

edited by koppor

Loading

ThiloteE commented Apr 22, 2022 •

edited

Loading

systemoperator commented Sep 21, 2023 •

edited

Loading

Zylence commented Sep 21, 2023 •

edited

Loading