Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding Issues with FF/Win10 #274

Closed
reox opened this issue Feb 9, 2021 · 23 comments
Closed

Encoding Issues with FF/Win10 #274

reox opened this issue Feb 9, 2021 · 23 comments

Comments

@reox
Copy link

reox commented Feb 9, 2021

This is similar to issue #33 (I also tested with the link there, and get the same buggy result)
Running version 2.4 in Firefox 85.0 on Windows 10 with JabRef 5.2--2020-12-24--6a2a512

For example, with this article: https://www.sciencedirect.com/science/article/pii/S8756328220301976
All umlauts and special characters are mangled when importing, for example µ will get μ.

image

A quick check in python shows that there is indeed some latin1/utf8 mixup:

>>> 'μ'.encode('latin1').decode('utf8')
'μ'

my Jabref library is configured as UTF-8.
I'm not sure if this bug comes from the extension or from JabRef itself (JabRef/jabref#2013) is the issue again.

@tobiasdiez
Copy link
Member

It's working on Windows, and is correctly displayed in the browser console:

JabRef: Send BibTeX to JabRef:
@Article{newton_automated_2020,
title = {Automated {MicroCT}-based bone and articular cartilage analysis using iterative shape averaging and atlas-based registration},
volume = {137},
issn = {8756-3282},
url = {https://www.sciencedirect.com/science/article/pii/S8756328220301976},
doi = {10.1016/j.bone.2020.115417},
abstract = {Micro-computed tomography (μCT) and contrast-enhanced μCT are important tools for preclinical analysis of bone and articular cartilage (AC). Quantitative data from these modalities is highly dependent on the accuracy of tissue segmentations, which are often obtained via time-consuming manual contouring and are prone to inter- and intra-observer variability. Automated segmentation strategies could mitigate these issues, but few such approaches have been described in the context of μCT. Here, we validated a fully-automated strategy for bone and AC segmentation based on registration of an average tissue atlas. Femora from healthy and arthritic rats underwent μCT scanning, and epiphyseal trabecular bone and AC volumes were manually contoured by an expert. Average tissue atlases composed of 1, 3, 5, 10 and 20 pre-contoured training images (n = 10 atlases/group) were generated using iterative shape averaging and registered onto unknown images via affine and non-rigid registration. Atlas-based and expert-defined volumes for bone and AC were compared in terms of shape-based similarity metrics, as well as morphometric and densitometric parameters. Our results demonstrate that atlas-based registrations were capable of highly accurate and consistent segmentation. Atlases built from as few as 3 training images had no incidence of mal-registration and exhibited improved incidence of accurate registration, and higher sensitivity and specificity compared to atlases built from only one training image. Atlas-based segmentation of bone and AC from μCT images is a robust and accurate alternative to manual tissue segmentation, enabling faster, more consistent segmentation of pre-clinical datasets.},
language = {en},
urldate = {2021-02-12},
journal = {Bone},
author = {Newton, Michael D. and Junginger, Lucas and Maerz, Tristan},
month = aug,
year = {2020},
keywords = {Automated segmentation, Micro-computed tomography, Articular cartilage, Bone, Tissue atlas, Iterative shape averaging},
pages = {115417}
}

So I guess it is indeed a problem with the python code. Just to make sure, can you maybe try to change the (default) encoding in JabRef / your (test) library.

@reox since you apparently have python knowledge, could you please play around with the "jabrefHost.py" script in the JabRef installation location. For example, the decoding is done at: https://github.com/JabRef/jabref/blob/master/buildres/linux/jabrefHost.py#L49.
@LyzardKing can you reproduce this?

@LyzardKing
Copy link
Collaborator

No, I cannot reproduce the issue. On Ubuntu both the snap and flatpak show the correct character.

@reox
Copy link
Author

reox commented Feb 13, 2021

It's working on Windows, and is correctly displayed in the browser console:

How can I see this? When I'm opening the browser console I just see:

JabRef: Got task to convert  
Array []
  to BibTeX bibtexConverter.js:240:11
JabRef: Converting item(s) to BibLaTeX:  
Array []
bibtexConverter.js:7:11

btw I'm on Windows - is the python script the correct one?

I also changed the default encoding to windows-1251 but I get the same result.

@tobiasdiez
Copy link
Member

If you are on Windows, then the powershell script is used and not the python one. But I've double-checked it and the most recent version uses utf8 correctly before sending it to JabRef. So based on

>>> 'μ'.encode('utf8').decode('latin1')
'μ'

I guess that JabRef is for some reason trying to decode it in latin1 instead of utf8. What is the encoding of your library (Library preferences)?

@reox
Copy link
Author

reox commented Mar 1, 2021

What is the encoding of your library (Library preferences)?

It is set to UTF8 and mode biblatex and also the bib file itself is indeed utf8:

$ file references.bib
references.bib: UTF-8 Unicode text, with very long lines, with CRLF, LF line terminators

I also resolved the issue of mixed CRLF/LF and set them all to LF now, but that did not changed the importer :/

Can I somehow debug what is send from the browser to jabref? is there a temp file I can watch?

@tobiasdiez
Copy link
Member

Strange...

Currently, there it is not written to a temporary file, but you can do this at the following point:
https://github.com/JabRef/jabref/blob/26433573032c72df99a7912cd4cf929f13891d53/buildres/windows/JabRefHost.ps1#L41 (the file should be in the installation directory of JabRef)
Adding something like Out-File -FilePath .\Dump.txt -InputObject $messageText should work. Maybe you need to change the path to something where you have write access.

@reox
Copy link
Author

reox commented Mar 1, 2021

Okay thanks, I just did that:

$ file Dump.txt
Dump.txt: Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators

but maybe that is because Out-file writes as UTF16 by default? Anyways, in the file the µ is correct.
I also tried setting the default and library encoding to UTF16 but that does not change the import...

@tobiasdiez
Copy link
Member

And it is working correctly if you import it from the cmd line jabref -i file? https://docs.jabref.org/advanced/commandline#import-file-i-filename-import-format

@reox
Copy link
Author

reox commented Mar 1, 2021

that does not like it at all:

"C:\Program Files\JabRef\runtime\bin\JabRef.bat" -i "C:\Users\Sebastian\Dump.txt",bibtex
ERROR StatusLogger Unrecognized format specifier [d]
ERROR StatusLogger Unrecognized conversion specifier [d] starting at position 16 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [thread]
ERROR StatusLogger Unrecognized conversion specifier [thread] starting at position 25 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [level]
ERROR StatusLogger Unrecognized conversion specifier [level] starting at position 35 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [logger]
ERROR StatusLogger Unrecognized conversion specifier [logger] starting at position 47 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [msg]
ERROR StatusLogger Unrecognized conversion specifier [msg] starting at position 54 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [n]
ERROR StatusLogger Unrecognized conversion specifier [n] starting at position 56 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [d]
ERROR StatusLogger Unrecognized conversion specifier [d] starting at position 16 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [thread]
ERROR StatusLogger Unrecognized conversion specifier [thread] starting at position 25 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [level]
ERROR StatusLogger Unrecognized conversion specifier [level] starting at position 35 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [logger]
ERROR StatusLogger Unrecognized conversion specifier [logger] starting at position 47 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [msg]
ERROR StatusLogger Unrecognized conversion specifier [msg] starting at position 54 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [n]
ERROR StatusLogger Unrecognized conversion specifier [n] starting at position 56 in conversion pattern.
Importing: C:\Users\Sebastian\Dump.txt
Error occurred when parsing entry: 'Error in line 1: Expected { or ( but received '. Skipped entry.

Dump.txt

@reox
Copy link
Author

reox commented Jul 14, 2021

I recently updated to JabRef 5.3 but there is still this issue.
Is there anything else I can try?

If I download the Bibtex file from Elsevier directly (i.e. Cite -> Export citation to bibtex) I can import it into JabRef using --importToOpen.

I also found out that this issue 'Error in line 1: Expected { or ( but received '. Skipped entry. is from importing a BOM marked UTF16 file. If I convert it into UTF8 without BOM it works. So this is an issue of dumping the file using powershell..

@tobiasdiez
Copy link
Member

What happens if you use Out-File -Encoding utf8NoBOM https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/out-file?view=powershell-7.1#parameters to ensure that the dumped file is in utf8? (According to the documentation, this is actually the default...strange).

@reox
Copy link
Author

reox commented Jul 19, 2021

I just tried that but then the script crashes.
I have added:

    Out-File -FilePath "C:\Users\Reox\Dump.bib" -InputObject $messageText 
    Out-File -Encoding utf8 -FilePath "C:\Users\Reox\Dump_utf8.bib" -InputObject $messageText 
    Out-File -Encoding utf8NoBom -FilePath "C:\Users\Reox\Dump_utf8nobom.bib" -InputObject $messageText 

The results:

$ file Dump.bib Dump_utf8.bib
Dump.bib:      Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators
Dump_utf8.bib: UTF-8 Unicode (with BOM) text, with very long lines, with CRLF line terminators

As said, utf8NoBom crashes... Unfortunately, I can not see why - the firefox extension simply says "Error while sending to JabRef.
Please see the browsers error console for details." but I can not see anything there.
When I run that command on a shell, I get:

> Out-File -Encoding utf8NoBom -FilePath baz.txt -InputObject "Hello world"
Out-File : Cannot validate argument on parameter 'Encoding'. The argument "utf8NoBom" does not belong to the set
"unknown,string,unicode,bigendianunicode,utf8,utf7,utf32,ascii,default,oem" specified by the ValidateSet attribute.
Supply an argument that is in the set and then try the command again.
At line:1 char:20
+ Out-File -Encoding utf8NoBom -FilePath baz.txt -InputObject "Hello wo ...
+                    ~~~~~~~~~
    + CategoryInfo          : InvalidData: (:) [Out-File], ParameterBindingValidationException
    + FullyQualifiedErrorId : ParameterArgumentValidationError,Microsoft.PowerShell.Commands.OutFileCommand

It looks like I'm running PS5:

> $PSVersionTable

Name                           Value
----                           -----
PSVersion                      5.1.19041.1023
PSEdition                      Desktop
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0...}
BuildVersion                   10.0.19041.1023
CLRVersion                     4.0.30319.42000
WSManStackVersion              3.0
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1

I downloaded v7 now, and in the v7 shell it seems to work with -Encoding. However, how do I get the JabRefHost.ps1 to run with 7 instead of the system 5.1? As far as I understand, you can not really remove 5.1 from the system...
I changed powershell.exe to pwsh.exe in JabRefHost.bat, and indeed it runs now with v7 and I can dump now all three:

Dump.bib:           UTF-8 Unicode text, with very long lines, with CRLF line terminators
Dump_utf8.bib:      UTF-8 Unicode text, with very long lines, with CRLF line terminators
Dump_utf8NoBom.bib: UTF-8 Unicode text, with very long lines, with CRLF line terminators

Now, they are all the same :D

Unfortunately, the characters are still broken.

However, I can now import the dumped bib file and the characters are all correct there.

@reox
Copy link
Author

reox commented Jul 19, 2021

As a workaround, I created a temporary file and import that one:

$ diff JabRefHost.bak.ps1 JabRefHost.ps1
41c41,45
<     $output = & $jabRefExe -importBibtex "$messageText" *>&1
---
>     $tempfile = New-TemporaryFile
>     # utf8NoBom just to be sure...
>     Out-File -Encoding utf8NoBom -FilePath $tempfile -InputObject $messageText
>     $output = & $jabRefExe -importToOpen $tempfile *>&1
>     Remove-Item $tempfile

This seems to work flawlessly!

@tobiasdiez
Copy link
Member

Whooo, nice! I'm glad you found a workaround.

May I ask you to open a PR at the main jabref repo with the changes to the powershell script https://github.com/JabRef/jabref/blob/main/buildres/windows/JabRefHost.ps1? Your approach with writing to a temporary file namely also fixes JabRef/jabref#7374.

@reox
Copy link
Author

reox commented Jul 19, 2021

The only issue is: it seems to only work with ps7...
Because in 5.1, the default was apprently unicode: https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/out-file?view=powershell-5.1#parameters
and because NoBom was not available back then, jabref would not accept that file.

Thus, it would require to check whenever ps5 is used and use the old method or if ps7 is used, the tmpfile.
Or is some powershell guru out there, who has a hint how to do that properly?

@tobiasdiez
Copy link
Member

You could use $PSVersionTable.PSVersion.Major to get the powershell version. Then for >= 7 one can use utf8NoBom, and for < 7 one can use utf8 for the encoding of the temporary file. Would that work?

@reox
Copy link
Author

reox commented Jul 19, 2021

could work yes - however for 5.1 there seems to be no way to write a file without BOM, or at least I can not make it work. Then you also have to switch in the bat file to the correct interpreter. The MSDN tells me that the 5.1 interpreter is called powershell.exe and the newer ones pwsh.exe. Thus you would have to check if pwsh.exe is installed and fallback to powershell.exe?

@tobiasdiez
Copy link
Member

What are the issues one encounters when using utf8 with BOM?

Yes, you are right the bat file also needs to be changed to run pwsh instead of powershell. The following might be helpful for this:
https://github.com/Jonathing/MCGradle-Scripts/blob/8dcccac6ba62e7e0fa5de4c2cfd500b50b1cc692/wrappers/MCGradle%20Scripts.bat#L6-L19
https://github.com/BlueBubblesApp/flutter2/blob/48c2a67b70c9a4c83e86d6f06a0b597c4dc3560d/bin/internal/shared.bat#L29-L40

@reox
Copy link
Author

reox commented Jul 19, 2021

What are the issues one encounters when using utf8 with BOM?

see #274 (comment)

then it can not be imported in jabref.

@tobiasdiez
Copy link
Member

Ah ok, I thought it was a problem with utf16. I'm shooting a bit in the blue, but does it work if you use [IO.File]::WriteAllLines($tempfile, $messageText) as suggested here https://stackoverflow.com/a/32951824/873661 ?

reox pushed a commit to reox/jabref that referenced this issue Jul 19, 2021
This resolves an issue where the encoding somehow got lost when using
the Jabref Browser extension. It will now write a temporary file
with UTF-8 encoding rather than passing the bibtex on the commandline.

See JabRef/JabRef-Browser-Extension#274
@reox
Copy link
Author

reox commented Jul 19, 2021

yes!
Using that I got it working in PS5.1.

I created a PR here: JabRef/jabref#7918

@tobiasdiez
Copy link
Member

Nice! Thanks a lot for your continued work on this issue. Very much appreciated ❤️

@reox
Copy link
Author

reox commented Jul 19, 2021

thank you for the powershell tricks ;)

btw I hope that it does not break other users' experience on other windows versions though 😅

Siedlerchr pushed a commit to JabRef/jabref that referenced this issue Jul 19, 2021
* Write temporary file on bib import

This resolves an issue where the encoding somehow got lost when using
the Jabref Browser extension. It will now write a temporary file
with UTF-8 encoding rather than passing the bibtex on the commandline.

See JabRef/JabRef-Browser-Extension#274

* adding changelog entry

Co-authored-by: Sebastian Bachmann <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants