Encoding Issues with FF/Win10 #274

reox · 2021-02-09T12:43:14Z

This is similar to issue #33 (I also tested with the link there, and get the same buggy result)
Running version 2.4 in Firefox 85.0 on Windows 10 with JabRef 5.2--2020-12-24--6a2a512

For example, with this article: https://www.sciencedirect.com/science/article/pii/S8756328220301976
All umlauts and special characters are mangled when importing, for example µ will get Î¼.

A quick check in python shows that there is indeed some latin1/utf8 mixup:

>>> 'Î¼'.encode('latin1').decode('utf8')
'μ'

my Jabref library is configured as UTF-8.
I'm not sure if this bug comes from the extension or from JabRef itself (JabRef/jabref#2013) is the issue again.

The text was updated successfully, but these errors were encountered:

tobiasdiez · 2021-02-12T22:45:25Z

It's working on Windows, and is correctly displayed in the browser console:

JabRef: Send BibTeX to JabRef:
@Article{newton_automated_2020,
title = {Automated {MicroCT}-based bone and articular cartilage analysis using iterative shape averaging and atlas-based registration},
volume = {137},
issn = {8756-3282},
url = {https://www.sciencedirect.com/science/article/pii/S8756328220301976},
doi = {10.1016/j.bone.2020.115417},
abstract = {Micro-computed tomography (μCT) and contrast-enhanced μCT are important tools for preclinical analysis of bone and articular cartilage (AC). Quantitative data from these modalities is highly dependent on the accuracy of tissue segmentations, which are often obtained via time-consuming manual contouring and are prone to inter- and intra-observer variability. Automated segmentation strategies could mitigate these issues, but few such approaches have been described in the context of μCT. Here, we validated a fully-automated strategy for bone and AC segmentation based on registration of an average tissue atlas. Femora from healthy and arthritic rats underwent μCT scanning, and epiphyseal trabecular bone and AC volumes were manually contoured by an expert. Average tissue atlases composed of 1, 3, 5, 10 and 20 pre-contoured training images (n = 10 atlases/group) were generated using iterative shape averaging and registered onto unknown images via affine and non-rigid registration. Atlas-based and expert-defined volumes for bone and AC were compared in terms of shape-based similarity metrics, as well as morphometric and densitometric parameters. Our results demonstrate that atlas-based registrations were capable of highly accurate and consistent segmentation. Atlases built from as few as 3 training images had no incidence of mal-registration and exhibited improved incidence of accurate registration, and higher sensitivity and specificity compared to atlases built from only one training image. Atlas-based segmentation of bone and AC from μCT images is a robust and accurate alternative to manual tissue segmentation, enabling faster, more consistent segmentation of pre-clinical datasets.},
language = {en},
urldate = {2021-02-12},
journal = {Bone},
author = {Newton, Michael D. and Junginger, Lucas and Maerz, Tristan},
month = aug,
year = {2020},
keywords = {Automated segmentation, Micro-computed tomography, Articular cartilage, Bone, Tissue atlas, Iterative shape averaging},
pages = {115417}
}

So I guess it is indeed a problem with the python code. Just to make sure, can you maybe try to change the (default) encoding in JabRef / your (test) library.

@reox since you apparently have python knowledge, could you please play around with the "jabrefHost.py" script in the JabRef installation location. For example, the decoding is done at: https://github.com/JabRef/jabref/blob/master/buildres/linux/jabrefHost.py#L49.
@LyzardKing can you reproduce this?

LyzardKing · 2021-02-13T08:15:24Z

No, I cannot reproduce the issue. On Ubuntu both the snap and flatpak show the correct character.

reox · 2021-02-13T11:01:14Z

It's working on Windows, and is correctly displayed in the browser console:

How can I see this? When I'm opening the browser console I just see:

JabRef: Got task to convert  
Array []
  to BibTeX bibtexConverter.js:240:11
JabRef: Converting item(s) to BibLaTeX:  
Array []
bibtexConverter.js:7:11

btw I'm on Windows - is the python script the correct one?

I also changed the default encoding to windows-1251 but I get the same result.

tobiasdiez · 2021-02-28T20:20:15Z

If you are on Windows, then the powershell script is used and not the python one. But I've double-checked it and the most recent version uses utf8 correctly before sending it to JabRef. So based on

>>> 'μ'.encode('utf8').decode('latin1')
'Î¼'

I guess that JabRef is for some reason trying to decode it in latin1 instead of utf8. What is the encoding of your library (Library preferences)?

reox · 2021-03-01T07:11:19Z

What is the encoding of your library (Library preferences)?

It is set to UTF8 and mode biblatex and also the bib file itself is indeed utf8:

$ file references.bib
references.bib: UTF-8 Unicode text, with very long lines, with CRLF, LF line terminators

I also resolved the issue of mixed CRLF/LF and set them all to LF now, but that did not changed the importer :/

Can I somehow debug what is send from the browser to jabref? is there a temp file I can watch?

tobiasdiez · 2021-03-01T12:16:45Z

Strange...

Currently, there it is not written to a temporary file, but you can do this at the following point:
https://github.com/JabRef/jabref/blob/26433573032c72df99a7912cd4cf929f13891d53/buildres/windows/JabRefHost.ps1#L41 (the file should be in the installation directory of JabRef)
Adding something like Out-File -FilePath .\Dump.txt -InputObject $messageText should work. Maybe you need to change the path to something where you have write access.

reox · 2021-03-01T13:09:31Z

Okay thanks, I just did that:

$ file Dump.txt
Dump.txt: Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators

but maybe that is because Out-file writes as UTF16 by default? Anyways, in the file the µ is correct.
I also tried setting the default and library encoding to UTF16 but that does not change the import...

tobiasdiez · 2021-03-01T13:12:09Z

And it is working correctly if you import it from the cmd line jabref -i file? https://docs.jabref.org/advanced/commandline#import-file-i-filename-import-format

reox · 2021-03-01T13:13:40Z

that does not like it at all:

"C:\Program Files\JabRef\runtime\bin\JabRef.bat" -i "C:\Users\Sebastian\Dump.txt",bibtex
ERROR StatusLogger Unrecognized format specifier [d]
ERROR StatusLogger Unrecognized conversion specifier [d] starting at position 16 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [thread]
ERROR StatusLogger Unrecognized conversion specifier [thread] starting at position 25 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [level]
ERROR StatusLogger Unrecognized conversion specifier [level] starting at position 35 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [logger]
ERROR StatusLogger Unrecognized conversion specifier [logger] starting at position 47 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [msg]
ERROR StatusLogger Unrecognized conversion specifier [msg] starting at position 54 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [n]
ERROR StatusLogger Unrecognized conversion specifier [n] starting at position 56 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [d]
ERROR StatusLogger Unrecognized conversion specifier [d] starting at position 16 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [thread]
ERROR StatusLogger Unrecognized conversion specifier [thread] starting at position 25 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [level]
ERROR StatusLogger Unrecognized conversion specifier [level] starting at position 35 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [logger]
ERROR StatusLogger Unrecognized conversion specifier [logger] starting at position 47 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [msg]
ERROR StatusLogger Unrecognized conversion specifier [msg] starting at position 54 in conversion pattern.
ERROR StatusLogger Unrecognized format specifier [n]
ERROR StatusLogger Unrecognized conversion specifier [n] starting at position 56 in conversion pattern.
Importing: C:\Users\Sebastian\Dump.txt
Error occurred when parsing entry: 'Error in line 1: Expected { or ( but received '. Skipped entry.

Dump.txt

reox · 2021-07-14T11:51:10Z

I recently updated to JabRef 5.3 but there is still this issue.
Is there anything else I can try?

If I download the Bibtex file from Elsevier directly (i.e. Cite -> Export citation to bibtex) I can import it into JabRef using --importToOpen.

I also found out that this issue 'Error in line 1: Expected { or ( but received '. Skipped entry. is from importing a BOM marked UTF16 file. If I convert it into UTF8 without BOM it works. So this is an issue of dumping the file using powershell..

tobiasdiez · 2021-07-18T10:31:25Z

What happens if you use Out-File -Encoding utf8NoBOM https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/out-file?view=powershell-7.1#parameters to ensure that the dumped file is in utf8? (According to the documentation, this is actually the default...strange).

reox · 2021-07-19T06:42:05Z

I just tried that but then the script crashes.
I have added:

    Out-File -FilePath "C:\Users\Reox\Dump.bib" -InputObject $messageText 
    Out-File -Encoding utf8 -FilePath "C:\Users\Reox\Dump_utf8.bib" -InputObject $messageText 
    Out-File -Encoding utf8NoBom -FilePath "C:\Users\Reox\Dump_utf8nobom.bib" -InputObject $messageText

The results:

$ file Dump.bib Dump_utf8.bib
Dump.bib:      Little-endian UTF-16 Unicode text, with very long lines, with CRLF line terminators
Dump_utf8.bib: UTF-8 Unicode (with BOM) text, with very long lines, with CRLF line terminators

As said, utf8NoBom crashes... Unfortunately, I can not see why - the firefox extension simply says "Error while sending to JabRef.
Please see the browsers error console for details." but I can not see anything there.
When I run that command on a shell, I get:

> Out-File -Encoding utf8NoBom -FilePath baz.txt -InputObject "Hello world"
Out-File : Cannot validate argument on parameter 'Encoding'. The argument "utf8NoBom" does not belong to the set
"unknown,string,unicode,bigendianunicode,utf8,utf7,utf32,ascii,default,oem" specified by the ValidateSet attribute.
Supply an argument that is in the set and then try the command again.
At line:1 char:20
+ Out-File -Encoding utf8NoBom -FilePath baz.txt -InputObject "Hello wo ...
+                    ~~~~~~~~~
    + CategoryInfo          : InvalidData: (:) [Out-File], ParameterBindingValidationException
    + FullyQualifiedErrorId : ParameterArgumentValidationError,Microsoft.PowerShell.Commands.OutFileCommand

It looks like I'm running PS5:

> $PSVersionTable

Name                           Value
----                           -----
PSVersion                      5.1.19041.1023
PSEdition                      Desktop
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0...}
BuildVersion                   10.0.19041.1023
CLRVersion                     4.0.30319.42000
WSManStackVersion              3.0
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1

I downloaded v7 now, and in the v7 shell it seems to work with -Encoding. ~~However, how do I get the JabRefHost.ps1 to run with 7 instead of the system 5.1? As far as I understand, you can not really remove 5.1 from the system...~~
I changed powershell.exe to pwsh.exe in JabRefHost.bat, and indeed it runs now with v7 and I can dump now all three:

Dump.bib:           UTF-8 Unicode text, with very long lines, with CRLF line terminators
Dump_utf8.bib:      UTF-8 Unicode text, with very long lines, with CRLF line terminators
Dump_utf8NoBom.bib: UTF-8 Unicode text, with very long lines, with CRLF line terminators

Now, they are all the same :D

Unfortunately, the characters are still broken.

However, I can now import the dumped bib file and the characters are all correct there.

reox · 2021-07-19T07:01:20Z

As a workaround, I created a temporary file and import that one:

$ diff JabRefHost.bak.ps1 JabRefHost.ps1
41c41,45
<     $output = & $jabRefExe -importBibtex "$messageText" *>&1
---
>     $tempfile = New-TemporaryFile
>     # utf8NoBom just to be sure...
>     Out-File -Encoding utf8NoBom -FilePath $tempfile -InputObject $messageText
>     $output = & $jabRefExe -importToOpen $tempfile *>&1
>     Remove-Item $tempfile

This seems to work flawlessly!

tobiasdiez · 2021-07-19T07:21:16Z

Whooo, nice! I'm glad you found a workaround.

May I ask you to open a PR at the main jabref repo with the changes to the powershell script https://github.com/JabRef/jabref/blob/main/buildres/windows/JabRefHost.ps1? Your approach with writing to a temporary file namely also fixes JabRef/jabref#7374.

reox · 2021-07-19T08:14:20Z

The only issue is: it seems to only work with ps7...
Because in 5.1, the default was apprently unicode: https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/out-file?view=powershell-5.1#parameters
and because NoBom was not available back then, jabref would not accept that file.

Thus, it would require to check whenever ps5 is used and use the old method or if ps7 is used, the tmpfile.
Or is some powershell guru out there, who has a hint how to do that properly?

tobiasdiez · 2021-07-19T09:20:24Z

You could use $PSVersionTable.PSVersion.Major to get the powershell version. Then for >= 7 one can use utf8NoBom, and for < 7 one can use utf8 for the encoding of the temporary file. Would that work?

reox · 2021-07-19T09:31:14Z

could work yes - however for 5.1 there seems to be no way to write a file without BOM, or at least I can not make it work. Then you also have to switch in the bat file to the correct interpreter. The MSDN tells me that the 5.1 interpreter is called powershell.exe and the newer ones pwsh.exe. Thus you would have to check if pwsh.exe is installed and fallback to powershell.exe?

tobiasdiez · 2021-07-19T10:41:33Z

What are the issues one encounters when using utf8 with BOM?

Yes, you are right the bat file also needs to be changed to run pwsh instead of powershell. The following might be helpful for this:
https://github.com/Jonathing/MCGradle-Scripts/blob/8dcccac6ba62e7e0fa5de4c2cfd500b50b1cc692/wrappers/MCGradle%20Scripts.bat#L6-L19
https://github.com/BlueBubblesApp/flutter2/blob/48c2a67b70c9a4c83e86d6f06a0b597c4dc3560d/bin/internal/shared.bat#L29-L40

reox · 2021-07-19T11:11:31Z

What are the issues one encounters when using utf8 with BOM?

see #274 (comment)

then it can not be imported in jabref.

tobiasdiez · 2021-07-19T11:27:56Z

Ah ok, I thought it was a problem with utf16. I'm shooting a bit in the blue, but does it work if you use [IO.File]::WriteAllLines($tempfile, $messageText) as suggested here https://stackoverflow.com/a/32951824/873661 ?

This resolves an issue where the encoding somehow got lost when using the Jabref Browser extension. It will now write a temporary file with UTF-8 encoding rather than passing the bibtex on the commandline. See JabRef/JabRef-Browser-Extension#274

reox · 2021-07-19T11:59:49Z

yes!
Using that I got it working in PS5.1.

I created a PR here: JabRef/jabref#7918

tobiasdiez · 2021-07-19T12:16:47Z

Nice! Thanks a lot for your continued work on this issue. Very much appreciated ❤️

reox · 2021-07-19T12:25:49Z

thank you for the powershell tricks ;)

btw I hope that it does not break other users' experience on other windows versions though 😅

* Write temporary file on bib import This resolves an issue where the encoding somehow got lost when using the Jabref Browser extension. It will now write a temporary file with UTF-8 encoding rather than passing the bibtex on the commandline. See JabRef/JabRef-Browser-Extension#274 * adding changelog entry Co-authored-by: Sebastian Bachmann <[email protected]>

tobiasdiez added env:linux status:accepted status:help-wanted type:bug labels Feb 12, 2021

tobiasdiez added status:waiting-for-feedback and removed status:accepted labels Feb 13, 2021

reox mentioned this issue Jul 19, 2021

Write temporary file on bib import in jabrefhost.ps1 JabRef/jabref#7918

Merged

tobiasdiez closed this as completed Jul 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoding Issues with FF/Win10 #274

Encoding Issues with FF/Win10 #274

reox commented Feb 9, 2021

tobiasdiez commented Feb 12, 2021

LyzardKing commented Feb 13, 2021

reox commented Feb 13, 2021

tobiasdiez commented Feb 28, 2021

reox commented Mar 1, 2021

tobiasdiez commented Mar 1, 2021

reox commented Mar 1, 2021 •

edited

Loading

tobiasdiez commented Mar 1, 2021

reox commented Mar 1, 2021 •

edited

Loading

reox commented Jul 14, 2021 •

edited

Loading

tobiasdiez commented Jul 18, 2021

reox commented Jul 19, 2021 •

edited

Loading

reox commented Jul 19, 2021

tobiasdiez commented Jul 19, 2021

reox commented Jul 19, 2021

tobiasdiez commented Jul 19, 2021

reox commented Jul 19, 2021

tobiasdiez commented Jul 19, 2021

reox commented Jul 19, 2021

tobiasdiez commented Jul 19, 2021

reox commented Jul 19, 2021

tobiasdiez commented Jul 19, 2021

reox commented Jul 19, 2021 •

edited

Loading

Encoding Issues with FF/Win10 #274

Encoding Issues with FF/Win10 #274

Comments

reox commented Feb 9, 2021

tobiasdiez commented Feb 12, 2021

LyzardKing commented Feb 13, 2021

reox commented Feb 13, 2021

tobiasdiez commented Feb 28, 2021

reox commented Mar 1, 2021

tobiasdiez commented Mar 1, 2021

reox commented Mar 1, 2021 • edited Loading

tobiasdiez commented Mar 1, 2021

reox commented Mar 1, 2021 • edited Loading

reox commented Jul 14, 2021 • edited Loading

tobiasdiez commented Jul 18, 2021

reox commented Jul 19, 2021 • edited Loading

reox commented Jul 19, 2021

tobiasdiez commented Jul 19, 2021

reox commented Jul 19, 2021

tobiasdiez commented Jul 19, 2021

reox commented Jul 19, 2021

tobiasdiez commented Jul 19, 2021

reox commented Jul 19, 2021

tobiasdiez commented Jul 19, 2021

reox commented Jul 19, 2021

tobiasdiez commented Jul 19, 2021

reox commented Jul 19, 2021 • edited Loading

reox commented Mar 1, 2021 •

edited

Loading

reox commented Mar 1, 2021 •

edited

Loading

reox commented Jul 14, 2021 •

edited

Loading

reox commented Jul 19, 2021 •

edited

Loading

reox commented Jul 19, 2021 •

edited

Loading