Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve UTF8 support #71

Closed
robertoostenveld opened this issue Nov 24, 2020 · 1 comment
Closed

improve UTF8 support #71

robertoostenveld opened this issue Nov 24, 2020 · 1 comment

Comments

@robertoostenveld
Copy link

With http://github.com/fieldtrip/fieldtrip we are using jsonlab in the data2bids function. @DidiLamers identified a problem when converting a dataset with an author (which goes in the dataset_description.json) that had a non-ascii character in her name (in fact, it was the letter "í", with an acute accent). The BIDS validator subsequently complained that the json is not UTF8; also opening the json file file in the Atom editor resulted in the character being shown correctly.

It might be that this is solved already in version 2.0; we are now shipping FieldTrip with jsonlab version 1.5.

If the limitation is still there in 2.0, I would like to discuss whether this can be solved on the jsonlab-side.

An alternative that I see is that around https://github.com/fieldtrip/fieldtrip/blob/2a67cb59746eb81fe05bc118c96d5f257a63f51c/data2bids.m#L2208
we introduce some extra code to deal with this, like reading the json file and writing it back with UTF8 encoding.

@fangq
Copy link
Member

fangq commented Mar 13, 2022

@robertoostenveld, sorry for leaving this issue open for such a long time. I have not yet been able to figure out an environment to reproduce this issue, but I can vaguely see why this happens.

In JSONLab, strings are saved in MATLAB's "native" encoding - so, it does not guarantee to be utf-8. Strictly speaking this is not JSON compliant, but it is a common practices in many JSON parsers (especially in the non-strict mode).

MATLAB's default text encoding can be queried and set via feature('DefaultCharacterSet'). In my case (Ubuntu Linux), this has been utf-8, but other users may have different settings (or manually set it differently to utf-8). In that case, the string typed in matlab may not be utf-8, and thus could cause the issue.

I suppose this can be resolved by adding

if(varargin{1}.strictunicode)
    val=native2unicode(val);
end

after this line, and add a new option StrictUnicode at the beginning.

However, I could not figure out how to test this. Using a unicode file sample, I was able to print the unicode properly regardless of my DefaultCharacterSet setting. see below screenshot.

bids_unicode_jsonlab_71

if you can help me find a system/matlab configuration that I can reproduce this issue, I would test my above proposed fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants