Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When reading edf, decode all strings using 'iso8859-1'. #429

Merged
merged 1 commit into from
Oct 28, 2022

Conversation

Ivorforce
Copy link
Contributor

@Ivorforce Ivorforce commented Oct 26, 2022

The edf specifications notes all strings being 'ascii'.

By default, decode() uses the 'utf-8' encoding. In some cases, this can cause decoding issues:

File ~/Library/Caches/pypoetry/virtualenvs/workspace-lukas-kum-5zRav6QH-py3.9/lib/python3.9/site-packages/wfdb/io/convert/edf.py:262, in read_edf(record_name, pn_dir, header_only, verbose, rdedfann_flag)
    259 physical_dims = []
    260 for _ in range(n_sig):
    261     physical_dims.append(
--> 262         struct.unpack("<8s", edf_file.read(8))[0].decode().strip()
    263     )
    264 if verbose:
    265     print("Physical Dimensions: {}".format(physical_dims))

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 0: invalid start byte

Arguably, the encoding ascii should be used. However, in my case, this didn't completely solve the problem:

File ~/Library/Caches/pypoetry/virtualenvs/workspace-lukas-kum-5zRav6QH-py3.9/lib/python3.9/site-packages/wfdb/io/convert/edf.py:262, in read_edf(record_name, pn_dir, header_only, verbose, rdedfann_flag)
    259 physical_dims = []
    260 for _ in range(n_sig):
    261     physical_dims.append(
--> 262         struct.unpack("<8s", edf_file.read(8))[0].decode("ascii").strip()
    263     )
    264 if verbose:
    265     print("Physical Dimensions: {}".format(physical_dims))

UnicodeDecodeError: 'ascii' codec can't decode byte 0xb5 in position 0: ordinal not in range(128)

Using the admittedly peculiar unicode_escape encoding solved the problem for me. Using the verbose printing mode, the culprit can be seen (µ):

Physical Dimensions: ['', 'mV', 'mV', 'mV', 'mV', 'mV', 'mV.s', 'µV', 'mV', 'mV.s', 'µV', '', 's', '', '', 'V']

The file in question was produced by LabView. Unfortunately, I cannot share it. If need be, it might be possible to reproduce the error otherwise.

@bemoody
Copy link
Collaborator

bemoody commented Oct 26, 2022

Is decode("unicode_escape") somehow different from decode("iso8859-1")? It looks to me like it does the same thing.

I think if you find a random EDF file, it's more likely to be encoded in ISO-8859-1 (or Windows-1252) than UTF-8, but it'd be nice for the wfdb package to be "locale-neutral".

Could we add 'encoding' and maybe 'errors' arguments to read_edf, to make this explicit? Maybe with ISO-8859-1 or Windows-1252 as default.

@Ivorforce
Copy link
Contributor Author

Ivorforce commented Oct 26, 2022

Yep, seems like iso8859-1 works too. I admit i blindly copied that encoding name from some Stackoverflow issue - eager to move forward - but reading up on it being python specific, it really doesn't make sense to use it.

I like your suggestion to make encoding a parameter. I'm updating the PR accordingly.

@bemoody
Copy link
Collaborator

bemoody commented Oct 26, 2022

It looks good to me. Thanks! I don't know why the test is failing :/

@bemoody
Copy link
Collaborator

bemoody commented Oct 26, 2022

Oh wait, it looks like you need to add the 'encoding' parameter to rdedfann() as well.

…(defaulting to iso8859-1 rather than utf-8). While edf specifications require pure ascii strings, files may ununiformly choose another encoding.
@Ivorforce
Copy link
Contributor Author

Right, makes sense. Fixed the issue.

@bemoody
Copy link
Collaborator

bemoody commented Oct 28, 2022

Looks great!

@bemoody bemoody merged commit 7b361e2 into MIT-LCP:main Oct 28, 2022
@Ivorforce Ivorforce deleted the edf-encoding branch October 28, 2022 18:39
@Ivorforce Ivorforce changed the title When reading edf, decode all strings using 'unicode_escape'. When reading edf, decode all strings using 'iso8859-1'. Nov 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants