PHI-base CSV releases should use UTF-8 encoding #13

jseager7 · 2021-12-09T13:52:00Z

Trying to open the PHI-base 4.12 CSV file as UTF-8 (in Python) throws an error because the file is not valid UTF-8.

I'm not completely sure what encoding the files use, but using cp1252 encoding doesn't throw any errors (that's the Windows-1252 encoding, a legacy default for many Windows components).

Windows-1252 isn't appropriate for PHI-base now (if it ever was) because some columns (e.g. 'Pathogen strain' and 'Host strain') contain characters outside of the Windows-1252 encoding range, such as the delta symbol (Δ). These symbols end up replaced with question marks. Here's an example from the PHI-base 4.12 CSV:

Record ID                        Record 11248
Pathogen strain    CA14 (?ku70 ?pyrG::AfpyrG)
Name: 11247, dtype: object

@martin2urban What program did you use to generate these CSV files? If I remember correctly, Microsoft Excel doesn't default to UTF-8 when saving as CSV and has to be manually configured to save in UTF-8 encoding.

Here's the list of files that fail to load as UTF-8:

phi-base_v4-01_2016-05-01.csv
phi-base_v4-03_2017-05-01.csv
phi-base_v4-05_2018-05-15.csv
phi-base_v4-11_2021-05-05.csv
phi-base_v4-12_2021-09-02.csv

We should really convert these files to UTF-8 by regenerating them from the original datasets (if possible).

For completeness, here's the list of valid files: those that are either UTF-8 encoded, or contain no characters outside of the ASCII character set:

phi-base_v4-00_2015-09-09.csv
phi-base_v4-02_2016-10-03.csv
phi-base_v4-04_2017-11-10.csv
phi-base_v4-06_2018-12-05.csv
phi-base_v4-07_2019-05-27.csv
phi-base_v4-08_2019-09-16.csv
phi-base_v4-09_2020-05-25.csv
phi-base_v4-10_2020-11-02.csv

The text was updated successfully, but these errors were encountered:

martin2urban · 2021-12-09T14:09:21Z

@jseager7 That is an interesting observation. Could you please make available the python code snippet throwing the error?

jseager7 · 2021-12-09T14:22:20Z

@martin2urban Here's the code. You'll need to run it in the 'releases' folder of this repository. It will print the filenames for all files that fail to decode as UTF-8.

import glob

release_files = glob.glob(
    'phi-base_v[0-9]-[0-9][0-9]'
    '_[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].csv'
)

for filename in release_files:
    with open(filename, encoding='utf8') as file:
        try:
            file.read()
        except UnicodeDecodeError:
            print(filename)

jseager7 · 2022-03-01T14:27:41Z

Just an update on this: the following two files fail to load even with Windows-1252 encoding:

phi-base_v4-01_2016-05-01.csv
phi-base_v4-05_2018-05-15.csv

phi-base_v4-01_2016-05-01.csv fails because a lowercase u with umlaut (ü) in Record 6166 is encoded incorrectly:

... Candida dubliniensis,no data found,W�284,Erythematous candidiasis,Birds, ...

(It should be Wü284)

phi-base_v4-05_2018-05-15.csv fails because it contains a mix of invalid UTF-8 bytes and valid UTF-8 characters. For example, byte 0xC2 at position 14829 looks like a leftover byte from a non-breaking space (0xC2A0).

martin2urban · 2022-11-21T09:34:39Z

I removed invalid UTF-8 characters from all files mentioned above.

jseager7 · 2023-08-01T10:59:51Z

@martin2urban Sorry about the late reply but I only just noticed this. It's not safe to simply remove the invalid characters: by doing this, you're changing the names of some strains.

For example, removing the ü from Wü284 makes it W284, which isn't the same strain name. Either the character ü should be replaced with u (without umlaut) or the files should be properly re-encoded as UTF-8.

I'll fix the problem with PHI-base version 4.1 but I'll also need to make sure that removing characters from version 4.5 hasn't caused problems.

Correct W284 to Wü284 See #13

This reverts commit 34a54c5. Simply removing non-ASCII characters doesn't work because it introduces typos in many strain names. See #13

jseager7 · 2023-08-01T11:24:31Z

I've looked at some of the other files modified by removing non-ASCII characters and there are too many data errors introduced by this change, so I've reverted the commit.

As I mentioned before, the only way we can fix the encoding errors properly is by copying values from the original PHI-base spreadsheets for these versions.

jseager7 added the bug Something isn't working label Dec 9, 2021

jseager7 assigned martin2urban Dec 9, 2021

martin2urban closed this as completed Nov 21, 2022

jseager7 reopened this Aug 1, 2023

jseager7 added a commit that referenced this issue Aug 1, 2023

Fix typo in PHI-base v4.1

ecdd97e

Correct W284 to Wü284 See #13

jseager7 added a commit that referenced this issue Aug 1, 2023

Revert "invalid UTF-8 characters removed from files"

ef9396e

This reverts commit 34a54c5. Simply removing non-ASCII characters doesn't work because it introduces typos in many strain names. See #13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PHI-base CSV releases should use UTF-8 encoding #13

PHI-base CSV releases should use UTF-8 encoding #13

jseager7 commented Dec 9, 2021

martin2urban commented Dec 9, 2021

jseager7 commented Dec 9, 2021

jseager7 commented Mar 1, 2022 •

edited

Loading

martin2urban commented Nov 21, 2022

jseager7 commented Aug 1, 2023

jseager7 commented Aug 1, 2023

PHI-base CSV releases should use UTF-8 encoding #13

PHI-base CSV releases should use UTF-8 encoding #13

Comments

jseager7 commented Dec 9, 2021

martin2urban commented Dec 9, 2021

jseager7 commented Dec 9, 2021

jseager7 commented Mar 1, 2022 • edited Loading

martin2urban commented Nov 21, 2022

jseager7 commented Aug 1, 2023

jseager7 commented Aug 1, 2023

jseager7 commented Mar 1, 2022 •

edited

Loading