Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHI-base CSV releases should use UTF-8 encoding #13

Open
jseager7 opened this issue Dec 9, 2021 · 6 comments
Open

PHI-base CSV releases should use UTF-8 encoding #13

jseager7 opened this issue Dec 9, 2021 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@jseager7
Copy link
Contributor

jseager7 commented Dec 9, 2021

Trying to open the PHI-base 4.12 CSV file as UTF-8 (in Python) throws an error because the file is not valid UTF-8.

I'm not completely sure what encoding the files use, but using cp1252 encoding doesn't throw any errors (that's the Windows-1252 encoding, a legacy default for many Windows components).

Windows-1252 isn't appropriate for PHI-base now (if it ever was) because some columns (e.g. 'Pathogen strain' and 'Host strain') contain characters outside of the Windows-1252 encoding range, such as the delta symbol (Δ). These symbols end up replaced with question marks. Here's an example from the PHI-base 4.12 CSV:

Record ID                        Record 11248
Pathogen strain    CA14 (?ku70 ?pyrG::AfpyrG)
Name: 11247, dtype: object

@martin2urban What program did you use to generate these CSV files? If I remember correctly, Microsoft Excel doesn't default to UTF-8 when saving as CSV and has to be manually configured to save in UTF-8 encoding.

Here's the list of files that fail to load as UTF-8:

  • phi-base_v4-01_2016-05-01.csv
  • phi-base_v4-03_2017-05-01.csv
  • phi-base_v4-05_2018-05-15.csv
  • phi-base_v4-11_2021-05-05.csv
  • phi-base_v4-12_2021-09-02.csv

We should really convert these files to UTF-8 by regenerating them from the original datasets (if possible).


For completeness, here's the list of valid files: those that are either UTF-8 encoded, or contain no characters outside of the ASCII character set:

  • phi-base_v4-00_2015-09-09.csv
  • phi-base_v4-02_2016-10-03.csv
  • phi-base_v4-04_2017-11-10.csv
  • phi-base_v4-06_2018-12-05.csv
  • phi-base_v4-07_2019-05-27.csv
  • phi-base_v4-08_2019-09-16.csv
  • phi-base_v4-09_2020-05-25.csv
  • phi-base_v4-10_2020-11-02.csv
@jseager7 jseager7 added the bug Something isn't working label Dec 9, 2021
@martin2urban
Copy link
Member

@jseager7 That is an interesting observation. Could you please make available the python code snippet throwing the error?

@jseager7
Copy link
Contributor Author

jseager7 commented Dec 9, 2021

@martin2urban Here's the code. You'll need to run it in the 'releases' folder of this repository. It will print the filenames for all files that fail to decode as UTF-8.

import glob

release_files = glob.glob(
    'phi-base_v[0-9]-[0-9][0-9]'
    '_[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].csv'
)

for filename in release_files:
    with open(filename, encoding='utf8') as file:
        try:
            file.read()
        except UnicodeDecodeError:
            print(filename)

@jseager7
Copy link
Contributor Author

jseager7 commented Mar 1, 2022

Just an update on this: the following two files fail to load even with Windows-1252 encoding:

  • phi-base_v4-01_2016-05-01.csv
  • phi-base_v4-05_2018-05-15.csv

phi-base_v4-01_2016-05-01.csv fails because a lowercase u with umlaut (ü) in Record 6166 is encoded incorrectly:

... Candida dubliniensis,no data found,W�284,Erythematous candidiasis,Birds, ...

(It should be Wü284)

phi-base_v4-05_2018-05-15.csv fails because it contains a mix of invalid UTF-8 bytes and valid UTF-8 characters. For example, byte 0xC2 at position 14829 looks like a leftover byte from a non-breaking space (0xC2A0).

@martin2urban
Copy link
Member

I removed invalid UTF-8 characters from all files mentioned above.

@jseager7
Copy link
Contributor Author

jseager7 commented Aug 1, 2023

@martin2urban Sorry about the late reply but I only just noticed this. It's not safe to simply remove the invalid characters: by doing this, you're changing the names of some strains.

For example, removing the ü from Wü284 makes it W284, which isn't the same strain name. Either the character ü should be replaced with u (without umlaut) or the files should be properly re-encoded as UTF-8.

I'll fix the problem with PHI-base version 4.1 but I'll also need to make sure that removing characters from version 4.5 hasn't caused problems.

@jseager7 jseager7 reopened this Aug 1, 2023
jseager7 added a commit that referenced this issue Aug 1, 2023
Correct W284 to Wü284
See #13
jseager7 added a commit that referenced this issue Aug 1, 2023
This reverts commit 34a54c5.

Simply removing non-ASCII characters doesn't work because it
introduces typos in many strain names.

See #13
@jseager7
Copy link
Contributor Author

jseager7 commented Aug 1, 2023

I've looked at some of the other files modified by removing non-ASCII characters and there are too many data errors introduced by this change, so I've reverted the commit.

As I mentioned before, the only way we can fix the encoding errors properly is by copying values from the original PHI-base spreadsheets for these versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants