Data issues in Data_S3.csv (BA predictor training data) #238

elonp · 2024-06-21T17:33:38Z

Hello!

I analysed Data_S3.csv, described in your publication as the training data for you BA predictor, and found some data errors.
Here's my notebook: https://gist.github.com/elonp/bcfbc4b417552d01b2b3d11896a19129
I also attach it as PDF.
inspect-mhcflurry-ba-training-data.pdf

Hopefully the analysis is useful for others.
It would be wonderful if you could corroborate my analysis, guesswork and conclusions!

timodonnell · 2024-06-21T18:11:32Z

Thanks for reporting this!

Haven't gone through it in detail but one thought - the "measurement_source" column is not really indicating an exact study or sample ID unfortunately. It's something I did quickly to track where measurements are coming from but it's pretty inexact - I believe it's just last author plus the measurement type (what IEDB calls "Method"). So one possible contributor here is that we may have separate measurements of the same peptide/HLA from different studies from same last author with identical measurement_source. In general my approach has been to show the predictor conflicting values for the same peptide / HLA whenever those occur in the training data, i.e. I have not tried to collapse duplicates or find consensus values.

Curious if this makes sense to you and if so how much your observations this might explain.

timodonnell · 2024-06-21T18:18:23Z

Also just to note, all of the curation code is on github (see the dirs staring with data_ in downloads-generation) and the bulk of it I believe is in this file:

https://github.com/openvax/mhcflurry/blob/master/downloads-generation/data_curated/curate.py

If you are able to identify places where we are introducing any of the issues you are seeing please let me know 🙏

The MS measurements with wrong inequality seems pretty important to fix if it is common.

elonp · 2024-06-21T22:05:35Z

I think your approach of allowing conflicting measurements in the training data makes sense.

Thanks for the pointer to the curation code. I will try to review it to see if it explains some of my findings.

I wonder if some of the issues are actually upstream to your curation - I find the definitions of measurement_inequality a bit confusing (does < mean measurement_value is an upper bound, i.e. real_value<measurement_value or a lower bound, i.e. measurement_value<real_value? I had to go over your loss code to verify it's the former, and I'm still not sure), so I wonder if it is possible that some of the upstead data authors interpreted it wrongly. There is evidence for this in duplicate sets of affinity measurements reported with both < and >, and my concern is that it is possible that this also happened in non-duplicate entries and we cannot detect it.

elonp · 2024-06-24T10:17:00Z

I updated the notebook with:
a. number of records for each of the data issues reported in the summary.
b. further analysis leading to a retraction of one of the minor issues.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data issues in Data_S3.csv (BA predictor training data) #238

Data issues in Data_S3.csv (BA predictor training data) #238

elonp commented Jun 21, 2024

timodonnell commented Jun 21, 2024 •

edited

Loading

timodonnell commented Jun 21, 2024

elonp commented Jun 21, 2024

elonp commented Jun 24, 2024

Data issues in Data_S3.csv (BA predictor training data) #238

Data issues in Data_S3.csv (BA predictor training data) #238

Comments

elonp commented Jun 21, 2024

timodonnell commented Jun 21, 2024 • edited Loading

timodonnell commented Jun 21, 2024

elonp commented Jun 21, 2024

elonp commented Jun 24, 2024

timodonnell commented Jun 21, 2024 •

edited

Loading