Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data issues in Data_S3.csv (BA predictor training data) #238

Open
elonp opened this issue Jun 21, 2024 · 4 comments
Open

Data issues in Data_S3.csv (BA predictor training data) #238

elonp opened this issue Jun 21, 2024 · 4 comments

Comments

@elonp
Copy link

elonp commented Jun 21, 2024

Hello!

I analysed Data_S3.csv, described in your publication as the training data for you BA predictor, and found some data errors.
Here's my notebook: https://gist.github.com/elonp/bcfbc4b417552d01b2b3d11896a19129
I also attach it as PDF.
inspect-mhcflurry-ba-training-data.pdf

Hopefully the analysis is useful for others.
It would be wonderful if you could corroborate my analysis, guesswork and conclusions!

@timodonnell
Copy link
Contributor

timodonnell commented Jun 21, 2024

Thanks for reporting this!

Haven't gone through it in detail but one thought - the "measurement_source" column is not really indicating an exact study or sample ID unfortunately. It's something I did quickly to track where measurements are coming from but it's pretty inexact - I believe it's just last author plus the measurement type (what IEDB calls "Method"). So one possible contributor here is that we may have separate measurements of the same peptide/HLA from different studies from same last author with identical measurement_source. In general my approach has been to show the predictor conflicting values for the same peptide / HLA whenever those occur in the training data, i.e. I have not tried to collapse duplicates or find consensus values.

Curious if this makes sense to you and if so how much your observations this might explain.

@timodonnell
Copy link
Contributor

Also just to note, all of the curation code is on github (see the dirs staring with data_ in downloads-generation) and the bulk of it I believe is in this file:

https://github.com/openvax/mhcflurry/blob/master/downloads-generation/data_curated/curate.py

If you are able to identify places where we are introducing any of the issues you are seeing please let me know 🙏

The MS measurements with wrong inequality seems pretty important to fix if it is common.

@elonp
Copy link
Author

elonp commented Jun 21, 2024

I think your approach of allowing conflicting measurements in the training data makes sense.

Thanks for the pointer to the curation code. I will try to review it to see if it explains some of my findings.

I wonder if some of the issues are actually upstream to your curation - I find the definitions of measurement_inequality a bit confusing (does < mean measurement_value is an upper bound, i.e. real_value<measurement_value or a lower bound, i.e. measurement_value<real_value? I had to go over your loss code to verify it's the former, and I'm still not sure), so I wonder if it is possible that some of the upstead data authors interpreted it wrongly. There is evidence for this in duplicate sets of affinity measurements reported with both < and >, and my concern is that it is possible that this also happened in non-duplicate entries and we cannot detect it.

@elonp
Copy link
Author

elonp commented Jun 24, 2024

I updated the notebook with:
a. number of records for each of the data issues reported in the summary.
b. further analysis leading to a retraction of one of the minor issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants