-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data issues in Data_S3.csv (BA predictor training data) #238
Comments
Thanks for reporting this! Haven't gone through it in detail but one thought - the "measurement_source" column is not really indicating an exact study or sample ID unfortunately. It's something I did quickly to track where measurements are coming from but it's pretty inexact - I believe it's just last author plus the measurement type (what IEDB calls "Method"). So one possible contributor here is that we may have separate measurements of the same peptide/HLA from different studies from same last author with identical measurement_source. In general my approach has been to show the predictor conflicting values for the same peptide / HLA whenever those occur in the training data, i.e. I have not tried to collapse duplicates or find consensus values. Curious if this makes sense to you and if so how much your observations this might explain. |
Also just to note, all of the curation code is on github (see the dirs staring with If you are able to identify places where we are introducing any of the issues you are seeing please let me know 🙏 The MS measurements with wrong inequality seems pretty important to fix if it is common. |
I think your approach of allowing conflicting measurements in the training data makes sense. Thanks for the pointer to the curation code. I will try to review it to see if it explains some of my findings. I wonder if some of the issues are actually upstream to your curation - I find the definitions of measurement_inequality a bit confusing (does < mean measurement_value is an upper bound, i.e. real_value<measurement_value or a lower bound, i.e. measurement_value<real_value? I had to go over your loss code to verify it's the former, and I'm still not sure), so I wonder if it is possible that some of the upstead data authors interpreted it wrongly. There is evidence for this in duplicate sets of affinity measurements reported with both < and >, and my concern is that it is possible that this also happened in non-duplicate entries and we cannot detect it. |
I updated the notebook with: |
Hello!
I analysed Data_S3.csv, described in your publication as the training data for you BA predictor, and found some data errors.
Here's my notebook: https://gist.github.com/elonp/bcfbc4b417552d01b2b3d11896a19129
I also attach it as PDF.
inspect-mhcflurry-ba-training-data.pdf
Hopefully the analysis is useful for others.
It would be wonderful if you could corroborate my analysis, guesswork and conclusions!
The text was updated successfully, but these errors were encountered: