-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added io_commons.read_csv to address issues with formatting of sample names in gCNV. #5811
Conversation
bc65b8d
to
6e4e827
Compare
Codecov Report
@@ Coverage Diff @@
## master #5811 +/- ##
===============================================
- Coverage 87.043% 80.309% -6.735%
+ Complexity 32153 30515 -1638
===============================================
Files 1975 1975
Lines 147415 147415
Branches 16225 16225
===============================================
- Hits 128315 118387 -9928
- Misses 13185 23315 +10130
+ Partials 5915 5713 -202
|
6e4e827
to
33990be
Compare
src/main/python/org/broadinstitute/hellbender/gcnvkernel/io/io_consts.py
Outdated
Show resolved
Hide resolved
33990be
to
fd973d7
Compare
@mwalker174 @droazen I think we should try to get this in before the release next Tuesday. |
…forced sort in all config JSON output, and removed some dead code.
93f4022
to
b5456f8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine to me. I have a minor suggestion and need some clarification - maybe I'm misunderstanding something.
input_pd = pd.read_csv(fh, delimiter=delimiter, dtype=dtypes_dict) # dtypes_dict keys may not be present | ||
found_columns_set = {str(column) for column in input_pd.columns.values} | ||
assert dtypes_dict is not None or mandatory_columns_set is None, \ | ||
"Invalid combination of dtypes_dict and mandatory_columns_set." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specify in the message why they are not valid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
if not line.startswith(comment): | ||
fh.seek(pos) | ||
break | ||
input_pd = pd.read_csv(fh, delimiter=delimiter, dtype=dtypes_dict) # dtypes_dict keys may not be present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I'm completely following this. It looks like this reads the csv using pandas starting from the beginning of the column header line. I see that you provide the expected datatypes for the columns, but how does this avoid the midline comment character issue? That is, what happens if sample ids containing the comment character are present in the column header line? Or is that never the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We call pd.read_csv
without specifying the comment
parameter, in which case it defaults to None
, so there's no checking for comments performed when reading the column header and rows.
In any case, we currently don't encounter the situation you describe in any of our files (but it's not too hard to imagine that we might in a future file format).
Thanks @mwalker174, back to you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Fixes failures caused by 1) sample names containing
@
, which pandas interprets as a midline comment character, and 2) cohorts where all samples have numerical names, in which case leading zeros may be stripped when pandas infers the column dtype to be int.Added some commits to address a few more issues. Closes #5778. Closes #5809.