Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added io_commons.read_csv to address issues with formatting of sample names in gCNV. #5811

Merged
merged 6 commits into from
Mar 22, 2019

Conversation

samuelklee
Copy link
Contributor

@samuelklee samuelklee commented Mar 18, 2019

Fixes failures caused by 1) sample names containing @, which pandas interprets as a midline comment character, and 2) cohorts where all samples have numerical names, in which case leading zeros may be stripped when pandas infers the column dtype to be int.

Added some commits to address a few more issues. Closes #5778. Closes #5809.

@codecov-io
Copy link

codecov-io commented Mar 19, 2019

Codecov Report

Merging #5811 into master will decrease coverage by 6.735%.
The diff coverage is 100%.

@@               Coverage Diff               @@
##              master     #5811       +/-   ##
===============================================
- Coverage     87.043%   80.309%   -6.735%     
+ Complexity     32153     30515     -1638     
===============================================
  Files           1975      1975               
  Lines         147415    147415               
  Branches       16225     16225               
===============================================
- Hits          128315    118387     -9928     
- Misses         13185     23315    +10130     
+ Partials        5915      5713      -202
Impacted Files Coverage Δ Complexity Δ
...umber/gcnv/GermlineCNVIntervalVariantComposer.java 98.462% <100%> (ø) 11 <0> (ø) ⬇️
...kers/filters/VariantFiltrationIntegrationTest.java 0.826% <0%> (-99.174%) 1% <0%> (-25%)
...dorientation/CollectF1R2CountsIntegrationTest.java 0.917% <0%> (-99.083%) 1% <0%> (-12%)
.../walkers/bqsr/BaseRecalibratorIntegrationTest.java 1.031% <0%> (-98.969%) 1% <0%> (-7%)
...ers/vqsr/FilterVariantTranchesIntegrationTest.java 1.053% <0%> (-98.947%) 1% <0%> (-5%)
...s/variantutils/VariantsToTableIntegrationTest.java 1.205% <0%> (-98.795%) 1% <0%> (-20%)
...on/FindBreakpointEvidenceSparkIntegrationTest.java 1.754% <0%> (-98.246%) 1% <0%> (-6%)
...bender/tools/spark/PileupSparkIntegrationTest.java 2.041% <0%> (-97.959%) 2% <0%> (-13%)
...tute/hellbender/tools/FlagStatIntegrationTest.java 2.083% <0%> (-97.917%) 1% <0%> (-5%)
...rs/variantutils/SelectVariantsIntegrationTest.java 0.25% <0%> (-97.75%) 1% <0%> (-70%)
... and 153 more

@samuelklee
Copy link
Contributor Author

@mwalker174 @droazen I think we should try to get this in before the release next Tuesday.

@samuelklee
Copy link
Contributor Author

Added some commits to address a few more issues. Not sure if tests still pass, or if I'll need to fix up more test resources---we'll see!

Closes #5778.
Closes #5809.

Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me. I have a minor suggestion and need some clarification - maybe I'm misunderstanding something.

input_pd = pd.read_csv(fh, delimiter=delimiter, dtype=dtypes_dict) # dtypes_dict keys may not be present
found_columns_set = {str(column) for column in input_pd.columns.values}
assert dtypes_dict is not None or mandatory_columns_set is None, \
"Invalid combination of dtypes_dict and mandatory_columns_set."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specify in the message why they are not valid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if not line.startswith(comment):
fh.seek(pos)
break
input_pd = pd.read_csv(fh, delimiter=delimiter, dtype=dtypes_dict) # dtypes_dict keys may not be present
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I'm completely following this. It looks like this reads the csv using pandas starting from the beginning of the column header line. I see that you provide the expected datatypes for the columns, but how does this avoid the midline comment character issue? That is, what happens if sample ids containing the comment character are present in the column header line? Or is that never the case?

Copy link
Contributor Author

@samuelklee samuelklee Mar 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call pd.read_csv without specifying the comment parameter, in which case it defaults to None, so there's no checking for comments performed when reading the column header and rows.

In any case, we currently don't encounter the situation you describe in any of our files (but it's not too hard to imagine that we might in a future file format).

@samuelklee
Copy link
Contributor Author

Thanks @mwalker174, back to you!

Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@samuelklee samuelklee merged commit 022800c into master Mar 22, 2019
@samuelklee samuelklee deleted the sl_gcnv_io_fixes branch March 22, 2019 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants