Added io_commons.read_csv to address issues with formatting of sample names in gCNV. #5811

samuelklee · 2019-03-18T20:33:50Z

Fixes failures caused by 1) sample names containing @, which pandas interprets as a midline comment character, and 2) cohorts where all samples have numerical names, in which case leading zeros may be stripped when pandas infers the column dtype to be int.

Added some commits to address a few more issues. Closes #5778. Closes #5809.

codecov-io · 2019-03-19T02:55:46Z

Codecov Report

Merging #5811 into master will decrease coverage by 6.735%.
The diff coverage is 100%.

@@               Coverage Diff               @@
##              master     #5811       +/-   ##
===============================================
- Coverage     87.043%   80.309%   -6.735%     
+ Complexity     32153     30515     -1638     
===============================================
  Files           1975      1975               
  Lines         147415    147415               
  Branches       16225     16225               
===============================================
- Hits          128315    118387     -9928     
- Misses         13185     23315    +10130     
+ Partials        5915      5713      -202

Impacted Files	Coverage Δ	Complexity Δ
...umber/gcnv/GermlineCNVIntervalVariantComposer.java	`98.462% <100%> (ø)`	`11 <0> (ø)`	⬇️
...kers/filters/VariantFiltrationIntegrationTest.java	`0.826% <0%> (-99.174%)`	`1% <0%> (-25%)`
...dorientation/CollectF1R2CountsIntegrationTest.java	`0.917% <0%> (-99.083%)`	`1% <0%> (-12%)`
.../walkers/bqsr/BaseRecalibratorIntegrationTest.java	`1.031% <0%> (-98.969%)`	`1% <0%> (-7%)`
...ers/vqsr/FilterVariantTranchesIntegrationTest.java	`1.053% <0%> (-98.947%)`	`1% <0%> (-5%)`
...s/variantutils/VariantsToTableIntegrationTest.java	`1.205% <0%> (-98.795%)`	`1% <0%> (-20%)`
...on/FindBreakpointEvidenceSparkIntegrationTest.java	`1.754% <0%> (-98.246%)`	`1% <0%> (-6%)`
...bender/tools/spark/PileupSparkIntegrationTest.java	`2.041% <0%> (-97.959%)`	`2% <0%> (-13%)`
...tute/hellbender/tools/FlagStatIntegrationTest.java	`2.083% <0%> (-97.917%)`	`1% <0%> (-5%)`
...rs/variantutils/SelectVariantsIntegrationTest.java	`0.25% <0%> (-97.75%)`	`1% <0%> (-70%)`
... and 153 more

src/main/python/org/broadinstitute/hellbender/gcnvkernel/io/io_consts.py

samuelklee · 2019-03-21T14:40:36Z

@mwalker174 @droazen I think we should try to get this in before the release next Tuesday.

… names in gCNV.

…forced sort in all config JSON output, and removed some dead code.

samuelklee · 2019-03-21T15:06:48Z

Added some commits to address a few more issues. Not sure if tests still pass, or if I'll need to fix up more test resources---we'll see!

Closes #5778.
Closes #5809.

mwalker174

Looks fine to me. I have a minor suggestion and need some clarification - maybe I'm misunderstanding something.

mwalker174 · 2019-03-21T22:10:38Z

src/main/python/org/broadinstitute/hellbender/gcnvkernel/io/io_commons.py

+        input_pd = pd.read_csv(fh, delimiter=delimiter, dtype=dtypes_dict)  # dtypes_dict keys may not be present
+    found_columns_set = {str(column) for column in input_pd.columns.values}
+    assert dtypes_dict is not None or mandatory_columns_set is None, \
+        "Invalid combination of dtypes_dict and mandatory_columns_set."


Specify in the message why they are not valid

mwalker174 · 2019-03-21T22:52:29Z

src/main/python/org/broadinstitute/hellbender/gcnvkernel/io/io_commons.py

+            if not line.startswith(comment):
+                fh.seek(pos)
+                break
+        input_pd = pd.read_csv(fh, delimiter=delimiter, dtype=dtypes_dict)  # dtypes_dict keys may not be present


Not sure I'm completely following this. It looks like this reads the csv using pandas starting from the beginning of the column header line. I see that you provide the expected datatypes for the columns, but how does this avoid the midline comment character issue? That is, what happens if sample ids containing the comment character are present in the column header line? Or is that never the case?

We call pd.read_csv without specifying the comment parameter, in which case it defaults to None, so there's no checking for comments performed when reading the column header and rows.

In any case, we currently don't encounter the situation you describe in any of our files (but it's not too hard to imagine that we might in a future file format).

samuelklee · 2019-03-22T14:07:47Z

Thanks @mwalker174, back to you!

mwalker174

Looks good!

samuelklee force-pushed the sl_gcnv_io_fixes branch from bc65b8d to 6e4e827 Compare March 19, 2019 02:18

samuelklee force-pushed the sl_gcnv_io_fixes branch from 6e4e827 to 33990be Compare March 19, 2019 03:04

samuelklee requested a review from mwalker174 March 19, 2019 14:45

samuelklee assigned mwalker174 Mar 19, 2019

samuelklee commented Mar 19, 2019

View reviewed changes

src/main/python/org/broadinstitute/hellbender/gcnvkernel/io/io_consts.py Outdated Show resolved Hide resolved

samuelklee force-pushed the sl_gcnv_io_fixes branch from 33990be to fd973d7 Compare March 21, 2019 14:40

samuelklee added 5 commits March 21, 2019 11:05

Added io_commons.read_csv to address issues with formatting of sample…

b685a40

… names in gCNV.

Cleaned up PEP8 violations.

f7b64a1

Added some minor edits to the dtype dictionaries.

5e4ab0d

Removed unordered contig set from ploidy-model config JSON output, en…

eaeadd5

…forced sort in all config JSON output, and removed some dead code.

Fixed Number entry of CNLP field in gCNV intervals VCF.

b5456f8

samuelklee force-pushed the sl_gcnv_io_fixes branch from 93f4022 to b5456f8 Compare March 21, 2019 15:05

mwalker174 reviewed Mar 21, 2019

View reviewed changes

Addressed PR comments.

0259617

mwalker174 approved these changes Mar 22, 2019

View reviewed changes

samuelklee merged commit 022800c into master Mar 22, 2019

samuelklee deleted the sl_gcnv_io_fixes branch March 22, 2019 15:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added io_commons.read_csv to address issues with formatting of sample names in gCNV. #5811

Added io_commons.read_csv to address issues with formatting of sample names in gCNV. #5811

samuelklee commented Mar 18, 2019 •

edited

Loading

codecov-io commented Mar 19, 2019 •

edited

Loading

samuelklee commented Mar 21, 2019

samuelklee commented Mar 21, 2019

mwalker174 left a comment

mwalker174 Mar 21, 2019

samuelklee Mar 22, 2019

mwalker174 Mar 21, 2019

samuelklee Mar 22, 2019 •

edited

Loading

samuelklee commented Mar 22, 2019

mwalker174 left a comment

Added io_commons.read_csv to address issues with formatting of sample names in gCNV. #5811

Added io_commons.read_csv to address issues with formatting of sample names in gCNV. #5811

Conversation

samuelklee commented Mar 18, 2019 • edited Loading

codecov-io commented Mar 19, 2019 • edited Loading

Codecov Report

samuelklee commented Mar 21, 2019

samuelklee commented Mar 21, 2019

mwalker174 left a comment

Choose a reason for hiding this comment

mwalker174 Mar 21, 2019

Choose a reason for hiding this comment

samuelklee Mar 22, 2019

Choose a reason for hiding this comment

mwalker174 Mar 21, 2019

Choose a reason for hiding this comment

samuelklee Mar 22, 2019 • edited Loading

Choose a reason for hiding this comment

samuelklee commented Mar 22, 2019

mwalker174 left a comment

Choose a reason for hiding this comment

samuelklee commented Mar 18, 2019 •

edited

Loading

codecov-io commented Mar 19, 2019 •

edited

Loading

samuelklee Mar 22, 2019 •

edited

Loading