Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More concise (text-based) columnar data definitions #579

Closed
manzt opened this issue Nov 8, 2021 · 5 comments · Fixed by #613
Closed

More concise (text-based) columnar data definitions #579

manzt opened this issue Nov 8, 2021 · 5 comments · Fixed by #613
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@manzt
Copy link
Member

manzt commented Nov 8, 2021

Apologies for the lack on context behind the choice, but I don't know if I understand the motivation for including quantitativeFields, genomicFields, and chromosomeField in the CSV data definition. These fields aren't marked as required in the docs, but every example I've seen includes their use.

Motivation

To my knowledge, these columns inform how the CSV is parsed but this interpretation is also captured elsewhere in the track definition (type: genomic, quantitative, categorical, etc), so really it's an abstraction leak. I see how the chromosomeField is currently necessary, but I'm curious if that information could also be captured in the "genomic" type rather than the data definition.

As a motivating example, what is the expected behavior if I use a field type that differs from the data definition? I assume the track type takes precedent, and if so we don't need the data definition since a type is required on all tracks.

import gosling as gos

data = gos.csv('./data.csv', genomicFields=['start', 'end'], chromosomeField='chr', quantitativeFields=['value'])

gos.Track(data).encode(x=gos.Channel('start:G'), y=gos.Channel('value:N')) # value doesn't match data definition

Proposal

Remove quantitativeFields, genomicFields, and maybe chromosomeField from CSV definition. This would make specifying CSV data more concise and avoid the case where data definition does not match track definition.

import gosling as gos
import pandas as pd

data = gos.csv('./data.csv')
data = pd.read_csv('./data.csv').gos.csv() # no arguments needed

Approach

Use build-in (d3?) auto-parsing of CSV into memory. We can coerce any data-types that are mis-interpreted based on the track definition. Perhaps as an extension to #575, we can think about if there is a way to include chromosome field in the X encoding definition.

interface X {
  type: 'genomic';
  // tuples represent chromosome position OR chromosome region
  field: [chrom: string, start: string] | [chrom: string, start: string, end: string];
  // ...
}
@manzt manzt added the enhancement New feature or request label Nov 8, 2021
@sehilyi
Copy link
Member

sehilyi commented Nov 16, 2021

Thank you @manzt for opening the discussion! I agree that it would be great if we can infer field types from the spec so that users do not need to specify details about their CSV files. We had a short discussion on this but did not come up with a clear solution.

I think it is a good idea to infer the quantitative fields in a more smart way (e.g., by looking at the first few rows) so that users do not need to specify quantitativeFields. We currently parse values to numbers only if their field names are specified in the quantitativeFields.

For the genomic fields, there are several things that we need to consider.

(1) Data fields can be used multiple times in the spec. For example, they can be used for multiple channels (e.g., x channels of multiple tracks that are overlaid)

// gene annotation example
"tracks": [
  {
    "mark": "text",
    "x": {"field": "start", "type": "genomic"}, ...
  },
  {
    "mark": "triangleLeft",
    "x": {"field": "start", "type": "genomic"}, ...
  },
  {
    "mark": "rect",
    "x": {"field": "start", "type": "genomic"}, ...
  }, ...
]

and in data transformations

{
  "type": "displace",
  "method": "pile",
  "boundingBox": {"startField": "start", "endField": "end"},
  "newField": "row",
  "maxRows": 15
},

(2) Genomic fields are parsed together with a chromosome field, and there can be multiple chromosome fields in a single CSV file (e.g., {chrField1, gField1, gField2}, {chrField2, gField3, gField4}).

Considering these, I think letting users specify field types in a single place (i.e., Data) still has several advantages. (A) Users do not need to specify chromosomeField multiple times in the rest of the spec, and (B) it prevents users from accidentally using different chromosome fields for the same genomic fields.

Also, please be aware that fields that are parsed as quantitative values can be still used as nominal values in the encoding or genomic positions can be used as quantitative/nominal ones:

data: { ..., genomicFields: ['position'] },
encoding: { ..., 
   x: { field: 'position', type: 'genomic' }, // genomic field is mapped to the x-axis
   color: { field: 'position', type: 'quantitative', range: 'grey' } // fill color based on genomic location
}

@manzt
Copy link
Member Author

manzt commented Nov 16, 2021

Thank you for the detailed write up. I think I have been convinced about keeping chromosomeField (since this information needs to be defined somewhere, and I'm convinced data is a better option than encoding). However, I'd like to push back slightly on genomicFields and quantitativeFields.

Also, please be aware that fields that are parsed as quantitative values can be still used as nominal values in the encoding or genomic positions can be used as quantitative/nominal ones:

data: { ..., genomicFields: ['position'] },
encoding: { ..., 
   x: { field: 'position', type: 'genomic' }, // genomic field is mapped to the x-axis
   color: { field: 'position', type: 'quantitative', range: 'grey' } // fill color based on genomic location
}

I think this is a perfect example for why genomicFields and quantitativeFields are not necessary. It is ultimately the encoding type that determines how the value should be treated (as a number or string in this case), not genomicFields.

The issue is that we need to load text (the CSV) in memory so that is easy to work, and we don't want to re-parse the whole CSV for each encoding. I think the assumption is that since we are parsing the CSV upfront, we should have information for how to do that parsing. But this isn't necessary. The expensive bit we want to avoid repeating is the creation of a useful data-structure for iterating over entries and accessing values by field name, not that the values are the correct data-type at this point.

As your example highlights, we can easily cast a value to a different primitive depending on the encoding type that is used. For sake of example pretend we have the following CSV:

chrom,start,end,value
chr1,0,10,100
chr1,2,200,5

and we initially parse the CSV without asking or inferring any data-types, leaving each value as string.

const data = [
  { chrom: 'chr1', start: '0', end: '10', value: '100' },
  { chrom: 'chr1', start: '2', end: '200', value: '5' },
];

At this stage we can access any value by index and field name, and cast to another primitive data-type if required by the encoding (which we already do). Now, we could just parse everything as strings initially, but using something like d3-autotype when creating our in-memory data structure will likely mean the values are appropriate types and won't need to cast for the encoding.

@sehilyi
Copy link
Member

sehilyi commented Nov 16, 2021

I agree that quantitativeFields is not necessary as well as the type conversion between numeric and string at the time of data parsing. I think we can remove the quantitativeFields property entirely from our schema.

The main reason why we currently have genomicFields is that a pair of a chromosome field and a genomic field is being used to calculate cumulative (absolute) genomic positions when parsing the data. For example, say we have the following file:

chrom,start,end,value
chr1,0,10,100
chr2,0,10,20
...

The genomic fields, like start and end, should be converted to contain absolute positions (instead of the current relative position in each chr) so that they can be mapped to the genomic axis in Gosling that concatenates chromosomes end-to-end in a linear axis:

// Data Spec
{ type: 'csv',
  url: 'http://...',
  chromosomeField: 'chrom',
  genomicFields: ['start', 'end']
}

// Converted Data
[ 
  { chrom: 'chr1', start: 0, end: 10, value: '100' },
  { chrom: 'chr2', start: 248956422, end: 248956432, value: '5' }, // <-- size of chr1 + relative position on chr2
];

And, like in BEDPE, multiple chromosome fields can exist in a single file, so we currently allow users to connect between chromosome fields and corresponding genomic fields:

data: {
    type: 'csv',
    url: '...',
    genomicFieldsToConvert: [
        { chromosomeField: 'chrom1', genomicFields: ['p1s', 'p1e'] },
        { chromosomeField: 'chrom2', genomicFields: ['p2s', 'p2e'] }
    ], ...
},
chrom1,p1s,p1e,chrom2,p2s,p2e
chr1,37923244,37923245,chr21,37941652,37941653
...

I am not sure if we can remove genomicFields considering this mandatory parsing process.

@manzt
Copy link
Member Author

manzt commented Nov 16, 2021

The genomic fields, like start and end, should be converted to contain absolute positions (instead of the current relative position in each chr)

Ah I did not know this. Thank you for clarifying. How do lazily accessed formats like BAM or bigwig work then? I assume the stored coordinates are relative and must to be converted to absolute positions when dynamically loaded.

Shouldn't the absolute position be based on the assembly as well?

@sehilyi
Copy link
Member

sehilyi commented Nov 16, 2021

Shouldn't the absolute position be based on the assembly as well?

Yes, it is. When parsing the data in Gosling, a view-level property, i.e., assembly, is used as a reference, while we sometimes refer to the header of BAM files that contain such information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants