-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More concise (text-based) columnar data definitions #579
Comments
Thank you @manzt for opening the discussion! I agree that it would be great if we can infer field types from the spec so that users do not need to specify details about their CSV files. We had a short discussion on this but did not come up with a clear solution. I think it is a good idea to infer the quantitative fields in a more smart way (e.g., by looking at the first few rows) so that users do not need to specify For the genomic fields, there are several things that we need to consider. (1) Data fields can be used multiple times in the spec. For example, they can be used for multiple channels (e.g., // gene annotation example
"tracks": [
{
"mark": "text",
"x": {"field": "start", "type": "genomic"}, ...
},
{
"mark": "triangleLeft",
"x": {"field": "start", "type": "genomic"}, ...
},
{
"mark": "rect",
"x": {"field": "start", "type": "genomic"}, ...
}, ...
] and in data transformations {
"type": "displace",
"method": "pile",
"boundingBox": {"startField": "start", "endField": "end"},
"newField": "row",
"maxRows": 15
}, (2) Genomic fields are parsed together with a chromosome field, and there can be multiple chromosome fields in a single CSV file (e.g., Considering these, I think letting users specify field types in a single place (i.e., Also, please be aware that fields that are parsed as quantitative values can be still used as nominal values in the encoding or genomic positions can be used as quantitative/nominal ones: data: { ..., genomicFields: ['position'] },
encoding: { ...,
x: { field: 'position', type: 'genomic' }, // genomic field is mapped to the x-axis
color: { field: 'position', type: 'quantitative', range: 'grey' } // fill color based on genomic location
} |
Thank you for the detailed write up. I think I have been convinced about keeping
data: { ..., genomicFields: ['position'] },
encoding: { ...,
x: { field: 'position', type: 'genomic' }, // genomic field is mapped to the x-axis
color: { field: 'position', type: 'quantitative', range: 'grey' } // fill color based on genomic location
} I think this is a perfect example for why The issue is that we need to load text (the CSV) in memory so that is easy to work, and we don't want to re-parse the whole CSV for each encoding. I think the assumption is that since we are parsing the CSV upfront, we should have information for how to do that parsing. But this isn't necessary. The expensive bit we want to avoid repeating is the creation of a useful data-structure for iterating over entries and accessing values by field name, not that the values are the correct data-type at this point. As your example highlights, we can easily cast a value to a different primitive depending on the encoding type that is used. For sake of example pretend we have the following CSV:
and we initially parse the CSV without asking or inferring any data-types, leaving each value as string. const data = [
{ chrom: 'chr1', start: '0', end: '10', value: '100' },
{ chrom: 'chr1', start: '2', end: '200', value: '5' },
]; At this stage we can access any value by index and field name, and cast to another primitive data-type if required by the encoding (which we already do). Now, we could just parse everything as strings initially, but using something like d3-autotype when creating our in-memory data structure will likely mean the values are appropriate types and won't need to cast for the encoding. |
I agree that The main reason why we currently have
The genomic fields, like // Data Spec
{ type: 'csv',
url: 'http://...',
chromosomeField: 'chrom',
genomicFields: ['start', 'end']
}
// Converted Data
[
{ chrom: 'chr1', start: 0, end: 10, value: '100' },
{ chrom: 'chr2', start: 248956422, end: 248956432, value: '5' }, // <-- size of chr1 + relative position on chr2
]; And, like in BEDPE, multiple chromosome fields can exist in a single file, so we currently allow users to connect between chromosome fields and corresponding genomic fields: data: {
type: 'csv',
url: '...',
genomicFieldsToConvert: [
{ chromosomeField: 'chrom1', genomicFields: ['p1s', 'p1e'] },
{ chromosomeField: 'chrom2', genomicFields: ['p2s', 'p2e'] }
], ...
},
I am not sure if we can remove |
Ah I did not know this. Thank you for clarifying. How do lazily accessed formats like BAM or bigwig work then? I assume the stored coordinates are relative and must to be converted to absolute positions when dynamically loaded. Shouldn't the absolute position be based on the assembly as well? |
Yes, it is. When parsing the data in Gosling, a view-level property, i.e., |
Apologies for the lack on context behind the choice, but I don't know if I understand the motivation for including
quantitativeFields
,genomicFields
, andchromosomeField
in the CSV data definition. These fields aren't marked as required in the docs, but every example I've seen includes their use.Motivation
To my knowledge, these columns inform how the CSV is parsed but this interpretation is also captured elsewhere in the track definition (type: genomic, quantitative, categorical, etc), so really it's an abstraction leak. I see how the
chromosomeField
is currently necessary, but I'm curious if that information could also be captured in the "genomic" type rather than the data definition.As a motivating example, what is the expected behavior if I use a field type that differs from the data definition? I assume the track type takes precedent, and if so we don't need the data definition since a type is required on all tracks.
Proposal
Remove
quantitativeFields
,genomicFields
, and maybechromosomeField
from CSV definition. This would make specifying CSV data more concise and avoid the case where data definition does not match track definition.Approach
Use build-in (d3?) auto-parsing of CSV into memory. We can coerce any data-types that are mis-interpreted based on the track definition. Perhaps as an extension to #575, we can think about if there is a way to include chromosome field in the
X
encoding definition.The text was updated successfully, but these errors were encountered: