More concise (text-based) columnar data definitions #579

manzt · 2021-11-08T15:28:34Z

Apologies for the lack on context behind the choice, but I don't know if I understand the motivation for including quantitativeFields, genomicFields, and chromosomeField in the CSV data definition. These fields aren't marked as required in the docs, but every example I've seen includes their use.

Motivation

To my knowledge, these columns inform how the CSV is parsed but this interpretation is also captured elsewhere in the track definition (type: genomic, quantitative, categorical, etc), so really it's an abstraction leak. I see how the chromosomeField is currently necessary, but I'm curious if that information could also be captured in the "genomic" type rather than the data definition.

As a motivating example, what is the expected behavior if I use a field type that differs from the data definition? I assume the track type takes precedent, and if so we don't need the data definition since a type is required on all tracks.

import gosling as gos

data = gos.csv('./data.csv', genomicFields=['start', 'end'], chromosomeField='chr', quantitativeFields=['value'])

gos.Track(data).encode(x=gos.Channel('start:G'), y=gos.Channel('value:N')) # value doesn't match data definition

Proposal

Remove quantitativeFields, genomicFields, and maybe chromosomeField from CSV definition. This would make specifying CSV data more concise and avoid the case where data definition does not match track definition.

import gosling as gos
import pandas as pd

data = gos.csv('./data.csv')
data = pd.read_csv('./data.csv').gos.csv() # no arguments needed

Approach

Use build-in (d3?) auto-parsing of CSV into memory. We can coerce any data-types that are mis-interpreted based on the track definition. Perhaps as an extension to #575, we can think about if there is a way to include chromosome field in the X encoding definition.

interface X {
  type: 'genomic';
  // tuples represent chromosome position OR chromosome region
  field: [chrom: string, start: string] | [chrom: string, start: string, end: string];
  // ...
}

The text was updated successfully, but these errors were encountered:

sehilyi · 2021-11-16T00:00:50Z

Thank you @manzt for opening the discussion! I agree that it would be great if we can infer field types from the spec so that users do not need to specify details about their CSV files. We had a short discussion on this but did not come up with a clear solution.

I think it is a good idea to infer the quantitative fields in a more smart way (e.g., by looking at the first few rows) so that users do not need to specify quantitativeFields. We currently parse values to numbers only if their field names are specified in the quantitativeFields.

For the genomic fields, there are several things that we need to consider.

(1) Data fields can be used multiple times in the spec. For example, they can be used for multiple channels (e.g., x channels of multiple tracks that are overlaid)

// gene annotation example
"tracks": [
  {
    "mark": "text",
    "x": {"field": "start", "type": "genomic"}, ...
  },
  {
    "mark": "triangleLeft",
    "x": {"field": "start", "type": "genomic"}, ...
  },
  {
    "mark": "rect",
    "x": {"field": "start", "type": "genomic"}, ...
  }, ...
]

and in data transformations

{
  "type": "displace",
  "method": "pile",
  "boundingBox": {"startField": "start", "endField": "end"},
  "newField": "row",
  "maxRows": 15
},

(2) Genomic fields are parsed together with a chromosome field, and there can be multiple chromosome fields in a single CSV file (e.g., {chrField1, gField1, gField2}, {chrField2, gField3, gField4}).

Considering these, I think letting users specify field types in a single place (i.e., Data) still has several advantages. (A) Users do not need to specify chromosomeField multiple times in the rest of the spec, and (B) it prevents users from accidentally using different chromosome fields for the same genomic fields.

Also, please be aware that fields that are parsed as quantitative values can be still used as nominal values in the encoding or genomic positions can be used as quantitative/nominal ones:

data: { ..., genomicFields: ['position'] },
encoding: { ..., 
   x: { field: 'position', type: 'genomic' }, // genomic field is mapped to the x-axis
   color: { field: 'position', type: 'quantitative', range: 'grey' } // fill color based on genomic location
}

manzt · 2021-11-16T17:45:51Z

Thank you for the detailed write up. I think I have been convinced about keeping chromosomeField (since this information needs to be defined somewhere, and I'm convinced data is a better option than encoding). However, I'd like to push back slightly on genomicFields and quantitativeFields.

Also, please be aware that fields that are parsed as quantitative values can be still used as nominal values in the encoding or genomic positions can be used as quantitative/nominal ones:

data: { ..., genomicFields: ['position'] },
encoding: { ..., 
   x: { field: 'position', type: 'genomic' }, // genomic field is mapped to the x-axis
   color: { field: 'position', type: 'quantitative', range: 'grey' } // fill color based on genomic location
}

I think this is a perfect example for why genomicFields and quantitativeFields are not necessary. It is ultimately the encoding type that determines how the value should be treated (as a number or string in this case), not genomicFields.

The issue is that we need to load text (the CSV) in memory so that is easy to work, and we don't want to re-parse the whole CSV for each encoding. I think the assumption is that since we are parsing the CSV upfront, we should have information for how to do that parsing. But this isn't necessary. The expensive bit we want to avoid repeating is the creation of a useful data-structure for iterating over entries and accessing values by field name, not that the values are the correct data-type at this point.

As your example highlights, we can easily cast a value to a different primitive depending on the encoding type that is used. For sake of example pretend we have the following CSV:

chrom,start,end,value
chr1,0,10,100
chr1,2,200,5

and we initially parse the CSV without asking or inferring any data-types, leaving each value as string.

const data = [
  { chrom: 'chr1', start: '0', end: '10', value: '100' },
  { chrom: 'chr1', start: '2', end: '200', value: '5' },
];

At this stage we can access any value by index and field name, and cast to another primitive data-type if required by the encoding (which we already do). Now, we could just parse everything as strings initially, but using something like d3-autotype when creating our in-memory data structure will likely mean the values are appropriate types and won't need to cast for the encoding.

sehilyi · 2021-11-16T21:33:55Z

I agree that quantitativeFields is not necessary as well as the type conversion between numeric and string at the time of data parsing. I think we can remove the quantitativeFields property entirely from our schema.

The main reason why we currently have genomicFields is that a pair of a chromosome field and a genomic field is being used to calculate cumulative (absolute) genomic positions when parsing the data. For example, say we have the following file:

chrom,start,end,value
chr1,0,10,100
chr2,0,10,20
...

The genomic fields, like start and end, should be converted to contain absolute positions (instead of the current relative position in each chr) so that they can be mapped to the genomic axis in Gosling that concatenates chromosomes end-to-end in a linear axis:

// Data Spec
{ type: 'csv',
  url: 'http://...',
  chromosomeField: 'chrom',
  genomicFields: ['start', 'end']
}

// Converted Data
[ 
  { chrom: 'chr1', start: 0, end: 10, value: '100' },
  { chrom: 'chr2', start: 248956422, end: 248956432, value: '5' }, // <-- size of chr1 + relative position on chr2
];

And, like in BEDPE, multiple chromosome fields can exist in a single file, so we currently allow users to connect between chromosome fields and corresponding genomic fields:

data: {
    type: 'csv',
    url: '...',
    genomicFieldsToConvert: [
        { chromosomeField: 'chrom1', genomicFields: ['p1s', 'p1e'] },
        { chromosomeField: 'chrom2', genomicFields: ['p2s', 'p2e'] }
    ], ...
},

chrom1,p1s,p1e,chrom2,p2s,p2e
chr1,37923244,37923245,chr21,37941652,37941653
...

I am not sure if we can remove genomicFields considering this mandatory parsing process.

manzt · 2021-11-16T23:04:26Z

The genomic fields, like start and end, should be converted to contain absolute positions (instead of the current relative position in each chr)

Ah I did not know this. Thank you for clarifying. How do lazily accessed formats like BAM or bigwig work then? I assume the stored coordinates are relative and must to be converted to absolute positions when dynamically loaded.

Shouldn't the absolute position be based on the assembly as well?

sehilyi · 2021-11-16T23:12:30Z

Shouldn't the absolute position be based on the assembly as well?

Yes, it is. When parsing the data in Gosling, a view-level property, i.e., assembly, is used as a reference, while we sometimes refer to the header of BAM files that contain such information.

manzt added the enhancement New feature or request label Nov 8, 2021

manzt mentioned this issue Nov 16, 2021

feat: Make BED v1 a primitive data format #583

Closed

sehilyi mentioned this issue Dec 17, 2021

feat!: quantitativeFields is removed entirely from csv and json data specs #613

Merged

sehilyi added the documentation Improvements or additions to documentation label Dec 17, 2021

sehilyi mentioned this issue Dec 17, 2021

Add description on how genomic fields are parsed gosling-lang/gosling-website#37

Open

sehilyi closed this as completed in #613 Dec 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More concise (text-based) columnar data definitions #579

More concise (text-based) columnar data definitions #579

manzt commented Nov 8, 2021 •

edited

Loading

sehilyi commented Nov 16, 2021 •

edited

Loading

manzt commented Nov 16, 2021 •

edited

Loading

sehilyi commented Nov 16, 2021 •

edited

Loading

manzt commented Nov 16, 2021 •

edited

Loading

sehilyi commented Nov 16, 2021

More concise (text-based) columnar data definitions #579

More concise (text-based) columnar data definitions #579

Comments

manzt commented Nov 8, 2021 • edited Loading

Motivation

Proposal

Approach

sehilyi commented Nov 16, 2021 • edited Loading

manzt commented Nov 16, 2021 • edited Loading

sehilyi commented Nov 16, 2021 • edited Loading

manzt commented Nov 16, 2021 • edited Loading

sehilyi commented Nov 16, 2021

manzt commented Nov 8, 2021 •

edited

Loading

sehilyi commented Nov 16, 2021 •

edited

Loading

manzt commented Nov 16, 2021 •

edited

Loading

sehilyi commented Nov 16, 2021 •

edited

Loading

manzt commented Nov 16, 2021 •

edited

Loading