Create new field: `original_input_data` that doesn't need to conform to a consistent schema (because the input schema can change) #3262

theosanderson · 2024-11-20T14:18:49Z

Currently we distinguish between raw_data and processed_data which is great and allows us a fair amount of latitude to change the format in which data are outputted by bumping the preprocessing pipeline version, which changes the processed_data but not the raw_data. But sometimes we want to change the format in which we accept data and doing that requires us to go in and edit the raw_data (in a carefully designed way to maintain data integrity).

It could be good to also create a second copy of the raw data at submission (original_input_data) which is:

never edited
never directly accessed by the preprocessing pipeline, so the schema the preprocessing pipeline expects can change

Then when needs for edits were needed we would edit the raw_data but be able to manually check if the raw_data semantically matches the original_input_data.

The text was updated successfully, but these errors were encountered:

chaoran-chen · 2024-11-20T14:38:34Z

With raw_data, do you mean what we currently call original_data in the database?

theosanderson · 2024-11-20T14:40:09Z

Yes, sorry, I do

chaoran-chen · 2024-11-20T14:57:04Z

(Slack discussion: https://loculus.slack.com/archives/C05G172HL6L/p1732111549067879)

corneliusroemer · 2024-11-20T15:42:33Z

This would work well with a set of migration scripts that can be applied to original data to deterministically reshape it to whatever we want the preprocessing pipeline to see. That would be a less hacky way: rather than doing one-off db surgery, apply scripts to original data:

Script 1 for changing author format: Overwrite these author fields with those values
Hypothetical script 2 for renaming a field: rename all field keys geoLocCity -> city
...

We could contain complexity as opposed to have it infect the processing pipeline which could theoretically accept all previous versions based on timestamps of submission etc, but would become unwieldy.

theosanderson changed the title ~~Create new fiel~~ Create new field: original_input_data that doesn't need to conform to a consistent schema (because the input schema can change) Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create new field: `original_input_data` that doesn't need to conform to a consistent schema (because the input schema can change) #3262

Create new field: `original_input_data` that doesn't need to conform to a consistent schema (because the input schema can change) #3262

theosanderson commented Nov 20, 2024 •

edited

Loading

chaoran-chen commented Nov 20, 2024

theosanderson commented Nov 20, 2024

chaoran-chen commented Nov 20, 2024

corneliusroemer commented Nov 20, 2024

Create new field: original_input_data that doesn't need to conform to a consistent schema (because the input schema can change) #3262

Create new field: original_input_data that doesn't need to conform to a consistent schema (because the input schema can change) #3262

Comments

theosanderson commented Nov 20, 2024 • edited Loading

chaoran-chen commented Nov 20, 2024

theosanderson commented Nov 20, 2024

chaoran-chen commented Nov 20, 2024

corneliusroemer commented Nov 20, 2024

Create new field: `original_input_data` that doesn't need to conform to a consistent schema (because the input schema can change) #3262

Create new field: `original_input_data` that doesn't need to conform to a consistent schema (because the input schema can change) #3262

theosanderson commented Nov 20, 2024 •

edited

Loading