Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create new field: original_input_data that doesn't need to conform to a consistent schema (because the input schema can change) #3262

Open
theosanderson opened this issue Nov 20, 2024 · 4 comments

Comments

@theosanderson
Copy link
Member

theosanderson commented Nov 20, 2024

Currently we distinguish between raw_data and processed_data which is great and allows us a fair amount of latitude to change the format in which data are outputted by bumping the preprocessing pipeline version, which changes the processed_data but not the raw_data. But sometimes we want to change the format in which we accept data and doing that requires us to go in and edit the raw_data (in a carefully designed way to maintain data integrity).

It could be good to also create a second copy of the raw data at submission (original_input_data) which is:

  • never edited
  • never directly accessed by the preprocessing pipeline, so the schema the preprocessing pipeline expects can change

Then when needs for edits were needed we would edit the raw_data but be able to manually check if the raw_data semantically matches the original_input_data.

@theosanderson theosanderson changed the title Create new fiel Create new field: original_input_data that doesn't need to conform to a consistent schema (because the input schema can change) Nov 20, 2024
@chaoran-chen
Copy link
Member

With raw_data, do you mean what we currently call original_data in the database?

@theosanderson
Copy link
Member Author

Yes, sorry, I do

@chaoran-chen
Copy link
Member

@corneliusroemer
Copy link
Contributor

This would work well with a set of migration scripts that can be applied to original data to deterministically reshape it to whatever we want the preprocessing pipeline to see. That would be a less hacky way: rather than doing one-off db surgery, apply scripts to original data:

  • Script 1 for changing author format: Overwrite these author fields with those values
  • Hypothetical script 2 for renaming a field: rename all field keys geoLocCity -> city
  • ...

We could contain complexity as opposed to have it infect the processing pipeline which could theoretically accept all previous versions based on timestamps of submission etc, but would become unwieldy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants