You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently we distinguish between raw_data and processed_data which is great and allows us a fair amount of latitude to change the format in which data are outputted by bumping the preprocessing pipeline version, which changes the processed_data but not the raw_data. But sometimes we want to change the format in which we accept data and doing that requires us to go in and edit the raw_data (in a carefully designed way to maintain data integrity).
It could be good to also create a second copy of the raw data at submission (original_input_data) which is:
never edited
never directly accessed by the preprocessing pipeline, so the schema the preprocessing pipeline expects can change
Then when needs for edits were needed we would edit the raw_data but be able to manually check if the raw_data semantically matches the original_input_data.
The text was updated successfully, but these errors were encountered:
theosanderson
changed the title
Create new fiel
Create new field: original_input_data that doesn't need to conform to a consistent schema (because the input schema can change)
Nov 20, 2024
This would work well with a set of migration scripts that can be applied to original data to deterministically reshape it to whatever we want the preprocessing pipeline to see. That would be a less hacky way: rather than doing one-off db surgery, apply scripts to original data:
Script 1 for changing author format: Overwrite these author fields with those values
Hypothetical script 2 for renaming a field: rename all field keys geoLocCity -> city
...
We could contain complexity as opposed to have it infect the processing pipeline which could theoretically accept all previous versions based on timestamps of submission etc, but would become unwieldy.
Currently we distinguish between
raw_data
andprocessed_data
which is great and allows us a fair amount of latitude to change the format in which data are outputted by bumping the preprocessing pipeline version, which changes theprocessed_data
but not theraw_data
. But sometimes we want to change the format in which we accept data and doing that requires us to go in and edit theraw_data
(in a carefully designed way to maintain data integrity).It could be good to also create a second copy of the raw data at submission (
original_input_data
) which is:Then when needs for edits were needed we would edit the
raw_data
but be able to manually check if theraw_data
semantically matches theoriginal_input_data
.The text was updated successfully, but these errors were encountered: