-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset versioning #235
Comments
Three approaches thus far 1. Store Link to parent dataset in metadataEach dataset can store a link to the parent dataset in metadata 2. Store links to all previous datasets in metadataSlight modification to 1. I prefer this as it provides a less brittle way to specify dataset concatenation. 3. Store all datasets in a single directory.Each dataset is stored in a sub-directory of a parent directory. The advantage of this approach is that there's less conceptual fragmentation compared to the first two approaches where url links may or may not exist. |
Something to watch out for, if we're creating multiple Measurement Sets we may end up creating empty required columns (TIME, ANTENNA1, ANTENNA2, UVW) when we only want to store a new version of for e.g. CORRECTED_DATA and FLAG. |
I vote #2!
Well that's exactly the idea, isn't it... new columns live in the "new" version, and all other columns live in the "old" version. So the "new" version is not self-contained without the old dataset also available, but that's by design. There's another use case for this that feels very related, and that is SSD caching. We've garnered new appreciation recently for how slow MS reads are (see https://github.com/ratt-ru/systems/issues/82), and I'll add SSD scratch filesystems to a few nodes to address this. The problem then is managing the scratch space, since it is necessarily much smaller. I think dask-ms-based tools can be made to be much more cache-friendly by implementing the following logic:
I think this can provide a very smooth user experience for dask-ms-based tools. If you have access to fast disk, you create your "fast MS" linking back to your original slow MS, and as long as you're actively using it, it stays on fast disk -- and if you stop molesting the data for a while, it makes its way back to slow storage eventually without you or the sysadmin worrying too much about it. (This will also be very useful in the cloud context -- I think things like S3 storage come in hierarchies of access speed...) Thoughts? |
Yep that's my vote too
I more concerned by the nature of the CASA Measurement Set in the sense that if you create an MS, say with To work around this, it should be possible to just create the new versions as plain CASA tables (which won't have required columns). Or perhaps modify the MS descriptor to just require the new columns... This is all workable but I also don't want things to get too complex. |
The zarr and parquet formats don't have these issues because there no distinction between a plain dataset and an MS dataset. |
Yep, these are all exciting suggestions. In fact, one thought that comes to mind is datasets composed of CASA tables for the original and zarr/parquet tables for the deltas (fast versions).
Yep, this is all sensible stuff. I would point that this is the kind of functionality that a datalake implements -- I'm somewhat wary of reinventing the wheel here, even though it may be fun to roll our own stuff. |
One strategy that occurred to me was to introduce a new MS indexing column (PROVENANCE_ID or repurposing an existing column) storing an integer for each row. Each entry would index a list of previous on-disk datasets, stored in a separate PROVENANCE subtable or in table/column metadata. |
I also think that averaging should result in the discarding previous provenance information, because a one-to-one mapping no longer exists. |
Posting @o-smirnov's chat discussion here:
|
Here is my current thoughts for a row-granularity data provenance mechanism: Provenance Sub-table
PROVENANCE_ID column
Add
|
I'm leaning towards ditching the |
If a new data size is produced (by averaging) for example, a completely new provenance |
Yes. A very common procedure is to split out calibrators and targets (different FIELD_IDs). It would be very nice if chains supported this, making this split-out effectively a no-op (in terms of storage used).
I think averaging makes a completely new dataset anyway, no, so there's no chain to speak of? |
Yes I see that concern. My current feeling is that a per-row I think what would be sufficient would be for a delta-column to represent a fixed (row-based) subset of the parent column. Can we do it without full-on row-level granularity somehow? Let's keep on thinking... |
Yes, we'd need data subsets you describe below to support the above case, because the split data would be derived from completely different rows.
Will do. In fact, am doing some reading under the wikpedia Data Lineage which seems to fall squarely into what we're planning. |
|
If performance in pathological edge cases is the only issue, then I wouldn't worry about it too much... |
TileDB supports dataset versioning and time travelling natively. https://docs.tiledb.com/main/background/key-concepts-and-data-format |
Agreed that this is unlikely |
Support Dataset Versioning. Briefly:
The text was updated successfully, but these errors were encountered: