-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using ERDDAP to build an archival package for NCEI to pickup #28
Comments
The problem we run into with using ERDDAP to submit data to NCEI is an issue with file fixity. Essentially every time NCEI goes to an ERDDAP endpoint and downloads a file, the resultant file will always have a different checksum. Even if the contents are the same. It's been stated that this is because some of the metadata ERDDAP includes in the resultant file is dynamically generated by ERDDAP and thus gets changed every time someone downloads the dataset, resulting in a different checksum. The ArchiveADataset tool resolves the fixity problem, however, it does not resolve the transfer mechanism problem. How will NCEI pick up a package that has been generated by that tool?
These are questions to explore. |
@BobSimons, FYI. |
I'll add: If the data set is available via ERDDAP "files" system and the files (once they are made available) don't change (e.g., for yearly, monthly, or daily files that never change), then having NCEI scan the "files" directory for new files and downloading them is a good solution. ArchiveADataset presumes that the ERDDAP admin has knowledge of when the dataset's data for a given time range is changing or has finished changing. Once a chunk of data (e.g., the data for a given year or month) has finished changing, then the admin can run ArchiveADataset and send the resulting file to NCEI. The presumption is that NCEI knows less (or nothing) about the dataset and when the data for a given time range is still changing, and also wouldn't know if the data was thought to be unchanging but then changed. How will NCEI pick up the ArchiveADataset result? That's for the data provider and NCEI to work out. One option is to put the archives in a directory and make an EDDTableFromFileNames dataset which makes the archive files publicly accessible. You could then tell NCEI to scan that directory and pick up any new files. If these options are insufficient and some other feature should be added to ERDDAP to facilitate archiving to NCEI, please let me know. |
The problem I have seen in the past is that when ERDDAP provides the file for download it updates the ":history" attribute to show when the file was "created" and provided for download. This also happens when ArchiveADataset is called to build an archive package out of a set of data files. Thus, every time the file is packaged it is different. Bob's solution works if and only if the admin knows exactly when a dataset has changed and only calls ArchiveADataset when the dataset data and/or metadata have been updated. This makes it hard to automate. Ideally, in my view, the system should provide a way that a set of packages could be automatically generated (using ArchiveADataset) but those packages would only be updated when actual data and/or metadata have changed, and possibly only once the admin has indicated those data are ready for archival. For example, let's say a dataset "A" is created. Once the admin is satisfied that "A" is ready for archival, they would set the "ready for archival" flag on the dataset. The automatic archival package generation routine would fire each day, and it would notice the "ready for archival" flag is set, and if a package was not already in the WAF for "A", a new package would be generated (using ArchiveADataset). After that, unless changes were made to the data and/or metadata, the package would remain unchanged. But if changes were made to the data and/or metadata, the automatic archival package generation routine would notice that "A" had been changed more recently than the package in the WAF, and would trigger the building of an updated package to the WAF. In this way, the admin would not have to remember to trigger the generation of a package manually. |
ERDDAP only puts the request URL and the date of the request in the
"history" attribute in .nc files (where there are attributes).
Other file types (e.g., .jsonl, .csv) don't have attributes so they don't
have a "history" attribute.
So there is a way to determine if the data for a given request (e.g., a
given time period) has changed:
make a non-.nc request and see if the response is different from the
previous response to that request.
Thus, you could automate the creation of a new archive package based on
whether the e.g., .jsonl response has changed.
I hope that helps.
…On Wed, Mar 16, 2022 at 9:04 AM John Relph ***@***.***> wrote:
The problem I have seen in the past is that when ERDDAP provides the file
for download it updates the ":history" attribute to show when the file was
"created" and provided for download. This also happens when ArchiveADataset
is called to build an archive package out of a set of data files. Thus,
every time the file is packaged it is different.
Bob's solution works if and only if the admin knows exactly when a dataset
has changed and only calls ArchiveADataset when the dataset data and/or
metadata have been updated. This makes it hard to automate.
Ideally, in my view, the system should provide a way that a set of
packages could be automatically generated (using ArchiveADataset) but those
packages would only be updated when actual data and/or metadata have
changed, and possibly only once the admin has indicated those data are
ready for archival. For example, let's say a dataset "A" is created. Once
the admin is satisfied that "A" is ready for archival, they would set the
"ready for archival" flag on the dataset. The automatic archival package
generation routine would fire each day, and it would notice the "ready for
archival" flag is set, and if a package was not already in the WAF for "A",
a new package would be generated (using ArchiveADataset). After that,
unless changes were made to the data and/or metadata, the package would
remain unchanged. But if changes were made to the data and/or metadata, the
automatic archival package generation routine would notice that "A" had
been changed more recently than the package in the WAF, and would trigger
the building of an updated package to the WAF. In this way, the admin would
not have to remember to trigger the generation of a package manually.
—
Reply to this email directly, view it on GitHub
<#28 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALKWODOEGL3IIT3AFGIEQLVAHL6DANCNFSM5NJKLCHA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
started a Code Sprint project page here https://ioos.github.io/ioos-code-sprint/2022/topics/06-using-erddap-for-ncei-archive.html |
Thanks @BobSimons and @relphj. Unfortunately the agreed upon specification for most of the datasets to be archived through this pathway is netCDF as NCEI needs all the associated metadata to build the archive metadata records. I'm wondering if it is possible to run ArchiveADataset in a non-interactive way, by listing all the answers on the command line? I can envision the data provider building a system similar to @relphj's recommendation. Flow would be:
|
Regarding 1ii) I again offer this solution: The admin can run a script which requests a non-.nc version of the data (e.g., .jsonl) and calculates the md5 (or sha256 or...) of that data file. Whenever that md5 changes from the previous value, the dataset has changed and is ready to be archived. Regarding 3i) Yes. Make an EDDTableFromFileNames dataset which points to all the files in a directory (and subdirectories if needed). Any files the administrator puts in that directory which match the dataset's file name regex will be available via ERDDAP's files system. |
AHH, so you propose the data provider makes some intermediary csv/json/not nc file to check for changes. Thank you for the clarification. Here's an example in Windows PowerShell of calculating the hash of an erddap csv endpoint:
|
jsonl is the preferred format. There has been request for the ArchiveADataset to specify an external directory. Whatever is in that external directory would be included in the BagIt file. Add optional yes/no to include ISO metadata record from ERDDAP. |
Run as one liner:
|
Java file to work on: https://github.com/BobSimons/erddap/blob/master/WEB-INF/classes/gov/noaa/pfel/erddap/ArchiveADataset.java TODO:
|
@iamchrisser Did ATN end up using this pathway to generate the files for submission to NCEI? |
@iamchrisser I wanted to pull the information I found out of an email and into something we can summarize. Feel free to add your experiences in this ticket too.
There's probably some way we can use this functionality to accomplish what we're after with the ATN automation. Maybe there could be something at NCEI that monitors http://erddap.ioos.us/erddap/files/ for new directories/changes to checksums in directories.
The text was updated successfully, but these errors were encountered: