Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using ERDDAP to build an archival package for NCEI to pickup #28

Open
MathewBiddle opened this issue Feb 1, 2022 · 14 comments
Open

Using ERDDAP to build an archival package for NCEI to pickup #28

MathewBiddle opened this issue Feb 1, 2022 · 14 comments
Labels
question Further information is requested

Comments

@MathewBiddle
Copy link
Collaborator

@iamchrisser I wanted to pull the information I found out of an email and into something we can summarize. Feel free to add your experiences in this ticket too.

There's probably some way we can use this functionality to accomplish what we're after with the ATN automation. Maybe there could be something at NCEI that monitors http://erddap.ioos.us/erddap/files/ for new directories/changes to checksums in directories.

@MathewBiddle MathewBiddle added the question Further information is requested label Feb 1, 2022
@MathewBiddle
Copy link
Collaborator Author

The problem we run into with using ERDDAP to submit data to NCEI is an issue with file fixity. Essentially every time NCEI goes to an ERDDAP endpoint and downloads a file, the resultant file will always have a different checksum. Even if the contents are the same. It's been stated that this is because some of the metadata ERDDAP includes in the resultant file is dynamically generated by ERDDAP and thus gets changed every time someone downloads the dataset, resulting in a different checksum.

The ArchiveADataset tool resolves the fixity problem, however, it does not resolve the transfer mechanism problem. How will NCEI pick up a package that has been generated by that tool?

  • Put it on a WAF?
  • Use ERDDAP to re-serve the package? (as described above)

These are questions to explore.

@MathewBiddle
Copy link
Collaborator Author

@BobSimons, FYI.

@BobSimons
Copy link

I'll add:

If the data set is available via ERDDAP "files" system and the files (once they are made available) don't change (e.g., for yearly, monthly, or daily files that never change), then having NCEI scan the "files" directory for new files and downloading them is a good solution.

ArchiveADataset presumes that the ERDDAP admin has knowledge of when the dataset's data for a given time range is changing or has finished changing. Once a chunk of data (e.g., the data for a given year or month) has finished changing, then the admin can run ArchiveADataset and send the resulting file to NCEI. The presumption is that NCEI knows less (or nothing) about the dataset and when the data for a given time range is still changing, and also wouldn't know if the data was thought to be unchanging but then changed.

How will NCEI pick up the ArchiveADataset result? That's for the data provider and NCEI to work out. One option is to put the archives in a directory and make an EDDTableFromFileNames dataset which makes the archive files publicly accessible. You could then tell NCEI to scan that directory and pick up any new files.

If these options are insufficient and some other feature should be added to ERDDAP to facilitate archiving to NCEI, please let me know.

@relphj
Copy link

relphj commented Mar 16, 2022

The problem I have seen in the past is that when ERDDAP provides the file for download it updates the ":history" attribute to show when the file was "created" and provided for download. This also happens when ArchiveADataset is called to build an archive package out of a set of data files. Thus, every time the file is packaged it is different.

Bob's solution works if and only if the admin knows exactly when a dataset has changed and only calls ArchiveADataset when the dataset data and/or metadata have been updated. This makes it hard to automate.

Ideally, in my view, the system should provide a way that a set of packages could be automatically generated (using ArchiveADataset) but those packages would only be updated when actual data and/or metadata have changed, and possibly only once the admin has indicated those data are ready for archival. For example, let's say a dataset "A" is created. Once the admin is satisfied that "A" is ready for archival, they would set the "ready for archival" flag on the dataset. The automatic archival package generation routine would fire each day, and it would notice the "ready for archival" flag is set, and if a package was not already in the WAF for "A", a new package would be generated (using ArchiveADataset). After that, unless changes were made to the data and/or metadata, the package would remain unchanged. But if changes were made to the data and/or metadata, the automatic archival package generation routine would notice that "A" had been changed more recently than the package in the WAF, and would trigger the building of an updated package to the WAF. In this way, the admin would not have to remember to trigger the generation of a package manually.

@BobSimons
Copy link

BobSimons commented Mar 16, 2022 via email

@MathewBiddle
Copy link
Collaborator Author

MathewBiddle commented Mar 31, 2022

@MathewBiddle
Copy link
Collaborator Author

Thanks @BobSimons and @relphj. Unfortunately the agreed upon specification for most of the datasets to be archived through this pathway is netCDF as NCEI needs all the associated metadata to build the archive metadata records.

I'm wondering if it is possible to run ArchiveADataset in a non-interactive way, by listing all the answers on the command line?

I can envision the data provider building a system similar to @relphj's recommendation. Flow would be:

  1. Create/edit configuration file which lists the dataset ID's in the host ERDDAP to be archived at NCEI. "Setting the 'ready for archival flag'".
    1. The data provider would have to manage which ones are new/updates.
    2. Question: How would the ERDDAP admin know when a previously shared dataset has been updated? They would need to manage that piece somehow.
  2. Script (ran at some frequency TBD by provider) uses the config file to run ArchiveADataset for each dataset. This assumes you can run it by listing all the answers on the command line.
    $ ArchiveADataset.sh Bagit tar.gz [contact] [datasetID] all .nc SHA-256
  3. Resultant BagIt package is put in the appropriate WAF for NCEI to pick up.
    1. It would be nice to use ERDDAP's files system to share the packages, but that clutters up the ERDDAP with duplicate data. Is it possible to have a package available via the files system, but not available through the rest of ERDDAP's services?

@BobSimons
Copy link

Regarding 1ii) I again offer this solution: The admin can run a script which requests a non-.nc version of the data (e.g., .jsonl) and calculates the md5 (or sha256 or...) of that data file. Whenever that md5 changes from the previous value, the dataset has changed and is ready to be archived.

Regarding 3i) Yes. Make an EDDTableFromFileNames dataset which points to all the files in a directory (and subdirectories if needed). Any files the administrator puts in that directory which match the dataset's file name regex will be available via ERDDAP's files system.

@MathewBiddle
Copy link
Collaborator Author

Regarding 1ii) I again offer this solution: The admin can run a script which requests a non-.nc version of the data (e.g., .jsonl) and calculates the md5 (or sha256 or...) of that data file. Whenever that md5 changes from the previous value, the dataset has changed and is ready to be archived.

AHH, so you propose the data provider makes some intermediary csv/json/not nc file to check for changes. Thank you for the clarification.

Here's an example in Windows PowerShell of calculating the hash of an erddap csv endpoint:

C:\Users> $wc = [System.Net.WebClient]::new()
C:\Users> Get-FileHash -InputStream ($wc.OpenRead("http://erddap.ioos.us/erddap/tabledap/raw_asset_inventory.csv"))

Algorithm       Hash                                                                   Path
---------       ----                                                                   ----
SHA256          44CD532AD8B1381557DE5252E88428FA1574FA3B04411B974666131E31808174

@MathewBiddle
Copy link
Collaborator Author

jsonl is the preferred format.

There has been request for the ArchiveADataset to specify an external directory. Whatever is in that external directory would be included in the BagIt file.

Add optional yes/no to include ISO metadata record from ERDDAP.

@MathewBiddle
Copy link
Collaborator Author

MathewBiddle commented Apr 26, 2022

Run as one liner:

docker run --rm -it \
  -v "$(pwd)/datasets:/datasets" \
  -v "$(pwd)/logs:/erddapData/logs" \
  -v "$(pwd)/erddap/content:/usr/local/tomcat/content/erddap" \
  -v "$(pwd)/erddap/data:/erddapData" \
  axiom/docker-erddap:latest \
  bash -c "cd webapps/erddap/WEB-INF/ && bash ArchiveADataset.sh -verbose BagIt tar.gz default raw_asset_inventory default "" "" .nc SHA-256"

@MathewBiddle
Copy link
Collaborator Author

Java file to work on: https://github.com/BobSimons/erddap/blob/master/WEB-INF/classes/gov/noaa/pfel/erddap/ArchiveADataset.java

TODO:

  • @BobSimons to add the capability to include external files into the BagIt package. One of those external files will be the ISO metadata record, if available.
  • @MathewBiddle to engage with other partners on what their requirements might be.

@MathewBiddle
Copy link
Collaborator Author

@MathewBiddle
Copy link
Collaborator Author

@iamchrisser Did ATN end up using this pathway to generate the files for submission to NCEI?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants