Skip to content

HBrendy/ingest-archiver

 
 

Repository files navigation

Ingest Archiver Build Status Maintainability codecov

Ingest Archiver

The archiver service is an ingest component that:

  • Submits metadata to the appropriate external accessioning authorities. These are currently only EBI authorities (e.g. Biosamples).
  • Converts metadata into the format accepted by each external authority

In the future it will:

  • Update HCA metadata with accessions provided by external authorities

At the moment it consists of 3 stages.

  1. Running the metadata archiver (MA) script (the one in this repository) which archives the metadata of a submission through the DSP. This script also checks the submission of the files by the file uploader (see below).
  2. Running the file uploader (FIU) of the archive data to the DSP which runs on the EBI cluster. This will need access to the file submission JSON instructions generated by the metadata archiver.
  3. Running the metadata archiver (MA) again to validate and submit the entire submission.

This component is currently invoked manually after an HCA submission.

How to run

Step 1: Docker Run Command

docker run -v $PWD:/output \
--env INGEST_API_URL=http://api.ingest.archive.data.humancellatlas.org/ \
--env INGEST_API_GCP={ "type": "service_account", "project_id": "...", "private_key_id": "...", "private_key": "...", "client_email": "...", "client_id": "...", "auth_uri": "https://accounts.google.com/o/oauth2/auth", "token_uri": "https://oauth2.googleapis.com/token", "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs", "client_x509_cert_url": "..." } \
--env DSP_API_URL=https://submission.ebi.ac.uk \
--env AAP_API_URL=https://api.aai.ebi.ac.uk/auth \
--env ONTOLOGY_API_URL=https://www.ebi.ac.uk/ols \
--env AAP_API_DOMAIN=<aap_domain> \
--env AAP_API_USER=<aap_user> \
--env AAP_API_PASSWORD=<aap_password> \
--env VALIDATION_POLL_FOREVER=False \
--env SUBMISSION_POLL_FOREVER=False \
quay.io/ebi-ait/ingest-archiver \
--alias_prefix=HCA \
--project_uuid=<project_uuid>

Environment variables

INGEST_API_URL
# The ingest environment to pull metadata for submission. 
Production: INGEST_API_URL=http://api.ingest.archive.data.humancellatlas.org/
Staging: INGEST_API_URL=http://api.ingest.staging.archive.data.humancellatlas.org/

INGEST_API_GCP (OPTIONAL)
# The service account token to use when connecting to the ingest api.
This is required when completing submissions, to post accessions back to ingest, but is otherwise optional.
Search the aws secrets manager for gcp-credentials.json

DSP_API_URL
# The DSP service on which to create the new submission to archives.
# The old name of DSP USI_API_URL is also supported
Production: https://submission.ebi.ac.uk
Test: https://submission-test.ebi.ac.uk

# The DSP uses an EBI Authentication and Authorization Profile (AAP) account.
AAP_API_URL
Production: https://api.aai.ebi.ac.uk/auth
Test: https://explore.api.aai.ebi.ac.uk/auth

AAP_API_DOMAIN
Test: subs.test-team-21
Production: subs.team-2

AAP_API_USER, AAP_API_PASSWORD
# Specify the AAP user and password. Ether create your own in the group above or use the common AAP user if archiving on behalf of ingest.
hca-ingest

Runtime Variables

--alias_prefix=HCA
--project_uuid=2a0faf83-e342-4b1c-bb9b-cf1d1147f3bb
The --alias-prefix above is prefixed to every DSP entitiy created by the Archiver.
The --project_uuid is used to download assay manifests from the Ingest API.

Execution

You should get output like:

GETTING MANIFESTS FOR PROJECT: 2a0faf83-e342-4b1c-bb9b-cf1d1147f3bb
Processing 6 manifests:
https://api.ingest.staging.archive.data.humancellatlas.org/bundleManifests/0d172fd7-f5af-4307-805b-3a421cdabd76
https://api.ingest.staging.archive.data.humancellatlas.org/bundleManifests/9526f387-bb5a-4a1b-9fd1-8ff977c62ffd
https://api.ingest.staging.archive.data.humancellatlas.org/bundleManifests/4d07290e-8bcc-4060-9b67-505133798ab0
https://api.ingest.staging.archive.data.humancellatlas.org/bundleManifests/b6d096f4-239a-476d-9685-2a03c86dc06b
https://api.ingest.staging.archive.data.humancellatlas.org/bundleManifests/985a9cb6-3665-4c04-9b93-8f41e56a2c71
https://api.ingest.staging.archive.data.humancellatlas.org/bundleManifests/19f1a1f8-d563-43a8-9eb3-e93de1563555

* PROCESSING MANIFEST 1/6: https://api.ingest.staging.archive.data.humancellatlas.org/bundleManifests/0d172fd7-f5af-4307-805b-3a421cdabd76
Finding project entities in bundle...
1
Finding study entities in bundle...
1
Finding sample entities in bundle...
17
Finding sequencingExperiment entities in bundle...
1
Finding sequencingRun entities in bundle...
1
...
Entities to be converted: {
    "project": 1,
    "study": 1,
    "sample": 19,
    "sequencingExperiment": 6,
    "sequencingRun": 6
}
Saving Report file...
Saved to /output/ARCHIVER_2019-01-04T115615/REPORT.json!
##################### FILE ARCHIVER NOTIFICATION
Saved to /output/ARCHIVER_2019-01-04T115615/FILE_UPLOAD_INFO.json!

Step 2 - Check REPORT.json

In your current directory, the MA will have generated a directory with the name ARCHIVER_<timestamp> containing two files, REPORT.json and FILE_UPLOAD_INFO.json. Inspect REPORT.json for errors. If there are any data files to upload you will always see FileReference dsp_validation_errors in the submission_errors field. These you can ignore - we will upload the files in the following steps. For example:

    "completed": false,
    "submission_errors": [
        {
            "error_message": "Failed in DSP validation.",
            "details": {
                "dsp_validation_errors": [
                    {
                        "FileReference": [
                            "The file [306982e4-5a13-4938-b759-3feaa7d44a73.bam] referenced in the metadata is not exists on the file storage area."
                        ]
                    },
                    {
                        "FileReference": [
                            "The file [988de423-1543-4a84-be9a-dd81f5feecff.bam] referenced in the metadata is not exists on the file storage area."
                        ]
                    },
                    {
                        "FileReference": [
                            "The file [fd226091-9a8f-44a8-b49e-257fffa2b931.bam] referenced in the metadata is not exists on the file storage area."
                        ]
                    }
                ]
            }
        }
    ],

If you see problems in the entities added to the submission with non-empty errors and warnings fields then please report. This is a small snippet showing a successful entity addition:

    "entities": {
        "HCA_2019-01-07-13-53__project_2a0faf83-e342-4b1c-bb9b-cf1d1147f3bb": {
            "errors": [], 
            "accession": null,
            "warnings": [], 
            "entity_url": "https://submission-dev.ebi.ac.uk/api/projects/c26466cd-9551-46c9-b760-72e05cfc51ac"
        },

Step 3 - Copy FILE_UPLOAD_INFO.json to cluster

FILE_UPLOAD_INFO.json contains the instructions necessary for the file uploader to convert and upload submission data to the DSP. You need to copy this file to HCA NFS directory accessible by the cluster. However, you also need to give it a unique name so that it doesn't clash with any existing JSON files.

Therefore, prepend something to the filename to make it unique. This can be anything but we suggest your username and the dataset. For example mfreeberg_rsatija_FILE_UPLOAD_INFO.json

You will copy the file using the secure copy (scp) command. This will need your EBI password and is equivalent to copying a file through ssh. For example

scp FILE_UPLOAD_INFO.json ebi-cli.ebi.ac.uk:/nfs/production/hca/mfreeberg_rsatija_FILE_UPLOAD_INFO.json

Step 4 - Login to cluster

Login to EBI CLI to access the cluster with your EBI password ssh ebi-cli.ebi.ac.uk

Step 5 - Run the file uploader

Run the file uploader with the bsub command below. We will explain more about the components below.

bsub 'singularity run -B /nfs/production/hca:/data docker://quay.io/humancellatlas/ingest-file-archiver -d=/data -f=/data/mfreeberg_rsatija_FILE_UPLOAD_INFO.json -l=https://explore.api.aai.ebi.ac.uk/auth -p=<ebi-aap-password> -u=hca-ingest'

  • bsub - the command for submitting a job to the cluster
  • singularity - the cluster runs jobs using Singularity containers.
  • B /nfs/production/hca:/data - this binds the /nfs/production/hca directory to /data inside the container.
  • docker://quay.io/humancellatlas/ingest-file-archiver - Singularity can run Docker images directly. This is the image for the file uploader.
  • -d=/data - workspace used to store downloaded files, metadata and conversions.
  • -f=/data/mfreeberg_rsatija_FILE_UPLOAD_INFO.json - the location of the FILE_UPLOAD_INFO.json you copied in a previous step.
  • -l=https://explore.api.aai.ebi.ac.uk/auth - The AAP API url, same as the AAP_API_URL environmental variable. As above, this will need to be -l=https://api.aai.ebi.ac.uk/auth instead if you are submitting to production DSP.
  • -p=<ebi-aap-password> - Test or production AAP password as used previously
  • -u=hca-ingest - The DSP user to use. This will always be hca-ingest right now.

On submitting you will see a response along the lines

Job <894044> is submitted to default queue <research-rh7>.

This shows that the job has been submitted to the cluster. To see the status of the job run

bjobs -W

The job should be reported as running but may also be pending if the cluster is busy.

If you want to see the job's current stdout/stderr then run the bpeek command

bpeek <job-id>

Once the job is running processing may take a long time, many days in the case where a dataset has many data file conversions to perform. It will continue running after you logout and on completion or failure will e-mail you with the results. Wait until you receive this e-mail before proceeding with the next step.

Here are some further useful links about using the cluster and associated commands.

Step 6 - Check the cluster job results e-mail

The e-mail you receive will have a title similar to Job %JOB-ID%: <singularity run -B /nfs/production/hca/mfreeberg:/data docker://quay.io/humancellatlas/ingest-file-archiver -d=/data -f=/data/FILE_UPLOAD_INFO.json -l=https://explore.api.aai.ebi.ac.uk/auth -p=%PW% -u=hca-ingest> in cluster <EBI> Done

This will contain a whole load of detail about the job run. Scroll down to the bottom and you should see a bunch of INFO messages such as

INFO:hca:File process_15.json: GET SUCCEEDED. Stored at fd226091-9a8f-44a8-b49e-257fffa2b931/process_15.json.

and

INFO:hca:File PBMC_RNA_R1.fastq.gz: GET SUCCEEDED. Stored at fd226091-9a8f-44a8-b49e-257fffa2b931/PBMC_RNA_R1.fastq.gz.

If you see any WARNING or ERROR messages please re-run the singularity command from the previous step (it will retry the failed steps) and tell ingest development.

Alternate Step 3-6 Running the data uploader outside of singularity.

For test purposes you can run the data uploader outside of singularity with the command

docker run --rm -v $PWD:/data quay.io/humancellatlas/ingest-file-archiver -d=/data -f=/data/FILE_UPLOAD_INFO.json -l=https://api.aai.ebi.ac.uk/auth -p=<password> -u=hca-ingest

Step 7 - Validate submission and submit

To do this you need to run the metadata archiver again:

docker run -v $PWD:/output \
--env INGEST_API_URL=http://api.ingest.archive.data.humancellatlas.org/ \
--env DSP_API_URL=https://submission.ebi.ac.uk \
--env AAP_API_URL=https://api.aai.ebi.ac.uk/auth \
--env ONTOLOGY_API_URL=https://www.ebi.ac.uk/ols \
--env AAP_API_DOMAIN=<aap_domain> \
--env AAP_API_USER=<aap_user> \
--env AAP_API_PASSWORD=<aap_password>> \
--env VALIDATION_POLL_FOREVER=False \
--env SUBMISSION_POLL_FOREVER=False \
quay.io/ebi-ait/ingest-archiver \
--alias_prefix=HCA \
--project_uuid=<project_uuid>
--submission_url=https://submission.ebi.ac.uk/api/submissions/<submission-uuid>

You can get the submission UUID from either the output of the initial metadata archiver run, e.g.

DSP SUBMISSION: https://submission-dev.ebi.ac.uk/api/submissions/b729f228-d587-440c-ae5b-d0c1f34b8766

or in the REPORT.json in the submission-url field (there will be several). For example,

"submission_url": "https://submission-dev.ebi.ac.uk/api/submissions/b729f228-d587-440c-ae5b-d0c1f34b8766"

On success you will get the message SUCCESSFULLY SUBMITTED. You're done!

How to run the tests

python -m unittest discover -s tests -t tests

Versioning

For the versions available, see the tags on this repository.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details

About

Ingest Archiver service

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 96.9%
  • Shell 2.9%
  • Dockerfile 0.2%