OpenCitations Meta contains bibliographic metadata associated with the documents involved in the citations stored in the OpenCitations infrastructure. The OpenCitations Meta Software performs two main actions: a data curation of the provided CSV files and the generation of new RDF files compliant with the OpenCitations Data Model.
An example of a raw CSV input file can be found in example.csv
.
The Meta process is launched through the meta_process.py
file via the prompt command:
python -m oc_meta.run.meta_process -c <PATH>
Where:
- -c --config : path to the configuration file.
The configuration file is a YAML file with the following keys (an example can be found in config/meta_config.yaml
).
Setting | Mandatory | Description |
---|---|---|
triplestore_url | ✓ | Endpoint URL to load the output RDF |
input_csv_dir | ✓ | Directory where raw CSV files are stored |
base_output_dir | ✓ | The path to the base directory to save all output files |
resp_agent | ✓ | A URI string representing the provenance agent which is considered responsible for the RDF graph manipulation |
base_iri | ☓ | The base URI of entities on Meta. This setting can be safely left as is |
context_path | ☓ | URL where the namespaces and prefixes used in the OpenCitations Data Model are defined. This setting can be safely left as is. |
dir_split_number | ☓ | Number of files per folder. dir_split_number's value must be multiple of items_per_file's value. This parameter is useful only if you choose to return the output in json-ld |
items_per_file | ☓ | Number of items per file. This parameter is useful only if you choose to return the output in json-ld |
default_dir | ☓ | This value is used as the default prefix if no prefix is specified. It is a deprecated parameter, valid only for backward compatibility and can safely be ignored |
supplier_prefix | ☓ | A prefix for the sequential number in entities’ URIs. This setting can be safely left as is |
rdf_output_in_chunks | ☓ | If True, save all the graphset and provset in one file, and save all the graphset on the triplestore. If False, the graphs are saved according to the usual OpenCitations strategy (the "complex" hierarchy of folders and subfolders for each type of entity) |
zip_output_rdf | ☓ | If True, the folder specified in output_rdf_dir must contain zipped JSON files, and the output will be zipped |
source | ☓ | Data source URL. This setting can be safely left as is |
use_doi_api_service | ☓ | If True, use the DOI API service to check if DOIs are valid |
workers_number | ☓ | Number of cores to devote to the Meta process |
blazegraph_full_text_search | ☓ | True if Blazegraph was used as a provenance triplestore, and a textual index was built to speed up queries. For more information, see https://github.com/blazegraph/database/wiki/Rebuild_Text_Index_Procedure |
fuseki_full_text_search | ☓ | True if Fuseki was used as a provenance triplestore, and a textual index was built to speed up queries. For more information, see https://jena.apache.org/documentation/query/text-query.html |
virtuoso_full_text_search | ☓ | True if Virtuoso was used as a provenance triplestore, and a textual index was built to speed up queries. For more information, see https://docs.openlinksw.com/virtuoso/rdfsparqlrulefulltext/ |
graphdb_connector_name | ☓ | The name of the Lucene connector if GraphDB was used as a provenance triplestore and a textual index was built to speed up queries. For more information, see https://graphdb.ontotext.com/documentation/free/general-full-text-search-with-connectors.html |
cache_endpoint | ☓ | Specifies the provenance triplestore URL to use as a cache to make queries on provenance faster |
cache_update_endpoint | ☓ | If your cache provenance triplestore uses different endpoints for reading and writing (e.g. GraphDB), specify the endpoint for writing in this parameter |
orcid_process.py
generates an index between DOIs and the author's ORCIDs using the ORCID Summaries Dump (e.g. ORCID_2019_summaries). The output is a folder containing CSV files with two columns, 'id' and 'value', where 'id' is a DOI or None, and 'value' is an ORCID. This process can be run via the following commad:
python -m oc_meta.run.orcid_process -s <PATH> -out <PATH> -t <INTEGER> -lm -v
Where:
- -s --summaries: ORCID summaries dump path, subfolder will be considered too.
- -out --output: a directory where the output CSV files will be store, that is, the ORCID-DOI index.
- -t --threshold: threshold after which to update the output, not mandatory. A new file will be generated each time.
- -lm --low-memory: specify this argument if the available RAM is insufficient to accomplish the task. Warning: the processing time will increase.
- -v --verbose: show a loading bar, elapsed time and estimated time, not mandatory.
crossref_publishers_extractor.py
generates an index between Crossref members' ids, names and DOI prefixes. The output is a CSV file with three columns, 'id', 'name', and 'prefix'.
This process can be run via the following command:
python -m oc_meta.run.crossref_publishers_extractor -o <PATH>
Where:
- -o --output: The output CSV file where to store relevant information.
This process generates raw CSV files using JSON files from the Crossref data dump (e.g. Crossref Works Dump - August 2019), enriching them with ORCID IDs from the ORCID-DOI Index generated by orcid_process.py
.
This function is launched through the crossref_process.py
file via the prompt command:
python -m oc_meta.run.crossref_process -cf <PATH> -o <PATH> -out <PATH> -w <PATH> -v
Where:
- -cf --crossref: Crossref JSON files directory (input files).
- -p --publishers: CSV file path containing information about publishers (id, name, prefix). This file can be generated via
crossref_publishers_extractor.py
. - -o --orcid: ORCID-DOI index filepath, generated by
orcid_process.py
. - -out --output: directory where CSVs will be stored.
- -w --wanted: path of a CSV file containing what DOI to process, not mandatory.
- -v --verbose: show a loading bar, elapsed time and estimated time, not mandatory.
As the parameters are many, you can also specify them via YAML configuration file. In this case, the process is launched via the command:
python -m oc_meta.run.crossref_process -c <PATH>
Where:
- -c --config : path to the configuration file.
The configuration file is a YAML file with the following keys (an example can be found in config/crossref_config.yaml
.
Setting | Mandatory | Description |
---|---|---|
crossref_json_dir | ✓ | Crossref JSON files directory (input files) |
output | ✓ | Directory where output CSVs will be stored |
orcid_doi_filepath | ☓ | ORCID-DOI index directory. It can be generated via oc_meta.run.orcid_process |
wanted_doi_filepath | ☓ | Path of a CSV file containing what DOI to process. This file can be generated via oc_meta.run.coci_process, if COCI's DOIs are needed |
verbose | ☓ | Show a loading bar, elapsed time and estimated time. This setting can be safely left as is. |
You can get a CSV file containing all the IDs from citation data organized in the CSV format accepted by OpenCitations. This CSV file can be passed as an input to the -wanted
argument of crossref_process.py
. You can obtain this file by using the get_ids_from_citations.py
script, in the following way:
python -m oc_meta.run.get_ids_from_citations -c <PATH> -out <PATH> -t <INTEGER> -v
Where:
- -c --citations: the directory containing the citations files, either in CSV or ZIP format
- -out --output: directory of the output CSV files
- -t --threshold: number of files to save after
- -v --verbose: show a loading bar, elapsed time and estimated time, not mandatory.
This plugin generates CSVs from the Meta triplestore. You can run the csv_generator.py
script in the following way:
python -m oc_meta.run.csv_generator -c <PATH>
Where:
- -c --config : path to the configuration file.
The configuration file is a YAML file with the following keys (an example can be found in
config/csv_generator_config.yaml
).
Setting | Mandatory | Description |
---|---|---|
triplestore_url | ✓ | URL of the endpoint where the data are located |
output_csv_dir | ✓ | Directory where the output CSV files will be stored |
info_dir | ✓ | The folder where the counters of the various types of entities are stored. |
base_iri | ☓ | The base IRI of entities on the triplestore. This setting can be safely left as is |
supplier_prefix | ☓ | A prefix for the sequential number in entities’ URIs. This setting can be safely left as is |
dir_split_number | ☓ | Number of files per folder. dir_split_number's value must be multiple of items_per_file's value. This setting can be safely left as is |
items_per_file | ☓ | Number of items per file. This setting can be safely left as is |
verbose | ☓ | Show a loading bar, elapsed time and estimated time. This setting can be safely left as is |
Before running Meta in multiprocess, it is necessary to prepare the input files. In particular, the CSV files must be divided by publisher, while venues and authors having an identifier must be loaded on the triplestore, in order not to generate duplicates during the multiprocess. These operations can be done by simply running the following script:
python -m oc_meta.run.prepare_multiprocess -c <PATH>
Where:
- -c --config : Path to the same configuration file you want to use for Meta.
Afterwards, launch Meta in multi-process by specifying the same configuration file. All the required modifications are done automatically.