The tasks performed by this software are:
- to create a mapping between the bibliographic resources (BRs) already existing in OpenCitations Meta (OC Meta) and their equivalents in OpenAlex
- to analyse the output of the mapping process with ad hoc processes and functions, with a focus on the provenance of OC Meta BRs that are not mapped to any BR in OpenAlex (unmapped BRs) and on OC Meta BRs that are mapped to more than one BR in OpenAlex (multi-mapped BRs).
The primary goal of the mapping is the addition of OpenAlex IDs to the metadata of the BRs in OC Meta; analysing the mapping's output, moreover, provides the chance to detect potential inconsistencies in the input collections (i.e. OC Meta and OpenAlex) and better understand the causes of these issues.
The present guide concerns the mapping process. For the mapping results analysis see the README.md documentation file inside "oc_alignoa/analytics".
The process can be launched from CLI with the following command, executed from inside the omid-openalex
directory:
python -m oc_alignoa.main -c <PATH>
Where:
-c
--config
: path to the configuration file.
A guide on how to write the configuration file is provided here.
The mapping between the BRs in the two collections is performed by linking them according to the presence, in both collections, of a common external persistent identifier (PID). For example, if a journal article in OC Meta has, in its metadata the same DOI as a BR in OpenAlex, these entities get aligned (Fig 1).
Figure 1. Visual representation of the mapping between a BR in OC Meta and the corresponding BR in OpenAlex.
The output of this mapping process is divided into three separate groups of BRs:
- BRs mapped in a 1:1 ratio, i.e. OC Meta BRs that have been mapped to exactly one BR in OpenAlex each and their corresponding BRs in OpenAlex.
- Multi-mapped BRs, i.e. OC Meta BRs that have been mapped to more than one BR in OpenAlex each and their corresponding BRs in OpenAlex.
- Unmapped BRs, i.e. OC Meta BRs that have not been mapped to any BR in OpenAlex.
The mapping process takes as input the zipped CSV files of an OC Meta dump and the compressed JSON-Lines files of a snapshot of the OpenAlex catalog. The OC Meta CSV dump can be downloaded from Figshare. The OpenAlex snapshot can be downloaded from Amazon S3 following the dedicated guide by the OpenAlex team.
Each row of the OC Meta CSV tables store basic metadata about a BR: for the purpose of the mapping and the following
analysis, we are interested in the id
field (storing the BR's OMID and, if present, external PIDs, such as DOIs or ISSNs)
and the type
field, specifying the BR type (e.g. journal article, book, journal, dataset, etc.).
# a sample CSV row of OC Meta dump
"id","title","author","issue","volume","venue","page","pub_date","type","publisher","editor"
"omid:br/060924 doi:10.4230/lipics.fun.2021.11","Efficient Algorithms For Battleship","Crombez, Loïc [omid:ra/062503694315 orcid:0000-0002-9542-5276]; Fonseca, Guilherme D. Da [omid:ra/0612046462 orcid:0000-0002-9807-028X]; Gerard, Yan [omid:ra/0612046189 orcid:0000-0002-2664-0650]","","","LIPIcs : Leibniz International Proceedings In Informatics [omid:br/060182 issn:1868-8969]","","2020","report","Schloss Dagstuhl - Leibniz-Zentrum Für Informatik [omid:ra/0607497]","Farach, Martin [omid:ra/0614045842 orcid:0000-0003-3616-7788]; Prencipe, Giuseppe [omid:ra/0625025102 orcid:0000-0001-5646-7388]; Uehara, Ryuhei [omid:ra/0617014675 orcid:0000-0003-0895-3765]"
In the snapshot of the OpenAlex collection, BRs and their metadata are distinguished into two entity types: Works (e.g.
journal articles, books, reports) and Sources (i.e. where Works are contained, e.g. journals, conferences, repositories).
Each Work or Source is represented inside the JSON-Lines files as a JSON object, of which, mapping-wise, we need to
consider only the ids
field; this is itself a JSON object where keys are PID schemes (e.g. doi
, issn
) and values
are the values for those schemes.
# a sample from the JSON object representing a Source entity in the OpenAlex snaphot
{
"id":"https://openalex.org/S168707975",
"issn_l":"1941-7012",
"issn":[
"1941-7012"
],
"display_name":"Journal of Renewable and Sustainable Energy",
...
"ids":{
"openalex":"https://openalex.org/S168707975",
"issn_l":"1941-7012",
"issn":[
"1941-7012"
],
"mag":"168707975",
"wikidata":"https://www.wikidata.org/entity/Q6295857",
"fatcat":"https://fatcat.wiki/container/xgsd2zrqkzhgpmbd3ofxlhuxci"
},
...
"type":"journal",
...
}
First, the input data (CSV files of the OC Meta dump and JSON-Lines files of the OpenAlex snapshot) is pre-processed to create new tables:
- a table storing the OMID and the external PIDs of each BR in the OC Meta CSV dump, and optionally
its type (
omid
,ids
andtype
fields, respectively). In this step, OC Meta BRs without external PIDs are discarded, meaning that they will not be included in the produced table and therefore will not be considered for the mapping step. - two tables, one for OpenAlex Works and the other for OpenAlex Sources, storing the OpenAlex IDs and the external PIDs
of each Work/Source in the OpenAlex snapshot, among the ID schemes that are also supported by OC Meta. In both tables,
each row stores an association between a single, mutually-supported external PID (
supported_id
field) and the OpenAlex ID of the resource it is attributed to (openalex_id
field). In this step, OpenAlex BRs without external PIDs that are also supported by OC Meta are discarded, meaning that they will not be included in the produced tables and therefore will not be considered for the mapping step.
Then, the two tables storing the IDs of the BRs in OpenAlex are converted into an SQLite database to enable faster queries on this data. Each external PID is a primary key, to which a single OpenAlex ID is linked (Fig 2).
Figure 2. Diagram representing the SQLite database storing OpenAlex IDs and the external PIDs associated to them.
Finally, the actual mapping step takes place: for each row in the table storing OC Meta IDs (i.e. for each OC Meta
BR with external PIDs), each external PID is looked for in the database: if it is present, the OMID
linked to that PID from the CSV table and the OpenAlex ID linked to it in the database are aligned. For every OC Meta BR
all external PIDs are looked for in the database: if all of them point to the same OpenAlex ID,
then the resource in OC Meta is mapped to a single resource in OpenAlex; if they point to different OpenAlex IDs, this results
in a multi-mapped case; if none of the external PIDs is found in the database, no mapping is found for that BR.
The three groups of BRs that make up the output of the mapping process are stored in three separate CSV tables. In the tables for
BRs mapped in a 1:1 ratio and tables for multi-mapped BRs each row constitutes an alignment between the BR in OC Meta
(represented by its OMID, in the omid
field) and the BR(s) in OpenAlex (represented by one or multiple OpenAlex IDs,
in the openalex_id
field); the optional type
field stores the type of BRs as registered in OC Meta.
In the tables storing unmapped BRs each rows contain only the OMID and,
optionally, the BR type, in the omid
and type
fields respectively.
The utils
module contains the read_csv_tables
function and the MultiFileWriter
class, which are used across the
whole software to read and write tables.
The source code for the mapping process can be found inside the mapping
module. There are three classes (Fig 3):
- The
MetaProcessor
class deals with creating the tables storing OC Meta BRs that have external IDs (process_meta_tables
method). - The
OpenAlexProcessor
class deals with creating the tables storing OpenAlex BRs that have external IDs supported also by OC Meta (create_openalex_ids_tables
method) and with converting these tables into SQLite database tables (create_id_db_table
method). - The
Mapping
class has only one method,map_omid_openalex_ids
, which is the implementation of the mapping step.
As of now, some software features concern the pre-processing of responsible agents (authors, publishers, editors) and the separate pre-processing of venues (container publications such as journals, conferences, etc.). For example:
- the
MetaProcessor.process_meta_tables
function saves separate tables for all bibliographic resources (in a directory namedprimary_ents
), for venues only (in a directory namedvenues
) and for responsible agents (in a directory namedresp_ags
) - the
OpenAlexProcessor
class includes methods for extracting PIDs of other OpenAlex entity types besides Works and Sources (namely Institutions, Publishers, Funders and Authors) These features are to be considered experimental and are not used for the mapping process.
The main
module is the executable to run the whole process (except the result analysis process, which is run separately). The following section illustrates how to write the configuration file storing the arguments passed to the functions called inside main.py
.
Function calls in the main
module (which stores the code to execute the mapping process) take their parameters from a YAML configuration file,
whose path is specified in the --config
argument of the launching command.
The YAML file is read as a nested dictionary, and arguments in it are grouped under different keys, according to the purpose they serve. An example of the configuration file can be found in the config.yaml file. The
following illustrates how to compile the configuration file.
Groups the parameters to pass to MetaProcessor.process_meta_tables()
for creating CSV tables of OC Meta BRs with external PIDs.
meta_dump_zip
(str): path to the ZIP file of the OC Meta dumpmeta_ids_out
(str): path to the directory where to save the CSV tables. Here, the tables will be stored in a subdirectory namedprimary_ents
all_rows
(bool): if True, processes all the BRs in the OC Meta CSV dump, regardless of whether a BR already has an OpenAlex ID. If False, only BRs for which the OpenAlex ID is missing are processed.
Groups the parameters to pass to OpenAlexProcessor.create_openalex_ids_tables()
for creating CSV tables of OpenAlex Works with external PIDs supported also in OC Meta.
inp_dir
(str): the path to the directory storing data of OpenAlex Worksout_dir
(str): the path to the directory where to store the CSV tables with OpenAlex Worksentity_type
(str): the OpenAlex entity type. Since the directory to process contains Works, it must be set to "work".
Groups the parameters to pass to OpenAlexProcessor.create_openalex_ids_tables()
for creating CSV tables of OpenAlex Works with external PIDs supported also in OC Meta.
The parameters are named the same as openalex_works
, as they are used by the same function, but they must be set to deal with OpenAlex Sources
instead of Works. The entity_type
parameter must be set to source
.
Groups the parameters to pass to OpenAlexProcessor.create_id_db_table()
for creating a database table storing the DOIs of Work entities in OpenAlex.
inp_dir
(str): path to the directory containing CSV tables storing OpenAlex Works with external PIDs (i.e. the same asopenalex_works.out_dir
)db_path
(str): the path to the SQLite database where to store the dataid_type
(str): the ID scheme of the IDs to store in the table. Since we want to store Works' DOIs, it must be set to "doi".entity_type
(str): the OpenAlex entity type for the database table to produce. Since we want to store DOIs for Works, it must be set to "work".
These group the parameters to pass to OpenAlexProcessor.create_id_db_table()
for creating database tables for PMIDs and PMCIDs of Works, and
for ISSNs and Wikidata IDs of Sources. This follows the same logic as the parameters in db_works_doi
, but the argument values must be adapted (except db_path
).
When processing Works, entity_type
must be set to "work and id_type
must be set to "pmid" and "pmcid"; when processing Sources, entity_type
must be set to "source" and id_type
must be set to "issn" and "wikidata".
Groups the parameters to pass to Mapping.map_omid_openalex_ids()
for creating the mapping.
inp_dir
(str): the directory where the table storing OC Meta BRs with external PIDs are saved, i.e. the directory "primary_ents" insidemeta_tables.meta_ids_out
db_path
(str): The path to the database storing OpenAlex BRsout_dir
(str): The directory where to save the tables storing BRs mapped in a 1:1 ratiomulti_mapped_dir
(str): The directory where to save the table storing multi-mapped BRsnon_mapped_dir
(str): The directory where to save the table storing unmapped BRstype_field
(bool): If True, always write thetype
field in the tables.all_rows
(bool): If True, processes all the BRs in the input table, regardless of whether a BR already has an OpenAlex ID. If False, only BRs for which the OpenAlex ID is missing are processed.