This is a harvester for software metadata. It actively attempts to detect and convert software metadata in source code repositories and converts this to a unified codemeta representation.
The tool is implemented as a simple POSIX shell script that in turn invokes a number of tools to do the actual work:
- codemetapy - Conversion to codemeta from various other metadata formats
- cffconvert - Conversion from CITATION.cff to codemeta
- Software Metadata Extraction Framework (SOMEF) - Analyzes and extracts metadata from README.md and some other sources (optional dependency)
A few simple additional metadata extractions methods, as simple shell scripts, have been implemented alongside the main script.
This harvester can be used for two purposes:
- to harvest a possibly large number of software projects, for instance to make them available in some kind of search portal.
- as a means to produce a
codemeta.json
file for your own project
A docker container can be build as follows:
make docker
A pre-built container image can also be pulled from Docker Hub:
docker pull proycon/codemeta-harvester
Alternatively if you prefer not to use containers, you can also install the software as follows:
- Run
make env
to build a Python virtual environment in theenv
directory with the needed dependencies. This assumes you have a Python installation on your system. - Activate the environment with
. env/bin/activate
whenever you want to use it. - You will need to also ensure to install the following dependencies using your system's package manager
You can use make devenv
if you want to rely on the latest development release of codemetapy, rather than the latest
stable version (this will create a devenv/
dir instead of env/
)
In your project directory, which ideally should be a git clone, you can just run codemeta-harvester to create a codemeta.json
file based on the files in your repository:
codemeta-harvester
You probably use the docker container, then the syntax is as follows:
docker run -v $(pwd):/data proycon/codemeta-harvester
The -v
argument mounts your current working directory in the container, you may adapt it according to your needs.
If you want to regenerate an existing codemeta.json
, rather than use it as input which would be the default
behaviour, then add the --regen
parameter. This overwrites any existing codemeta.json
.
The harvester can make use of the Github/GitLab API to query metdata from GitHub/GitLab, but this allows only limited anonymous requests. Please set the
environment variable $GITHUB_TOKEN
/$GITLAB_TOKEN
to a github personal access token / gitlab personal access token, if you use Docker you should pass it to the container using --env-arg GITHUB_TOKEN=$GITHUB_TOKEN
/--env-arg GITLAB_TOKEN=$GITLAB_TOKEN
.
To harvest and collect metadata from various projects, you need to create configuration files that tells the harvester
where to look. These are simple yaml
configuration files, one for each tool to harvest. They are put into a directory of your choice, and take the following format:
source: https://github.com/user/repo
services:
- https://example.org
The source
property specifies a single source code repository where the source code of the tool lives. This must be git repository that is publicly accessible. Note that you can specify only one repository here, choose the one that is representative for the software as a whole.
The services
property lists zero or more URLs where the tool can be accessed as a service. This may be a web application, simple webpage, or some other form of webservice. For webservices, rather than enumerate all service endpoints individually, this should be pointed to a URL that provides itself provides a specification of endpoints, for example a URL serving a OpenAPI specification. The information provided here will be expressed in the resulting codemeta.json
through the targetProduct
schema.org property as described in issue codemeta/codemeta#271. This links the source code to specific instantiations of the software.
Additional properties you may specify:
root
- The root path in the source code repository where to look for metadata. This can be set if the tool lives as a sub-part of a larger repository. Defaults to the repository root.scandirs
- Sub directories to scan for metadata, in case not everything lives in the root directory.ref
- The git reference (a branch name of tag name) to use. You can set this if you want to harvest one particular version. If not set, codemeta-harvester will check out the latest version tag by default (this assumes you use some kind of semantic versioning for your tags). Only if no tags are present at all, it falls back to using themaster
ormain
branch directly.tagprefix
- A prefix used for the git tags (only applicable in edge cases), the last part of the tag must still comply to a semantic versioning scheme.tagignore
- A regular expression (grep -E
) of git tags to ignore (only applicable in edge cases), by default tags with "alpha", "beta" and like "rc1" will be ignored.
Pass the directory where you put your configurations (or a single configuration file) to codemeta-harvester as follows:
codemeta-harvester /path/to/your/configdir/
Or for Docker:
docker run -v /path/to/your/configdir/:/config -v $(pwd):/data proycon/codemeta-harvester /config
Codemeta-harvester relies codemetapy to combine different input sources into one codemeta.json
, we call this composition.
When a certain input source defines a property (on schema:SoftwareSourceCode
), it will overwrite any values that were set earlier by previous sources. This entails that there is a certain order of precedence in which sources codemeta-harvester considers more important than others. The priority is roughly the following:
codemeta.json
, if this file is provided, the harvest won't look at anything else (aside from the three exceptions mentioned at the end).codemeta-harvest.json
CITATION.cff
- Language specific metadata from
setup.py
,pyproject.toml
,pom.xml
,package.json
and similar. - files such as
LICENSE
,CONTRIBUTORS
,AUTHORS
,README
- Information from git (e.g. contributors, git tag for version, date of first/last commit)
- Information from the github or gitlab API (e.g. project name/description)
Three notable exceptions are:
- For development status, repostatus badge in the
README.md
in the git master/main branch takes precendence over all else (overriding whatever is in codemeta.json!) - For maintainers, the parsing of
MAINTAINERS
in the git master/main branch is always taken into account (merged with anything in codemeta.json) - If the harvester finds a version-specific DOI at Zenodo for your software, it will always use that (overriding whatever is in codemeta.json)
This software was funded in the scope of the CLARIAH-PLUS project.