Developers Italia provides a catalog of Free and Open Source software aimed to Public Administrations.
This crawler finds and retrieves the publiccode.yml
files from the
organizations publishing the software that have registered through the
onboarding procedure.
The generated YAML files are then used by developers.italia.it build to generate its static pages.
The crawler can either run manually on the target machine or it can be deployed from a Docker container with its helm-chart in Kubernetes.
Elasticsearch 6.8 is used to store the data and has ready to accept connections before the crawler is started.
-
cd crawler
-
Save the auth tokens to
domains.yml
. -
Rename
config.toml.example
toconfig.toml
and set the variablesNOTE: The application also supports environment variables in substitution to config.toml file. Remember: "environment variables get higher priority than the ones in configuration file"
-
Build the crawler binary with
make
The repository has a Dockerfile
, used to build the production image,
and a docker-compose.yml
file to setup the development environment.
-
Copy the
.env.example
file into.env
and edit the environment variables as it suits you..env.example
has detailed descriptions for each variable.cp .env.example .env
-
Save your auth tokens to
domains.yml
cp crawler/domains.yml.example crawler/domains.yml editor crawler/domains.yml
-
Start the environment:
docker-compose up
Gets the list of organizations in whitelist/*.yml
and starts to crawl
their repositories.
If it finds a blacklisted repository, it will remove it from Elasticsearch, if it is present.
It also generates:
-
amministrazioni.yml
containing all the Public Administrations their name, website URL and iPA code. -
softwares.yml
containing all the software that the crawler scraped, validated and saved into ElasticSearch.The structure is similar to publiccode data structure with some additional fields like vitality and vitality score.
-
software-riuso.yml
containing all the software insoftwares.yml
having an iPA code. -
software-open-source.yml
containing all the software insoftwares.yml
with no iPA code. -
https://crawler.developers.italia.it/HOSTING/ORGANIZATION/REPO/log.json
containing the logs of the scraping for that particularREPO
. (eg.https://crawler.developers.italia.it/github.com/italia/design-scuole-wordpress-theme/log.json
)
In this mode one single repository at the time will be evaluated. If the
organization is present, its iPA code will be matched with the ones in
whitelist, otherwise it will be set to null and the slug
will have a random
code in the end (instead of the iPA code).
Furthermore, the iPA code validation, which is a simple check within whitelists (to ensure that code belongs to the selected PA), will be skipped.
If it finds a blacklisted repository, it will exit immediately.
-
bin/crawler updateipa
downloads iPA data and writes them into Elasticsearch -
bin/crawler delete [URL]
deletes software from Elasticsearch using its code hosting URL specified inpubliccode.url
-
bin/crawler download-whitelist
downloads organizations and repositories from the onboarding portal repository and saves them to a whitelist file
The whitelist directory contains the of organizations to crawl from.
whitelist/manual-reuse.yml
is a list of Public Administrations repositories
that for various reasons were not onboarded with
developers-italia-onboarding,
while whitelist/thirdparty.yml
contains the non-PAs repos.
Here's an example of how the files might look like:
- id: "Comune di Bagnacavallo" # generic name of the organization.
codice-iPA: "c_a547" # codice-iPA
organizations: # list of organization urls.
- "https://github.com/gith002"
Blacklists are needed to exclude individual repository that are not in line with our guidelines.
You can set BLACKLIST_FOLDER
in config.toml
to point to a directory
where blacklist files are located.
Blacklisting is currently supported by the one
and crawl
commands.
- publiccode-parser-go: the Go package for parsing publiccode.yml files
Developers Italia is a project by AgID and the Italian Digital Team, which developed the crawler and maintains this repository.