In Italy there are no available official Open Data about the performance (delays, cancellations, ...) of the italian public rail transport. This project offers a tool which allows anyone to gather it and run some stats and visualizations.
flowchart TB
S[Scraper] --> |Downloads data| D("ViaggiaTreno and Trenord APIs")
S -->|Produces| P[(Daily .pickle dumps)]
E[Extractor] -->|Reads| P
E[Extractor] -->|Produces| C[(Daily .CSV dumps)]
A2["(BYOD Analyzer)"] -.->|Reads| C
A[Analyzer] -->|Reads| C
A[Analyzer] -->|Produces| K(Stats, visualizations, etc...)
The application is composed by multiple modules, accessible via CLI:
scraper
: unattended script to incrementally download and preserve the current status of the italian railway network. If run constantly (e.g. ~every hour usingcron
) all trains will be captured and saved indata/%Y-%m-%d/trains.pickle
.train-extractor
andstation-extractor
: converts raw scraped data to usable.csv
files;analyze
: shows reproducible stats and visualizations.
The project is written in Python and it uses modern typing annotations, so Python >= 3.11 is needed.
A Dockerfile is available to avoid installing the dependencies manually. You can use the automatically updated ghcr.io/marcobuster/railway-opendata:latest Docker image if you want the latest version available on the master branch.
For instance, the following command will start the scraper on your machine.
$ docker run -v ./data:/app/data ghcr.io/marcobuster/railway-opendata:latest scraper
⚠️ WARNING: this project currently uses the builtinhash(...)
function to quickly index objects. To ensure reproducibility between runs, you need to disable Python's hash seed randomization by setting thePYTHONHASHSEED=0
environment variable. If you fail to do so, the software will refuse to start.
$ export PYTHONHASHSEED=0
$ virtualenv venv
$ source ./venv/bin/activate
$ pip install -r requirements.txt
$ python main.py ...
-
Start the scraper. For continuos data collection, it should be run every ~hour.
$ python main.py scraper
-
Extract train data from a pickle file and save it in CSV.
$ python main.py train-extractor -o data/2023/04-29/trains.csv data/2023-04-29/trains.pickle
-
Extract station data from a pickle file and save it in GeoJSON.
$ python main.py station-extractor -f geojson data/stations.pickle
-
Describe a dataset and filter observation by date.
$ python main.py analyze --start-date 2023-05-01 --end-date today data/stations.csv data/2023-05-*/trains.csv --stat describe
-
Show delay stats of the last stop.
$ python main.py analyze --group-by train_hash --agg-func last [..]/stations.csv [..]/trains.csv --stat delay_box_plot
-
Show daily train count grouped by railway companies.
$ python main.py analyze --group-by client_code [..]/stations.csv [..]/trains.csv --stat day_train_count
-
Display an interactive map and open it in the web browser.
$ python main.py analyze [..]/stations.csv [..]/trains.csv --stat trajectories_map
-
Display a timetable graph.
$ python main.py analyze [..]/stations.csv [..]/trains.csv --stat timetable --timetable-collapse
Column | Data type | Description | Notes |
---|---|---|---|
code |
String | Station code | This field is not actually unique. One station can have multiple codes |
region |
Integer | Region code | If zero, unknown. Used in API calls |
long_name |
String | Station long name | |
short_name |
String | Station short name | Can be empty |
latitude |
Float | Station latitude | Can be empty |
longitude |
Float | Station longitude | Can be empty |
In the extracted trains CSV, each line is a train stop (not station nor train). Many fields are actually duplicated.
Column | Data type | Description | Notes |
---|---|---|---|
train_hash |
MD5 hash | Unique identifier for a particular train | |
number |
Integer | Train number | Can't be used to uniquely identify a train1 |
day |
Date | Train departing date | |
origin |
Station (code) | Train absolute origin | |
category |
String | Train Category | See table2 |
destination |
Station (code) | Train final destination | |
client_code |
Integer | Railway company | See table3 |
phantom |
Boolean | True if train was only partially fetched | Trains with this flag should be safely ignored |
trenord_phantom |
Boolean | True if the train was only partially fetched using Trenord APIs | Trains with this flag should be safely ignored4 |
cancelled |
Boolean | True if the train is marked as cancelled | Not all cancelled trains are marked as cancelled: for more accuracy, you should always check stop_type |
stop_number |
Integer | Stop progressive number (starting at 0) | |
stop_station_code |
Station (code) | Stop station code | |
stop_type |
Char | Stop type | P if first, F if intermediate, A if last, C if cancelled |
platform |
String | Stop platform | Can be empty |
arrival_expected |
ISO 8601 | Stop expected arrival time | Can be empty |
arrival_actual |
ISO 8601 | Stop actual arriving time | Can be empty |
arrival_delay |
Integer | Stop arriving delay in minutes | Is empty if arrival_expected or arrival_actual are both empty |
departure_expected |
ISO 8601 | Stop expected departing time | Can be empty |
departure_actual |
ISO 8601 | Stop actual departing time | Can be empty |
departure_delay |
Integer | Stop departing delay in minutes | Is empty if departing_expected or departing_actual are both empty |
crowding |
Integer | Train crowding in percentage | Reported by Trenord |
See CONTRIBUTING.md.
The ViaggiaTreno APIs are known to be buggy and unreliable.
As stated before, many fields (like departure_expected
and arrival_expected
) are not always guaranteed to be present and some concepts are counter-intuitive (a train number is not an unique identifier nor are station codes).
ViaggiaTreno is the main source of truth for many final user applications (like Trenìt! or Orario Treni) and is itself linked on the Trenitalia official website.
For instance, if the API does not return information for a train stop, no other application will display it: the data simply does not exists online.
The scraper always tries to save as much data as possible ("best effort") even when is probably incomplete; in those cases, proper flags (like phantom
and trenord_phantom
) are activated so the developer can choose for themselves.
Copyright (c) 2023 Marco Aceti. Some rights reserved (see LICENSE).
Terms and conditions of the ViaggiaTreno web portal state that copying is prohibited (except for personal use) as all rights for the content are reserved to the original owner (Trenitalia or Gruppo FS). In July 2019 Trenitalia sued Trenìt for using train data in its app, but partially lost. I think data about the performance of public transport should be open as well, but I'm not a lawyer and I'm not willing to risk lawsuits redistributing data; if someone wants to, the tool is now available.
BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
Footnotes
-
In Italy, two different trains can share the same number. A train is only uniquely identified by the triple (number, origin, day). ↩
-
Known categories are listed below.
Category Description REG Regional trains MET Metropolitan trains FR Frecciarossa (red arrow) IC Intercity ICN Intercity Night EC Eurocity FB Frecciabianca (white arrow) FA Frecciargento (silver arrow) EN EuroNight EC ER Eurocity -
Known client codes are listed below.
Client code Railway company 1 TRENITALIA_AV 2 TRENITALIA_REG 4 TRENITALIA_IC 18 TPER 63 TRENORD 64 OBB -
This flag is activated when a train is seen on ViaggiaTreno APIs and marked as Trenord's but it can't be fetched on Trenord's APIs. ↩