Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Populate and use started_at column in trip table for disambiguating trips #125

Open
cedarbaum opened this issue Aug 24, 2023 · 3 comments
Open

Comments

@cedarbaum
Copy link
Collaborator

cedarbaum commented Aug 24, 2023

The field started_at is currently not populated/used or used in the trip table. There are 2 reasons this can be useful:

  1. Per this article, the start date of a trip can be necessary when joining realtime data with static data.
  2. It can disambiguate duplicate trip IDs occurring at once (e.g., Trip_1 starts at 11:30PM and then an overlapping Trip_1 started at 12:01AM the next day).

(1) is a longer term concern associated with the work described in #11, but (2) could prove useful for general data integrity with the existing API.

To accomplish this, thetransiter.public.trip.trip_route_pk_id_key, will have to be changed to incorporate the started_at field as well (e.g., transiter.public.trip.trip_started_at_route_pk_id_key). This will break the assumption used throughout the system that, at any given point, there is a unique trip ID per route. For example, /systems/{system_id}/routes/{route_id}/trips/{trip_id} always returns a single trip. I believe this can be mostly solved with the below changes:

  1. Whenever a Trip or Trip.Reference is returned by the API, also return the started_at field.
  2. For the .../trips/{trip_id} endpoint, add an optional query parameter ?started_at={date} to disambiguate multiple trips with the same ID. If this query parameter is not provided, always return the earliest trip. I believe the default case matches what would happen today, since the later trip could not be added to the table until the earlier trip ends.

@jamespfennell please let me know if agree with above problem statement and if you think this sounds like a workable solution.

@jamespfennell
Copy link
Owner

Ah this is gnarly :) I think your solution would certainly work, though it may make the API somewhat confusing because it kind of breaks the REST semantics.

I wonder as an alternative could we transform the trip ID for every realtime trip that comes into Transiter, and use that as the "trip ID" that we store in the database? We could then persist the regular trip ID as original_trip_id or something like that. The transformed trip ID could be <trip_id>_<started_at_date>_<started_at_time>, with defaults of 00-00-000 and 00:00 if the started at fields are not provided.

@cedarbaum
Copy link
Collaborator Author

Good points! I think changing the trip id format is definitely nicer from a REST perspective, but my concern is that it loses the 1-to-1 mapping with the source GTFS content. A couple other solutions I was thinking about:

  1. Change .../trips/{trid_id} endpoint to return a list of trips with further references to individual trip URIs: .../trips/{trid_id}/{start_time}.
  2. Allow both .../trips/{trid_id} and .../trips/{trid_id}_{start_time} to match to the same resource in cases where there is no ambiguity. If multiple trips are running with the same (GTFS) trip ID, then only allow .../trips/{trid_id}_{start_time} to work and return 404 otherwise.

Curious to know your thoughts! Also I should mention this isn't, from my view, an urgent issue, just something I was thinking about while reading GTFS documentation. My intuition is it's relatively rare in the wild, but I've neither run into it nor gone out of my way to look for such cases as of yet.

@jamespfennell
Copy link
Owner

I believe we could maintain the 1-1 mapping. Suppose we had the following convention for the "normalized trip ID":

  • If start date and time are set, make trip ID <trip_id>_YYYYMMDD_HHMMSS
  • If only start date is set, make trip ID <trip_id>_YYYYMMDD_
  • If only start time is set, make trip ID <trip_id>__HHMMSS
  • If neither is set, make trip ID <trip_id>__

In this case given a normalized trip ID, you can split on the last two _ to get the original trip ID, start date and time back, irrespective of the structure of trip ID (which may itself contain _ characters).

As you say in option (2), we could also have options where this other trip ID is an alias for the regular trip ID, and so the API would work with either option as long as there is no ambiguity.

Our conversation so far has been very theoretical :) it would be interesting I think to find examples of systems that have this issue (maybe Amtrack?). Also it seems to be related to the GTFS frequencies.txt file which is a way to define many trips with the same trip ID but offset from each other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants