-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle vehicle GTFS entities #110
Conversation
321975c
to
d317efc
Compare
Thank you for this PR - it looks like it was a lot of work!! The scope was bigger than I thought: I didn't realize we could support geographic search for vehicles, but it's a great idea! Also I appreciate the effort you put into the hugely helpful PR description. Because this is a large PR, I'll just make bigger picture comments in this round of review and then in a subsequent round can look at more detail at the code, if that sounds okay. First, addressing the 3 things you highlighted at the start:
Given the scope of the feature, I think it's the right decision to ignore non-id vehicles for the moment. Adding support later will probably be fine. There is also the possibility we never will support these - it's not clear that these vehicles exist in the wild?
I'm personally a big fan of the data integrity guarantees that foreign keys give, so instinctively I don't love the idea of the Btw, just you mentioned insert/update errors and deadlocks so I just want to clarify what Postgres gives us here. If we have two concurrent updates that are inconsistent (say a vehicle is being inserted that foreign keys into a preexisting trip, but we're concurrently deleting that trip), Postgres will fail one of the transactions, the associated feed update will fail, and the other feed update will succeed. The end state in the database will never be inconsistent. It will be equivalent to as if just one of the updates was never started at all. This is a guarantee we get from Postgres's implementation of transactions. For realtime feeds, where by default we update every 5 seconds, this happening very occasionally is totally fine in my opinion. We'll just retry the failing update, and given the new state of the database the retry will succeed. Finally, feed updates in Transiter are fast (~400ms for the 123456 NYC subway feed, which is the slowest one) so overlapping feed updates are not super common? But maybe the bus is worse.
This is tricky! When thinking about adding vehicle support I never realized this would happen! If I understand correctly, the problem is that we may get a trip update that has Two thoughts. Firstly, adding the new config seems like the simplest solution and so I'm totally happy to do that for the moment, especially given the scope of this work. In general though I don't love these kinds of configs because to use them correctly you have to learn something about the feed dynamics. The ideal case is that Transiter works out of the box and handles whatever is thrown at it. But again, it seems like a good thing to do for the moment. Second, I wonder is the following a general solution to this issue. We could add the
If this sounded good we could do this in a follow-up. |
Thanks for the initial feedback! Adding some follow-ups below:
Sounds like we're aligned here!
Your understanding is correct here. While running updates using a strict foreign key requirement, there would occasionally be transaction failures and deadlocks at the DB level. However, I unfortunately don't have a good sense for how often this occurs with the MTA buses feed. My initial implementation was based on the initial schema design with a strict foreign key requirement, so I can try to rerun this and get a better sense for how often it happens. I am happy to go ahead with either design and I may have gotten too caught up in trying to get to 0 transaction conflicts 😅. If you have time and would also like to experiment with the different implementations in practice, I have a branch for each:
Aside from how these implementations relate vehicles to trips, they should behave the same.
Definitely agree that this is not a great long-term solution. That being said, I think it is a sufficiently complex problem that a follow-up would be better (especially since this change is already a bit large). Please let me know if there's anything from my end needed for this currently (e.g., adding comments or creating another task/issue). |
That's interesting about seeing the transaction failures on the bus system! Do you happen to have the logs that you see in that case? Would definitely be good to look out for when doing future iterations of this work. Overall, everything here sounds good. I left a few comments so just waiting for those small changes I guess! |
I ran the https://github.com/cedarbaum/transiter/tree/vehicles_trip_pk branch for ~2.5 hours and it looks like it encountered deadlocks 9 times. Below are the transiter logs as well as an example deadlock error description from the DB. So it's not too bad, but also not quite as rare as I initially hoped. At this point I have both implementations ( Transiter logs (updating trip and vehicle entities every 5 seconds for
An example error from the DB logs:
|
Thanks for the logs! Super interesting. I filed #114 as a follow-up for this issue. As I mention there, there is a correctness concern that is perhaps even worse than the deadlocks. If we have a trip, and then we update the vehicles feed that references the trip but with Because of this I definitely think we should be conservative here and tackle the full solution separately. For the moment your new setting |
Just read through that issue again and I think we're in agreement for the necessity of To clarify the above experiment, this was done with with the We then also have the design where So I would argue each approach, in conjunction with the |
Ah, thanks for the clarification! So looking more closely at the error, it seems that we're concurrently deleting the trip (because it has been removed from the trips feed) but updating a vehicle to point at the trip (because the trip is still referenced in the vehicles feed). And this just keep repeating until one feed update wins due to some timing luck. To me, this is a bug that we should fix (but maybe in a follow-up because it seems pretty delicate?). We should be able to handle concurrent updates like this. If there is a conflict, the resolution should be one of the transactions failing, and the other succeeding, not both failing due to a deadlock. Or maybe both succeeding if we're clever enough. Doing some searching, I think we may be able to solve it by using In terms of how to move forward on the PR, I really don't mind either option ( |
Make sense, thanks! Given the discussion so far, I went ahead and went with the I've sent a revision that uses the original trip FK design. The main changes from the original revision will be at the query layer and the |
api/public.proto
Outdated
optional double latitude = 3; | ||
optional double longitude = 4; | ||
optional float bearing = 5; | ||
optional int64 updated_at = 6; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific reason to have these in the reference type? In general I've tried to keep the reference type as small as possible for API simplicity and to keep the code simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought here was these would be the properties that a consumer of trips would be most likely interested in. For example, if you get a trip, you can quickly also pinpoint the vehicle on a map without another API call. This is sort of similar to trip references, which also has some commonly needed properties.
I am totally fine either removing these fields or only including them with a query param option if you think this is too much data for a reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I would prefer not to have them if that was okay? We could always add them back in down the line if it turned out to make consuming the API much easier?
Another thing we could do is add a vehicle reference as a field to the trip reference - thus in the GetStops view, say, we could go directly to GetVehicle rather than having to poll GetTrip first. (I understand I'm somewhat contradicting myself by proposing putting another field on the trip reference while arguing against fields on the vehicle reference...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good - I simplified vehicle references to just be the id/resource of the vehicle in the latest iteration.
SGTM! Just left one last question about the API and then can submit :) |
Thank you very much for the PR Sam! Including the detailed information in the PR description and your patience going through multiple rounds of review! It's great to have this feature and was a decent chunk of work. By the way, just a note that when I submit PRs I generally squash merge the branch. So writing a PR with multiple commits is totally fine, if that makes life easier for you! |
Overview
This change allows
Vehicle
entities parsed from theGTFS
library to be added to the DB and also implements several API endpoints for retrieving vehicles. There are 3 significant changes/design decisions that are worth calling out and may require further discussion:trip_pk
), thetrip_pk
column is replaced withtrip_id
and the associated trip doesn't have to exist in thetrip
table. This is to avoid cases where concurrent updates totrip
andvehicle
tables interfere with each other and cause insert/update errors (e.g., the associated trip was deleted) or deadlocks.onlyProcessFullEntities
option is added for realtime feeds, which allows entity references (e.g., the minimal trip info in a vehicle entity) to be ignored. This is to stop feeds from conflicting with each other in updates.Schema changes
current_status
andoccupancy_status
to benull
latitude
andlongitude
types tonumeric
to match other GPS typesbearing
andspeed
types toreal
to match GTFS source typesstop_pk
FK tonull
if associated stop is deletedtrip_pk
FK with unique/indexedtrip_id
column (more details in overview)Realtime update changes
updateVehicles
inrealtime
moduleonlyProcessFullEntities
GTFS realtime feed option in bothupdateTrips
andupdateVehicles
API changes
ListVehicles
, which is paginated and supports search by geolocation (similar toListStops
).MaxVehiclesPerRequest
launch option to limit results sizeGetVehicle
Vehicle
references to trips and stop timesVehicle
referencesFeed config changes
onlyProcessFullEntities
GTFS realtime feed option (more details in overview)us-ny-buses
system config changesvehiclePositions
GTFS realtime feedNYCT_BUS_TRIPS
GTFS extension as the system now appears more resilient to duplicate trip IDs. If this is needed again in the future,NYCT_BUS_TRIPS
will have to be modified to also transform trip IDs ofVehicle
entities so they can properly be associated to trips.Testing
realtime
modulestest_vehicles.py