Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add concept of "run" to trips and stop times #195

Closed
wants to merge 1 commit into from

Conversation

thzinc
Copy link

@thzinc thzinc commented Dec 18, 2019

Overview

The GTFS specification does not currently represent the work of a driver or operator of a vehicle. (aka, a "run", also colloquially "paddle") The current GTFS spec does permit an optional "block_id" to be specified on a trip, which is used to denote a group of work that a vehicle will accomplish. This pull request amends the spec to allow an optional "run_id" to be specified on a trip as well as on an individual stop time to denote the group of work a driver/operator will accomplish.

Background

Syncromatics has been using a nonstandard representation of runs based directly on Schedule Masters' "runcut.txt" file, which uses a cumbersome trip and stop key to identify the starts and ends of runs. This has been very difficult to update manually, which is antithetical to the GTFS spec's "Feeds should be easy to create and edit" guiding principle.

* Added a description of run_id field

Added run_id field to both trips.txt and stop_times.txt
Add "Mid-trip driver changes are indicated in the stop_times.txt file.".
Add "A run_id value specified for one stop_time does not apply to subsequent stop_times in the same trip. The run_id value must be repeated in each subsequent row for the remainder of the trip following a driver switch.
If the entire trip is performed by a single driver, use the run_id in trips.txt instead."
@googlebot
Copy link
Collaborator

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@thzinc
Copy link
Author

thzinc commented Dec 18, 2019

To the maintainers,

We (@Sassy-Tester and myself) have been trying to find recent information on how to properly propose this spec change. The third point on the CHANGES.md indicates we should announce this in the GTFS-changes group, but that group appears to be private. Any help in properly directing this proposal is greatly appreciated.

@thzinc
Copy link
Author

thzinc commented Dec 18, 2019

@googlebot I signed it

@googlebot
Copy link
Collaborator

We found a Contributor License Agreement for you (the sender of this pull request), but were unable to find agreements for all the commit author(s) or Co-authors. If you authored these, maybe you used a different email address in the git commits than was used to sign the CLA (login here to double check)? If these were authored by someone else, then they will need to sign a CLA as well, and confirm that they're okay with these being contributed to Google.
In order to pass this check, please resolve this problem and then comment @googlebot I fixed it.. If the bot doesn't comment, it means it doesn't think anything has changed.

ℹ️ Googlers: Go here for more info.

@Sassy-Tester
Copy link

@googlebot I fixed it.

@googlebot
Copy link
Collaborator

All (the pull request submitter and all commit authors) CLAs are signed, but one or more commits were authored or co-authored by someone other than the pull request submitter.

We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that by leaving a comment that contains only @googlebot I consent. in this pull request.

Note to project maintainer: There may be cases where the author cannot leave a comment, or the comment is not properly detected as consent. In those cases, you can manually confirm consent of the commit author(s), and set the cla label to yes (if enabled on your project).

ℹ️ Googlers: Go here for more info.

@thzinc
Copy link
Author

thzinc commented Dec 18, 2019

@googlebot I consent.

@skinkie
Copy link
Contributor

skinkie commented Dec 19, 2019

The way you model it would duplicate your schedule if your trips remain the same, but your run_id would change. In addition, why does this improve travel information to the end-user? Or more specifically: why aren't you using operational data exchange standards where this kind of information is perfectly modelled in the first place?

@amirmatalon
Copy link

For us at Optibus, this addition to the GTFS schema makes a lot of sense.
It allows us to include the entire crew and vehicle schedule in one place instead of dispersing it in multiple protocols and schemas.

GTFS as a data schema and protocol have exceeded the limits of Google and it is now being used worldwide to exchange public transit data so any improvement like this is more than welcome and will lead to better cooperation between public transit agencies, operators and local authorities.

@stevenmwhite
Copy link
Contributor

For us at GMV Syncromatics -- we've been using this format with Optibus, Hastus, and some agencies who build their GTFS files manually and it's dramatically cut down on integration headaches on the operational side of things. GTFS has proven to be a capable and consistently understood format between various players in the industry.

As far as rider-facing information goes, I will admit that the majority of benefits here are related to operational data exchange, but as a daily transit rider myself I will say that particularly in the case of mid-trip relief, driver changes do cause unexpected breaks in the trip and behavior that riders witness. It's not "normal" for your bus driver to pull over and get out of the bus and I've seen an entire bus full of people think there was an emergency situation, when in reality it was a planned driver change in the middle of a trip. Visibility into this on trip planners (for those agencies who choose to make it optionally available and those consumers who choose to optionally utilize it) will help give riders a fuller picture of what to expect throughout their journey.

@tsherlockcraig
Copy link

At @trilliumtransit we support the reasoning by @amirmatalon and @stevenmwhite above. This information could be effectively utilized by trip planners to provide information important to riders just as blocks are used, and GTFS is increasingly utilized effectively as a standard way to pass information about transit networks between systems, as well as for network analysis and comparison where this information could be useful.

I would also be interested in hearing different approaches that may exist to conveying this information within GTFS by other producers, if those exist.

If GTFS were to incorporate run information, which I believe it should, I think it would be necessary to consider how runs are meant to relate to service days. This was an issue with blocks previously which as addressed through https://github.com/google/transit/pull/44/files so this might be an opportunity for the community to comment on that previous change and what affect it may have practically had on data production and consumption.

@skinkie
Copy link
Contributor

skinkie commented Dec 19, 2019 via email

@gcamp
Copy link
Contributor

gcamp commented Dec 19, 2019

I'm mixed on the fact that this information is really about operation and I agree that GTFS generally shouldn't include operational information (even if there's already a bit of information in there).

I'm also not sure if there's any overlap with the #32 proposition.

@barbeau
Copy link
Collaborator

barbeau commented Jan 30, 2020

I have mixed feeling about including this in GTFS as well. Historically official GTFS has been strictly about passenger facing information.

That being said, I think it’s great if producers/consumers can standardize on a format rather than using proprietary methods.

I'd suggest the following:

  1. Have all existing producers/consumers comment on this thread saying they are producing or consuming, and the use case it solves for them
  2. Clearly specify if there is a direct impact on passenger facing info

Also, as mentioned above, note that this doesn't need to be officially adopted into the spec in order to be widely used by producers/consumers. It can live nicely in GTFS feeds next to the officially-defined fields.

@dimnioras
Copy link

Hi everyone,

Although GTFS was created primarily for service communication to the public, in reality it also evolved to a data-exchange tool within an agency. Scheduling platforms provide their own proprietary ways of adding operational data to GTFS, and that defeats the purpose of having a single standard.

My opinion is that GTFS should be more than what data we exchange with the public. It is very easy to omit some fields during creation of the feeds, so I don't see any issue having those operational characteristics inside the feed.

Coming to the run_id issue, how about we approach it the same way we approach the headsign? That way, when street relieves are not used each trip will have the corresponding run_id, otherwise, it can be part of the stop_times file.

@antrim
Copy link
Contributor

antrim commented Sep 30, 2020

GTFS was originally for passenger-facing information (@barbeau's earlier comment).

GTFS is being used (and is useful!) for purposes other than traveler-facing information. There is a discussion in Slack about the scope of GTFS. I encourage everyone following along to add to that conversation.

What other uses are there for the "run" concept?

  • Are there other real-time providers information vendors that would ingest this information?
  • Would this be useful to evaluate and compare scheduling approaches?
  • Would this be useful for configuring APC or other on-board systems?

@e-lo
Copy link

e-lo commented Sep 30, 2020

What are the pros/cons of separating out:

  1. schedule system --> consumer
  2. schedule system --> operations-oriented software

Note that the Cal-ITP project is working on some background research on this topic and requested that MobilityData form a working group. Please contact us (me) if you would like to be on the research circuit

@colemccarren
Copy link

Speaking for Central Maryland RTA, we've come to the conclusion that small, local agencies are far more likely to produce accurate GTFS - and maintain that GTFS - when it's not just another duplicative process.

I agree that the concept of "runs" in GTFS would be used mostly in an operational context. But, it would help us immensely in limiting those "duplicative" processes, which in turn helps us achieve that end goal of accurate GTFS information everywhere, including the passenger-facing applications.

I think I read that some larger agencies get around the particular issue this PR attempts to address by creating different GTFS feeds for different consumers. For example, a spec-defined block_id (strictly describing vehicle behavior) might be applied to trip_ids on a data set used by a trip planner, but that same block_id field will be populated by "run_id" values within a data set used by a CAD/AVL system. Nearly everything else in both those two example feeds would be exactly the same.

While that solution may work great for some, I don't see it being a sustainable solution for most smaller agencies. I think there would be great value in attempting to standardize this data within GTFS.

@stevenmwhite
Copy link
Contributor

@colemccarren Thanks for your input! For what it's worth, we (GMV Syncromatics) serve mostly small and mid-sized agencies (generally 20 - 300 buses) and the main conclusion of our customers is similar to yours. In fact, they don't want to "manage their GTFS" or even think about GTFS specifically -- it should just be a byproduct of managing the operational software systems they use.

As far as this proposal for adding runs to GTFS goes... we've adopted it and are currently using it for data transfer with Optibus, Hastus, Trapeze (customized by a local agency), and Trillium. So whether it's officially in the spec or not, it's in use as a de facto standard by a number of us in the industry. For consumers that don't know or care what the run_id is used for, they simply ignore it if present in the feed.

@skinkie
Copy link
Contributor

skinkie commented May 5, 2021

@stevenmwhite So whether it's officially in the spec or not, it's in use as a de facto standard by a number of us in the industry.

As you are aware I am on the boat with Transmodel for specifically the exchange of information from vendors, for a reason that is not directly evident from this discussion. I imagine that many vendors do things in similar ways and in different ways, but I would like to prevent that the information from their system has to be "projected" (as in: a database term for a transformation) into different semantics than the original system has. To be clear: it is not about exporting the information from a system, but representing the exact situation the source system stores. Hence: no simplication, inference or remodelling because the target standard state so.

Even if we introduce the concept of run in a csv format, it is not what has been stored in these system, it is in best case a derivative. If you care about your data (and many informational science people do) you don't settle for that. You ask your vendor: give me access to the database. What lacks in that case is the discussion of vocabulary (this is where Transmodel comes in).

For consumers that don't know or care what the run_id is used for, they simply ignore it if present in the feed.

It will still require that we make the right choice based on cardinality, because, with respect, there are "software architects" that have no clue about the consequences of a choice they make.

@tsherlockcraig
Copy link

Even if we introduce the concept of run in a csv format, it is not what has been stored in these system, it is in best case a derivative.

I'm not sure that this is exactly true for all such systems, in the context of software that provides some 'runcutting' capabilities for small transit agencies. Have you performed a survey of scheduling systems and have evidence to demonstrate this statement?

Runs are sometimes simply a subset of vehicle blocks in the small urban and rural systems that I'm most familiar with, and while I've not seen the details of the backend databases of all software applications that service these systems, I've seen multiple software applications that don't provide any functionality which couldn't be represented faithfully in this matter, and @stevenmwhite has confirmed above that this model is successfully being used to transfer data between Syncromatics' system and multiple scheduling systems.

If you care about your data (and many informational science people do) you don't settle for that.

This is an opinion. Clearly, some folk don't think this is a matter of settling. For many use cases, the proposed spec is sufficient. Some vendors/agencies have chosen this approach over Transmodel, so apparently in some cases this approach is superior. Their opinions and approaches which are working for them are not invalid.

It will still require that we make the right choice based on cardinality, because, with respect, there are "software architects" that have no clue about the consequences of a choice they make.

I'm unsure of what your technical complaint is in this statement, @skinkie . GTFS data sets regularly include all sorts of extra fields and optional fields. Any "software architect" that works with GTFS has experience with choosing which optional fields will be included in their system or not. Can you explain how this is different than another optional field which might be ignored by a system? For example, block_id. Why is this field harder to deal with than block_id, another field which a system may very well choose to ignore?

@stevenmwhite
Copy link
Contributor

stevenmwhite commented May 6, 2021

For what it’s worth, we’re not against Transmodel (our parent company is featured with an entire page on the Transmodel website: http://www.transmodel-cen.eu/implementations/spain ) — it just doesn’t have an immediate use to our product in the context of partners we integrate with right now, whereas this solution is simple, easy for people to understand, and has already proven useful in practice.

We prefer not to adopt things that aren’t already accepted into the spec (they may change and thus create more work for us), but this is a simple case where we (along with others) have an actual need to fill right now (this isn’t theoretical) that is solved by this proposal and thus we’ve gone ahead and adopted it regardless of its status in the official spec. It’s been working for us for the past year and a half and the CAD/AVL and scheduling industries don’t seem closer to accepting any other standard, at least here in the US. If there was a big movement towards selecting another standard we’d be happy to participate.

@skinkie
Copy link
Contributor

skinkie commented May 6, 2021

Even if we introduce the concept of run in a csv format, it is not what has been stored in these system, it is in best case a derivative.

I'm not sure that this is exactly true for all such systems, in the context of software that provides some 'runcutting' capabilities for small transit agencies. Have you performed a survey of scheduling systems and have evidence to demonstrate this statement?

I am aware of the database formats of the key players, and reversed engineered two of them over the years. None look like a single value attached to a trip. They use independent structures to refer to trips.

This is an opinion. Clearly, some folk don't think this is a matter of settling. For many use cases, the proposed spec is sufficient. Some vendors/agencies have chosen this approach over Transmodel, so apparently in some cases this approach is superior. Their opinions and approaches which are working for them are not invalid.

Sufficient gives me the feeling of an ever growing product that is joined with glue and staples, not a well thought out plan to model and exchange. Especially this addition is an example on how something that started as "travel information" becomes "operational exchange" without rethinking if the model aligns with those objectives. I can fit on demand transport perfectly in GTFS, that the dataset size of a small isle is 1/3 of an entire country... is "suboptimal" or can we state "it was never intented to be used this way". The same will be for blocks, runs, and what comes after that: duties.

I'm unsure of what your technical complaint is in this statement, @skinkie . GTFS data sets regularly include all sorts of extra fields and optional fields. Any "software architect" that works with GTFS has experience with choosing which optional fields will be included in their system or not. Can you explain how this is different than another optional field which might be ignored by a system? For example, block_id. Why is this field harder to deal with than block_id, another field which a system may very well choose to ignore?

I have already described it in this discussion: #195 (comment)

@tsherlockcraig
Copy link

tsherlockcraig commented May 7, 2021

I have already described it in this discussion: #195 (comment)

Got it. I'm not sure I fully understand your response (sorry I never followed up, if I remember right I was on my way to a vacation that day). I think what it amounts to is "if runs are added for different days, we're multiplying the number of stop times by the number of days". I'm not sure how this is really an argument against this proposed extension, because we already allow datasets that have a different service_id for each day, and accept the related expansion of stop_times. I suppose having the capability to add runs could feasibly encourage some producers to add more service_ids. Is that the concern?

I am aware of the database formats of the key players

I hear this and agree with you on "the key players" but I'm not honestly concerned about the key players. Personally, I'm concerned about Guadalupe, CA, Pendleton, OR, Ellensburg, WA, with populations of 8, 16, and 21k respectively (and pathetic US transit budgets), and the vendors who serve them now and could feasibly serve them in the future. In today's market, they don't get to have fully interoperable scheduling systems, because Transmodel doesn't make sense for them and GTFS isn't capable, for lack of a column. We want an open market of scheduling applications for many different use cases, not a market dominated by a few key players who only focus on large urban systems.

glue and staples

Fair, but

  1. to be quite honest GTFS may very well stand for Glue, Tape, and F*ing Staples. It's never been an optimal spec or "a well thought out plan to model and exchange". It's always been a practical tool focused on moving the industry forward in a way that's approachable by a wide-variety of agencies and technical competencies. I very much appreciate your and all contributors' considerations of what would make it best, but if you're not proposing a superior alternative other than to adopt Transmodel for operations exchange, which is not a feasible option for Ellensburg, then I think --

  2. there's another nightmare scenario, which is GTFS breaks apart because it tried to be pure rather than adapt to the times. No we can't accept every change but yes we should encourage GTFS to adapt to the use cases that agencies want to use it for. If we don't they'll stop using GTFS and adopt proprietary systems that keep their costs low short term, or create a GTFS spin off that adds confusion to the industry.

Interested to hear from @gcamp @barbeau who earlier expressed some hesitancy, but did not state they were opposed to this change. I think the first of @barbeau 's requests feels met to me by Maryland and Syncro's comments. Perhaps it would help to hear more from @amirmatalon .

Sean with your second bullet in #195 (comment) were you looking for a proponent to demonstrate a connection to passenger info (not sure this would be needed, but I'd highlight driver breaks as potentially relevant to customers) or were you looking for an opponent of the extension to demonstrate that it would have an adverse effect on passenger info?

@barbeau
Copy link
Collaborator

barbeau commented May 18, 2021

I think the first of @barbeau 's requests feels met to me by Maryland and Syncro's comments.

Yes - so if I understand correctly it sounds like this has already been adopted by (at least):

  • Central Maryland RTA
  • Optibus
  • Hastus
  • Trapeze (customized by a local agency)
  • Trillium

IMHO that seems like a strong argument for standardization.

Sean with your second bullet in #195 (comment) were you looking for a proponent to demonstrate a connection to passenger info (not sure this would be needed, but I'd highlight driver breaks as potentially relevant to customers) or were you looking for an opponent of the extension to demonstrate that it would have an adverse effect on passenger info?

I was looking for a proponent to demonstrate a connection to passenger info. If there is a direct connection to passenger info then IMHO there isn't a philosophical obstacle to adopting the feature based on GTFS tradition, which has historically had a strong focus on passenger info features. I think @stevenmwhite makes a good argument that there is a potential rider-facing benefit in #195 (comment) for driver breaks in stop_times.txt. Given our experience with https://github.com/TheTransitClock/transitime I could also see this being useful when generating real-time arrival predictions from vehicle positions as an indicator that the vehicle might pause at a stop and you shouldn't interpret this as a congestion-based delay. Ideally an existing consumer would already be showing this passenger info to end users - not sure if that's currently the case?

Even if there isn't a direct connection to passenger info, though, I think there is still a strong argument for adopting this type of feature if it reduces friction for creating and maintaining GTFS, as mentioned by @colemccarren in #195 (comment) and @stevenmwhite in #195 (comment). It's an indirect but important contributor to good passenger info in that case.

@skinkie
Copy link
Contributor

skinkie commented May 18, 2021

@barbeau can you please put on your academic hat and review the cardinality issue that everyone wants to avoid?

@ehowington
Copy link

In my discussions with operators and producers, I have found the following to be helpful when describing how blocks and runs can effectively work within GTFS. Many operators and software developers use different terms, but if run can be codified as block has been in the context of GTFS, that would be helpful.

Blocks = vehicles
Runs = drivers/operators

They are effectively independent and often cross over - neither is a subset of the other. The intersection of a block and a run is a 'piece of work' - sometimes called a job - and is the combo of a driver and a vehicle. A driver will often use more than one vehicle in a given shift/paddle, and a vehicle will almost always have more than one driver in a given service day.

As such, each block_id will likely interact with multiple run_ids, and each run_id will likely be represented on multiple block_ids. (the below diagram is calling runs "paddles")

blocks_runs_work

Also, in terms of benefitting public information as found in the GTFS, I think addressing runs would be helpful because GTFS producers are making sacrifices by using block_id to describe drivers instead of vehicles for ingestion into CAD/AVL systems. In particular, breaking up trip plans with driver changes when a rider should not be exposed to that since the vehicle is staying the same (eg, a loop route having its blocking cut up to account for driver changes and preventing riders from seeing that no transfer is required.)

@dimnioras
Copy link

This information is very valuable, also thanks for the clear graph. The initial usefulness of block_id in GTFS was so that customers would know if a trip is supposed to leave on time based on the status of a bus on the previous trip, and that makes it more complicated when trying to include scheduling information that is not needed by customers. The beauty of GTFS is the abundance of information in a very simple format.

The real question about the run_id (which is also part of the block_id discussion) is: "Can we make sure that a specific trip (or trip segment for runs) will only have one block_id and one run_id, so that we don't end up with unnecessary duplication?"

@skinkie
Copy link
Contributor

skinkie commented May 20, 2021

What is the reason that the person that proposes this standard does not want to exchange:

runs.txt
service_date,run_id,trip_id

blocks.txt
service_date,block_id,trip_id

@roystonvasey
Copy link

Yes, but this is unacceptable, and unfeasible for real time matching.

Isn't this already the case with blocks?

Runs are subsets of blocks, which means that one block will have at least one run assigned and runs consist of one (if regular run) or 2 block parts (swing). Technically speaking, just adding the run_id will not create any issues with duplication, as operators pick the runs they will work on (so the run numbers are not based on the operator). So, each block_id will have the same run_id's throughout the schedule and vice-versa.

I am sorry in advance for those of you that already know this information, but I personally don't know of any case that a specific run_id will overlap with multiple blocks at the same time. Personally, I think that working with run_id in the feed in a similar way the headsign info works (you have it in the trips file if it's the same for the whole trip or in the stop_times file every time there's a street relief). Talking about customer information, having the run_id in the stop_times file would be helpful in cases there is a long waiting time at a stop because of that.

We have many instances where trips would have 2 runs, drivers will change mid route as they pass the depot or at a transit centre with rest facilities.
This would break the proposal to append a run_id to a trip and to append a run_id to a stop time.
A solution would be to have an inbound_run_id and an outbound_run_id rather than a single run_id.

@dimnioras
Copy link

We have many instances where trips would have 2 runs, drivers will change mid route as they pass the depot or at a transit centre with rest facilities.
This would break the proposal to append a run_id to a trip and to append a run_id to a stop time.
A solution would be to have an inbound_run_id and an outbound_run_id rather than a single run_id.

That's why I was proposing that the run_id would behave the same way as the trip_headsign/stop_headsign: attached to the trips file when it is the same for the whole trip or attached to the stop_times file when the run changes mid-route.

@skinkie
Copy link
Contributor

skinkie commented May 20, 2021

That's why I was proposing that the run_id would behave the same way as the trip_headsign/stop_headsign: attached to the trips file when it is the same for the whole trip or attached to the stop_times file when the run changes mid-route.

This is the reason why in a professional exchange these operations are independent and make a reference to a specific "stop time".

The problem with "the people" here: is they propose a simple addition "because for the simple case it works". Objections are made that this is not a simple thing. Replies become "But it is convenient". Elephant in the room "But we want driver changes at stop times level"... and then everyone agrees it is not simple.

@tsherlockcraig
Copy link

tsherlockcraig commented May 21, 2021

My preference for run_id as a field rather than runs.txt as a file is primarily about human readability in small files and ease of maintenance by agencies using spreadsheets. I agree with @skinkie that introducing run_id into stop_times does somewhat concede that we're only focusing on simple use cases, even if I disagree with the apocalyptic tenor of what allowing stop_time level runs would cause.

We have many instances where trips would have 2 runs,

Potential 'compromise' proposal:

  • only add run_id to trips.txt
  • force a producer with 2 runs in a trip to break that trip in 2. trip_short_name and block_id still exist to let the consumer know that the two trips are operated under the same trip name or by the same vehicle.

The 'truly simple' use case would be allowed, the 'slightly less simple' use case would be allowed but the producer would take on responsibility for handing the consumer trips with only one block/one run.

@skinkie is this less objectionable? I'm also not opposed to your proposal (#195 (comment)) I just think it introduces unnecessary complexity for small producers.

@thzinc
Copy link
Author

thzinc commented May 21, 2021

@dimnioras, indeed the proposed spec change is written this way. (i.e., Mid-trip driver changes are indicated with stop_times.run_id)

@skinkie
Copy link
Contributor

skinkie commented May 21, 2021

@tsherlockcraig I would still love to hear your opinion on runs.txt / blocks.txt and potentially duty.txt

@tsherlockcraig
Copy link

tsherlockcraig commented May 21, 2021

@tsherlockcraig I would still love to hear your opinion on runs.txt / blocks.txt and potentially duty.txt

I mentioned in my last comment that I'm not opposed to this approach, even potentially changing blocks to this approach, but that I prefer the approach of this PR.

Specifically:

My preference for run_id as a field rather than runs.txt as a file is primarily about human readability in small files and ease of maintenance by agencies using spreadsheets.

I'm also not opposed to your proposal (#195 (comment)) I just think it introduces unnecessary complexity for small producers.

Duty we don't have anyone proposing and I'm not advocating for. I don't see it as implied as needed by adding runs as you seem to. Duties fall squarely in active driver management, whereas runs are a precursor to driver management accounted for in the scheduling process with smaller agencies.

@tsherlockcraig
Copy link

@skinkie have a thought on my counter-compromise?

Potential 'compromise' proposal:

  • only add run_id to trips.txt
  • force a producer with 2 runs in a trip to break that trip in 2. trip_short_name and block_id still exist to let the consumer know that the two trips are operated under the same trip name or by the same vehicle.

@skinkie
Copy link
Contributor

skinkie commented May 26, 2021

I think your proposal limits the expressiveness of what people would like to exchange, and forcing data to fit to a data model, opposed to gradually add more information. I really fail to understand why the separate runs.txt is observed as more complex.

@tsherlockcraig
Copy link

Many small producers build data via literal spreadsheets. A column in a spreadsheet is very easy to keep track of. A separate table must be kept in sync more laboriously. Having the information in the same file is also easier to view within the trips file. I'm not proposed to runs.txt and have used a similar model in the past, but it's definitely more complex for small producers.

@skinkie
Copy link
Contributor

skinkie commented May 26, 2021

Many small producers build data via literal spreadsheets. A column in a spreadsheet is very easy to keep track of. A separate table must be kept in sync more laboriously. Having the information in the same file is also easier to view within the trips file. I'm not proposed to runs.txt and have used a similar model in the past, but it's definitely more complex for small producers.

If a producer can afford Hastus or Optibus I consider this just one OIG export, and it is trivial to make.

Standardization based on the fact if you can make your data via a spreadsheet should not be an argument to come up with a poor design. If a produce can fit their trips with run_id, then one slight change could be suggested, using service_id (opposed to date), trip_id, run_id, that can still be exported from a single spreadsheet. That would still allow 'bigger' agencies to use a different service_id than was used in trips.txt, and not duplicating their stop_times.txt.

@tsherlockcraig
Copy link

If a producer can afford Hastus or Optibus

And if a producer cannot afford Hastus or Optibus, then their riders don't deserve trip planning? Cool.

As I've said, I'm not opposed to runs.txt and if you open up that pr I'd vote for it. But we should as open to considering the socio-economic impacts of our decisions as we are open to technical impacts.

@skinkie
Copy link
Contributor

skinkie commented May 26, 2021

If a producer can afford Hastus or Optibus

And if a producer cannot afford Hastus or Optibus, then their riders don't deserve trip planning? Cool.

There are quite a few free open source and affordable tools available to create GTFS feeds. Didn't come across of an open source Hastus, but that one is on my bucket list.

But we should as open to considering the socio-economic impacts of our decisions as we are open to technical impacts.

That is why we should have a best-practice implementation, not have people spend their working life on repeating lines in stop_times.txt.

@tsherlockcraig
Copy link

tsherlockcraig commented May 26, 2021

There are quite a few free open source and affordable tools

Open source =/= affordable, and some of the cheap tools that have existed provide only very minimal files. I think you should talk with the city of Guadalupe with me and expand your perspective of the tools that agencies can afford.

That is why we should have a best-practice implementation, not have people spend their working life on repeating lines in stop_times.txt.

I'm unsure of what you mean to say here. You're saying that by allowing people to add run_id to trips.txt, we'd be wasting people's time and thus doing something bad from a socio-economic perspective? If so, you're addressing efficiency, not equity, and I'm making a point about the latter.

--edit

I think you should talk with the city of Guadalupe with me and expand your perspective of the tools that agencies can afford.

I'll grant that my perspective here is driven by the incredibly poor approaches to transit funding that we have the US, but rural US-folk are a constituency I'm inclined to try to design for in the spec, because the US federal government doesn't look like it's planning to fix the issue any time soon.

I don't know the degree to which these issues exist elsewhere in the world where an agency may still want to import into a CAD/AVL, but I'd hazard that there are some agencies in similar situations in Central/South America, Africa, and parts of Asia. The market is built around a distinction between scheduling and CAD/AVL, which is sensible, but means some agencies will try to purchase CAD/AVL before purchasing scheduling, so spreadsheets are a legitimate approach to scheduling.

@stevenmwhite
Copy link
Contributor

For some context about other possibilities... before adoption the run_id as detailed here, we used a separate format that was originally developed by a scheduling provider. This format included a file added to GTFS by the title of Runcut.txt.

The Runcut file had the following columns:
runs_id,service_id,run_number,piece_number,start_trip_id,start_stop_id,end_trip_id,end_stop_id

We can currently import this no problem and we continue to use file sets from a small set of agencies with the Runcut file included. If this was adopted as a standard today it'd have no impact to us because our software supports it fully, but our customers found it generally unwieldy for two reasons:

  1. It required custom development in their scheduling (runcutting) software that they were not prepared to do themselves and/or their software provider had no interest in doing because it was seen as the format of a competing vendor.
  2. It was extremely difficult to follow or create manually for agencies that don't use scheduling software for a few reasons, one being its use of trips as a range attribute of a run, which was different than the model already used in GTFS for blocks (presented as an attribute of a trip).

Since adopting the run_id instead, the creation of schedules for import into our CAD/AVL system has involved a massively reduced amount of friction for our customer agencies -- which we believe is the goal of standardization in general.

We're not opposed to exploring other ways of standardizing this information but I would like to ask if anyone else is actually interested in adopting (or already has) run information in GTFS in a different format. We can come up with the best data standard design all we want, but if it's developed by people who won't use it and not adopted by those who would get the interoperability benefits, then it doesn't matter.

It seems to me the conversation has sort of boiled down to two camps: one that says "this would help us greatly today, let's do it" and one that says "no, there are other better ways" but that the no camp isn't actually in need of this information in GTFS at all. (@skinkie please correct me if I'm wrong, as I don't want to misinterpret you, but my understanding is that run information is of little to no consequence for your work).

My main point is: Let's say we all agree on a new runs.txt file -- is there anyone here who would want to produce or consume it immediately?

@jamespfennell
Copy link

jamespfennell commented May 26, 2021 via email

@stevenmwhite
Copy link
Contributor

stevenmwhite commented May 26, 2021

please consider the consumers perspective here too and not just producers.

Absolutely! I should point out that while we are typically a producer of GTFS feeds, in this case we're actually acting as a consumer of the feed.

Agencies and/or scheduling software are the producers, and as a CAD/AVL provider we are consuming the feed for import into our software. We don't actually currently pass this data through to the feeds we publish for public consumption (though we could if public consumers saw a benefit).

On that note, while we currently ask our customers to split trips in some specific scenarios I completely agree that this is not great so I'd prefer not adopting anything that makes splitting trips the preferred way to share data.

@tsherlockcraig
Copy link

I see a suggestion above to take single logical trips and break them into multiple trips to support this runs features (joining on trip short name).

Clarification since I made that suggestion - I'm not actually encouraging it, and this already happens in the spec because there's no 'pattern' concept and consumers already have to link trips through blocks to provide accurate customer info. The purpose of the comment was to demonstrate that adding the field only to trips.txt would still allow for all requested producer use cases, but put the labor on them rather than the consumer.

But I agree it would be preferred to add the field to stop_times as well as the proposal suggests.

Also I should just note since a previous comment on this thread identifies me as representing a software producer that I'm now representing CALACT (ie, an agency association that focuses on smaller and rural agencies).

@safrazier17
Copy link

safrazier17 commented May 26, 2021

Hi all,

Given the active discussion on this thread, I would like to thread that, beginning in the next several weeks, the state of California's Integrated Travel Project (Cal-ITP) will be sending out invitations to convene a working group around these particular gaps in GTFS.

The goal of the working group is to develop a standard data specification for conveying operational scheduling data from schedule producers to consumers (cad/avl companies and other app developers). We are trying to get a wide range of stakeholders involved in this working group so that v1 of the spec (which we are hoping to have published by the end of 2021) will be broadly reflective of the range of use cases that exist among operators of scheduled transit.

With what we know about the friction that exists between the production and consumption of schedules from various sources, we believe this is a high priority for improving the overall quality of mobility data in CA and beyond. We will also be looking for commitments from participating stakeholders to develop support for v1 of the schedule spec after conclusion of the working group's activities.

As I said, invitations have not yet gone out, however, we do have a prospective list of invitees. If you or your org would like to be contacted or want more information, please feel free to reach out to me directly at scott [at] compiler [dot] la. Thank you!

Edit: Worth noting as well that the bounds of what we will be looking at as part of this working group includes operational concepts like runs but is also intended to be more broadly encompassing of the information needed to plan transit operations. Potentially, this means extending existing specifications to cover personnel information, driver activities (layovers or deadheads, e.g.), planned service modifications, and so on.

@skinkie
Copy link
Contributor

skinkie commented May 26, 2021

There are quite a few free open source and affordable tools

Open source =/= affordable, and some of the cheap tools that have existed provide only very minimal files. I think you should talk with the city of Guadalupe with me and expand your perspective of the tools that agencies can afford.

Are you here publicly stating you have found an agency that can operate a realtime AVL system, that has run_id requirements but cannot afford operating a timetable generator beyond a spreadsheet?

I'm unsure of what you mean to say here. You're saying that by allowing people to add run_id to trips.txt, we'd be wasting people's time and thus doing something bad from a socio-economic perspective? If so, you're addressing efficiency, not equity, and I'm making a point about the latter.

Someone working from spreadsheets has to generate stop_times.txt as well. By hand.

I'll grant that my perspective here is driven by the incredibly poor approaches to transit funding that we have the US, but rural US-folk are a constituency I'm inclined to try to design for in the spec, because the US federal government doesn't look like it's planning to fix the issue any time soon.

I am currently producing and have been producing GTFS for many local regions, starting form much worse startingpoints. Hence, I really don't see the economical nor technological risks, you try to sell me. But maybe we should have this chat, and come up with a no brainer solution that anyone can use.

I don't know the degree to which these issues exist elsewhere in the world where an agency may still want to import into a CAD/AVL, but I'd hazard that there are some agencies in similar situations in Central/South America, Africa, and parts of Asia. The market is built around a distinction between scheduling and CAD/AVL, which is sensible, but means some agencies will try to purchase CAD/AVL before purchasing scheduling, so spreadsheets are a legitimate approach to scheduling.

They buy buses too, without they knew they need a diesel supply? Sadly I know these things happen... but someone is making a profit without informing the client. You just cannot support stupidity without consequences.

(@skinkie please correct me if I'm wrong, as I don't want to misinterpret you, but my understanding is that run information is of little to no consequence for your work).

Sadly it is. We have had a discussion with a big consumer two weeks ago that basically uses "commitment on completeness of data in current and future GTFS" as a KPI for our software. So while in about two months all our source information will be in NeTEx, it does mean we are converting the NeTEx features in GTFS, which will include blocks (and if available: runs).

My main point is: Let's say we all agree on a new runs.txt file -- is there anyone here who would want to produce or consume it immediately?

We may produce it, but have no purpose for consuming it.

@tsherlockcraig
Copy link

Are you here publicly stating you have found an agency that can operate a realtime AVL system, that has run_id requirements but cannot afford operating a timetable generator beyond a spreadsheet?

Slightly more complicated than saying "yes", but: generally yes. I would be willing to share the specifics in voice.

We may produce it, but have no purpose for consuming it.

@amirmatalon would Optibus be willing to produce runs.txt?

My main concern by far is getting run information into the GTFS spec. If we can get a producer and consumer for the runs.txt approach, I'll support it just as vigorously.

@barbeau
Copy link
Collaborator

barbeau commented May 26, 2021

FWIW, the first GTFS Guiding Principle is:

Feeds should be easy to create and edit
We chose CSV as the basis for the specification because it's easy to view and edit using spreadsheet programs and text editors, which is helpful for smaller agencies. It's also straightforward to generate from most programming languages and databases, which is good for publishers of larger feeds.

@stale
Copy link

stale bot commented Aug 21, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale Issues and Pull Requests that have remained inactive for 30 calendar days or more. label Aug 21, 2021
@stale
Copy link

stale bot commented Aug 28, 2021

This pull request has been closed due to inactivity. Pull requests can always be reopened after they have been closed. See the Specification Amendment Process.

@stale stale bot closed this Aug 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale Issues and Pull Requests that have remained inactive for 30 calendar days or more.
Projects
None yet
Development

Successfully merging this pull request may close these issues.