Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move Dataset Publishing and General Practices from Best Practices to the spec #375

Closed
emmambd opened this issue Apr 5, 2023 · 12 comments · Fixed by #406
Closed

Move Dataset Publishing and General Practices from Best Practices to the spec #375

emmambd opened this issue Apr 5, 2023 · 12 comments · Fixed by #406
Labels
Change: Best Practice Changes focusing on recommendations for optimal use of the specification. GTFS Schedule Issues and Pull Requests that focus on GTFS Schedule

Comments

@emmambd
Copy link
Collaborator

emmambd commented Apr 5, 2023

Problem

MobilityData’s heard a number of pains from the community about the GTFS best practices and the spec’s SHOULD statements living in two different places:

  • Producers do not always refer to the best practices, and so moving these into the official spec would give the best practices greater visibility and improve data quality for everyone

  • Merging the best practices in the spec would make it easier for regulators to point producers to one place to get the information they need to create their GTFS feeds

Solution

Based on this feedback, MobilityData is interested in working to merge the best practices into the official spec. We started this work by removing best practices already covered by the spec in 2021. We recognize this is a big effort that will require much community discussion, so we want to start with a small section as a pilot to see if we can gain momentum on this and make some incremental changes that will be high value.

We’ve seen some initial public interest in merging the Dataset Publishing and General Practices section of the Best Practices into the spec (Slack conversation here - you can join the MobilityData slack to see the discussion). Based on this, we’re seeking more feedback to assess interest and get insight.

Recommended Scope

MobilityData suggests keeping the scope to the minimum updates required for the Dataset Publishing and General Practices section to be merged into the official spec. This would include

There would be no changes to the severity of any statements - they would all remain SHOULD statements.

We do recognize there are outstanding issues and PRs related to this section, which we recommend resolving in future iterations. If a PR was voted on and passed, MobilityData would create issues in google/transit that link to the original best practices discussions so the work can continue.

List of outstanding discussions related to the Dataset Publishing and General Practices section (feel free to add others):

MobilityData/GTFS_Schedule_Best-Practices#49
MobilityData/GTFS_Schedule_Best-Practices#48
MobilityData/GTFS_Schedule_Best-Practices#40

@emmambd emmambd added the GTFS Schedule Issues and Pull Requests that focus on GTFS Schedule label Apr 5, 2023
@westontrillium
Copy link
Contributor

I think the community would definitely benefit from an increased visibility of Best Practices. Including them in the transit/gtfs GH repo seems to me to be the most obvious way to do that–in fact I can’t seem to find them referenced anywhere at all there. However, I am not sure embedding them in the reference.md file is the right approach, if that is what is being suggested. I worry about making that page more bloated and inaccessible with such a significant addition of text (longterm… I understand the scope of this proposal is just the Dataset Publishing & General Practices section, but that would presumably just be the first wave of merging all Best Practices).

Another consideration: What value do we get from having essentially four tiers of requirement–Required, Conditionally Required, Should/Recommended, and Optional? At this point, why not instead move to make Shoulds/Recommended into Conditionally Required (e.g., “should include agency phone if one available” becomes Conditionally Required if such a number is available)?

We should think globally and consider how these kind of changes may increase the barrier to entry for producers.

Are there other ways of increasing visibility of Best Practices in this space? Could the Best Practices live in a dedicated .md page? Can we actually reference the Best Practices in the reference.md page (e.g., “see Best Practice on X component here”)?

@emmambd
Copy link
Collaborator Author

emmambd commented Apr 21, 2023

Thanks for providing this feedback @westontrillium! I definitely agree we should think about this change globally, and I appreciate the prompts to consider different solutions.

However, I am not sure embedding them in the reference.md file is the right approach, if that is what is being suggested. I worry about making that page more bloated and inaccessible with such a significant addition of text

We want the spec to be accessible, but the spec is also expanding more and more over time (e.g fares adoption, the community’s pending work on flex). The spec is going to get longer regardless of where the best practices live. Because of this, I think we need to separate out the spec itself (reference.md) from how the spec is rendered. We can find more accessible ways to render the spec on https://gtfs.org while still allowing the spec to expand.

One solution MobilityData is exploring to render the spec in an easier-to-use way:

Defining components of GTFS (e.g core GTFS for required files, pathways, translations) so documentation readers can more easily find the use cases they care about and how to model them in GTFS.

One result of this could be a dynamic interface on https://gtfs.org/ where I can choose what use cases I want to represent (e.g I want GTFS basic requirements, text-to-speech and pathways), and then everything irrelevant to me is filtered out.

Working on improving the rendering would be out of scope for #375, but MobilityData would expect to work on it in parallel to adding the best practices to the spec so we can improve the spec’s accessibility.

There’s also some usability issues with keeping the best practices and spec separate. For example:

I want to create a GTFS dataset. Let’s start with agency.txt.

I go to https://gtfs.org/schedule/reference/#agencytxt and read the agency_id description

Conditionally Required:

  • Required when the dataset contains data for multiple transit agencies.

Cool, I only have one agency so I am not including this, moving on.

I notice that the Best Practices exists

I go to https://gtfs.org/schedule/best-practices/#agencytxt and I read, in the agency_id description

Should be included, even if there is only one agency in the feed.

Ugh, I guess I should change that!

Since there are so many per-file best practices, this flow seems unintuitive, even for linking to best practices in the spec. Ideally the vision is that we’d have one source of truth (reference.md) that we can render in different ways for user types that want to see a simpler version of the spec.

@isabelle-dr
Copy link
Collaborator

One thought regarding the requirement tiers:

What value do we get from having essentially four tiers of requirement–Required, Conditionally Required, Should/Recommended, and Optional? At this point, why not instead move to make Shoulds/Recommended into Conditionally Required (e.g., “should include agency phone if one available” becomes Conditionally Required if such a number is available)?

My understanding is that Must/Required and Conditionally Required are essentially one tier, which relates to data validity and they trigger ERRORS in the Canonical GTFS Schedule Validator.
There are two additional tiers that relate to data quality: Should/Recommended and Optional:

  • The Should/Recommend tier deals with minor issues (e. g. route_desc should not be a duplicate of route_short_name) and it triggers WARNINGS.
  • The Optional is related to completeness (e. g. adding wheelchair accessibility fields makes the data more complete), and it's also used for fields that are not always applicable.

I recognize this is a generalization and there are some exceptions.

Migrating the Should to Conditionally Required would have significant implications for existing datasets and we should probably discuss this in a separate issue.

@ehowington
Copy link

@emmambd I think your example is the reason why we ought be thinking about why the shoulds aren’t considered required or conditional, and why simply adding them to the spec itself (instead of siloed in Best Practices) creates more confusion.

In the case of wanting to build agency.txt, the spec is explicit that agency_id is optional for single agency datasets. So I don’t understand why we want a producer to then experience “oh, I am supposed to use this.” Effectively, that means the spec’s optional descriptor is meaningless if the shoulds override it.

@isabelle-dr I definitely agree this is outside this discussion, but I think this is why it’s worth considering this question before simply adding more confusing elements to the spec. Again, referencing agency_id for a single agency dataset, 4.0 validator does not flag this as a warning, so it does not fall into the realm of minor issues and would not warrant a should.

I imagine there are other scenarios where the breakdown of R/CR, Shoulds (to avoid warnings), and Optional (for data completeness) break down besides the example of agency_id (oh hey, blocks.) I think having this kind of clear, unambiguous routing of requirements makes sense and ought to be pursued instead of merging best practices into the spec itself. Doing so will likely surface some other examples where we can clean up wording and better define requirements to create a holistic, accessible, navigable spec that works in tandem with the validator (there's a huge written gap between the spec and the validator too we can address.)

@emmambd
Copy link
Collaborator Author

emmambd commented Apr 25, 2023

@evantrillium Thanks for expanding on the agency_id example and clarifying why retaining the shoulds would be confusing in this instance. You're right where this is an example where it's important to change the severity from Conditionally Required to Required, since it 1) significantly improve the readability of the spec 2) wouldn't cause backwards compatibility issues.

I think having this kind of clear, unambiguous routing of requirements makes sense and ought to be pursued instead of merging best practices into the spec itself.

I'm confused why we wouldn't want to pursue both, since merging them increases visibility and clarifying the requirement level increases readability. Doing both seem to help improve data quality (though I agree reviewing requirement levels is a must for merging the per-file requirements). The next iteration of this work after #375 could focus on discrepancies between requirement tiers across the best practices and the spec.

Are there any examples of where having best practices separated from the spec reduces confusion for producers? Or perhaps where we should reconsider the requirement level of any best practices within the scope of Dataset Publishing and General Practices?

@emmambd
Copy link
Collaborator Author

emmambd commented Apr 25, 2023

After looking into some other examples (like the timepoint best practice in stop_times.txt or the block_id best practice in frequencies.txt), we're envisioning that a part of this merging work would be to add a Recommended requirement level to the Presences in cases where modifying the best practice to be Conditionally Required or Required is not possible/easy to understand.

This would look like the following for the agency_id case:

agency_id (Conditionally Required): Identifies a transit brand which […]. Should be included, even if there is only one agency in the feed.
Conditionally Required: Required when the dataset contains data for multiple transit agencies.
Recommended otherwise

Regardless, we agree this kind of harmonization work is key between the best practices and the spec to improve readability and merging best practices into the spec without this would just increase confusion. If there are places where this is relevant to #375, we would include the Recommended requirement level now. If not, it would wait until the next iteration of this work.

@antrim
Copy link
Contributor

antrim commented Apr 26, 2023

Hi @emmambd and @isabelle-dr ! Thanks for moving this all forward! We (Trillium/Optibus) discussed this internally more. First, as I led the GTFS Best Practices working group, I thought it might be helpful to provide a bit of extra-institutional memory around the Best Practices.

The purpose of the Best Practices was to align industry interpretation around the Spec. As GTFS was acquiring broader use, we were encountering situations where trip planners, CAD/AVL vendors, and others had some different expectations and interpretations of the spec. This made problems for everyone — data consumers, GTFS producers (vendors and agencies), and transit riders. So we assembled together some prominent GTFS consumers and producers to agree on Best Practices.

It was always the vision that some of these Best Practices would make their way into the Spec reference and be subject to the larger governance process.

In the early days of MobilityData, we also discussed transitioning some of the Best Practices to a “how to” guide. So, BP that define what correct GTFS is would go into the reference. BP that define how to use the spec and provide examples would go to this “How To Guide”.

Trillium (@trilliumtransit) & Optibus (@Optibus) support the goal of aligned expectations/specs across data consumers. Misalignment creates issues for our business, other data producers, data consumers, and transit passengers. We support a process that moves obvious and well-supported Best Practices into the Reference document, discards Best Practices that have gone out of date or disagrees with the reference, and also revisits more complex Best Practices in discussion with the community — to see if they should be reformulated or moved to a different document.

@emmambd
Copy link
Collaborator Author

emmambd commented Apr 26, 2023

Thanks for providing that critical context @antrim! We definitely want to start this process by focusing on increasing clarity and removing duplication. Based on all this feedback, I'm thinking that we alter the proposed scope for this first iteration in #375 to cover:

  • Harmonizing/merging best practices that conflict with the field requirement level severity in the spec — meaning that the requirement level in the spec should be updated to "Recommended" where it's currently Optional (or sometimes not in the spec at all).
  • Merging any individual best practices from the Dataset Publishing section into the spec that we know are widely used. The main one we've heard about from the community so far is adding At any time, the published GTFS dataset should be valid for at least the next 7 days.

I started an audit of best practices that fit this scope and suggested improvements that the community is welcome to give feedback on. We'll be talking about it internally and planning to move forward with a PR based on it in the next few weeks.

@emmambd emmambd added the Change: Best Practice Changes focusing on recommendations for optimal use of the specification. label May 10, 2023
@e-lo
Copy link

e-lo commented May 10, 2023

From @evantrillium

In the case of wanting to build agency.txt, the spec is explicit that agency_id is optional for single agency datasets. So I don’t understand why we want a producer to then experience “oh, I am supposed to use this.” Effectively, that means the spec’s optional descriptor is meaningless if the shoulds override it.

I'd argue that having the shoulds in a completely separate place is even more confusing.

The user experience is essentially:

  1. Ah, I don't need to build agency.txt so I'm going to leave it out of my software implementation
  2. Client is upset that agency.txt is not there b/c its use is widespread...points them to the best practices documentation
  3. Ohhh, there is a whole other set of suggestions that I should have been doing all along?

I'd argue that replacing a MAY with a SHOULD (which is ideally synchronous with adding a warning to the validator) is backwards compatible because it does not produce an ERROR and would still be considered valid GTFS.

Put another way - you shouldn't get a warning in the validator for something that isn't a SHOULD in the GTFS spec.

@ehowington
Copy link

Hi @e-lo

We are in alignment on a consolidated reference - my example was specifically about agency_id (not the entirety of agency.txt) being an Optional element for single-agency feeds, but a Recommended element for all feeds per the BPs. @emmambd’s recommendations in #376 are a great step towards bringing the spec, the BPs, and the validator all in tighter sync which Trillium fully supports.

@isabelle-dr
Copy link
Collaborator

Closing, solved in #406

@isabelle-dr
Copy link
Collaborator

Follow the rest of the Best Practice and Spec reconciliation in issue #396.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Change: Best Practice Changes focusing on recommendations for optimal use of the specification. GTFS Schedule Issues and Pull Requests that focus on GTFS Schedule
Projects
None yet
6 participants