Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language agnostic programmatic definition of GTFS-Static #127

Open
Subzidion opened this issue Dec 9, 2018 · 36 comments
Open

Language agnostic programmatic definition of GTFS-Static #127

Subzidion opened this issue Dec 9, 2018 · 36 comments
Labels
GTFS Schedule Issues and Pull Requests that focus on GTFS Schedule Status: Pull Request Created Issues that have been transferred to the Pull Request stage.

Comments

@Subzidion
Copy link

I was curious if there was a language agnostic programmatic definition, similar to the proto definition for GTFS-Realtime, but for GTFS-Static, that can be used with an ORM for creating Class representation of the text files. This would simplify the coding process, and would eliminate the need to write code to represent each text file and possibly needing to update that code if/when the spec changes.

@skinkie
Copy link
Contributor

skinkie commented Dec 9, 2018

It is called SQL ;-) and I think it does exists via GTFSdb.

@Subzidion
Copy link
Author

GTFSdb is all Python, nothing that's just SQL. I was thinking something more along the lines of what OneBusAway does with Hibernate, but usable in any language.

@skinkie
Copy link
Contributor

skinkie commented Dec 9, 2018

GTFSdb produces SQL tables in different flavors. You could use that in your agnostic definition.

@Subzidion
Copy link
Author

Using GTFSdb for this would still requiring updating the Python code to reflect any changes to the spec, running it to create your database schema, then using some other tool to map from the schema to classes in whatever language you want to use. Feels a bit cumbersome. I understand there's probably some class representation for most languages already created, but needing to check and hope they get updated if the spec gets changed seems annoying. If there was some definition, similar to the Hibernate one, that could be used in any language, it would make any of those language-specific GTFS-Static ORMs a lot easier.

@barbeau
Copy link
Collaborator

barbeau commented Dec 11, 2018

@Subzidion There isn't anything official, but the closest thing I'm aware of in concept to what you're looking for is this Data Package specification:

I started generalizing this to any GTFS:
https://github.com/CUTR-at-USF/GTFS

I think I have some work stashed somewhere beyond what's currently in the above branch...

@Subzidion
Copy link
Author

This Data Package Specification is exactly what I was looking for. Is there any way we can make this specification a part of the main GTFS package? I would think defining GTFS in terms of the JSON schema would help clarify ambiguity instead of attempting to dfeine the JSON schema from markdown.

@barbeau
Copy link
Collaborator

barbeau commented Dec 13, 2018

@Subzidion It's certainly possible.

Is there anyone else interested in this type of programmatically-readable schema definition for GTFS?

@barbeau barbeau added the GTFS Schedule Issues and Pull Requests that focus on GTFS Schedule label Mar 25, 2019
@barbeau
Copy link
Collaborator

barbeau commented Sep 28, 2020

Note that there is a proposal and discussion related to GTFS schemas happening at #244.

@devadvance
Copy link

Similar to the discussion in #244, I strongly bias towards achieving consensus on the problem statement before proposing a standard.

It sounds like the core objective is to further codify GTFS to meet these criteria (broken into bullets for easier visual parsing):

  • Machine-readable instructions that specify
  • in a language-agnostic, storage-agnostic manner
  • the correct structure, syntax, and relationships
  • of GTFS static data
  • such that GFTS data can be processed and stored
  • in a backwards and forwards compatible manner
  • that minimizes the need for implementation updates by individuals.

It would be helpful to know if that's an accurate characterization, or if criterial like human-readable or CSV-specific need to be appended.

@e-lo
Copy link

e-lo commented Sep 28, 2020

@devadvance:

criterial like human-readable

I think human readability is an important consideration for transparency and maintainability since changes to the spec will be represented and discussed within the context of a pull-request and vote.

such that GFTS data can be processed and stored

AND

  • validated...adding in the potentiality for conditions beyond Type.
  • documented...reducing errors and friction.

@wesleyi23
Copy link

I am new to this community, so there are likely many nuances that I am missing. However I have some questions and comments about this issue. For some background I am working, in collaboration with CALTRANS, on an application to implement software to support V1 of the GTFS "Grading Standard."

A canonical, machine-readable version of the standard would be most helpful. I would like to be able to abstract away as much of the standard as possible from this application, so I don't need to make updates to this application each time the standard changes. Having a machine readable version of the standard, with at least an agreed upon format\structure, defined types, and enumerations, would be central to this goal.

So I have a couple of questions:

  • Generally, is the version of the schema in the feature/json-schema branch suitable to begin development from? Or are their potential issues with it I should be aware of?
  • If I were to start developing from the JSON-Schema file, would it be maintained in the future?
  • How much might this file change in the future?

I also wanted to voice my support for developing a canonical JSON-Schema for the following reasons: it has already been developed (assuming the draft is up to date and there aren't any significant issues with it), it is the most widely used of the proposed standards, and to my knowledge it supports most of objectives that have been mentioned so far. However as I said there are likely many nuances that I am missing.

More importantly, I wanted to express that not having a standard machine readable standard creates a significant issue right at the beginning of any new development effort: How do I model the standard and how do I keep that model up to date as it evolves? Providing a schema file would help alleviate many of these issues and free up developer time for other work.

@e-lo
Copy link

e-lo commented Nov 19, 2020

@wesleyi23 do you have any additions/mods to @devadvance 's summary of the problem statement ?

@wesleyi23
Copy link

@e-lo and @devadvance the only thing, I would add is that it would be nice if the solution not only included the correct structure, syntax, and relationships of GTFS static data, but also the file and field descriptions. I would make a pitch that these be represented in an HTML format, because there are some order lists, paragraphs, and other similar items. I think this would provide a more or less complete reproduction of the current standard documents.

@e-lo
Copy link

e-lo commented Nov 30, 2020

Does anyone have any additions/edits to the following problem statement?

  1. Machine-readable instructions that specify
  2. in a language-agnostic, storage-agnostic manner
  3. that is relatively standardized itself (such that there are existing tools and a potential ecosystem for testing as well as rendering in a "front-end" form)
  4. is human legible in its native form (to allow for easy git-diffs + increase likelihood of catching errors)
  5. which articulates the correct structure, syntax, bounds, and relationships
  6. of GTFS static data
  7. as well as the file and field descriptions
  8. such that GFTS data can be processed and stored
  9. and validated
  10. in a backwards and forwards compatible manner

@e-lo
Copy link

e-lo commented Nov 30, 2020

BTW - I saw that @Stephen-Gates started developing a Frictionless data package for GTFS and would be curious why it seems to have been abandoned?

@barbeau
Copy link
Collaborator

barbeau commented Nov 30, 2020

@e-lo My understanding is that https://github.com/Stephen-Gates/GTFS was created specifically for validating
the South East Queensland GTFS data and was never intended to be a canonical schema for the general GTFS spec. For example, some of the location constraints defined for stop location lat/longs are specific to Queensland.

I started expanding Stephen's work in this branch to represent the entire spec a while back, but other priorities pulled my attention away:
https://github.com/CUTR-at-USF/GTFS/tree/full-spec

You can see my changes in these two commits:

Here were the remaining TODOs I noted in 2016:

  • Review TODOs and FIXMEs - some constraints will break extensibility
  • Add missing files

If someone would want to pick up this work I'd certainly welcome the contribution.

@e-lo
Copy link

e-lo commented Nov 30, 2020

@barbeau Awesome and thanks for background. In your opinion is frictionless "the right" spec for achieving the objectives above? my main hesitation is lack of progress/movement recently in the organization.

@wesleyi23 It seems like Sean's repo is a good place to start.

@barbeau
Copy link
Collaborator

barbeau commented Nov 30, 2020

@e-lo It looked very promising to me, and the above work was mainly an experiment to see if it panned out. Unfortunately I don't have any experience with frictionless outside of the above so I can't say for sure.

@wesleyi23
Copy link

@e-lo @barbeau At the moment I have a need for a schema document, so I am happy to put time in to developing one further.

Sean I reviewed your repo and I agree it could be a good place to start. There is also @LeoFrachet JSON-Schema file referenced in #244 which would also make a good starting place. As a new community member, I have no context or background to weigh the pros or cons of JSON-Schema vs Frictionless.

From a technical perspective they both appear to meet the the satisfy the identified problem statement, unless there is something I am missing.

Any guidance on which path to follow would be greatly appreciated.

@jamespfennell
Copy link

jamespfennell commented Nov 30, 2020 via email

@skinkie
Copy link
Contributor

skinkie commented Nov 30, 2020

@jamespfennell one reason could be that some constraints are "OR".

@e-lo
Copy link

e-lo commented Nov 30, 2020

@jamespfennell : To my knowledge SQL definitions aren't designed be 'read in' as data other than for SQL – but I would be curious if somebody more familiar with various options could evaluate this option vis-a-vis the 10 points above.

@barbeau
Copy link
Collaborator

barbeau commented Nov 30, 2020

As a new community member, I have no context or background to weigh the pros or cons of JSON-Schema vs Frictionless.

My concern with the JSON-Schema is that we'd be introducing an entirely new encoding-specific concept to GTFS that doesn't currently exist there. I think it would also tempt some producers and consumers to "JSON-ize" GTFS data, and I see that further complicating an already complex ecosystem.

Frictionless Table Schema format was designed to represent tabular data, which is the current representation/encoding of static GTFS data (CSV files in a ZIP file). IMHO it seems a better fit to the existing GTFS spec, unless there is a limitation that that I don't know of.

Why not just use a SQL database definition file? This can include unique
constraints, foreign key constraints, enums, and so on.

@jamespfennell Could you give me an example for a table in GTFS?

As @skinkie says there are some situations that won't be easy to model, like service_id in calendar_dates.txt, which in some cases is a primary key but in others is a foreign key (potentially within the same GTFS dataset, as evidenced by MobilityData/gtfs-validator#397):
https://github.com/google/transit/blob/master/gtfs/spec/en/reference.md#calendar_datestxt

@wesleyi23
Copy link

wesleyi23 commented Dec 8, 2020

After talking with folks, I am starting work to update and expand the Frictionless Schema, Stephen created for Queensland. I have forked @barbeau branch and will be working on here: https://github.com/wesleyi23/GTFS-Frictionless.

@e-lo
Copy link

e-lo commented Jan 4, 2021

Thank you to @wesleyi23 for creating a fairly complete definition of GTFS here: https://github.com/wesleyi23/GTFS-Frictionless

It would be great if all who are interested could add issues, contribute to, and improve this definition.

I'm also interested in if the community would be amenable to using this type of definition as the canonical GTFS definition such that we can generate Markdown/HTML from the programatic definition in JSON rather than visa-versa.

@MuckT
Copy link

MuckT commented Jan 6, 2021

I found the old version of schemas to not work well with popular schema validators so I've started converting them to JSON schema v7. I've created a simple Nx app that converts .txt or .csv files into JSON objects and then validates them in the browser. My UI abilities are a bit lacking, currently the results are in console logs, but you can see my progress here: MuckT/gtfs-tools

As of writing this I have only rewritten the agency.txt schema; any help in the UI or schema development would be appreciated.

@e-lo
Copy link

e-lo commented Jan 6, 2021

@MuckT - @LeoFrachet developed a full JSON Schema for GTFS which is in the PR linked to this issue. See the discussion in that PR and above problem statement for why frictionless seemed to fit the bill btter.

Note that validators are fairly easy to create once the schema is in a parsable format. You can also use goodtables.io to do data validation "as a service" in frictionless' format.

@github-actions
Copy link

github-actions bot commented Jan 7, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the Status: Stale Issues and Pull Requests that have remained inactive for 30 calendar days or more. label Jan 7, 2022
@derhuerst
Copy link

@github-actions Don't close.

@github-actions github-actions bot removed the Status: Stale Issues and Pull Requests that have remained inactive for 30 calendar days or more. label Jan 10, 2022
@pietercolpaert
Copy link

Really interesting discussion!

Does anyone have any additions/edits to the following problem statement #127 (comment)

I’d like to add something that was in the original issue as well: that it should be easy, once the spec was processed, for a system using the spec to keep in sync with the latest additions to the spec. If an optional field was added for example, I’d want my codebase to create that new class on the next run, or I’d want a JSON schema I defined to add that property.

My own case: I created an RDF/Linked Data vocabulary for GTFS back in 2015. Today it’s horribly out of date, but we just started updating it to the latest spec: OpenTransport/linked-gtfs#20

I wonder whether we should re-iterate the problem scope towards: what programmatic description should we use to make sure everyone can keep up to date their own technology-specific schema they can use in their own technology to import or validate GTFS static files? If this problem would be solved, then we also can have automaric translations towards commonly used schema languages like JSON and XML Schema, SQL, protobuf, RDF/SHACL/ShEx, etc.

The problem is thus not choosing the one schema language to rule them all, but it is choosing the one that will best express the things decided in the GTFS specification, so that it can be automatically translated to all others.

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the Status: Stale Issues and Pull Requests that have remained inactive for 30 calendar days or more. label Jan 28, 2023
@github-actions
Copy link

This issue has been closed due to inactivity. Issues can always be reopened after they have been closed.

@derhuerst
Copy link

This is still relevant.

@pietercolpaert
Copy link

pietercolpaert commented Feb 13, 2023

Ack! Still relevant.

Wonder if we could use https://linkml.io for this. Seems to do what I described above

@isabelle-dr
Copy link
Collaborator

Re-opening :)

@isabelle-dr isabelle-dr reopened this Feb 13, 2023
@github-actions github-actions bot removed the Status: Stale Issues and Pull Requests that have remained inactive for 30 calendar days or more. label Feb 14, 2023
@eliasmbd
Copy link
Collaborator

📢 The participants in this conversation might want to look at issue #391 to discuss adding the GeoJSON format in GTFS as part of the GTFS-Flex extension proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GTFS Schedule Issues and Pull Requests that focus on GTFS Schedule Status: Pull Request Created Issues that have been transferred to the Pull Request stage.
Projects
None yet
Development

Successfully merging a pull request may close this issue.