Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify your chosen source of input data #1

Open
MikeRalphson opened this issue Mar 10, 2023 · 13 comments
Open

Identify your chosen source of input data #1

MikeRalphson opened this issue Mar 10, 2023 · 13 comments
Labels
good first issue Good for newcomers

Comments

@MikeRalphson
Copy link
Contributor

From https://github.com/schemaorg/schemaorg/tree/main/data/releases/15.0

Examine the various formats in which the schema.org Types are described.

Which format would be best to work from programmatically? Are the various formats equivalent? What leads you to those conclusions?

Can you identify the data which describes the Book Type?

@MikeRalphson MikeRalphson added the good first issue Good for newcomers label Mar 10, 2023
@Himanshu-Dedha
Copy link

In https://github.com/schemaorg/schemaorg/tree/main/data/releases/15.0, there are various formats in which the schema.org Types are described, such as JSON-LD, RDFa, Microdata, Turtle, N-Triples and Quads.

The best format to work from programmatically may depend on the specific use case and tooling available. However, in my opinion, JSON-LD would be the best suited from a programming perspective, because JSON-LD provides a clear and intuitive syntax (because of the key-value pairs that are easy to read). The JSON-LD format provides a standardized way to express semantic information using JSON syntax, making it easy to parse and deserialize. Microdata and RDFa(and other types), on the other hand, are markup languages that are used to add semantic information to HTML documents. While they can be used to represent data, they are not as well-suited to serialization and deserialization as JSON-LD. Microdata and RDFa require more complex parsing and are less intuitive to work with than JSON-LD, which can make deserialization more difficult.

The various formats are equivalent in terms of the information they contain, but they differ in the way they are represented. We can use whatever type suits our needs.

Also, I'm a little confused by "the data which describes the Book Type", does this mean the expected datatype for the properties of the Book?
Assuming yes, the various data types of the various properties of the Book Type are shown in the table below:

Property Data Type Description
abridged Boolean Indicates whether the book is an abridged edition or not
bookEdition Text The edition of the book
bookFormat BookFormatType The format of the book
illustrator Person the illustrator of the book
isbn Text The ISBN of the book
numberOfPages Integer The number of pages in the book

Note
The data types of the properties inherited from CreativeWork and Thing have not been included in the table above.

@MikeRalphson
Copy link
Contributor Author

@Himanshu-Dedha thank you for your detailed response! I would encourage you to formally apply to GSoC for this project. https://summerofcode.withgoogle.com/

@Himanshu-Dedha
Copy link

Hello @MikeRalphson ! I had a doubt... So for Schema.org types, there are some mandatory fields, like we have name and image for Organization type, so when we map the schema.org properties, we'll have to map the required fields as well, right?

@MikeRalphson
Copy link
Contributor Author

If you identify required fields in schema.org Types, yes the required array should be populated in the output.

@Himanshu-Dedha
Copy link

Also, one more thing I had to ask you, in this project, when we convert schema.org Types in JSON-LD format to OpenAPI specification... we're losing all the semantic information, right? So would it be right to interpret this as filtering out the semantics i.e. converting the JSON-LD to JSON, then developing a JSON schema for validation and then converting this JSON schema to an OpenAPI specification?

@MikeRalphson
Copy link
Contributor Author

What do you mean by all the semantic information?

@Himanshu-Dedha
Copy link

By semantic information, I mean the context that is provided in JSON-LD which is used to map terms used in a JSON-LD document to a vocabulary of terms, i.e. schema.org here.

@Himanshu-Dedha
Copy link

So what I meant by converting JSON-LD to JSON was removing the context from JSON-LD. Is that right? Or am I making a mistake?

@MikeRalphson
Copy link
Contributor Author

Can you provide an example of a property you think will be lost?

@Himanshu-Dedha
Copy link

Himanshu-Dedha commented Apr 1, 2023

Sure,
So here's an example of Book type in JSON-Ld format:
{
"@context": "http://schema.org",
"@type": "Book",
"name": "Random Book",
"author": {
"@type": "Person",
"name": "Random Person"
},
"publisher": {
"@type": "Organization",
"name": "The Random Books"
},
"isbn": "9780330508567537"
}

So JSON-LD is the superset of JSON i.e. it contains some extra information that JSON doesn't, which here are @context and the @type fields So if we removed the context field from JSON-LD, the type field would lose the information of what type it was actually referring to. So the resulting JSON from the above JSON-LD would be:
{
"name": "Random Book",
"author": {
"name": "Random Person
},
"publisher": {
"name": "The Random Books"
},
"isbn": "9780330508567537"
}
The @context and @type fields have been removed because they are specific to JSON-LD format. However, I do think that @type can have a separate field to implement modularity in JSON schema and the OpenAPI specifications.

@MikeRalphson
Copy link
Contributor Author

There are ways to express @context and @type in JSON schema native concepts. Please continue to think on these points!

@Himanshu-Dedha
Copy link

So I've been going through the JSON Schema documentation for hours now, and I might have understood what mistake I was making before, so JSON-LD is the super-set of JSON, and since JSON Schema is for validation of a JSON object, I assumed that validating JSON-LD with JSON Schema would result in loss of data i.e. @context or the semantics of the data. But I've just realized that @context can be validated as just any another key.
we can describe @context in the following way :
"@context": {
"type": "string",
"format": "regex",
"pattern": "http://schema.org"
}
Note: Used Schema available for Schema.org on Schema Store

@pragya-20
Copy link
Contributor

Hello @MikeRalphson, you can find my inputs for this issue below :
I have gone through this https://github.com/schemaorg/schemaorg/tree/main/data/releases/15.0 and I found 5 file formats used to define the schema markup which is:

  • JSON-LD(JSON for Linked Data)
  • N-Quad files
  • N-Triples files
  • RDF/XML files
  • Turtle files

On a broader view, there are 3 file formats: JSON-LD, Microdata, and RDFa to write schema markup.
Frankly saying, when I was exploring schema.org then I found JSON-LD easier to understand, and implement than other formats like a turtle and RFD/XML. Also, after reading many documentations also, JSON-LD is still a better option to use programmatically because of the following reasons:

  1. It does not affect the performance of the page because it can be loaded asynchronously
  2. Flexible, JSON-LD can be used in various places like in APIs and data exchange platforms white microdata and RDFa are tied with HMTL markups which make it complex to use and limit the ability to use in other contexts.
  3. Widely used, it’s used by many developers so one can easily find resources and tools to work with
  4. Interoperable, as it provides a way to link different data sources by allowing you to define relationships between entities.

All the file formats are not exactly equivalent, and while all of these formats are used to represent data using Schema.org vocabulary and describe structured data, they have different syntax, structure, and characteristics. They are suitable for their specific use cases such as:
JSON-LD is well-suited for use in web applications that need to exchange structured data over the internet, as it is easy to parse and generate using JavaScript, while Microdata and RDFa are more closely tied to HTML and are often used for adding additional information to the HTML code of a web page to provide more context and meaning to the content.

For the data which describes the Book type, properties are the ones that describe the data of the book which are below:

Property Data Type Description
bookEdition Text The edition of the book
bookFormat BookFormatType The format of the book
abridged Boolean Indicates whether the book is an abridged edition
illustrator Person The illustrator of the book
isbn Text The ISBN of the book
numberOfPages Integer The number of pages in the book

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants