Algorithm for an Event Structured Data Score #120

fjjulien · 2024-07-30T14:27:02Z

fjjulien
Jul 30, 2024
Maintainer

Context

CAPACOA hired three consultants to develop a multi-modal evaluation framework for digital discoverability. During phase 1, consultants used a software called Hercules to run Google searches and analyze results pages (see the report). This research method has proven useful to tell us what kind of queries are likely to yield what kind of pages on a Search Engine Results Page (SERP). However, it does not provide any clue as to why, among similar pages, certain pages are ranking better than others.
During the next phase of the project, consultants wish to measure correlations between Schema.org structured data inputs and rankings on SERP. They therefore need a method to easily assign a structured data quality score to hundreds of event pages, and then correlate these scores with SERP ranking data. This method should enable the research team to answer questions such as:
“Is there a minimum amount of structured data - a floor - that must be met in order for Google to react to the presence of structured data?”
“Is there a threshold - a ceiling - beyond which more structured data makes no difference whatsoever in Google search results?”
“Are some structured data errors more critical than others, from Google’s point of view?”
Meanwhile, the Artsdata team developed a method to “qualify” event structured data quality, that is, to assess whether an event meets required property requirements, has recommended properties, etc. This method however requires a human to analyze the structured data and to report the information in a spreadsheet (see the template). This method is however too labour intensive for the digital discoverability measurement project.
Work order associated with this project: 2425-W-007.

Goal

Overall goal: Quantify the quality of performing arts events’ structured data.

Specific goal: Design an algorithm that will assign numeric value (i.e. a “score) to a given performing arts event’s structured data. The score should be a reflection of the extent to which the structured data is suited for:

Disambiguating the event and the nested entities associated with it;
Supporting the main use cases for event data in the performing arts industry: ticketing platforms, search engines, recommender systems, traditional media, territorial marketing (i.e. tourism), and, ultimately, consumption by arts goers (see Estermann and Julien, 2019, p. 29-34).

In the context of the Artsdata project, the reuse of structured data in events listings such as Tout Culture and La Vitrine represents a particularly important use case

Specifications

The algorithm should assign a point value to each attribute-value pair found within an event object. The point value (or ‘weighting’) should consider the importance of each attribute-value pair, relative to the above specific goals (disambiguation, and consumption-driven use cases). Those attribute-value pairs that are deemed critically important for these objectives should be given a higher weight. For example, some properties are “required” in Artsdata because they are essential for disambiguation. These properties should be given the highest weight, because disambiguation is a prerequisite for most use cases.

The algorithm’s assessment for each attribute-value pair should ideally be more than a simple true/false based on the presence or absence of a property. For those properties that are deemed important (for example, “required” or “recommended” status in Artsdata), the algorithm should also consider if the value is an expected object for the property (for example, the value for location should be an @type Place) or if it is in the right format (for example, the value for startDate should follow the syntax of ISO 8601). It may also need to verify errors, such as the @id and url properties having the same value.

Initial weighting proposal

As a starting point for discussion, I would like to propose the following point values / weightings to specific attribute-value pairs:

Weight 12

startDate with a proper xsd:startTime with no error in the timezone offset (see note 5)

Weight 6

name
location.type with expected object value (i.e. Place object or subtype)
location.name
location.sameAs with a URI constituting a unique identifier for the object

Weight: 3

location.address with expected object value for the property
location.address.postalCode with valid postal code.
id with a proper URI constituting a unique identifier for the Event (within the website domain, but distinct from the url value)
url
additionalType
description
image with a proper url value OR nested ImageObject with a proper image.url value
organizer.type with expected object value for the property
organizer.sameAs with a URI constituting a unique identifier for the object
performer.type with expected object value for the property
performer.sameAs with a URI constituting a unique identifier for the object
offers.type with expected object value for the property (Offer or AggregateOffer)
offers.url

Weight: 1

Any other property with an expected value.

Notes:

“Weight 12” and “Weight 6” properties correspond to Artsdata-required properties. “Weight 3” properties correspond to Artsdata recommended properties (see this spreadsheet for details).
I attributed the weight values so that the sum of required properties ( 12+(4x6)=36 ) is more or less equivalent to the sum of recommended properties ( 13x3=39 ).
Based on my experience of reviewing event structured data, I do not anticipate the sum of “weight 1” properties to ever exceed 20. The influence of “weight 1” properties will therefore be minimal on the total score.
Under “weight 3”, I decided not to include the @id properties for nested objects. Currently, very few sites assign an @id to nested objects, and, of those who do, the value is rarely a valid URI. Unless the algorithm can easily assess the validity of the values, it is better not to assign a nested @id a weighting higher than 1.
Validating the timezone offset value will be challenging. This would require building a capacity to call an external database to retrieve the proper timezone offset based on the location. However, in most structured data sources, there is not enough location data to disambiguate the place.
For performer.type, website CMSs often automatically assign the same type to all performer objects (for example, “Person” or “PerformingGroup”), regardless of the actual nature of the performer entity. While it would be technically possible to design an algorithm that would be able to guess the performer type based on a reconciliation of the performer.name string, I believe this would be too much work for the potential benefit. For version 1 of the algorithm, we should blindly accept any expected value (i.e. Person or Organization), and not attempt to make a judgement on the validity of the value.

Liverace · 2024-07-31T15:40:10Z

Liverace
Jul 31, 2024

You should perhaps offer links to "Tout Culture and Le Vitrine". Le Vitrine should be spelled La Vitrine.

1 reply

Liverace Jul 31, 2024

Note 1 should spell Arsdata : Artsdata

Liverace · 2024-07-31T15:51:47Z

Liverace
Jul 31, 2024

Since "Validating the timezone offset value will be challenging" should you carry on assigning it a weight of "12" at the risq of having many ERRORS in generating the score ? While startDate is considered required shouldn't it be left aside in a first phase and still the exercice allow to generate an indicative score to make your proof of concept ?

6 replies

Liverace Aug 1, 2024

Aren't Event disambiguation and identifier for the Event both linked to the issue of allocating a correct Timezone and why is it so difficult to get the venues to do it properly ? Shouldn't we find a solution to this in the first place ? Just asking...

fjjulien Aug 5, 2024
Maintainer Author

I created an issue #123, in which I describe several potential causes for inaccurate or incomplete time zone offset values. These kinds of errors are very common and we don't have enough capacity to provide one-on-one coaching to help each arts organization fix their time zone errors.

In this issue, I propose attempting to infer the local time zone based on the source of a data set rather than on location entities within Event structured data. I believe this could be a cost-efficient quick fix while the performing arts sector adopts better data practices with their startDate and location structured data.

fjjulien Aug 5, 2024
Maintainer Author

Autrement, que penses-tu de l'idée d'attribuer 10 points pour une bonne syntaxe dateTime et un autre 6 points si la valeur de fuseau horaire est correcte?

christianroy Aug 6, 2024

Je n'ai pas de réponse claire à la difficulté sur les fuseaux horaires, mais je souligne que ces problèmes sont pour moi de nature et de cause un peu différente que ce qu'on tente de mesurer en terme de qualité des données.

La présence ou l'absence d'une paire clé-valeur relève d'une erreur de spécifications, de ne pas avoir bien lu une spécification, de ne pas utiliser les gabarits d'Artsdata, d'avoir un CMS ou une base de données mal structurés, etc. «C'est de la faute au product owner!». Et je comprends que c'est ce qu'on veut mesurer.

Le bon format de date de façon générale, et le timezone en particulier, relèvent plus d'une erreur de programmation, de ne pas comprendre comment ça fonctionne, d'avoir oublié de configurer le fuseau horaire sur son serveur ou dans son WordPress, etc. «C'est de la faute au programmeur!» Et c'est moins ça qu'on veut mesurer, je pense (de la même façon qu'on ne valide pas que le JSON est valide, que les guillemets sont correctement "escapés" dans le texte de description, etc.)

Peut-être qu'on voudrait simplement disqualifier (zéro points) les données où le format ISO-8601 n'est pas respecté, et c'est tout (et considérer qu'un mauvais fuseau horaire est un "bogue", pas un niveau de qualité de données).

fjjulien Aug 7, 2024
Maintainer Author

@christianroy Tu soulèves de très bon points. Je suis d'accord avec toi que les erreurs de fuseau horaire sont dans la quasi-totalité des cas imputables au système et à son programmeur, plutôt qu'à l'utilisateur du CMS. Considérant que la détection programmatique des fuseaux horaires incorrects est très difficiles (voir #123), mieux vaut ne pas tenter d'accorder des points selon l'exactitude des valeurs de fuseau horaire.

Peut-être qu'on voudrait simplement disqualifier (zéro points) les données où le format ISO-8601 n'est pas respecté...

Si l'erreur dans la syntaxe est détectable et corrigeable (par exemple, lorsqu'il manque des secondes), nous devrions peut-être quand même accepter les valeurs (voir la même question soulevée par Gregory plus bas).

dlh28 · 2024-07-31T18:54:52Z

dlh28
Jul 31, 2024
Maintainer

I think that location.address.postalCode should be assigned a higher weight value than 3. Location is a required (not a recommended value) and from my understanding of Artsdata, it can be used just as easily as a sameAs URI to automatically disambiguate location values. It is also much more likely that an organization will have a postal code than a sameAs value in their structured data.

3 replies

fjjulien Aug 5, 2024
Maintainer Author

I get your point and I agree with your assumption that organizations are more likely to "have a postal code than a sameAs value in their structured data".

My proposed weight value of 3 for location.address.postalCode is based on another assumption: that an organization who bothers populating a postal code in the structured data would normally also populate other address properties.

If this assumption is true, then a complete address with a postal code would be worth 10 points under the proposed algorithm:

"type": "PostalAddress": 3 points
"streetAddress": 1 point
"addressLocality": 1 point
"addressRegion": 1 point
"postalCode": 3 points
"addressCountry": 1 point
Total: 10 points

This is considerably more than the 6-point weight assigned to location.sameAs.

Do you agree with this assumption? And is a 10-point total value for a full location.address object too low, too high or good enough?

And do you share my concern that a fully described location object could have a disproportionate weight compared to the rest of attribute-value pairs? If so, we could shift the weight of location.sameAs from 6 to 3 points.

fjjulien Aug 8, 2024
Maintainer Author

@dlh28 What do you think of the suggestion to shift the weight of location.sameAs from 6 to 3 points?

Or would you prefer that we create a distinct weighting category for location.sameAs and location.address.postalCode? We could give each one a 4-point value.

dlh28 Aug 27, 2024
Maintainer

If we score "streetAddress", "address Locality", etc., separately as 1-point values, then I think it would make sense to give location.address.postalCode and location.sameAs the exact same numerical values (since we have been treating them both on equal standing in our evaluation of minimum requirements for event discoverability).

saumier · 2024-08-05T21:11:20Z

saumier
Aug 5, 2024
Maintainer

startDate with a proper xsd:startTime with no error in the timezone offset (see note 5)

There is a typo in the datatype. In Artsdata the datatypes are xsd:dateTime or xsd:date. The Artsdata pipeline sets the datatype to either xsd:date for dates (2024-08-02) or xsd:dateTime (2024-08-02T20:00:00-04:00) depending if there is time. This is important in Artsdata for searching and filtering by date/times.

However, the JSON-LD @context of schema.org sets the datatype of all Event startDates and endDates to schema:Date, and not xsd:dateTime nor schema:DateTime.

Since, in the wild, the vast majority of JSON-LD uses the schema.org @context, we really only need to check the syntax, and not the datatype.

See my comments on detecting errors in timezone offset #123 (comment). In this project I will only be able to fix missing timezones, and not incorrect timezones.

@fjjulien My questions:

Do we also give 12 points when the startDate has no time and only date? Example: 2024-08-02 (which is a valid ISO 8601 syntax)?
If not, then is there a different score for date only?
Do we fix missing timezone (as per the solution of providing a timezone for all events on a specific website) before applying the algorithm?
Do we fix errors in the syntax (see table below) before applying the algorithm, or do we remain strict and not fix anything? Artsdata typically fixes basic errors such as missing seconds, missing double digits for hours...

Syntax should be CCYY-MM-DDThh:mm:ss[Z|(+|-)hh:mm]
Some common syntax errors

value	problem	solution
2023-09-15T10:38:22-0400	Missing colon in timezone offset	Add colon
2023-9-15T10:38:22-04:00	Month is single digit	Prefix 0 to month
2023-09-1T10:38:22-04:00	Day is single digit	Prefix 0 to day
2023-09-15T10:38-04:00	Missing seconds	Add :00 to time

5 replies

fjjulien Aug 7, 2024
Maintainer Author

@saumier

Do we also give 12 points when the startDate has no time and only date? Example: 2024-08-02 (which is a valid ISO 8601 syntax)?
If not, then is there a different score for date only?

On one hand, for simplicity's sake, I'm inclined to treat date only and date-time values on equal footing. It's entirely legitimate for a festival or a day-long event to have just a date and no start time. While we could design the algorithm to detect a festival type and adapt the score accordingly, it would add complexity.

On another hand, we could design the algorithm on the assumption that there will be few festival events in the dataset, and decide to only give full points to start dates with a time.

What do others think?

Do we fix missing timezone (as per the solution of providing a timezone for all events on a specific website) before applying the algorithm?

My proposition about time zones was meant primarily to "validate" the correctness a time zone offset value. However, as you wrote in #123, wrong time zone values are difficult to detect programmatically. A startDate without a time zone is much less of a problem than a startDate with a wrong time zone. If we proceed with implementing a solution to fix missing time zone, I would run this process prior to running the algorithm. And, if we to decide to accept all startDate values that have a valid ISO 8601 syntax (with or without time), then we should logically also accept values without time zones.

Do we fix errors in the syntax (see table below) before applying the algorithm, or do we remain strict and not fix anything? Artsdata typically fixes basic errors such as missing seconds, missing double digits for hours...

The algorithm is meant to assess to what extent structured data lends itself to supporting consumption-based use cases. Does Google fix basic errors? We would need to check with Google Rich Results. If Google fixes some errors, then algorithm she be applied after errors are fixed.

Liverace Aug 8, 2024

J'avais l'impression que la présence de la chaîne complète CCYY-MM-DDThh:mm:ss[Z|(+|-)hh:mm] était ce qui vous permettait d'assigner un identifiant unique à un événement. Est-ce toujours le cas et cela représente-t-il le gain majeur qu'il me semble ? Auquel cas le boulot de "fixer" les manques et erreurs me semble être un pré-requis intéressant à réaliser dans le cadre de ce projet de production d'un algorithme. Lorsque les informations sont impossibles à obtenir, ne pourrait-on pas, comme moindre mal, insérer une valeur par défaut comme 2023-09-15T00:00:00-00:00 ?

christianroy Aug 8, 2024

Lorsque les informations sont impossibles à obtenir, ne pourrait-on pas, comme moindre mal, insérer une valeur par défaut comme 2023-09-15T00:00:00-00:00 ?

Objection, votre honneur! 😅 J'aime mieux des données inutilisables (et donc inutilisées) que des données erronées qui se propagent d'un système interconnecté à l'autre et qui deviennent pratiquement impossible à corriger.

fjjulien Aug 8, 2024
Maintainer Author

@Liverace Je suis d'accord avec Christian. Lorsqu'il est question de données de type dateTime, une valeur 0 (zéro) n'égale pas une valeur non-spécifiée. Dans ton exemple 2023-09-15T00:00:00-00:00, la valeur correspond exactement le 15 septembre à minuit, heure de Greenwich. Si la valeur était 2023-09-15 alors ça désigne le 15 septembre à une heure non-spécifiée. Il peut par exemple s'agir d'un événement d'une journée entière, dans quel cas l'ajout de valeurs nulles créerait une startDate erronée.

saumier Aug 15, 2024
Maintainer

Does Google fix basic errors? We would need to check with Google Rich Results. If Google fixes some errors, then algorithm she be applied after errors are fixed.

@fjjulien To you question, I know that validator.schema.org fixes some errors. Although the Schema.org validator is not Google, I will proceed with the idea that "fixable" errors should be fixed before applying the algorithm.

Do we also give 12 points when the startDate has no time and only date? Example: 2024-08-02 (which is a valid ISO 8601 syntax)?

YES - I will use the syntax CCYY-MM-DD[Thh:mm:ss[Z|(+|-)hh:mm]]

Do we fix missing timezone (as per the solution of providing a timezone for all events on a specific website) before applying the algorithm?

YES

Do we fix errors in the syntax (see table below) before applying the algorithm, or do we remain strict and not fix anything? Artsdata typically fixes basic errors such as missing seconds, missing double digits for hours...

YES - as per table of common syntax errors at the top of this thread.

christianroy · 2024-08-06T18:06:00Z

christianroy
Aug 6, 2024

Je comprends l'idée d'assigner plus de points aux propriétés recommandées, mais je me demande si c'est utile de le faire pour les propriétés obligatoires... Je m'explique: est-ce qu'il ne faudrait pas plutôt donner automatiquement zéro si les propriétés obligatoires ne sont pas remplies correctement? C'est la «barrière à l'entrée». Et on pourrait assigner un nombre fixe de points (ex: 30) pour l'ensemble des propriétés obligatoires. Ensuite, les propriétés supplémentaires ajoutent de la valeur.

Dans le scénario actuel, si j'interprète bien:

un événement qui aurait les 18 propriétés obligatoires et recommandées SAUF les numéros 4 et 5 (qui sont essentielles pour connaître le lieu) aurait 12 + 26 + 133 = 63 points, alors qu'il est invalide (parce que ne contient pas les propriétés obligatoires)
un autre événement avec seulement les propriétés obligatoires aurait 12 + 4*6 = 36 points

Avec ma proposition, le premier événement aurait 0 (zéro) et le deuxième le nombre fixe (par exemple 30). On pourrait utiliser un autre approche, mais je pense qu'il faut absolument que les événements qui ne respectent pas les obligations ne puissent pas avoir plus de points que ceux qui les respectent.

6 replies

christianroy Aug 8, 2024

Sur quelles bases qualifierait-on ou disqualifierait-on un objet imbriqué sous location?

J'avais compris que type, name et sameAs étaient les sous-propriétés obligatoires (parce qu'elles ont un poids de 6), donc ce serait sur cette base là (dès qu'une des trois n'est pas présente, disqualification).

Pour ces raisons, il me semble qu'il serait préférable de conserver un pointage variable pour les propriétés obligatoires.

Je n'ai pas d'objection à ça. Mon propos est qu'il ne faudrait pas que le système de pointage permette à des événements invalides (au sens où ils ne respectent pas les obligations) mais enrichis aient plus de points que des événements valides mais sans enrichissements, comme dans l'exemple fictif que je donnais. Une fois de s'en assurer est simplement d'annuler tous les points dès que les obligations ne sont pas rencontrées.

fjjulien Aug 8, 2024
Maintainer Author

J'avais compris que type, name et sameAs étaient les sous-propriétés obligatoires (parce qu'elles ont un poids de 6), donc ce serait sur cette base là (dès qu'une des trois n'est pas présente, disqualification).

Dans le modèle d'Artsdata, location.sameAs n'est pas une propriété obligatoire. C'est par contre propriété qui rend possible une désambiguïsation automatisée du lieu (lorsque la valeur est un identifiant pérenne) et on lui accorde conséquemment plus d'importance qu'aux propriétés recommandées. Disons qu'elle a un statut que je qualifierais de envoye donc siouplait ça serait vraiment vraiment le fun.

Si on voulait être cohérants avec le modèle d'Artsdata, il faudrait déplacer les propriétés 4 et 5 dans le groupe de pondération 3. Mais il me semble qu'un triplet location.sameAs a beaucoup de valeur qu'un triplet organizer.sameAs, n'est-ce pas?

fjjulien Aug 8, 2024
Maintainer Author

Dans le scénario actuel, si j'interprète bien:

un événement qui aurait les 18 propriétés obligatoires et recommandées SAUF les numéros 4 et 5 (qui sont essentielles pour connaître le lieu) aurait 12 + (2x6) + (15x3) = 57 points, alors qu'il est invalide (parce que ne contient pas les propriétés obligatoires)

un autre événement avec seulement les propriétés obligatoires aurait 12 + (4*6) = 36 points

C'est théoriquement vrai, quoi qu'assez improbable. Je doute qu'une source de données structurées puisse manquer location.name mais se donne la peine de renseigner location.address.postalCode. Même si cela était le cas, de telles données auraient quand même assez de profondeur pour être désambiguisables par un humain.

fjjulien Aug 8, 2024
Maintainer Author

@christianroy Au-delà de tous mes commentaires précédents (qui se voulaient d'abord et avant tout des clarifications), je serais d'accord à mettre en place une logique de disqualification. Cependant, je proposerais une logique légèrement différente de la tienne :

Si un événement n'a pas les propriétés 1, 2 et 3 (startDate, name, location.type), alors l'événement est disqualifié et son pointage est 0 (zéro).
Autrement, on applique la pondération de l'algorithme sur chaque pair attribut valeur et on calcule le score.

Selon ce scénario, il serait théoriquement possible qu'un événement sans location.name ait plus de points qu'un événement ayant toutes les propriétés de valeur 6. Toutefois, la probabilité serait assez faible que cela n'aurait pas d'incidence significative sur les données du projet de mesure de la découvrabilité.

Pour contrer cette probabilité, nous pourrions augmenter les pointages des propriétés 1, 2 et 3 pour qu'ils aient collectivement une valeur un plus élevée. Par exemple, je proposais plus haut d'augmenter la valeur de startDate à 15.

On pourrait par ailleurs retirer la propriété 6 (location.address) de la pondération de niveau 3 et lui accorder un seul point. Cela aurait pour effet à la fois de diminuer la valeur totale des propriétés recommandées. De plus, ça équilibrerait le poids relatif de startDate comparativement à la valeur totale d'un objet imbriqué de type Place. Selon ma proposition initiale, un objet imbriqué de type Place, pourrait avoir une valeur totale de 22 (24 en incluant la longitude et la latitude), ce qui est deux fois plus que la valeur de 12 accordées à startDate.

Si on montait startDate de 12 à 15 et qu'on diminuait location.address @type de 3 à 1, alors :

Les priorités de pondération 15 et 6 auraient une poids combiné de 15 + 4x6 = 39 et les propriétés de pondération 3 auraient un poids combiné de 12x3=36.
startDate aurait un poids de 15 et un objet de type Place complet aurait un poids de 20 (22 en incluant la longitude et la latitude).

On peut jouer avec le poids de startDate pour arriver à un beau chiffre rond de 40, mais il me semble qu'on est en train d'arriver à un équilibre intéressant. Est-ce que ça te conviendrait?

christianroy Aug 12, 2024

Oui! Je n'ai pas de souhait particulier sur telle ou telle propriété et sur les pointages, je pense que ça doit simplement être cohérent avec ce qui est exigé et documenté pour Artsdata. Mon point était surtout d'éviter des résultats incongrus comme le scénario fictif que j'évoquais et je pense que la simple disqualification mentionnée dans ta proposition révisée règle le problème!

fjjulien · 2024-08-10T16:04:57Z

fjjulien
Aug 10, 2024
Maintainer Author

Revised weighting proposal

Taking into account all the excellent feedback provided so far, I would like to propose this revised weighting. It introduces a new category worth 4 points for properties that are deemed useful for disambiguation. Required properties are given a weight of 8 points and recommended properties are brought down to a weight of 2 points to address concerns that recommended properties may collectively have a higher cumulative value than required properties. I also propose to integrate @christianroy's proposal to give a null score if an event does not have all three required properties.

Required + disambiguation property: 8+4= 12 points

startDate with a value that passes Artsdata SHACL validation (proper ISO-8601 syntax or minimal errors that are tolerated by the SHACL validation)

Required properties: 8 points

name
location.type with expected object value (i.e. Place object or subtype)

Disambiguation properties: 4 points

location.name
location.address.postalCode with valid postal code.
location.sameAs with a URI constituting a unique identifier for the object

Recommended properties: 2 points

id with a proper URI constituting a unique identifier for the Event (within the website domain, but distinct from the url value)
url
additionalType
description
image with a proper url value OR nested ImageObject with a proper image.url value
organizer.type with expected object value for the property
organizer.sameAs with a URI constituting a unique identifier for the object
performer.type with expected object value for the property
performer.sameAs with a URI constituting a unique identifier for the object
offers.type with expected object value for the property (Offer or AggregateOffer)
offers.url

Other properties: 1 point

Any other property with an expected value (including "location.address.type", which I'm proposing to bump down for simplicity's sake and to balance the total weighting of space attributes compared to time attributes).

Under this proposal:

An event with all three required properties would have a score of 12 + (2 x 8) = 28.
- If any of the three required properties is missing, the score would be 0 (zero), no matter how good the rest of the structured data is.
An event with all three disambiguation properties would have 12 additional points, for a sub-total of 40.
An event with all 11 recommended properties would have 22 additional points. If other contributors wished to keep the weight of recommended properties to 3, the total would be 33, which is in the same ball park as the value of required properties.

1 reply

saumier Sep 5, 2024
Maintainer

@fjjulien @christianroy @dlh28 @Liverace This weighting proposal has been implemented with the exception of "Other properties: 1 point" .

https://github.com/culturecreates/artsdata-score

@fjjulien I propose removing "Other properties: 1 point" from scope because it is too open ended.

To try it out

You can test this on individual webpages by going to artsdata.ca, pasting a webpage url into the top right search box, then in the options for "External resources" click dereference, and then clickthe link compute score. This will reload the webpage with the score added into the Event data (keep scrolling down).

The next task is to enable a batch of webpage urls.

fjjulien · 2024-08-30T17:20:41Z

fjjulien
Aug 30, 2024
Maintainer Author

Although the scores are meant primarily for internal needs, they will sooner or later start to circulate. We may for example include the score in feedback to Digital Discoverability Program participants. However, a score on its own, without any scale or interpretation note is meaningless.

In order to help external users make sense of the structured data scores, I think we should be able to provide qualitative interpretations base on score tiers. I would like to propose these suggested interpretations:

<40: The event structured data is good. An event type is detected and all fundamental attributes – name, date-time, and location – are documented.
40-59: The event structured data is very good. In addition to clearly identifying and describing the event, the structured data also includes some information about related entities, for example the performer(s), the organizer (i.e. the presenting organization) and/or the offer(s) (i.e. information about tickets).
<60: The event structured data is excellent. The structured data is detailed enough to clearly identify, describe and categorize the event and includes riche information about the related entities. The structured data may also include links to external knowledge bases identifying the same event or related entities.

Please share your comments, and propose edits.

2 replies

dlh28 Aug 30, 2024
Maintainer

I don't believe that sharing this score with participants would be helpful. Especially for the overachievers who are thinking on a academic-esque percentage scale and may think anything less than an 80(%) is a "failure". It would just be confusing at best and cause some unnecessary anxiety at worst.

What I find has been helpful instead is providing participants with a checklist (see this example for Aurora Cultural Centre). Using Tammy's initial template that breaks down minimum, recommended and advanced properties, I check off whatever is present in the structured data. I highlight whatever properties are missing/incorrectly formatted that make the data unusable (in orange) and whatever properties aren't present/properly formatted, where fixing them would be extremely helpful to meet that particular organization's needs (in green).

This way, it's less of a "scholarly" judgment on their structured data (where we're assigning "grades") and instead, simply providing a list of available properties and guiding them to make whatever changes fit their personalized needs.

fjjulien Aug 30, 2024
Maintainer Author

@dlh28 Thanks for your feedback. After having read your comment, I'm changing my mind and agreeing with you: we should not share these scores with participants in the Digital Discoverability Program.

fjjulien · 2024-09-05T20:25:03Z

fjjulien
Sep 5, 2024
Maintainer Author

That's unfortunate, because properties such as endDate are more than mere nice-to-have. Would it push the project behind if you at least included the "optional" properties from the Artsdata-specific instructions? If it does significantly impact the critical path, we will be okay with the algorithm in the current format. fj Le jeu. 5 sept. 2024, 2 h 45 p.m., Gregory Saumier-Finch < ***@***.***> a écrit :

…

@fjjulien <https://github.com/fjjulien> @christianroy <https://github.com/christianroy> @dlh28 <https://github.com/dlh28> @Liverace <https://github.com/Liverace> This weighting proposal has been implemented with the exception of "Other properties: 1 point" . https://github.com/culturecreates/artsdata-score @fjjulien <https://github.com/fjjulien> I propose removing "Other properties: 1 point" from scope because it is too open ended. To try it out You can test this on individual webpages by going to artsdata.ca, pasting a webpage url into the top right search box, then in the options for "External resources" click *dereference*, and then clickthe link *compute score*. This will reload the webpage with the score added into the Event data (keep scrolling down). The next task is to enable a batch of webpage urls. — Reply to this email directly, view it on GitHub <#120 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ASW5GOUY3URKCL4WKAQEY6LZVCRDNAVCNFSM6AAAAABLWQBN2KVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTANJWGEYTSMA> . You are receiving this because you were mentioned.Message ID: <culturecreates/artsdata-data-model/repo-discussions/120/comments/10561190 @github.com>

3 replies

saumier Sep 7, 2024
Maintainer

@fjjulien It was the "any other property" that was problematic. If we make a list then I can add them.

I propose the following properties for 1 point each:

schema:location.address.type
schema:location.url
schema:location.address.streetAddress
schema:location.address.addressLocality
schema:location.address.addressRegion
schema:location.address.addressCountry
schema:endDate
schema:eventStatus
schema:eventAttendanceMode
schema:doorTime
schema:duration
schema:mainEntityOfPage.url

Let me know if this works for the 1 point properties.

saumier Sep 10, 2024
Maintainer

@fjjulien I have added the 12 properties for 1 point each.

fjjulien Sep 10, 2024
Maintainer Author

This is good.

By restricting the number of 1-point properties, the score is no longer open-ended: it has a maximum value.

Required properties: 28
Reconciliation properties: 12
Recommended properties: 22
Optional properties: 12
Maximum score: 74

fjjulien · 2024-09-10T19:02:51Z

fjjulien
Sep 10, 2024
Maintainer Author

Representing the structured data score over a 100-point scale

An absolute score with a value between 0 and 74 meets all the research needs for the digital discoverability measurement project. However, since the discoverability measurement project team is planning to correlate the structured data score with a SERP score, which will likely have a different scale, it may become difficult for a human to make sense of scores on different scale. In order to make it easier for humans to interpret the structured data score and to compare it with a SERP score, we would like both scores to be transformed into matching 100-point scales.

@saumier Could you implement a rule of three to transform the 74-point score into a 100-point score for human interpretation? No decimals are needed (you may round it to the integer). Rather than giving the minimal qualification score (28 for all three required property) a percentage of 39 (i.e. 28/74), we would like this qualification score to be ascribed a percentage value of 50 (and all higher scores distributed between 50 and 100).

Note: This request falls in the "nice-to-have" category. If it is too difficult to display this percentage score in Nebula and to deliver it through the batch process (on top of the absolute score), we could easily run the rule of three outside of Artsdata. If you stumble upon a block, please do not waste countless hours on it this request.

1 reply

saumier Sep 13, 2024
Maintainer

@fjjulien No problem. I added this (((score-28)/46)*50) + 50 and added it to the property ex:scorePercent.
Took 15 min.

saumier · 2024-09-25T16:33:56Z

saumier
Sep 25, 2024
Maintainer

@fjjulien @christianroy The batch tool is ready to be used. I'd like to make a demo to see if it meets the needs of this project.

Is there a time you would like to meet for a demo?

Note: the tool can also load websites with injected JSON-LD by rendering the javascript in a headless browser (this was not available at the start of this project but has been developed since).

0 replies

Algorithm for an Event Structured Data Score #120

fjjulien Jul 30, 2024 Maintainer

Context

Goal

Specifications

Initial weighting proposal

Weight 12

Weight 6

Weight: 3

Weight: 1

Notes:

Replies: 10 comments · 28 replies

fjjulien Aug 5, 2024 Maintainer Author

fjjulien Aug 5, 2024 Maintainer Author

fjjulien Aug 7, 2024 Maintainer Author

dlh28 Jul 31, 2024 Maintainer

fjjulien Aug 5, 2024 Maintainer Author

fjjulien Aug 8, 2024 Maintainer Author

dlh28 Aug 27, 2024 Maintainer

saumier Aug 5, 2024 Maintainer

fjjulien Aug 7, 2024 Maintainer Author

fjjulien Aug 8, 2024 Maintainer Author

saumier Aug 15, 2024 Maintainer

fjjulien Aug 8, 2024 Maintainer Author

fjjulien Aug 8, 2024 Maintainer Author

fjjulien Aug 8, 2024 Maintainer Author

fjjulien Aug 10, 2024 Maintainer Author

Revised weighting proposal

Required + disambiguation property: 8+4= 12 points

Required properties: 8 points

Disambiguation properties: 4 points

Recommended properties: 2 points

Other properties: 1 point

saumier Sep 5, 2024 Maintainer

To try it out

fjjulien Aug 30, 2024 Maintainer Author

dlh28 Aug 30, 2024 Maintainer

fjjulien Aug 30, 2024 Maintainer Author

fjjulien Sep 5, 2024 Maintainer Author

saumier Sep 7, 2024 Maintainer

saumier Sep 10, 2024 Maintainer

fjjulien Sep 10, 2024 Maintainer Author

fjjulien Sep 10, 2024 Maintainer Author

Representing the structured data score over a 100-point scale

saumier Sep 13, 2024 Maintainer

saumier Sep 25, 2024 Maintainer

fjjulien
Jul 30, 2024
Maintainer

Replies: 10 comments 28 replies

fjjulien Aug 5, 2024
Maintainer Author

fjjulien Aug 5, 2024
Maintainer Author

fjjulien Aug 7, 2024
Maintainer Author

dlh28
Jul 31, 2024
Maintainer

fjjulien Aug 5, 2024
Maintainer Author

fjjulien Aug 8, 2024
Maintainer Author

dlh28 Aug 27, 2024
Maintainer

saumier
Aug 5, 2024
Maintainer

fjjulien Aug 7, 2024
Maintainer Author

fjjulien Aug 8, 2024
Maintainer Author

saumier Aug 15, 2024
Maintainer

fjjulien Aug 8, 2024
Maintainer Author

fjjulien Aug 8, 2024
Maintainer Author

fjjulien Aug 8, 2024
Maintainer Author

fjjulien
Aug 10, 2024
Maintainer Author

saumier Sep 5, 2024
Maintainer

fjjulien
Aug 30, 2024
Maintainer Author

dlh28 Aug 30, 2024
Maintainer

fjjulien Aug 30, 2024
Maintainer Author

fjjulien
Sep 5, 2024
Maintainer Author

saumier Sep 7, 2024
Maintainer

saumier Sep 10, 2024
Maintainer

fjjulien Sep 10, 2024
Maintainer Author

fjjulien
Sep 10, 2024
Maintainer Author

saumier Sep 13, 2024
Maintainer

saumier
Sep 25, 2024
Maintainer