Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TG2-VALIDATION_DAY_STANDARD #147

Open
Tracked by #24
ArthurChapman opened this issue Aug 26, 2018 · 39 comments
Open
Tracked by #24

TG2-VALIDATION_DAY_STANDARD #147

ArthurChapman opened this issue Aug 26, 2018 · 39 comments
Labels
CODED Conformance CORE TG2 CORE tests Test Tests created by TG2, either CORE, Supplementary or DO NOT IMPLEMENT TG2 TIME Validation

Comments

@ArthurChapman
Copy link
Collaborator

ArthurChapman commented Aug 26, 2018

TestField Value
GUID 47ff73ba-0028-4f79-9ce1-ee7008d66498
Label VALIDATION_DAY_STANDARD
Description Is the value of dwc:day an integer between 1 and 31 inclusive?
TestType Validation
Darwin Core Class dwc:Event
Information Elements ActedUpon dwc:day
Information Elements Consulted
Expected Response INTERNAL_PREREQUISITES_NOT_MET if dwc:day is bdq:Empty; COMPLIANT if the value of the field dwc:day is an integer between 1 and 31 inclusive; otherwise NOT_COMPLIANT.
Data Quality Dimension Conformance
Term-Actions DAY_STANDARD
Parameter(s)
Source Authority
Specification Last Updated 2023-09-18
Examples [dwc:day="15": Response.status=RUN_HAS_RESULT, Response.result=COMPLIANT, Response.comment="dwc:day is in range"]
[dwc:day="32": Response.status=RUN_HAS_RESULT, Response.result=NOT_COMPLIANT, Response.comment="dwc:day is not in range"]
Source TDWG2018 DQIG Meeting; TG2-Gainesville
References
Example Implementations (Mechanisms) FilteredPush/Kurator:event_date_qc 10.5281/zenodo.596795.
Link to Specification Source Code event_date_qc DwCEventDQ.validationDayStandard()
Notes This is part of a group of similar tests (VALIDATION_DAY_INRANGE (8d787cb5-73e2-4c39-9cd1-67c7361dc02e), VALIDATION_STARTDAYOFYEAR_INRANGE (85803c7e-2a5a-42e1-b8d3-299a44cafc46), VALIDATION_ENDDAYOFYEAR_INRANGE (9a39d88c-7eee-46df-b32a-c109f9f81fb8)).
@ArthurChapman
Copy link
Collaborator Author

We have changed the description of this from "is not an unambiguous value" to "is not an integer between 1 and 31 inclusive.

@chicoreus
Copy link
Collaborator

Isn't this a duplicate of #125 ?

@Tasilee Tasilee added the Test Tests created by TG2, either CORE, Supplementary or DO NOT IMPLEMENT label Aug 30, 2018
@chicoreus
Copy link
Collaborator

Added guid.

@Tasilee Tasilee added Test and removed Test labels Sep 25, 2018
chicoreus added a commit to FilteredPush/event_date_qc that referenced this issue Mar 12, 2019
…DESCRIPTION: VALIDATION_DAY_NOTSTANDARD now returns NOT_COMPLIANT for any dwc:day value except an integer in the range 1-31, prerequisites not met no longer thrown internally to this implementation as that is only returned if term is absent.
@chicoreus
Copy link
Collaborator

Field not present as a prerequisite is a significant barrier for implemenations. An implementation of an atomic test as a method which takes values as parameters e.g. validateDayNotStandard(day), is very likely to be isolated from a larger framework which understands whether or not a term is present in an input data set or not. Bringing "is present" into the test definition conflates areas of concern.

This gets back to the need for a definition of isEmpty - if we define isEmpty to include null values, empty strings, and cases where the term isn't present in the data set then we can consistently define prerequisites as being not empty.

I thus propose that we change the element of the definition "INTERNAL_PREREQUISITES_NOT_MET if the field dwc:day is not present; " to "INTERNAL_PREREQUISITES_NOT_MET if the field dwc:day is EMPTY;", reference a standard definition of EMPTY, and propagate this change to all other test definitions.

@tucotuco
Copy link
Member

@chicoreus I see what you mean about being able to abstract the test in code be independent of the record structure that it might act on. Defining EMPTY to include when a field isn't present seems an elegant simplification that I don't anticipate would cause any issues. I mean, once it comes to the test implementation, it doesn't much matter of there was a source value that was null or if there was no source field. Making the change would likely simplify all the descriptions that reference 'not present'. Downside: it means a review and rewrite of a bunch of tests - it looks like there might be 67 tests affected. I would not wish that on Lee or Arthur.

@Tasilee
Copy link
Collaborator

Tasilee commented Mar 27, 2019

If setting up a definition of EMPTY will consolidate intent, so be it. I am happy to work through the tests and change the Expected response wording.

But how best to do it? My initial reaction is to add a new Issue for EMPTY that can therefore be referenced, otherwise, in relation to this work, it will be hanging?

What would the definition be?

EMPTY:
Field not present
Field contains a string of one or more null characters
Field contains a string of one or more blank characters
...

@tucotuco
Copy link
Member

tucotuco commented Mar 27, 2019 via email

@Tasilee
Copy link
Collaborator

Tasilee commented Mar 27, 2019

Thanks @tucotuco. I like the definition but where do we put it so that it is at least as referenceable as the tests and comments?

@ArthurChapman
Copy link
Collaborator Author

ArthurChapman commented Mar 28, 2019

It goes into the Vocabulary #152

@Tasilee
Copy link
Collaborator

Tasilee commented Mar 28, 2019

Thanks @ArthurChapman. Duh, I forgot about that :|. Will do it now.

@ArthurChapman
Copy link
Collaborator Author

But note we already have a definition given by @chicoreus. Perhaps we need to discuss more - see previous discussion in #111

@Tasilee
Copy link
Collaborator

Tasilee commented Mar 28, 2019

I like @tucotuco 's :) as I can easily understand it, and added it to #111 and the vocab #152. Lets see what @chicoreus says. I'm sure he will find an issue? Ha.

@ArthurChapman
Copy link
Collaborator Author

Problem is, @tucotuco doesn't cover everything - like 9999, etc. and some other things that were in @chicoreus version. That is why we need to discuss more fully. @tucotuco and I will look at it more tomorrow.

@chicoreus
Copy link
Collaborator

I think a general definition of EMPTY on the line of @tucotuco's would work, if accompanied by notes about implementations in different languages (empty string, null (if the language supports it e.g in java, javascript, sql), undefined (e.g. in javascript), a char array of size 0 (e.g. in C)). I'm not quite sure, however, about non-printing characters (if using either an ascii or unicode definition), as those are characters and constitute data, and could represent mistyped values (e.g. something that's supposed to be a small integer that is being represented as a low ascii character). I'd be inclined to treat a non-printing character as non-empty and then let tests downstream fail. Based on the data I've seen in the wild (and @tucotuco has seen plenty more), I'd propose including several strings that are typical string serializations of a database null ("\N", "null", "NULL"). I also put on the table, but am not particularly happy with other common serializations of empty value markers ("**", "9999", etc.). The trouble with these is that some of them might be valid data values for some fields, so they don't seem to fit a generalized definition of empty.

@tucotuco
Copy link
Member

tucotuco commented Mar 28, 2019 via email

@tucotuco
Copy link
Member

Following the discussions arising from the event date case study for the BISS paper, I believe that this test should be updated to use the expected response and notes of TG2-VALIDATION_DAY_OUTOFRANGE (#125), which should then be deprecated. Everything else about this test is already up to date with the proposed change.

@Tasilee
Copy link
Collaborator

Tasilee commented Aug 14, 2019

@chicoreus: I'm presuming you are suggesting the removal of all phrases like "dwc:xxx is not present"? It does in retrospect seem odd that we have defined EMPTY but not PRESENT.

@chicoreus
Copy link
Collaborator

@Tasilee Yes, i would very strongly advocate the removal of "is not present" everywhere it occurs. I would further advocate that we ensure that the definition of isEmpty()/EMPTY allows test implementations to treat the absence of a term in a single record as an empty value without making a distinction between the term being present with no value and the term being absent. This allows tests to be implemented as methods with signatures that take a list of terms (e.g. validationDayNotStandard(day, month, year), rather than as methods that take Darwin Core records and internally extract the terms from them (which is what we would be forced into by assessing the presence/absence of terms at the level of each test).

Presence/absence of terms related to the core in a MultiRecord feels like a distinct test related to only a particular subset of data preparation pipelines, and not being a question about the utility of the data for consumers (e.g. all research consumers would care about is the presence of values for dwc:eventDate, while year, month, and day are primarily concerns of upstream providers and aggregators attempting to map disparate schemas onto Darwin Core and then to aggregate disparate mappings onto a common provision of dwc:eventDate).

I will make two claims: (1) Including a test of terms being present as part of the assessment of emptyness of terms at the level of individual tests puts severe engineering constraints on the implementations of those tests (e.g. @tucotuco couldn't have assessed the difference between term present/absent and term value = null in his sql implementation of the subset of TIME tests). And (2) assessing whether terms are present or absent in a particular SingleRecord or MultiRecord is out of scope for the CORE tests as it relates to a subset of use cases around mapping the data that are out of scope of the set of uses for the data that we examined for CORE.

(1) Says we can define a separate test for presence of terms in a data serialization. (2), if accepted, says we can ignore this issue and just drop is present, (2) if rejected says we should define a separate test for the presence/absence of the information elements/darwin core terms we identified as the set relevant to the core tests.

@tucotuco
Copy link
Member

I think it simplifies everything to include in our EMPTY definition the case where a term is not present and remove what would then be superfluous wording.

@Tasilee
Copy link
Collaborator

Tasilee commented Aug 14, 2019

Thanks @chicoreus. I always need to read your comments 4 times to get my head around the concepts. Yes, you all know I am slow by now. Anyway, I think I agree with the conclusions and with @tucotuco. I will need to await clarification of the EXTERNAL_PREREQUISITES question before editing. I don't want to do two passes! Again. I'll leave the definitions to @ArthurChapman and @tucotuco

@ArthurChapman
Copy link
Collaborator Author

OK - I will attempt to change the definition of Empty in #152 (and in the BISS Paper)

@ArthurChapman
Copy link
Collaborator Author

I have changed the definition of EMPTY to:
A field that is either not present or does not contain any characters or values. Note: A field containing non-printing or other invalid characters or values (including NULL values) may be separately detected.

@Tasilee
Copy link
Collaborator

Tasilee commented Aug 15, 2019

Do we need a note about @chicoreus comment on difficulty of implementing a concept of fields that are 'not present'

@chicoreus
Copy link
Collaborator

That is probably very worth putting into an informative document as part of the standard.

@chicoreus
Copy link
Collaborator

This issue is still conflating standardization (unambiguously interpreting text strings as integers) with a test of range (does the day exist for the given month and year).

We should have a test validation_day_outofrange(day, month, year) testing the latter (day exists for month and year), and this test #147 validation_day_notstandard(day) paired with #127 amendment_day_standardized(day) handling cases where values like "1st" can be unambigously interpreted as 1 (and for such, we should use the language unambigously interpreted instead of cast. Cast has a specific meaning in typed programing languages. @ArthurChapman somewhere mentioned cast of the string "1.240" representing a decimal number to the integer 1, which is likely to fail when presented with a european format "1,1240" string, and in many implementations of cast of the string to an integer will fail as the string doesn't contain an integer "1" might work, "1.0" is unlikely to. Actual implementations are likely to use parsing of numbers from strings rather than cast, and the expected case for any Darwin Core input values is as data typed as strings rather than as numeric types (given the absence of numeric types in the formal definitions, and the widespread presence of untyped values in csv files in Darwin Core Archives...)..

@ArthurChapman
Copy link
Collaborator Author

@chicoreus Check out #125 which was deprecated by @tucotuco . Are you suggesting it be resurected? Check out discussions there

@chicoreus
Copy link
Collaborator

We need both this and #125. The validation #125 is needed to test whether a value for day exists for the given month and year. It must treat non-integer values as internal prerequisites not met. This test validation #147 (which should take only dwc:day as a parameter and test whether the value of day is an integer) forms a couplet with the amendment #127 (which can propose the replacement of a non standard non-integer value with an integer (e.g. 2nd to 2)). This pairing of a validation and an amendment where the validation tests for exactly what the amendment can correct is critical to measuring improvement of quality. The test (#125) for the (not able to be corrected from these three fields alone, and thus not paired with an amendment) existence of a day value given the month and year is a distinct test, and we need both.

@chicoreus
Copy link
Collaborator

This test needs to be VALIDATION_DAY_NOTSTANDARD(dwc:day).
Test #125 needs to be VALIDATION_DAY_OUTOFRANGE(dwc:day, dwc:month: dwc:year). Another name for #125 is validation_day_existsformonthyear (not suggesting we adopt that, but it is less confusing).

@ArthurChapman
Copy link
Collaborator Author

In the Expected Response. we have replaced "cast" with either "ambiguously interpreted" or "interpreted" depending on contect

@chicoreus
Copy link
Collaborator

chicoreus commented Aug 16, 2019

I propose (1) Removing year and month from the information elements. (2) changing the specification from:

INTERNAL_PREREQUISITES_NOT_MET if a) dwc:day is EMPTY, or b) dwc:day can not be unambiguously interpreted as an integer, or c) dwc:day can be interpreted as an integer between 29 and 31 inclusive and dwc:month can not be interpreted as an integer between 1 and 12, or d) dwc:month can be interpreted as the integer 2 and dwc:month can be interpreted as the integer 29 and dwc:year can not be unambiguously interpreted as a valid ISO 8601 year; COMPLIANT e) if the value of the field dwc:day can be interpreted as an integer between 1 and 28 inclusive, or f) dwc:day can be interpreted as an integer between 29 and 30 and dwc:month can be interpreted as one of (4,6,9,11), or g) dwc:day can be interpreted as an integer between 29 and 31 and dwc:month can be interpreted as one of (1,3,5,7,8,10,12), or h) dwc:day can be interpreted as the integer 29 and dwc:month can be interpreted as the integer 2 and dwc:year is a valid leap year (evenly divisible by 400 or (evenly divisible by 4 but not evenly divisible by 100)); otherwise NOT_COMPLIANT

to:

INTERNAL_PREREQUISITES_NOT_MET if dwc:day is EMPTY; COMPLIANT if the value of the field dwc:day is an integer in the between 1 and 31 inclusive; otherwise NOT_COMPLIANT.

@chicoreus
Copy link
Collaborator

Contrast proposed specification with #125 VALIDATION_DAY_OUTOFRANGE (which checks if the day exists given the month and the year), and compare with the parallel specification in the amendment #127 AMENDMENT_DAY_STANDARDIZED.

@Tasilee
Copy link
Collaborator

Tasilee commented Aug 18, 2019

@chicoreus: Your logic makes sense. #125 is doing the complex range checking, while this one is detecting a simpler scenario. I'll edit the Expected Response accordingly as suggested.

@chicoreus
Copy link
Collaborator

Removing year and month from information elements. Minor edit of specification (s/II/I/ s/in the between/between) from:

IINTERNAL_PREREQUISITES_NOT_MET if dwc:day is EMPTY; COMPLIANT if the value of the field dwc:day is an integer in the between 1 and 31 inclusive; otherwise NOT_COMPLIANT.

to:

INTERNAL_PREREQUISITES_NOT_MET if dwc:day is EMPTY; COMPLIANT if the value of the field dwc:day is an integer between 1 and 31 inclusive; otherwise NOT_COMPLIANT.

chicoreus added a commit to FilteredPush/event_date_qc that referenced this issue Aug 19, 2019
…c:day based on latest changes in issues. DESCRIPTION: Implementation of validation day notstandard to match amendment_day_standardized testing only dwc:day and checking for integer in range 1-31, and validation_day_inrange to use day month year and return internal prerequisites not met on month or year only when those values are required to evaluate a day value (29-31).
@tucotuco
Copy link
Member

@timrobertson100 Noted that this test still showed a dependency on year and month, which we explicitly removed after long deliberations. I have updated the Notes from

"This test must take into account the given month and year, if present, to account for leap years. This is part of a group of similar tests (#125, #130, #131)."

to

"This is part of a group of similar tests (#125, #130, #131)."

to reflect this.

chicoreus added a commit to FilteredPush/event_date_qc that referenced this issue Mar 11, 2022
…sts to current specifications. DESCRIPTION: Updating implementation of VALIDATION_STARTDAYOFYEAR_OUTOFRANGE to updated (2022-03-10) specification, adding and fixing unit tests to confirm with current specification. Removing instead of deprecating and using as wrapper previous startDayOfYearInRangeForYear() method, as it takes year instead of eventDate and can't conform to the current standard. Updating implementation and fixing unit tests for VALIDATION_DAY_NOTSTANDARD to conform with current specification.
@Tasilee Tasilee changed the title TG2-VALIDATION_DAY_NOTSTANDARD TG2-VALIDATION_DAY_STANDARD Mar 22, 2022
@ArthurChapman
Copy link
Collaborator Author

Changed Note from:

"This is part of a group of similar tests (#125 #130, #131)."
to

"This is part of a group of similar tests (VALIDATION_DAY_INRANGE (8d787cb5-73e2-4c39-9cd1-67c7361dc02e), VALIDATION_STARTDAYOFYEAR_INRANGE )85803c7e-2a5a-42e1-b8d3-299a44cafc46, VALIDATION_ENDDAYOFYEAR_INRANGE (9a39d88c-7eee-46df-b32a-c109f9f81fb8))."

To removed internal GitHub links

@chicoreus chicoreus added the CODED label Jul 5, 2023
@Tasilee
Copy link
Collaborator

Tasilee commented Sep 18, 2023

Splitting bdqffdq:Information Elements into "Information Elements ActedUpon" and "Information Elements Consulted".

Also changed "Field" to "TestField", "Output Type" to "TestType" and updated "Specification Last Updated"

@chicoreus chicoreus added the CORE TG2 CORE tests label Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CODED Conformance CORE TG2 CORE tests Test Tests created by TG2, either CORE, Supplementary or DO NOT IMPLEMENT TG2 TIME Validation
Projects
None yet
Development

No branches or pull requests

4 participants