-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TG2-VALIDATION_DAY_STANDARD #147
Comments
We have changed the description of this from "is not an unambiguous value" to "is not an integer between 1 and 31 inclusive. |
Isn't this a duplicate of #125 ? |
Added guid. |
…DESCRIPTION: VALIDATION_DAY_NOTSTANDARD now returns NOT_COMPLIANT for any dwc:day value except an integer in the range 1-31, prerequisites not met no longer thrown internally to this implementation as that is only returned if term is absent.
Field not present as a prerequisite is a significant barrier for implemenations. An implementation of an atomic test as a method which takes values as parameters e.g. validateDayNotStandard(day), is very likely to be isolated from a larger framework which understands whether or not a term is present in an input data set or not. Bringing "is present" into the test definition conflates areas of concern. This gets back to the need for a definition of isEmpty - if we define isEmpty to include null values, empty strings, and cases where the term isn't present in the data set then we can consistently define prerequisites as being not empty. I thus propose that we change the element of the definition "INTERNAL_PREREQUISITES_NOT_MET if the field dwc:day is not present; " to "INTERNAL_PREREQUISITES_NOT_MET if the field dwc:day is EMPTY;", reference a standard definition of EMPTY, and propagate this change to all other test definitions. |
@chicoreus I see what you mean about being able to abstract the test in code be independent of the record structure that it might act on. Defining EMPTY to include when a field isn't present seems an elegant simplification that I don't anticipate would cause any issues. I mean, once it comes to the test implementation, it doesn't much matter of there was a source value that was null or if there was no source field. Making the change would likely simplify all the descriptions that reference 'not present'. Downside: it means a review and rewrite of a bunch of tests - it looks like there might be 67 tests affected. I would not wish that on Lee or Arthur. |
If setting up a definition of EMPTY will consolidate intent, so be it. I am happy to work through the tests and change the Expected response wording. But how best to do it? My initial reaction is to add a new Issue for EMPTY that can therefore be referenced, otherwise, in relation to this work, it will be hanging? What would the definition be? EMPTY: |
Hmm. Not sure a label is necessary since we can find them in a string
search using 'EMPTY'.
As for a definition, how about:
EMPTY: A field that is needed as input is not present, or, the input field
is present but there is no value in the field, or the field is present and
the value of the field consists entirely of non-printing characters.
…On Wed, Mar 27, 2019 at 4:51 PM Lee Belbin ***@***.***> wrote:
If setting up a definition of EMPTY will consolidate intent, so be it. I
am happy to work through the tests and change the Expected response wording.
But how best to do it? My initial reaction is to add a new Issue for EMPTY
that can therefore be referenced, otherwise, in relation to this work, it
will be hanging?
What would the definition be?
EMPTY:
Field not present
Field contains a string of one or more null characters
Field contains a string of one or more blank characters
...
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#147 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAcP6w4RqyQm7kSnAAbfFyJg23Q-eX39ks5va8vGgaJpZM4WMu13>
.
|
Thanks @tucotuco. I like the definition but where do we put it so that it is at least as referenceable as the tests and comments? |
It goes into the Vocabulary #152 |
Thanks @ArthurChapman. Duh, I forgot about that :|. Will do it now. |
But note we already have a definition given by @chicoreus. Perhaps we need to discuss more - see previous discussion in #111 |
I like @tucotuco 's :) as I can easily understand it, and added it to #111 and the vocab #152. Lets see what @chicoreus says. I'm sure he will find an issue? Ha. |
Problem is, @tucotuco doesn't cover everything - like 9999, etc. and some other things that were in @chicoreus version. That is why we need to discuss more fully. @tucotuco and I will look at it more tomorrow. |
I think a general definition of EMPTY on the line of @tucotuco's would work, if accompanied by notes about implementations in different languages (empty string, null (if the language supports it e.g in java, javascript, sql), undefined (e.g. in javascript), a char array of size 0 (e.g. in C)). I'm not quite sure, however, about non-printing characters (if using either an ascii or unicode definition), as those are characters and constitute data, and could represent mistyped values (e.g. something that's supposed to be a small integer that is being represented as a low ascii character). I'd be inclined to treat a non-printing character as non-empty and then let tests downstream fail. Based on the data I've seen in the wild (and @tucotuco has seen plenty more), I'd propose including several strings that are typical string serializations of a database null ("\N", "null", "NULL"). I also put on the table, but am not particularly happy with other common serializations of empty value markers ("**", "9999", etc.). The trouble with these is that some of them might be valid data values for some fields, so they don't seem to fit a generalized definition of empty. |
I agree, non-printing characters are not best included for prerequisites
not being met. It would not allow us to report that the content (and there
is content) is not compliant, and that is what we really want the recipient
to know. I think the same applies for all the string serializations of a
database null, now that I think more about it. That is something being
shared that is not compliant also.
…On Thu, Mar 28, 2019 at 12:56 AM Paul J. Morris ***@***.***> wrote:
I think a general definition of EMPTY on the line of @tucotuco
<https://github.com/tucotuco>'s would work, if accompanied by notes about
implementations in different languages (empty string, null (if the language
supports it e.g in java, javascript, sql), undefined (e.g. in javascript),
a char array of size 0 (e.g. in C)). I'm not quite sure, however, about
non-printing characters (if using either an ascii or unicode definition),
as those are characters and constitute data, and could represent mistyped
values (e.g. something that's supposed to be a small integer that is being
represented as a low ascii character). I'd be inclined to treat a
non-printing character as non-empty and then let tests downstream fail.
Based on the data I've seen in the wild (and @tucotuco
<https://github.com/tucotuco> has seen plenty more), I'd propose
including several strings that are typical string serializations of a
database null ("\N", "null", "NULL"). I also put on the table, but am not
particularly happy with other common serializations of empty value markers
("**", "9999", etc.). The trouble with these is that some of them might be
valid data values for some fields, so they don't seem to fit a generalized
definition of empty.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#147 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAcP65YJtOEQLgYsSQlH83w6ncge3yeyks5vbD1tgaJpZM4WMu13>
.
|
Following the discussions arising from the event date case study for the BISS paper, I believe that this test should be updated to use the expected response and notes of TG2-VALIDATION_DAY_OUTOFRANGE (#125), which should then be deprecated. Everything else about this test is already up to date with the proposed change. |
@chicoreus: I'm presuming you are suggesting the removal of all phrases like "dwc:xxx is not present"? It does in retrospect seem odd that we have defined EMPTY but not PRESENT. |
@Tasilee Yes, i would very strongly advocate the removal of "is not present" everywhere it occurs. I would further advocate that we ensure that the definition of isEmpty()/EMPTY allows test implementations to treat the absence of a term in a single record as an empty value without making a distinction between the term being present with no value and the term being absent. This allows tests to be implemented as methods with signatures that take a list of terms (e.g. validationDayNotStandard(day, month, year), rather than as methods that take Darwin Core records and internally extract the terms from them (which is what we would be forced into by assessing the presence/absence of terms at the level of each test). Presence/absence of terms related to the core in a MultiRecord feels like a distinct test related to only a particular subset of data preparation pipelines, and not being a question about the utility of the data for consumers (e.g. all research consumers would care about is the presence of values for dwc:eventDate, while year, month, and day are primarily concerns of upstream providers and aggregators attempting to map disparate schemas onto Darwin Core and then to aggregate disparate mappings onto a common provision of dwc:eventDate). I will make two claims: (1) Including a test of terms being present as part of the assessment of emptyness of terms at the level of individual tests puts severe engineering constraints on the implementations of those tests (e.g. @tucotuco couldn't have assessed the difference between term present/absent and term value = null in his sql implementation of the subset of TIME tests). And (2) assessing whether terms are present or absent in a particular SingleRecord or MultiRecord is out of scope for the CORE tests as it relates to a subset of use cases around mapping the data that are out of scope of the set of uses for the data that we examined for CORE. (1) Says we can define a separate test for presence of terms in a data serialization. (2), if accepted, says we can ignore this issue and just drop is present, (2) if rejected says we should define a separate test for the presence/absence of the information elements/darwin core terms we identified as the set relevant to the core tests. |
I think it simplifies everything to include in our EMPTY definition the case where a term is not present and remove what would then be superfluous wording. |
Thanks @chicoreus. I always need to read your comments 4 times to get my head around the concepts. Yes, you all know I am slow by now. Anyway, I think I agree with the conclusions and with @tucotuco. I will need to await clarification of the EXTERNAL_PREREQUISITES question before editing. I don't want to do two passes! Again. I'll leave the definitions to @ArthurChapman and @tucotuco |
OK - I will attempt to change the definition of Empty in #152 (and in the BISS Paper) |
I have changed the definition of EMPTY to: |
Do we need a note about @chicoreus comment on difficulty of implementing a concept of fields that are 'not present' |
That is probably very worth putting into an informative document as part of the standard. |
This issue is still conflating standardization (unambiguously interpreting text strings as integers) with a test of range (does the day exist for the given month and year). We should have a test validation_day_outofrange(day, month, year) testing the latter (day exists for month and year), and this test #147 validation_day_notstandard(day) paired with #127 amendment_day_standardized(day) handling cases where values like "1st" can be unambigously interpreted as 1 (and for such, we should use the language unambigously interpreted instead of cast. Cast has a specific meaning in typed programing languages. @ArthurChapman somewhere mentioned cast of the string "1.240" representing a decimal number to the integer 1, which is likely to fail when presented with a european format "1,1240" string, and in many implementations of cast of the string to an integer will fail as the string doesn't contain an integer "1" might work, "1.0" is unlikely to. Actual implementations are likely to use parsing of numbers from strings rather than cast, and the expected case for any Darwin Core input values is as data typed as strings rather than as numeric types (given the absence of numeric types in the formal definitions, and the widespread presence of untyped values in csv files in Darwin Core Archives...).. |
@chicoreus Check out #125 which was deprecated by @tucotuco . Are you suggesting it be resurected? Check out discussions there |
We need both this and #125. The validation #125 is needed to test whether a value for day exists for the given month and year. It must treat non-integer values as internal prerequisites not met. This test validation #147 (which should take only dwc:day as a parameter and test whether the value of day is an integer) forms a couplet with the amendment #127 (which can propose the replacement of a non standard non-integer value with an integer (e.g. 2nd to 2)). This pairing of a validation and an amendment where the validation tests for exactly what the amendment can correct is critical to measuring improvement of quality. The test (#125) for the (not able to be corrected from these three fields alone, and thus not paired with an amendment) existence of a day value given the month and year is a distinct test, and we need both. |
In the Expected Response. we have replaced "cast" with either "ambiguously interpreted" or "interpreted" depending on contect |
I propose (1) Removing year and month from the information elements. (2) changing the specification from:
to:
|
@chicoreus: Your logic makes sense. #125 is doing the complex range checking, while this one is detecting a simpler scenario. I'll edit the Expected Response accordingly as suggested. |
Removing year and month from information elements. Minor edit of specification (s/II/I/ s/in the between/between) from:
to:
|
…c:day based on latest changes in issues. DESCRIPTION: Implementation of validation day notstandard to match amendment_day_standardized testing only dwc:day and checking for integer in range 1-31, and validation_day_inrange to use day month year and return internal prerequisites not met on month or year only when those values are required to evaluate a day value (29-31).
@timrobertson100 Noted that this test still showed a dependency on year and month, which we explicitly removed after long deliberations. I have updated the Notes from "This test must take into account the given month and year, if present, to account for leap years. This is part of a group of similar tests (#125, #130, #131)." to "This is part of a group of similar tests (#125, #130, #131)." to reflect this. |
…sts to current specifications. DESCRIPTION: Updating implementation of VALIDATION_STARTDAYOFYEAR_OUTOFRANGE to updated (2022-03-10) specification, adding and fixing unit tests to confirm with current specification. Removing instead of deprecating and using as wrapper previous startDayOfYearInRangeForYear() method, as it takes year instead of eventDate and can't conform to the current standard. Updating implementation and fixing unit tests for VALIDATION_DAY_NOTSTANDARD to conform with current specification.
Changed Note from: "This is part of a group of similar tests (#125 #130, #131)." "This is part of a group of similar tests (VALIDATION_DAY_INRANGE (8d787cb5-73e2-4c39-9cd1-67c7361dc02e), VALIDATION_STARTDAYOFYEAR_INRANGE )85803c7e-2a5a-42e1-b8d3-299a44cafc46, VALIDATION_ENDDAYOFYEAR_INRANGE (9a39d88c-7eee-46df-b32a-c109f9f81fb8))." To removed internal GitHub links |
Splitting bdqffdq:Information Elements into "Information Elements ActedUpon" and "Information Elements Consulted". Also changed "Field" to "TestField", "Output Type" to "TestType" and updated "Specification Last Updated" |
The text was updated successfully, but these errors were encountered: