Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TG2-VALIDATION_EVENTDATE_INRANGE #36

Open
iDigBioBot opened this issue Jan 5, 2018 · 69 comments
Open

TG2-VALIDATION_EVENTDATE_INRANGE #36

iDigBioBot opened this issue Jan 5, 2018 · 69 comments
Labels
CODED Conformance CORE TG2 CORE tests Parameterized Test requires a parameter Test Tests created by TG2, either CORE, Supplementary or DO NOT IMPLEMENT TG2 TIME Validation

Comments

@iDigBioBot
Copy link
Collaborator

iDigBioBot commented Jan 5, 2018

TestField Value
GUID 3cff4dc4-72e9-4abe-9bf3-8a30f1618432
Label VALIDATION_EVENTDATE_INRANGE
Description Is the value of dwc:eventDate entirely with the Parameter Range?
TestType Validation
Darwin Core Class dwc:Event
Information Elements ActedUpon dwc:eventDate
Information Elements Consulted
Expected Response INTERNAL_PREREQUISITES_NOT_MET if dwc:eventDate is bdq:Empty or if the value of dwc:eventDate is not a valid ISO 8601 date; COMPLIANT if the range of dwc:eventDate is entirely within the range bdq:earliestValidDate to bdq:latestValidDate, inclusive, otherwise NOT_COMPLIANT
Data Quality Dimension Conformance
Term-Actions EVENTDATE_INRANGE
Parameter(s) bdq:earliestValidDate
bdq:latestValidDate
Source Authority bdq:earliestValidDate default ="1582-11-15"
bdq:latestValidDate default = "{current year}"
Specification Last Updated 2024-09-16
Examples [dwc:eventDate="1962-11-01T10:00-0600": Response.status=RUN_HAS_RESULT, Response.result=COMPLIANT, Response.comment="dwc:eventDate is IN_RANGE"]
[dwc:eventDate="2300-11-01T10:00": Response.status=RUN_HAS_RESULT, Response.result=NOT_COMPLIANT, Response.comment="dwc:eventDate is NOT_IN_RANGE"]
Source VertNet
References
Example Implementations (Mechanisms) Kurator:event_date_qc
Link to Specification Source Code FilteredPush event_date_qc DwCEventDQ.validationEventdateInrange()
Notes This test provides for a default earliest date, which is 1582-11-15 by convention. That date was chosen because ISO 8601-1 asserts that "the use of proleptic Gregorian calendar dates prior are not allowed in ISO 8601-1 without prior agreement of the parties exchanging data", and Darwin Core does not comment on this. Different calendars have been used at different times in different places, and the transcription of an original date in one calendar into dwc:eventDate, where a Gregorian Calendar is assumed, may or may not have been done with the correct translation of the date, and metadata may or not be present to even identify such records. Given the complexity, and ongoing nature of transitions between calendars, we do not advocate using this test for quality assurance by selecting a transition date and using it as a threshold.
@iDigBioBot
Copy link
Collaborator Author

Comment by Lee Belbin (@Tasilee) migrated from spreadsheet:
Was thinking of adding a lower bound to make it a more comprehensive test, but could we have fossil eventDate?

@tucotuco tucotuco changed the title TG2-VALIDATION_EVENTDATE_OUTOFRANGE TG2-VALIDATION_EVENTDATE_INFUTURE Jan 19, 2018
@tucotuco tucotuco changed the title TG2-VALIDATION_EVENTDATE_INFUTURE TG2-VALIDATION_EVENTDATE_OUTOFRANGE Jan 19, 2018
@ArthurChapman ArthurChapman added Test Tests created by TG2, either CORE, Supplementary or DO NOT IMPLEMENT and removed VOCABULARY labels Jan 19, 2018
@chicoreus
Copy link
Collaborator

Needs clarification for eventDate values which are ranges and which span the oldest/youngest boundaries. For example, 1700-01-01/2100-01-10 is an entirely valid eventDate value with a range which includes all likely specimen collecting dates extant, or for some time into the future. Under the current definition, this value (which is in essence a placeholder for "we don't know what the date was"), fails the test. Similarly 1650-01/1850-02 would be expected to fail, simply because it places a lower bound to the uncertainty earlier than the default 1700. Framing the test to mark as problems any range which extends outside the 1700-present range will potentially encourage people to frame uncertainty about dates too narrowly, instead of setting reasonable uncertainty values for their situation. I'd prefer to just flag eventDate values which fall entirely outside the specified range. Other potential failure cases produced by considering ranges that span the boundaries as problems are an eventDate who's value is the current date, without a time. This is a time interval that extends into the future, and a reasonable implementation of the test as stated would mark any record with an eventDate consisting of the current date without a time as an error - something not desirable when the quality control processes are placed upstream close to initial data capture.

@ArthurChapman
Copy link
Collaborator

@chicoreus I don't see a problem here - we are not saying it is wrong - just a warning that it is out of range. What is done with that is up to the user, but it flags a possible problem. With annotations - a followup annotation may be that this is OK, because ...

@chicoreus
Copy link
Collaborator

The problem is again on different interpretations of how to represent uncertainty in eventDate values. A European institution with old collections which very reasonably decides to set 1400-01-01 and 2100-01-10 as end boundaries for any events where the collecting date is not known (the 2100 date making these records very easy to find and distinguish from ones which have had the date narrowed based on some additional interpretation), and would have all of these flagged as problems binned in with real problem records such as the typical typo 190-10-01. It is very rational from a database perspective to set an end date at some distant future point for all records with uncertainty, this makes them easy to find and collect). I'm not at all in favor of a position that declares that ranges that fall outside the likely bounds are problems. I'd much rather see a narrower test for intervals that entirely fall outside the range of plausible collecting event dates - that should get a much smaller set of false positives and more effectively identify problematic data that needs to be fixed.

The today's date will fail issue (because today's date to a resolution of one day in an ISO date is a temporal interval that extends into the future, unless special case handling is added for today's date) also makes this test highly problematic for upstream uses near the point of observation.

@ArthurChapman
Copy link
Collaborator

ArthurChapman commented Feb 5, 2018

I can understand that at the dataset level, but would expect it to be very rare at the record level. The earliest date can be a designated date for the run as well if you need to set an earlier date for some reason - or particular dataset. I don't see it as a big issue.

@Tasilee
Copy link
Collaborator

Tasilee commented Feb 6, 2018

I'm a simple soul. I side with @ArthurChapman. We have to be careful that we don't errect obstacles that eveyone is then forced to climb over. KISS. Others?

@chicoreus
Copy link
Collaborator

Another way of putting the problem I am seeing: By treating any range that extends beyond 1700-today as an error is conflating two classes of problems: (1) errors in accuracy (e.g. 198-10-15), and (2) broad statements about uncertainty (1500/2100). Broad statements about uncertainty are already captured separately with a measure of event duration. I will argue that it is important to be able to identify the first class of error in isolation, by implementing this test (in the easier way) by flagging records who's range falls entirely outside the range 1700-present. The current statement of the test is more complex, as it raises the specter of special case handling of records with today's date. I also like KISS, and argue that the current description isn't the simple one.

About 10% of the MCZ data has an unknown event date, recorded in the database (which enforces a start and end date as oracle date fields) as 1700-01-01/2100-01-01. From a database perspective, this is a very useful pair - it is very easy to extract those 183136 records on the basis of those values, narrowing by any inference makes these harder to locate as a single sort of data quality issue.

@Tasilee
Copy link
Collaborator

Tasilee commented Feb 7, 2018

OK, I'll buy it (range outside 1700-present) @chicoreus , but I would like to hear from the rest of the team.

@ArthurChapman
Copy link
Collaborator

How many institutions do this other then MCV? It does seem to be a problem. Under your reasoning @chicoreus - we can't only do "not in future" It would appear to me that the field is being used in ways it was never meant to be used, but I can't see any simple way around it other than to remove this test altogether.

@Tasilee
Copy link
Collaborator

Tasilee commented Feb 11, 2018

Re-examining this validation, I cannot see a problem with flagging a suspicious date (or date range) that is before 1700 or after the day the test is run. A "NOT COMPLIANT" would seem useful information to follow up on. A false positive flag seems prefereable to me that a false negative where one end of a range is totally outside 1700-today.

Considering #66, I'd be inclined to include invalid dates (e.g., Feb 30) under this test as they are not in the possible range of dates, and they may well be formatted to ISO standard. This would make this validation dependent on #61.

@chicoreus
Copy link
Collaborator

I'll suggest that we split this test into two separate tests, one of which tests whether or not the event date extends outside the boundaries 1700-present, and the other to test whether or not the event date falls entirely outside the boundaries 1700-present. The first test (crosses out of bounds) may represent problematic data or it may represent a large uncertainty. The second test (falls entirely out of bounds) likely flags data that contains errors (e.g. typos that leave a digit out of the year 190-05-18), but can potentially also flag rare but valid older material, and certain representations of zooarcheological material. This fits a principle of keeping tests simple and focused on particular potential problems.

chicoreus added a commit to FilteredPush/event_date_qc that referenced this issue Mar 1, 2018
…terpreations of event date in range. DESCRIPTION: Marking one implementation as 3cff4dc4-72e9-4abe-9bf3-8a30f1618432 in what seems to be the desired consensus. Marking the other implementation with a new uuid as a potential alternative (or additional test).
@tucotuco tucotuco added the Parameterized Test requires a parameter label Aug 26, 2018
@chicoreus
Copy link
Collaborator

chicoreus commented Jun 12, 2023

From TG2 call per @tucotuco If you care about dates affected by unknown calendar use the start date 1918... Add note: (here and in #84), if your use requires knowledge of date to a precision of finer than one year and ten days, use 1918-02-14 as the earliestValidDate (as the calendar isn't certain).

@chicoreus
Copy link
Collaborator

Slightly edited notes from an email, with added notes in italics from TG2 call:

I'll suggest we switch to 1582-11-15. General agreement on this in TG2 call. That date is supportable on the basis of ISO 8601-1 asserting that the use of proleptic gregornian calendar dates prior are not allowed in ISO 8601-1 without prior agreement of the parties exchanging data.

Since Darwin Core is mute on whether proleptic gregorian dates are allowed, no prior agreement exists, and we can argue that dates prior to this are automatically suspect.

In practice, dates prior to 1752 in the British empire, 1700 in various European protestant countries, 1918 in Russian territories (1918-02-14), are suspect, as those are the years of adoption of the gregorian calendar in those areas, and a reported date may not have the metadata needed to determine if it was a julian date as originally asserted, or has been
converted to a gregorian date. So any analysis that depends on date precision of less than 10 days, can't simply use any date prior to 1918 without thinking harder about the sources of the date data...
_Proposal above from @tucotuco to specify in this test (and in #84) the value for bdq:earliestValidDate=1918-02-14 provides education that this may be a concern, and provides a means for users where this may be a concern to identify records where it may be a concern. _

Difference between the gregorian and julian calendar has typically been around 10 days, but see the comparison on
https://en.wikipedia.org/wiki/Proleptic_Gregorian_calendar where there is no difference in most of years 100 to 200... Also year 0 may or may not exist...

But, it gets worse... there is the issue of what the start day of the year was, e.g. with the British civil year starting on March 25 instead of January 1. So dates from the British empire, or from British collectors from prior to 1752 may be off by 10 days and off by one year, depending.

Looks like a good explication on https://www.cree.name/genuki/dates.htm

Wikipedia cites this for the text "The best practice for citation of historically contemporary documents is to cite the date as expressed in the original text and to notate any contextual implications and conclusions regarding the calendar used and equivalents in other calendars. This practice permits others to re-evaluate the original evidence"

We expect dwc:eventDate to contain a gregorian date. dwc:verbatimEventDate allows for capture of a date as found in the
original text, and eventRemarks does allow for the capture of metadata about the translation of a local julian date into a gregorian date. So the capability exists within Darwin Core to document transformations between calendars and the related evidence for so doing.

We'll likely also need to consider this in #86, at least by including metadata that the assumed calendar for verbatim date is gregorian.

@Tasilee
Copy link
Collaborator

Tasilee commented Jun 12, 2023

Changed Parameter(s) to "bdq:earliestValidDate, bdq:latestValidDate".

I'll leave the outcomes of the 1500 and calendar discussions to @chicoreus to decide and implement. My conclusion to @tucotuco (loose and strict implementation) is to document (Notes) our Parameter(s) accordingly?

@ArthurChapman
Copy link
Collaborator

I have updated the default earliest date to 1582-11-15 and added to the Notes "That date is supportable on the basis of ISO 8601-1 asserting that "the use of proleptic gregornian calendar dates prior are not allowed in ISO 8601-1 without prior agreement of the parties exchanging data."

@ArthurChapman
Copy link
Collaborator

Suggest we add to the notes (also in #84, #76):

If setting a Parameter for this test be aware that prior to 1918, there may be issues associated with the use of the Julian calendar versus the Gregorian calendar in some countries. Difference between the Gregorian and Julian calendar has typically been around 10 days, but see the comparison on https://en.wikipedia.org/wiki/Proleptic_Gregorian_calendar where "there is no difference in most of years 100 to 200... Also year 0 may or may not exist...". See also, the explanation on https://www.cree.name/genuki/dates.htm

@Tasilee
Copy link
Collaborator

Tasilee commented Jun 13, 2023

Restructured Parameter(s) and Source authority.

@ArthurChapman
Copy link
Collaborator

ArthurChapman commented Jun 13, 2023

Thumbs up if you agree to this change

Change Notes to

The results of this test are time-dependent: An invalid date for tomorrow will be valid tomorrow. This test provides for a default earliest date, which is 1582-11-15 by convention. That date is supportable on the basis of ISO 8601-1 asserting that "the use of proleptic Gregorian calendar dates prior are not allowed in ISO 8601-1 without prior agreement of the parties exchanging data".

If setting a Parameter for this test be aware that prior to 1918, there may be issues associated with the use of the Gregorian calendar versus the Julian calendar in some countries. Difference between the Gregorian and Julian calendar has typically been around 10 days (but can be as great as 1 year and 10 days) see the comparison on https://en.wikipedia.org/wiki/Proleptic_Gregorian_calendar where "there is no difference in most of years 100 to 200... Also year 0 may or may not exist...". If your use requires knowledge of date to a precision of finer than one year and ten days, and you are not certain of the use of the Gregorian calendar, use 1919-01-01 as the earliestValidDate.":

@chicoreus
Copy link
Collaborator

@ArthurChapman change: "That date is supportable on the basis of ISO 8601-1 asserting" to "That date was chosen because ISO 8601-1 asserts", and then add, ", and Darwin Core does not specify such." to the end of the sentence. The second paragraph needs some work too.

Suggest changing notes to:

The results of this test are time-dependent: An invalid date for tomorrow will be valid tomorrow. This test provides for a default earliest date, which is 1582-11-15 by convention. That date was chosen because ISO 8601-1 asserts that "the use of proleptic Gregorian calendar dates prior are not allowed in ISO 8601-1 without prior agreement of the parties exchanging data", and Darwin Core does not specify such.

If setting a Parameter for this test be aware that prior to about 1918 different countries and (researchers from those countries) switched from the Julian calendar to the Gregorian calendar versus the Julian calendar at different times. For example, Russia adopted the Gregorian Calendar on 1918-02-14, the British Empire in 1752-09-14, different regions in France between 1582 and 1760, with France also adopting the French Republican Calendar 1793-1805. The difference between the Gregorian and Julian calendar has typically been around 10 days. But, the day that is considered the first day of the year has also changed at different times in different countries, meaning that the difference can be as great as 1 year and 10 days. If your use requires knowledge of date to a precision of finer than one year and ten days, and you are not certain of the use of the Gregorian calendar, use 1923-03-01 (when Greece adopted the Gregorian Calendar) as the earliestValidDate.

@chicoreus
Copy link
Collaborator

We probably also need to add text to the notes on the order of "If temporal resolution of one year or better is important different calendars have been used at different times in different places, and the transcription of an original date in one calendar into dwc:eventDate, where a Gregorian Calendar is assumed, may or may not have been done with the correct translation of the date, and metadata may or not be present to identify such records."

@tucotuco
Copy link
Member

I reiterate that I would not enumerate some transition dates while leaving out others. I would definitely not portray 1923 as if it was the latest transition. Transitions are still ongoing, and some may never happen. It would be discriminatory if any transition comes after any we chose and we can't have that. Better to cite Wikipedia and not give a cut-off date.

@chicoreus
Copy link
Collaborator

@tucotuco I agree, different uses are likely to have different needs. I would advocate listing a few dates, as examples, to remind people that this may be an important concern for dates present in historical biodiversity collections data, and that the absence of clear metadata about interpretations of those dates may make any quality assurance approach using this test as a threshold impractical.

@chicoreus
Copy link
Collaborator

How about:

The results of this test are time-dependent: An invalid date for tomorrow will be valid tomorrow. This test provides for a default earliest date, which is 1582-11-15 by convention. That date was chosen because ISO 8601-1 asserts that "the use of proleptic Gregorian calendar dates prior are not allowed in ISO 8601-1 without prior agreement of the parties exchanging data", and Darwin Core does not specify such.

If temporal resolution of one year or better is important different calendars have been used at different times in different places, and the transcription of an original date in one calendar into dwc:eventDate, where a Gregorian Calendar is assumed, may or may not have been done with the correct translation of the date, and metadata may or not be present to identify such records. Different countries and (researchers from those countries) have changed from the Julian calendar to the Gregorian calendar at different times. For example, Russia adopted the Gregorian Calendar on 1918-02-14, the British Empire in 1752-09-14, different regions in France between 1582 and 1760, with France also adopting the French Republican Calendar 1793-1805. The difference between the Gregorian and Julian calendar has typically been around 10 days. But, the day that is considered the first day of the year has also changed at different times in different countries, meaning that the difference can be as great as 1 year and 10 days. Given the complexity, and ongoing nature of transitions between calendars, we do not advocate using this test for quality assurance by simply selecting a transition date and using it as a threshold.

@Tasilee
Copy link
Collaborator

Tasilee commented Jun 15, 2023

That looks useful, with a few minor edits and one query-

This test provides for a default earliest date, which is 1582-11-15 by convention. That date was chosen because ISO 8601-1 asserts that "the use of proleptic Gregorian calendar dates prior are not allowed in ISO 8601-1 without prior agreement of the parties exchanging data", and Darwin Core does not comment on this.

Different calendars have been used at different times in different places, and the transcription of an original date in one calendar into dwc:eventDate, where a Gregorian Calendar is assumed, may or may not have been done with the correct translation of the date, and metadata may or not be present to even identify such records. Countries and researchers have changed from the Julian calendar to the Gregorian calendar at different times. For example, Russia adopted the Gregorian Calendar on 1918-02-14, the British Empire in 1752-09-14, different regions in France between 1582 and 1760, with France also adopting the French Republican Calendar 1793-1805. The difference between the Gregorian and Julian calendar has typically been around 10 days. But, the day that is considered the first day of the year has also changed at different times in different countries, meaning that the difference can be as great as 1 year and 10 days. Given the complexity, and ongoing nature of transitions between calendars, we do not advocate using this test for quality assurance by selecting a transition date and using it as a threshold.

But that is what we are currently doing aren't we?

@tucotuco
Copy link
Member

tucotuco commented Jun 16, 2023

We aren't doing anything except providing the test. We aren't using the test for quality assurance. The user has to decide for what purpose it is appropriate to use the test. The bolded text is just guidance about that.

@ArthurChapman
Copy link
Collaborator

I am happy with the last version. After all, we are just checking if a date is in a range, and the Calendar dates are only an issue if one is setting a different date to the defaults. The majority of the tests will just test for the default, but if someone had a different start date (e.g.1900) then they just need to be aware of the issues and that is now covered in the notes. They could probably get around any problems in their parameter, by setting the date a year earlier (for bdq:earliestValidDate or a year later for bdq:latestValidDate).

@chicoreus
Copy link
Collaborator

Following up on #36 (comment) by @tucotuco inherent in the framework is that any test may be used for either quality control (finding data (or process improvements) that could be changed to improve the quality of data for some, in our case CORE use), or for quality assurance, filtering data in a MultiRecord to just those data that conform with the needs for some (in our case CORE) use. By using the framework, the tests are by design agnostic to their use.

It does still make some sense to provide some non-normative (notes) guidance for research users who might want to parameterize this test for quality assurance (that they will quickly get into the morass that we have in this discussion, and we advise looking at other approaches to meet their needs).

@Tasilee
Copy link
Collaborator

Tasilee commented Jun 21, 2023

OK, how about this for Notes:

This test provides for a default earliest date, which is 1582-11-15 by convention. That date was chosen because ISO 8601-1 asserts that "the use of proleptic Gregorian calendar dates prior are not allowed in ISO 8601-1 without prior agreement of the parties exchanging data", and Darwin Core does not comment on this.

Different calendars have been used at different times in different places, and the transcription of an original date in one calendar into dwc:eventDate, where a Gregorian Calendar is assumed, may or may not have been done with the correct translation of the date, and metadata may or not be present to even identify such records. Given the complexity, and ongoing nature of transitions between calendars, we do not advocate using this test for quality assurance by selecting a transition date and using it as a threshold.

We place this text into the Standard document:

Different calendars have been used at different times in different places, and the transcription of an original date in one calendar into dwc:eventDate, where a Gregorian Calendar is assumed, may or may not have been done with the correct translation of the date, and metadata may or not be present to even identify such records.

Countries and researchers have changed from the Julian calendar to the Gregorian calendar at different times. For example, Russia adopted the Gregorian Calendar on 1918-02-14, the British Empire in 1752-09-14, different regions in France between 1582 and 1760, with France also adopting the French Republican Calendar 1793-1805. The difference between the Gregorian and Julian calendar has typically been around 10 days. But, the day that is considered the first day of the year has also changed at different times in different countries, meaning that the difference can be as great as 1 year and 10 days. Given the complexity, and ongoing nature of transitions between calendars, we do not advocate using this test for quality assurance by selecting a transition date and using it as a threshold.

@chicoreus
Copy link
Collaborator

We should note, and specify in the validation data, whether or not imprecice event dates that span the boundary should be considered compliant or not, that is, are eventDate = "1582", or eventDate = "1582-11" compliant or not (I suspect they are).

@chicoreus
Copy link
Collaborator

Missing a word: (I suspect they are not). They are reduced precision dates, so they aren't explicit about range, but they don't sound like they match the clause: "if the range of dwc:eventDate is entirely within the range bdq:earliestValidDate to bdq:latestValidDate, inclusive".

chicoreus added a commit to FilteredPush/event_date_qc that referenced this issue Jun 23, 2023
…q#36 VALIDATION_EVENTDATE_INRANGE to match change of default value for the earlyest date parameter to the start date for the Gregorian calendar 1582-11-15. Updating unit tests.
@tucotuco
Copy link
Member

tucotuco commented Jun 23, 2023 via email

@Tasilee
Copy link
Collaborator

Tasilee commented Sep 16, 2023

Splitting bdqffdq:Information Elements into "Information Elements ActedUpon" and "Information Elements Consulted"

@Tasilee
Copy link
Collaborator

Tasilee commented Sep 16, 2024

Changed reference in Expected Response from ISO 8601-1 to ISO 8601

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CODED Conformance CORE TG2 CORE tests Parameterized Test requires a parameter Test Tests created by TG2, either CORE, Supplementary or DO NOT IMPLEMENT TG2 TIME Validation
Projects
None yet
Development

No branches or pull requests

6 participants