-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TG2-VALIDATION_EVENTDATE_INRANGE #36
Comments
Comment by Lee Belbin (@Tasilee) migrated from spreadsheet: |
Needs clarification for eventDate values which are ranges and which span the oldest/youngest boundaries. For example, 1700-01-01/2100-01-10 is an entirely valid eventDate value with a range which includes all likely specimen collecting dates extant, or for some time into the future. Under the current definition, this value (which is in essence a placeholder for "we don't know what the date was"), fails the test. Similarly 1650-01/1850-02 would be expected to fail, simply because it places a lower bound to the uncertainty earlier than the default 1700. Framing the test to mark as problems any range which extends outside the 1700-present range will potentially encourage people to frame uncertainty about dates too narrowly, instead of setting reasonable uncertainty values for their situation. I'd prefer to just flag eventDate values which fall entirely outside the specified range. Other potential failure cases produced by considering ranges that span the boundaries as problems are an eventDate who's value is the current date, without a time. This is a time interval that extends into the future, and a reasonable implementation of the test as stated would mark any record with an eventDate consisting of the current date without a time as an error - something not desirable when the quality control processes are placed upstream close to initial data capture. |
@chicoreus I don't see a problem here - we are not saying it is wrong - just a warning that it is out of range. What is done with that is up to the user, but it flags a possible problem. With annotations - a followup annotation may be that this is OK, because ... |
The problem is again on different interpretations of how to represent uncertainty in eventDate values. A European institution with old collections which very reasonably decides to set 1400-01-01 and 2100-01-10 as end boundaries for any events where the collecting date is not known (the 2100 date making these records very easy to find and distinguish from ones which have had the date narrowed based on some additional interpretation), and would have all of these flagged as problems binned in with real problem records such as the typical typo 190-10-01. It is very rational from a database perspective to set an end date at some distant future point for all records with uncertainty, this makes them easy to find and collect). I'm not at all in favor of a position that declares that ranges that fall outside the likely bounds are problems. I'd much rather see a narrower test for intervals that entirely fall outside the range of plausible collecting event dates - that should get a much smaller set of false positives and more effectively identify problematic data that needs to be fixed. The today's date will fail issue (because today's date to a resolution of one day in an ISO date is a temporal interval that extends into the future, unless special case handling is added for today's date) also makes this test highly problematic for upstream uses near the point of observation. |
I can understand that at the dataset level, but would expect it to be very rare at the record level. The earliest date can be a designated date for the run as well if you need to set an earlier date for some reason - or particular dataset. I don't see it as a big issue. |
I'm a simple soul. I side with @ArthurChapman. We have to be careful that we don't errect obstacles that eveyone is then forced to climb over. KISS. Others? |
Another way of putting the problem I am seeing: By treating any range that extends beyond 1700-today as an error is conflating two classes of problems: (1) errors in accuracy (e.g. 198-10-15), and (2) broad statements about uncertainty (1500/2100). Broad statements about uncertainty are already captured separately with a measure of event duration. I will argue that it is important to be able to identify the first class of error in isolation, by implementing this test (in the easier way) by flagging records who's range falls entirely outside the range 1700-present. The current statement of the test is more complex, as it raises the specter of special case handling of records with today's date. I also like KISS, and argue that the current description isn't the simple one. About 10% of the MCZ data has an unknown event date, recorded in the database (which enforces a start and end date as oracle date fields) as 1700-01-01/2100-01-01. From a database perspective, this is a very useful pair - it is very easy to extract those 183136 records on the basis of those values, narrowing by any inference makes these harder to locate as a single sort of data quality issue. |
OK, I'll buy it (range outside 1700-present) @chicoreus , but I would like to hear from the rest of the team. |
How many institutions do this other then MCV? It does seem to be a problem. Under your reasoning @chicoreus - we can't only do "not in future" It would appear to me that the field is being used in ways it was never meant to be used, but I can't see any simple way around it other than to remove this test altogether. |
Re-examining this validation, I cannot see a problem with flagging a suspicious date (or date range) that is before 1700 or after the day the test is run. A "NOT COMPLIANT" would seem useful information to follow up on. A false positive flag seems prefereable to me that a false negative where one end of a range is totally outside 1700-today. Considering #66, I'd be inclined to include invalid dates (e.g., Feb 30) under this test as they are not in the possible range of dates, and they may well be formatted to ISO standard. This would make this validation dependent on #61. |
I'll suggest that we split this test into two separate tests, one of which tests whether or not the event date extends outside the boundaries 1700-present, and the other to test whether or not the event date falls entirely outside the boundaries 1700-present. The first test (crosses out of bounds) may represent problematic data or it may represent a large uncertainty. The second test (falls entirely out of bounds) likely flags data that contains errors (e.g. typos that leave a digit out of the year 190-05-18), but can potentially also flag rare but valid older material, and certain representations of zooarcheological material. This fits a principle of keeping tests simple and focused on particular potential problems. |
…terpreations of event date in range. DESCRIPTION: Marking one implementation as 3cff4dc4-72e9-4abe-9bf3-8a30f1618432 in what seems to be the desired consensus. Marking the other implementation with a new uuid as a potential alternative (or additional test).
From TG2 call per @tucotuco If you care about dates affected by unknown calendar use the start date 1918... Add note: (here and in #84), if your use requires knowledge of date to a precision of finer than one year and ten days, use 1918-02-14 as the earliestValidDate (as the calendar isn't certain). |
Slightly edited notes from an email, with added notes in italics from TG2 call: I'll suggest we switch to 1582-11-15. General agreement on this in TG2 call. That date is supportable on the basis of ISO 8601-1 asserting that the use of proleptic gregornian calendar dates prior are not allowed in ISO 8601-1 without prior agreement of the parties exchanging data. Since Darwin Core is mute on whether proleptic gregorian dates are allowed, no prior agreement exists, and we can argue that dates prior to this are automatically suspect. In practice, dates prior to 1752 in the British empire, 1700 in various European protestant countries, 1918 in Russian territories (1918-02-14), are suspect, as those are the years of adoption of the gregorian calendar in those areas, and a reported date may not have the metadata needed to determine if it was a julian date as originally asserted, or has been Difference between the gregorian and julian calendar has typically been around 10 days, but see the comparison on But, it gets worse... there is the issue of what the start day of the year was, e.g. with the British civil year starting on March 25 instead of January 1. So dates from the British empire, or from British collectors from prior to 1752 may be off by 10 days and off by one year, depending. Looks like a good explication on https://www.cree.name/genuki/dates.htm Wikipedia cites this for the text "The best practice for citation of historically contemporary documents is to cite the date as expressed in the original text and to notate any contextual implications and conclusions regarding the calendar used and equivalents in other calendars. This practice permits others to re-evaluate the original evidence" We expect dwc:eventDate to contain a gregorian date. dwc:verbatimEventDate allows for capture of a date as found in the We'll likely also need to consider this in #86, at least by including metadata that the assumed calendar for verbatim date is gregorian. |
Changed Parameter(s) to "bdq:earliestValidDate, bdq:latestValidDate". I'll leave the outcomes of the 1500 and calendar discussions to @chicoreus to decide and implement. My conclusion to @tucotuco (loose and strict implementation) is to document (Notes) our Parameter(s) accordingly? |
I have updated the default earliest date to 1582-11-15 and added to the Notes "That date is supportable on the basis of ISO 8601-1 asserting that "the use of proleptic gregornian calendar dates prior are not allowed in ISO 8601-1 without prior agreement of the parties exchanging data." |
Suggest we add to the notes (also in #84, #76): If setting a Parameter for this test be aware that prior to 1918, there may be issues associated with the use of the Julian calendar versus the Gregorian calendar in some countries. Difference between the Gregorian and Julian calendar has typically been around 10 days, but see the comparison on https://en.wikipedia.org/wiki/Proleptic_Gregorian_calendar where "there is no difference in most of years 100 to 200... Also year 0 may or may not exist...". See also, the explanation on https://www.cree.name/genuki/dates.htm |
Restructured Parameter(s) and Source authority. |
Thumbs up if you agree to this change Change Notes to The results of this test are time-dependent: An invalid date for tomorrow will be valid tomorrow. This test provides for a default earliest date, which is 1582-11-15 by convention. That date is supportable on the basis of ISO 8601-1 asserting that "the use of proleptic Gregorian calendar dates prior are not allowed in ISO 8601-1 without prior agreement of the parties exchanging data". If setting a Parameter for this test be aware that prior to 1918, there may be issues associated with the use of the Gregorian calendar versus the Julian calendar in some countries. Difference between the Gregorian and Julian calendar has typically been around 10 days (but can be as great as 1 year and 10 days) see the comparison on https://en.wikipedia.org/wiki/Proleptic_Gregorian_calendar where "there is no difference in most of years 100 to 200... Also year 0 may or may not exist...". If your use requires knowledge of date to a precision of finer than one year and ten days, and you are not certain of the use of the Gregorian calendar, use 1919-01-01 as the earliestValidDate.": |
@ArthurChapman change: "That date is supportable on the basis of ISO 8601-1 asserting" to "That date was chosen because ISO 8601-1 asserts", and then add, ", and Darwin Core does not specify such." to the end of the sentence. The second paragraph needs some work too. Suggest changing notes to: The results of this test are time-dependent: An invalid date for tomorrow will be valid tomorrow. This test provides for a default earliest date, which is 1582-11-15 by convention. That date was chosen because ISO 8601-1 asserts that "the use of proleptic Gregorian calendar dates prior are not allowed in ISO 8601-1 without prior agreement of the parties exchanging data", and Darwin Core does not specify such. If setting a Parameter for this test be aware that prior to about 1918 different countries and (researchers from those countries) switched from the Julian calendar to the Gregorian calendar versus the Julian calendar at different times. For example, Russia adopted the Gregorian Calendar on 1918-02-14, the British Empire in 1752-09-14, different regions in France between 1582 and 1760, with France also adopting the French Republican Calendar 1793-1805. The difference between the Gregorian and Julian calendar has typically been around 10 days. But, the day that is considered the first day of the year has also changed at different times in different countries, meaning that the difference can be as great as 1 year and 10 days. If your use requires knowledge of date to a precision of finer than one year and ten days, and you are not certain of the use of the Gregorian calendar, use 1923-03-01 (when Greece adopted the Gregorian Calendar) as the earliestValidDate. |
We probably also need to add text to the notes on the order of "If temporal resolution of one year or better is important different calendars have been used at different times in different places, and the transcription of an original date in one calendar into dwc:eventDate, where a Gregorian Calendar is assumed, may or may not have been done with the correct translation of the date, and metadata may or not be present to identify such records." |
I reiterate that I would not enumerate some transition dates while leaving out others. I would definitely not portray 1923 as if it was the latest transition. Transitions are still ongoing, and some may never happen. It would be discriminatory if any transition comes after any we chose and we can't have that. Better to cite Wikipedia and not give a cut-off date. |
@tucotuco I agree, different uses are likely to have different needs. I would advocate listing a few dates, as examples, to remind people that this may be an important concern for dates present in historical biodiversity collections data, and that the absence of clear metadata about interpretations of those dates may make any quality assurance approach using this test as a threshold impractical. |
How about: The results of this test are time-dependent: An invalid date for tomorrow will be valid tomorrow. This test provides for a default earliest date, which is 1582-11-15 by convention. That date was chosen because ISO 8601-1 asserts that "the use of proleptic Gregorian calendar dates prior are not allowed in ISO 8601-1 without prior agreement of the parties exchanging data", and Darwin Core does not specify such. If temporal resolution of one year or better is important different calendars have been used at different times in different places, and the transcription of an original date in one calendar into dwc:eventDate, where a Gregorian Calendar is assumed, may or may not have been done with the correct translation of the date, and metadata may or not be present to identify such records. Different countries and (researchers from those countries) have changed from the Julian calendar to the Gregorian calendar at different times. For example, Russia adopted the Gregorian Calendar on 1918-02-14, the British Empire in 1752-09-14, different regions in France between 1582 and 1760, with France also adopting the French Republican Calendar 1793-1805. The difference between the Gregorian and Julian calendar has typically been around 10 days. But, the day that is considered the first day of the year has also changed at different times in different countries, meaning that the difference can be as great as 1 year and 10 days. Given the complexity, and ongoing nature of transitions between calendars, we do not advocate using this test for quality assurance by simply selecting a transition date and using it as a threshold. |
That looks useful, with a few minor edits and one query- This test provides for a default earliest date, which is 1582-11-15 by convention. That date was chosen because ISO 8601-1 asserts that "the use of proleptic Gregorian calendar dates prior are not allowed in ISO 8601-1 without prior agreement of the parties exchanging data", and Darwin Core does not comment on this. Different calendars have been used at different times in different places, and the transcription of an original date in one calendar into dwc:eventDate, where a Gregorian Calendar is assumed, may or may not have been done with the correct translation of the date, and metadata may or not be present to even identify such records. Countries and researchers have changed from the Julian calendar to the Gregorian calendar at different times. For example, Russia adopted the Gregorian Calendar on 1918-02-14, the British Empire in 1752-09-14, different regions in France between 1582 and 1760, with France also adopting the French Republican Calendar 1793-1805. The difference between the Gregorian and Julian calendar has typically been around 10 days. But, the day that is considered the first day of the year has also changed at different times in different countries, meaning that the difference can be as great as 1 year and 10 days. Given the complexity, and ongoing nature of transitions between calendars, we do not advocate using this test for quality assurance by selecting a transition date and using it as a threshold. But that is what we are currently doing aren't we? |
We aren't doing anything except providing the test. We aren't using the test for quality assurance. The user has to decide for what purpose it is appropriate to use the test. The bolded text is just guidance about that. |
I am happy with the last version. After all, we are just checking if a date is in a range, and the Calendar dates are only an issue if one is setting a different date to the defaults. The majority of the tests will just test for the default, but if someone had a different start date (e.g.1900) then they just need to be aware of the issues and that is now covered in the notes. They could probably get around any problems in their parameter, by setting the date a year earlier (for bdq:earliestValidDate or a year later for bdq:latestValidDate). |
Following up on #36 (comment) by @tucotuco inherent in the framework is that any test may be used for either quality control (finding data (or process improvements) that could be changed to improve the quality of data for some, in our case CORE use), or for quality assurance, filtering data in a MultiRecord to just those data that conform with the needs for some (in our case CORE) use. By using the framework, the tests are by design agnostic to their use. It does still make some sense to provide some non-normative (notes) guidance for research users who might want to parameterize this test for quality assurance (that they will quickly get into the morass that we have in this discussion, and we advise looking at other approaches to meet their needs). |
OK, how about this for Notes: This test provides for a default earliest date, which is 1582-11-15 by convention. That date was chosen because ISO 8601-1 asserts that "the use of proleptic Gregorian calendar dates prior are not allowed in ISO 8601-1 without prior agreement of the parties exchanging data", and Darwin Core does not comment on this. Different calendars have been used at different times in different places, and the transcription of an original date in one calendar into dwc:eventDate, where a Gregorian Calendar is assumed, may or may not have been done with the correct translation of the date, and metadata may or not be present to even identify such records. Given the complexity, and ongoing nature of transitions between calendars, we do not advocate using this test for quality assurance by selecting a transition date and using it as a threshold. We place this text into the Standard document: Different calendars have been used at different times in different places, and the transcription of an original date in one calendar into dwc:eventDate, where a Gregorian Calendar is assumed, may or may not have been done with the correct translation of the date, and metadata may or not be present to even identify such records. Countries and researchers have changed from the Julian calendar to the Gregorian calendar at different times. For example, Russia adopted the Gregorian Calendar on 1918-02-14, the British Empire in 1752-09-14, different regions in France between 1582 and 1760, with France also adopting the French Republican Calendar 1793-1805. The difference between the Gregorian and Julian calendar has typically been around 10 days. But, the day that is considered the first day of the year has also changed at different times in different countries, meaning that the difference can be as great as 1 year and 10 days. Given the complexity, and ongoing nature of transitions between calendars, we do not advocate using this test for quality assurance by selecting a transition date and using it as a threshold. |
We should note, and specify in the validation data, whether or not imprecice event dates that span the boundary should be considered compliant or not, that is, are eventDate = "1582", or eventDate = "1582-11" compliant or not (I suspect they are). |
Missing a word: (I suspect they are not). They are reduced precision dates, so they aren't explicit about range, but they don't sound like they match the clause: "if the range of dwc:eventDate is entirely within the range bdq:earliestValidDate to bdq:latestValidDate, inclusive". |
…q#36 VALIDATION_EVENTDATE_INRANGE to match change of default value for the earlyest date parameter to the start date for the Gregorian calendar 1582-11-15. Updating unit tests.
Agreed. Not.
…On Fri, Jun 23, 2023, 17:07 Paul J. Morris ***@***.***> wrote:
Missing a word: (I suspect they are *not*). They are reduced precision
dates, so they aren't explicit about range, but they don't sound like they
match the clause: "if the range of dwc:eventDate is entirely within the
range bdq:earliestValidDate to bdq:latestValidDate, inclusive".
—
Reply to this email directly, view it on GitHub
<#36 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADQ723U3OHHFG5JMXT3LYTXMXZQ5ANCNFSM4EKSMI5Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Splitting bdqffdq:Information Elements into "Information Elements ActedUpon" and "Information Elements Consulted" |
Changed reference in Expected Response from ISO 8601-1 to ISO 8601 |
The text was updated successfully, but these errors were encountered: