Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finalizing the documents - email correspondence #7

Open
17 of 21 tasks
SimonGoring opened this issue Dec 27, 2016 · 1 comment
Open
17 of 21 tasks

Finalizing the documents - email correspondence #7

SimonGoring opened this issue Dec 27, 2016 · 1 comment

Comments

@SimonGoring
Copy link
Contributor

SimonGoring commented Dec 27, 2016

Notes from the email chain:

  • The structure is not the same for all of the 347 csv files. For example, 21_test_pollen does not county and stateProvince, while 19874 does.
  • As I understand it, we want to publish an aggregate of all of these data as one published data set. If I have this wrong, and each data set must remain separate, then much of what follows does not apply, and signifies a huge management burden going forward that we'll have to discuss.
  • Based on 2, it would be best to have a single csv file with a complete set of all fields that might be populated in any subset. If that is a burden to produce, then it would be fine to have the separate files, but with the same fields in each.
  • Based on 2, we will need metadata for the aggregated data set.
  • dcterms:type should be 'PhysicalObject' if these records are based on material samples. If they are from literature, then dcterms:type should be 'Text'.
  • gbif.year looks like a field that we will not propagate - please confirm
  • dcterms:references is supposed to be the URI to the most detailed information available about the record. The values in that field point to an entire data set in Neotoma Explorer. That is OK if that is the finest level of granularity that exists.
  • The values in collectionID look more like a collectionCode, whereas a collectionID one would expect to be a URI - similar to the datasetID, datasetName distinction you created already.
  • For dynamicProperties, the leading and trailing double quotes should be left off. The examples on the Darwin Core Quick Reference Guide have the double quotes around the whole thing to show the content of that field. I see that is confusing. It should be pure JSON, thus starting with { and ending with }.
  • You populated organismQuanityType, but not organismQuantity. If there are no quantities, then I would suggest that a better mapping might be dwc:preparations for the values given in this field. This is especially true seeing that you populated this field with bone/tooth for the data set 4564_test_vertebrate fauna.csv. That's exactly the kind of thing we would put in dwc:preparations.
  • eventID and parentEventID are given, but are not URIs. In the broader context these would benefit from being GUIDs, but there is no problem with what you have provided as long as no one else out there happens to have a CollectionUnit_21, for example.
  • Is the eventDate really 1974-01-01, or is it all of 1974 for the file 21_test_pollen.csv? If the latter, better to include only the level of temporal precision that is justified. I wondered because the startDayOfYear and endDayOfYear are not consistent with the information in eventDate and year, month, day. The file 10527_test_testate amoebae.csv definitely has an inconsistency, because eventDate is given as 1969-06-29 where startDayOfYear (1) and endDayOfYear (365) cover the whole year. In 14610_test_insect.csv it looks from the eventRemarks like the eventDate should be 1975/1977 and that year, month, day, startDayOfYear, and endDayOfYear should be left blank.
  • Some data sets, such as 19874_test_diatom, do not have an eventDate.
  • The sampleSizeUnit contains values such as "Number of Identified Samples". Will that be clear to the users of the data? Does it mean, for example, 2 pollen grains of the given taxon were present and all were analyzed? This is just my naivitë, so don't worry if the rest of the world will know exactly what you mean. I see that the file 14610_test_insect.csv contains "Minimum Number of Individuals", which does seem quite clear and apt.
  • I would recommend rounding decimalLatitude and decimalLongitude to 7 decimal places where the precision given is more than that.
  • Coordinate precision seems to be an issue. In 21_test_pollen.csv the value is given as 0, whereas it looks like the value should be 0.5 (nearest half degree). In 4564_test_vertebrate fauna.csv the value is 0, but the coordinates are given to 13 decimal places. In 19874_test_diatom the value given is 0.0006699999504, which does not correspond to a nearest anything in terms of geographic coordinates. I see you calculation method for it on https://github.com/NeotomaDB/DwC-Mapping/blob/master/DarwinCoreMapping.Rmd, but that is not the meaning of dwc:coordinatePrecision (see http://rs.tdwg.org/dwc/terms/index.htm#coordinatePrecision). It looks like you were trying to provide something akin to a coordinateUncertaintyInMeters, but that is a non-trivial latitude-dependent calculation to make.
  • The polygons for the bounding boxes in footprintWKT are not closed, they need a fifth coordinate pair that is the same as the first one to close it.
  • The lists, such as associatedTaxa and associatedOccurrences would benefit from using space | space as the delimiter rather than just |.
  • Are there units to accompany the samplingEffort? Your document says the total count, but that will not be clear without having documentation to read.
  • The scientificName field contains the identificationQualifier even though you were able to break out the qualifier into its own field. The scientificName field should really contain only the most specifically unequivocally determined scientificName (not the identification information). If you can keep those separate, that is best, otherwise the migrator needs to do that.
  • The sampleSizeValue (a) and sampleSizeUnit ('present/absent') for 4564_test_vertebrate fauna.csv are curious. What is that meant to signify? If it is meant to show that a bone/tooth for the species was present, then preparations='bone/tooth' and occurrenceStatus='present' already would do that and the sampleSize fields would not be necessary.
SimonGoring added a commit that referenced this issue Dec 27, 2016
…zed_run`, in reference to an issue in Issue #7.
SimonGoring added a commit that referenced this issue Dec 28, 2016
…ds. Also changing `organismQuanityType` to `dwc:preparations`.
SimonGoring added a commit that referenced this issue Dec 28, 2016
… comment in Issue #7.  Added a precision function, but I'm not sure it will work as intended, since it won't account for rounding in whole numbers.

Also checked the issue regarding collection dates.  I've fixed the `startOfYear`/`endOfYear` issue, based on comments in Issue #7, but some dates in Neotoma do appear to show false precision, reporting multiple collection dates on the first of the year, or month.  While these dates are unlikely, they represent the information within Neotoma.  My feeling is that they should be retained.
@SimonGoring
Copy link
Contributor Author

Relating to item:

dcterms:references is supposed to be the URI to the most detailed information available about the record. The values in that field point to an entire data set in Neotoma Explorer. That is OK if that is the finest level of granularity that exists.

At this point we believe that the dataset is the fundamental unit for most records, and that Explorer provides the most detailed description of the dataset.

SimonGoring added a commit that referenced this issue Feb 23, 2017
… Samples or Individuals to the total sum of the units.

Removed the leading and trailing quotes in the `dynamicProperties` field.

All relating to issue #7.
SimonGoring added a commit that referenced this issue Aug 29, 2017
#7.  There is no specific API link to the actual collection, although this can be added in the future.

The use of `map()`and `bind_rows()` at the bottom of the function resolves issue #7's note about having different columns in each file.  We will generate a single large file for every data type (possibly?).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant