-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Name occurrence verification needs #67
Comments
Thank you for your feedback @Teinostoma. Currently there is a hybrid approach to BHL names. There are name-finding methods that used ngram approach and BHLindex. BHLindex creates very few false positives, but misses many names with OCR errors. Names found before BHLindex using ngram did find many names with OCR errors, however they also contain a lot of false positives. We are thinking about ways to fix this problem. Do you know a good resource for fossil mollusks names? Currently, I think, there are 3 sources of mollusks names: WoRMS, PaleoBioDB, and for old names -- Sherborn's Index Animalium. Context, usage of dates, figuring out how to reconcile abbreviated names -- definitely a way to improve name-fidning. I am actually writing a grant exacly about that. Implementing user-feedback for names would be out of scope of BHLindex, and more of a decision for BHL folks (@mlichtenberg, @cajunjoel). @gdower also might be interested. |
I believe that WoRMS has all the names from MolluscaBase, so I don't
think MolluscaBase would need separate attention.
Paleobiology Database doesn't have very thorough coverage of many mollusc
faunas; most of the attention has gone to "what are things you can do with
this data" rather than to supporting data generation and quality control (a
common problem of large biodiversity databases).
Ruhoff (https://repository.si.edu/handle/10088/5331 ) adds a couple of
decades beyond Sherborn, though it is not quite as thorough. Fossils were
not included in the Zoological Register for a while, so it does not help
with them for the first few decades.
…On Thu, Jun 15, 2023 at 8:52 AM Dmitry Mozzherin ***@***.***> wrote:
Thank you for your feedback @Teinostoma <https://github.com/Teinostoma>.
Currently there is a hybrid approach to BHL names. There are name-finding
methods that used ngram approach and BHLindex. BHLindex creates very few
false positives, but misses many names with OCR errors. Names found before
BHLindex using ngram did find many names with OCR errors, however they also
contain a lot of false positives. We are thinking about ways to fix this
problem.
Do you know a good resource for fossil mollusks names? Currently, I think,
there are 3 sources of mollusks names: WoRMS, PaleoBioDB, and for old names
-- Sherborn's Index Animalium.
Context, usage of dates, figuring out how to reconcile abbreviated names
-- definitely a way to improve name-fidning. I am actually writing a grant
exacly about that.
Implementing user-feedback for names would be out of scope of BHLindex,
and more of a decision for BHL folks ***@***.***
<https://github.com/mlichtenberg>, @cajunjoel
<https://github.com/cajunjoel>). @gdower <https://github.com/gdower> also
might be interested.
—
Reply to this email directly, view it on GitHub
<#67 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AY5MAWGT2VS6TQSWNMGNUK3XLMAQLANCNFSM6AAAAAAZG6KK4Y>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Dr. David Campbell
Associate Professor, Geology
Department of Natural Sciences
110 S Main St, #7270
Gardner-Webb University
Boiling Springs NC 28017
|
Does Ruhoff exist as a curated digitized data? I did try to OCR it using Adobe Acrobat and found that the final result does contain a fair amount of errors If the OCR errors are corrected in the file (from the species epithet to the year), it would be fairly easy to convert it into a data-source |
I don't know of a curated version of Ruhoff; I mostly use my print copy,
which doesn't help what you need much.
…On Thu, Jun 15, 2023 at 5:08 PM Dmitry Mozzherin ***@***.***> wrote:
Does Ruhoff exists as a curated digitized data? I did try to OCR it using
Adobe Acrobat and found that the final result does contain a fair amount of
errors
F_A_Ruhoff_Mollusca_1850_1870.txt
<https://github.com/gnames/bhlindex/files/11763335/F_A_Ruhoff_Mollusca_1850_1870.txt>
—
Reply to this email directly, view it on GitHub
<#67 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AY5MAWD72FD2YSAP2ESWGNTXLN2TLANCNFSM6AAAAAAZG6KK4Y>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Dr. David Campbell
Associate Professor, Geology
Department of Natural Sciences
110 S Main St, #7270
Gardner-Webb University
Boiling Springs NC 28017
|
@Teinostoma, I did try my best to clean up OCR for names in Ruhoff, would it be ok for you to look at the result, and tell what do you think: https://github.com/gnames/ds-ruhoff-mollusca/blob/master/data/07-fmt-names.csv To avoid problems with UTF-8, it is better to use LibreOffice instead of Excel I only did pay attention to the names themselves (1st and 2nd columns), the metadata after the names are not as clean. I did try to reconcile them against other datasets, looks like about half of them are new for my data. |
It looks like a good start. I noticed two corrections for the first page -
in *Nucula hammen aalensis, **hammen* is an error for *hammeri*
and *Architectonica abbottii* Gabb, 1861 is missing, but that's far better
than the OCR.
…On Sun, Jun 18, 2023 at 7:57 PM Dmitry Mozzherin ***@***.***> wrote:
@Teinostoma <https://github.com/Teinostoma>, I did try my best to clean
up OCR for names in Ruhoff, would it be ok for you too look at the result,
and tell what do you think:
https://github.com/gnames/ds-ruhoff-mollusca/blob/master/data/07-fmt-names.csv
—
Reply to this email directly, view it on GitHub
<#67 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AY5MAWDTDZR6DKAV5LYQD7TXL6IWZANCNFSM6AAAAAAZG6KK4Y>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Dr. David Campbell
Associate Professor, Geology
Department of Natural Sciences
110 S Main St, #7270
Gardner-Webb University
Boiling Springs NC 28017
|
thank you @Teinostoma! I added a fix gnames/ds-ruhoff-mollusca@1299450 I have currently 3 levels of curation quality, "curated" -- when I am pretty sure that there is a significant effort to scrutinize data done by specialists, "auto-curated" when cleaning is done mostly by scripts, and the rest, when curation is unknown. For example IRMNG is considered to be curated, GBIF is auto-curated, ION is not curated. Do you think it is good enough to apply "auto-curated" to the data? It would push its matching results above 'non-curated' names. |
That seems the right level to me.
…On Tue, Jun 20, 2023 at 7:14 AM Dmitry Mozzherin ***@***.***> wrote:
thank you @Teinostoma <https://github.com/Teinostoma>! I have currently 3
levels of curation quality, "curated" -- when I am pretty sure that there
is a significant effort to scrutinize data, "auto-curated" when cleaning is
done mostly by scripts, and the rest, when curation is unknown.
For example IRMNG is considered to be curated, GBIF is auto-curated, ION
is not curated.
Do you think it is good enough to apply "auto-curated" to the data? It
would push its result above 'non-curated' names.
—
Reply to this email directly, view it on GitHub
<#67 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AY5MAWGTL6WGE37OB2FJTADXMGAY3ANCNFSM6AAAAAAZG6KK4Y>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Dr. David Campbell
Associate Professor, Geology
Department of Natural Sciences
110 S Main St, #7270
Gardner-Webb University
Boiling Springs NC 28017
|
I did attempt to detect more elusive typos, looks like about 25% of names in the publication are new to https://verifier.globalnames.org/ https://raw.githubusercontent.com/gnames/ds-ruhoff-mollusca/master/data/08-reconsile.csv |
OCR often does very poorly on documents in BHL, and the list of names being searched for is very incomplete, at least when it comes to fossil mollusks. Authors also did not make this easy, often using idiosyncratic ways of abbreviating. As a result, both the false positive and false negative rates are very high in the documents that I am reading on BHL. A few ideas:
Is there a way to take the date of the publication into consideration? Names published after a publication was written will not be found in that publication (for example, the word lens will not be a reference to the genus Lens Simpson, 1900 in publications from the 1800's). This would help decease false positives.
Is there a way to allow users to quickly indicate "here is a name missed by the system", "this is correct", "this name finding is spurious", etc.? It would require verification to protect against trolling or errors, but could be a useful way to improve the name finding.
Is there a way to take context into account to identify higher taxonomic levels? This is especially of value for homonyms. For example, being able to search for references that contain both Auricularia and Mollusca would avoid the huge number of hits for the fungus Auricularia.
The text was updated successfully, but these errors were encountered: