Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting started #1

Open
Daniel-Mietchen opened this issue Aug 2, 2014 · 8 comments
Open

Getting started #1

Daniel-Mietchen opened this issue Aug 2, 2014 · 8 comments

Comments

@Daniel-Mietchen
Copy link

csl2wikidata sounds interesting, and I would like to try it out by uploading the references (all, or at least the openly licensed ones) cited in
https://en.wikipedia.org/wiki/Malaria
to Wikidata.

However, I could not figure out how to get started, so some guidance on that would be appreciated.

@mitar
Copy link
Owner

mitar commented Aug 2, 2014

It is not yet finished. Where we stopped was that we couldn't find a good format as input. First we thought to use CSL, but then CSL in fact is just a style-sheet for citations, which does define a input format, but it is not so well defined or at least I have not find a nice way/code to parse it and get out some document/object I could then use to push into wikidata. If you have some suggestion, please help.

We are planing to continue working on this at https://wikimania2014.wikimedia.org/wiki/Hackathon/Citathon, but we could now discuss what to use. So, I would need a library which takes bibliographic entry in some format and produces some standardized object I can then use to make API calls to wikidata.

Here we started discussing what such an input format should be:

https://etherpad.mozilla.org/TCrCIcyEDL

cc @jure

@Daniel-Mietchen
Copy link
Author

What I have in mind is the following steps:

  1. generate list of DOIs cited in https://en.wikipedia.org/wiki/Malaria
  2. look up the respective metadata via the CrossRef API
  3. (if necessary) convert the metadata format into format that csl2wikidata can ingest
  4. use the Wikidata API to create Wikidata items for each of these bibliographic items, adding the metadata using the appropriate properties.

I set up a sample item about a research article under https://www.wikidata.org/wiki/Q15625490 . We (pinging @notconfusing and @wrought) want to use the Malaria ones in demo of OA signalling on Friday morning (cf. https://wikimania2014.wikimedia.org/wiki/Submissions/Marking_open-access_references_cited_on_Wikipedia ). All three of us shall be at the Citathon.

@mitar
Copy link
Owner

mitar commented Aug 3, 2014

So yes, please help me find a JavaScript library for step 3. :-)

@HLHJ-zz
Copy link

HLHJ-zz commented Aug 3, 2014

Steps 1-3 are already automated in Zotero; it has a function where you paste in a list of DOIs and it checks Crossref and returns all the metadata. Zotero can export (and import) Wikipedia Citation Templates, BibTeX, BibLateX, RefWorks, MODS, COinS, Citation Style Language/JSON, Refer/BibIX, RIS, TEI, Evernote, EndNote, Bibliontology RDF, Bookmarks, Unqualified Dublin Core RDF, and Zotero RDF. Mvolz borrows the Zotero javascript libraries to do this in Citoid, I believe.

I'd suggest making Zotero do step 4, too. There is interest in the Zotero community at making it interact with Wikidata, as it already interacts with Google Scholar and Crossref.
https://forums.zotero.org/discussion/36151/wikified-copyleft-bibliographic-database/
It would also make it easy for anyone already using Zotero to contribute. I have hundreds of papers in mine, many with metadata proofread by me; if I could upload them with a click, I would, regularly.

Since some of Google Scholar's and even the publishers' metadata has errors, it might be necessary to maintain errata (so it can automatically ignore repeated uploads of this data), and skip/manually merge fields that are already uploaded, so as not to overwrite good data with bad.

Should the default dump the data in a bot's userspace, if you don't configure it with your own Wikimedia account details? Should via-a-standard-bot be the only option? The exact UI would hardly matter for a beta version.

Mitar, you asked for fields we might need: there are the ones already generally available, then there are those we might want to add. Apart from the list on the etherpad, which was just a list of CSL fields that corresponded to Wikidata fields, I don't know that we made one. I was writing one a few days ago, which I post; it's not very well-thought-out yet, please comment.

Standard fields

(these are the ones Zotero already has for journal articles, lightly modified for database form)

type = journal article
abstract = (fair use to use in catalogues, by long-standing custom)
doi =
issn = 
volume = 
issue = 
pages = 
author(s) -> link to separate entries, merge manually later
        In an author entry:
    -- last name (problem here with, e.g., Chinese names)
    -- first name
    -- other names (people often publish under different names, should accept all scripts)
    -- institution -> link to
    -- contact info (usually an e-mail in a publication, maybe skip this one on Wikidata)
    -- urls of personal/institutional website(s)
title = 
journal -> link to (should contain journal abbreviation; there are databases of these. The SHERPA ROMEO and/or DOAJ databases might give you their data.)
language =
date = 
series -> fields like title, editor...
url of copy of record (the thing the DOI resolves) = 
archive->
catalogue->
call number= (presumably for one of the above)

Fields that might need modifying

  • Copyright (Zotero has a "Rights" field that mushes all this information together in a format they are standardizing)
    • Rights reserved (CC-BY, All rights reserved, Public domain, etc.)
    • Rightsholder (the authors, the Royal Society for the Advancement of Science, Nature Publishing group, etc.)
    • Copyright year (1993, 1826, etc.)
      A field for the nature of the rightsholder (natural person? institution?) might be useful for automatically determining if the copyright has expired, as I think it matters in some jurisdictions.
  • URL
    • url(s) of versions of record =
    • url(s) of postprints =
    • url(s) of preprints =
    • url(s) of drafts =
  • DOI – DOIs of datasets, graphs, and other information incorporated into the article

Fields that might be desirable

  • Wikimedia pages that cite or use images from this article (link to)
  • Peer-reviewed (boolean)
  • Review article (boolean)
  • Meta-analysis (boolean)
  • Contains previously unpublished data (boolean)
  • Retracted (link to retraction, if any)
  • Corrections (link to any)
  • Responses (link to any)
  • Article republished from (link to original article)
  • Article is a translation/discussion of (link to original article)
  • Data from study/project (link to a study/studies/projects also used in other articles, e.g. 'CERN' or '1970 British Cohort Study')
  • Contains data also published in (link to other article(s))
  • Registered trial (link to pre-trial registration, if any)
  • Datasets (link to if on Wikidata, give DOI if they have one)
  • Acknowledgements (might link to people/"authors", institutions)
  • Conflicts of Interest (might link to institutions; might have different ones for each author)
  • Funding sources (link to institutions)
  • Citations (links to articles etc.)

There seem to be some publishers who claim that the citations in the bibliography of a scholarly article cannot be reproduced online under fair use. Others disagree. Presumably Wikimedia has professionals who could advise. It would be a really useful field, and can certainly be added for OA articles under CC-0, CC-BY, or CC-BY-NC.

@HLHJ-zz
Copy link

HLHJ-zz commented Aug 3, 2014

A field for linking to an open lab notebooks containing the raw data of the study, probably as URL(s), would also be useful.

@ghost
Copy link

ghost commented Aug 5, 2014

would it be better to be comprehensive with the choice of fields, as you
have tried to do above, or be selective in order to make it easier to start?

if we avoid anything with potential copyright issues to start with -
abstract, citations etc. - this would mean less to worry about while
getting the project started. it might mean revisiting data at a later point
to add more fields but that is probably not a big deal.

On 3 August 2014 21:14, HLHJ [email protected] wrote:

A field for linking to an open lab notebooks containing the raw data of
the study, probably as URL(s), would also be useful.


Reply to this email directly or view it on GitHub
#1 (comment).

@mitar
Copy link
Owner

mitar commented Aug 14, 2014

@mitar
Copy link
Owner

mitar commented Oct 20, 2014

We worked a bit more on this at PLOS citations hackathon event and here are few notes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants