Architecture Proposal for D-PLACE v3.0

The data collected/aggregated/curated for D-PLACE 3 should be organized in a more distributed way, optimized for

distributed curation/maintenance, in particular
- making additions easy (even more so, if there are blog posts describing how to do that - e.g. to compare your own data with D-PLACE's)
- integrating work-in-progress datasets in a better way
easier access to individual datasets.

Examples

Aggregated databases using a model that could serve as example for D-PLACE:

lexibank - a big collection of lexical data for the world's languages
UniversalDependencies (UD): on GitHub, website

One GitHub repository per dataset/phylogeny

Each citable unit of data aggregated in D-PLACE should be curated in its own GitHub repository (preferably - but not necessarily under the D-PLACE GitHub organization).
The basic distribution format for datasets should be a CLDF derivative - but additional distribution formats could be provided, e.g.
- a "flat CSV file" format for datasets
- a Nexus file for phylogenies
cldfbench should be used to transparently separate editable data sources from distribution formats in these repositories.

Each repository would have the same top-level layout:

dplace-dataset-EA
|- raw/   possibly a spreadsheet? whatever is easiest to edit
|- cldf/  the primary distribution format
|- dist/  secondary distribution formats, e.g. "flat CSV file"

Each repository can be curated/released separately, and released versions will be separately usable/citable via DOI from Zenodo.

While this will add some overhead (there will be more repositories and each repository will have quite a few files), I still think this would make the data more transparent.

To date, this would mean about 12 repositories for D-PLACE datasets, and 26 for D-PLACE phylogenies. From our experience with lexibank (>100 datasets), that would seem manageable - also for the forseeable future. Considering the different types of data in D-PLACE, a naming convention for the repositories could be adopted (something along the lines of dplace-dataset-<AUTHOR><YEAR> and dplace-phylogeny-<AUTHOR><YEAR>), to make browsing simpler.

The D-PLACE umbrella/brand

Just like UD's "tools" and "docs" repositories, or lexibank's pylexibank package, there would be quite a bit of infrastructure making up the D-PLACE umbrella or brand:

pydplace: a curation package, ensuring individual repositories meet common standards
D-PLACE/societysets - a repository to catalog society sets (and mappings between them) which have been used to collect cross-cultural data (much like https://concepticon.clld.org catalogs concept lists used to collect lexical data)
D-PLACE GitHub org
D-PLACE Zenodo community - cataloging released versiond of D-PLACE datasets
https://d-place.org - the store front
The blog/book/cookbook?

Open Questions

Attribution: Will easy access to individual datasets (e.g. EA) dilute the D-PLACE brand? Can this be prevented with suitable citation recommendations? E.g. lexibank's datasets typically have titles like "CLDF dataset derived from SOURCE", so D-PLACE datasets could have titles like "D-PLACE dataset derived from Murdock's Ethnographic Atlas".

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dplace3.md

dplace3.md

Architecture Proposal for D-PLACE v3.0

Examples

One GitHub repository per dataset/phylogeny

The D-PLACE umbrella/brand

Open Questions

Files

dplace3.md

Latest commit

History

dplace3.md

File metadata and controls

Architecture Proposal for D-PLACE v3.0

Examples

One GitHub repository per dataset/phylogeny

The D-PLACE umbrella/brand

Open Questions