GSoC: Javascript API for data catalog #1290

larsyencken · 2022-03-18T13:50:39Z

larsyencken
Mar 18, 2022
Maintainer

DIFFICULTY: MEDIUM, SIZE: 175-350 HOURS (variable)

Overview

Our World in Data has been building a new publicly accessible data catalogue, where we bring together data from a huge variety of public sources into one space for remixing and reuse. Currently, this catalogue has a Pythonic API and publishes data in Feather (Apache Arrow) + JSON format.

We want to encourage public reuse of this data, and we see that many of the most interesting data visualisations today are done in Javascript using Observable notebooks. We believe that a simple but effective Javascript API could support FAIR data principles (findable, accessible, interoperable, reusable), and make it much easier for people to add their piece to the conversation on major global issues.

We believe the arquero project is a promising target for quickly loading data for visualisation for this API. Our goal is to allow users to find the data they want in a single-line of code, and likewise load the data they need in a single line. We also want to make it very simple for users to cite the data source correctly, based on metadata in the catalogue. This will help ensure that people can trace back where data came from, a level of transparency that builds trust and ensures data providers get credit for their work.

Technical background

In the past and present all the data that we process and then show using Grapher on our website was stored in a MySQL database in a somewhat limited format where everything had to fit a fixed structure of (year, country, indicator_id, data_value). However, we have come to believe that this structure is not flexible enough, and makes the data too hard for the general public to reuse. For this reason, we are building a new public data catalog.

The new catalog consists of three main projects:

walden: snapshots of major datasets in whatever format they arrived in (e.g. a zip file of excel sheets)
etl: a compute graph that transforms, cleans and harmonises the data, creating a data catalog of it locally on disk. OWID staff then publish this catalog so that it is available over HTTP.
owid-catalog-py: the Pythonic API for the data catalog

The different sources of data that we snapshot into walden, and then transform in the etl.

A core difference in the new data catalog is that the entire catalog is based on flat files that can live locally on your disk. A catalog is just a folder, with an internal hierarchy: <channel>/<namespace>/<dataset>/<version>/<table>

channel: a top-level folder we use to indicate the level of data-cleanliness (e.g. "meadow" means the data was imported as-is, but "garden" means the data has been cleaned)
namespace: a folder in indicating the data provider, for example who for "World Health Organisation"
version: a folder indicating the data's year or date, e.g. "2017", "2017-03-01"
dataset: a folder indicating the collection of data, for example gbd for "Global Burden of Disease"
table: a csv or feather file representing a single table of data (a.k.a. data frame). The table has an index indicating its dimensions (e.g. (year, country, gender) and the remaining columns are called variables, e.g. deaths_by_cancer

The owid-catalog-py project then allows a member of the public to find this data from Python and load it as a data frame. For example, you can install the package like any other Python package:

pip install owid-catalog

You can then use catalog.find() or catalog.find_one() to search the available data tables for a topic that you're interested in, and easily download that data.

>>> from owid import catalog
>>> df = catalog.find_one('covid', namespace='owid')
>>> df.head()
                    continent  ... excess_mortality_cumulative_per_million
iso_code date                  ...
AFG      2020-02-24      Asia  ...                                     NaN
         2020-02-25      Asia  ...                                     NaN
         2020-02-26      Asia  ...                                     NaN
         2020-02-27      Asia  ...                                     NaN
         2020-02-28      Asia  ...                                     NaN

[5 rows x 65 columns]

It is this Pythonic API that we would like to be able to replicate in Javascript. The Python version works by firstly fetching a catalog index over HTTP in feather format, which lets the user search the available data. When loading the a data table, the feather file for that specific dataset is downloaded.

Note: Our Catalog currently only holds a smalls subset of all datasets that we have in our MySQL database but we are working on backporting existing datasets into the catalog at the moment.

Required skills

Javascript/Typescript
Python
Pandas

Expected outcomes

Create an open source JS/TS package for consuming the OWID data API
The package should contain an API for
- Discovering available data
- Loading available data
The API should be harmonised with the existing Python API (the Python API may also need updating based on what you implement)
The project should include getting-started documentation that helps others use your work
Provide example Observable notebooks which consume data from OWID using the API and plot it

Potential mentors

acrulopez · 2022-03-18T16:08:59Z

acrulopez
Mar 18, 2022

Hello @larsyencken! I ran into this project while looking in the ideas list of OWID and found it very interesting. I have a couple of questions:

Would it be necessary to adapt the functions from the Pythonic API to the new Javascript API? All of them, a few? How would this be decided?
Will it be needed to implement new functionalities? If so, are there already some ideas about this? I guess the student working on this should also provide ideas, is that right?

I also wanted to ask if OWID allows starting talking and improving the proposal with mentors or we should wait for that time to simply send it. If so, how is the process?

Thanks in advance!

4 replies

danyx23 Mar 21, 2022
Maintainer

HI @acrulopez!

To your first question: the Python API is at this point more of an inspiration - it is not used heavily yet so while it is already useful it is not a production hardened design yet. What we would definitely like to have is a very low effort API to pull in data form our catalog.

To your second question: this project can be either a one month or two month project. For the former I think nothing else than searching and loading our datasets would be a good scope. For the larger version it would be great to have some ideas from interested students about where this could go. I personally think that some promising avenues would be around fetching additional metadata, citation information and such. When doing data science works in interactive notebooks like ObservableHQ it is important to also surface data caveats and ideally a trail of where the data was retrieved from and how it was transformed so that the readers can trust the results. A very complete solution is probably out of scope for this work but if you could outline something along this axis of design that could be great I think.

Feel free to start preparing your own draft of what you see this project becoming. You can either post your iterations here or send them to me via email.

acrulopez Mar 21, 2022

Thanks for the information @danyx23 !

Indeed, I think it would be very convenient to know where the data comes from and be able to trace back the transformations. I have checked the catalog with the pythonic API and I could not find information regarding that. Does OWID store that information somewhere? In case it doesn't, do you think there is an easy way to achieve this? It looks complex due to the large number of different sources.

It may also be useful to talk with people that already use the pythonic API (or potential users) and ask them several questions to discover features that could ease their interaction with the API. It could be done in the early stage of the project concurrently while starting to develop the basic functionality. What do you think?

danyx23 Mar 25, 2022
Maintainer

Yes, I talked with @larsyencken and we'll try to add a bit more information to the main text of this proposal about where the data is stored and how the python api retrieves it (probably at the beginning of next week). The python API itself is still undergoing development and hasn't been widely publicized so there are only a few internal users so far that we are aware of. One trivial example where the catalog is used to fetch our preferred population dataset is in this notebook.

danyx23 Mar 29, 2022
Maintainer

@acrulopez I tried giving some additional background in the discusssion post. We'll also improve the readme of the ETL repository in the next couple of days to make it a bit easier to understand how our catalog gets built. Let me know if this clears things up for you or if you have some remaining questions related to this!

Aaru143 · 2022-03-28T18:14:51Z

Aaru143
Mar 28, 2022

Hi @larsyencken

I saw this proposal under the list of ideas proposed by OWID for GSoc'22. I am very much interested in working on this project, however, I have a few questions.

While I have a vague idea of the goal of the project and existing tools in place, I would still like to get more clarity on the current system. I am a little confused about how the catalogue is accessed currently. Is there an existing Python API in place and are we trying to develop something similar for Javascript?

Thanks a lot for your time and help!

1 reply

danyx23 Mar 29, 2022
Maintainer

Hi @Aaru143!

Nice to hear that you are interested in contributing to GSoC with us! Could I ask you to write an email to [email protected] if you haven't done so already? This will help us track all applicants and communicate occasional updates to everyone. Thanks!

Yes indeed, we have a Python library that allows access in the way that we outline here - just follow the links at the top in the post and check out the Python API and give it a spin to get a feel for it!

I also added a bit more technical background to the discussion above that should also be helpful.

If you are interested in working on this then the next step is for you to start drafting a proposal. We have posted a guide for this process here: #1318

Looking forward to your proposal!

larsyencken · 2022-03-29T09:53:45Z

larsyencken
Mar 29, 2022
Maintainer Author

Hi all! I just updated the technical description now in an attempt to better explain the overall project, please have a re-read and see if the existing API is a little more understandable.

2 replies

acrulopez Mar 29, 2022

Hello @larsyencken! It's more understandable now, thanks for the effort. Just one thing, there is a typo link to the walden repository.

larsyencken Mar 29, 2022
Maintainer Author

Thanks! Fixed the link now.

Abdelrahmanrezk · 2022-04-11T01:52:37Z

Abdelrahmanrezk
Apr 11, 2022

Hello @larsyencken @danyx23 , As we need to build javascript APIs in meet of pythonic apis, I have clone the project and see that some of the imports have not worked with me, as they use new version of python like typing.Literal, I think beside of the work in javascrip apis, its need to handle issues with python.
finally I would like to work on that project and doing my best.

1 reply

danyx23 Apr 13, 2022
Maintainer

Hi @Abdelrahmanrezk! True, we require Python 3.8 - but we don't have much appetite for going below that version. 3.6 has reached end of life already and 3.7 is EOLed a year from now. So just as a heads up I think we would stay with the Python 3.8 requirement for the time beeing.

Abdelrahmanrezk · 2022-04-13T10:33:38Z

Abdelrahmanrezk
Apr 13, 2022

I got it now, thank you @danyx23, is there is a time to submit the proposal and get your feedback ?

1 reply

danyx23 Apr 13, 2022
Maintainer

Hi @Abdelrahmanrezk! Send it to me at [email protected] (best before Apr 14 Noon CEST) and I can try to give you feedback - can't promise I'll still make it before the Apr 19 deadline though.

Abdelrahmanrezk · 2022-04-13T10:39:21Z

Abdelrahmanrezk
Apr 13, 2022

I will send you by today, thank you very much @danyx23

0 replies

Abdelrahmanrezk · 2022-04-13T10:42:26Z

Abdelrahmanrezk
Apr 13, 2022

Hey @danyx23 there is no proposal form in the provided file, so I will do my best and if there is something should I follow for that please let me know.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GSoC: Javascript API for data catalog #1290

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 7 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

GSoC: Javascript API for data catalog #1290

larsyencken Mar 18, 2022 Maintainer

Overview

Technical background

Required skills

Expected outcomes

Potential mentors

Replies: 7 comments · 9 replies

danyx23 Mar 21, 2022 Maintainer

danyx23 Mar 25, 2022 Maintainer

danyx23 Mar 29, 2022 Maintainer

danyx23 Mar 29, 2022 Maintainer

larsyencken Mar 29, 2022 Maintainer Author

larsyencken Mar 29, 2022 Maintainer Author

danyx23 Apr 13, 2022 Maintainer

danyx23 Apr 13, 2022 Maintainer

larsyencken
Mar 18, 2022
Maintainer

Replies: 7 comments 9 replies

danyx23 Mar 21, 2022
Maintainer

danyx23 Mar 25, 2022
Maintainer

danyx23 Mar 29, 2022
Maintainer

danyx23 Mar 29, 2022
Maintainer

larsyencken
Mar 29, 2022
Maintainer Author

larsyencken Mar 29, 2022
Maintainer Author

danyx23 Apr 13, 2022
Maintainer

danyx23 Apr 13, 2022
Maintainer