GSoC: Javascript API for data catalog #1290
Replies: 7 comments 9 replies
-
Hello @larsyencken! I ran into this project while looking in the ideas list of OWID and found it very interesting. I have a couple of questions:
I also wanted to ask if OWID allows starting talking and improving the proposal with mentors or we should wait for that time to simply send it. If so, how is the process? Thanks in advance! |
Beta Was this translation helpful? Give feedback.
-
Hi @larsyencken I saw this proposal under the list of ideas proposed by OWID for GSoc'22. I am very much interested in working on this project, however, I have a few questions. While I have a vague idea of the goal of the project and existing tools in place, I would still like to get more clarity on the current system. I am a little confused about how the catalogue is accessed currently. Is there an existing Python API in place and are we trying to develop something similar for Javascript? Thanks a lot for your time and help! |
Beta Was this translation helpful? Give feedback.
-
Hi all! I just updated the technical description now in an attempt to better explain the overall project, please have a re-read and see if the existing API is a little more understandable. |
Beta Was this translation helpful? Give feedback.
-
Hello @larsyencken @danyx23 , As we need to build javascript APIs in meet of pythonic apis, I have clone the project and see that some of the imports have not worked with me, as they use new version of python like typing.Literal, I think beside of the work in javascrip apis, its need to handle issues with python. |
Beta Was this translation helpful? Give feedback.
-
I got it now, thank you @danyx23, is there is a time to submit the proposal and get your feedback ? |
Beta Was this translation helpful? Give feedback.
-
I will send you by today, thank you very much @danyx23 |
Beta Was this translation helpful? Give feedback.
-
Hey @danyx23 there is no proposal form in the provided file, so I will do my best and if there is something should I follow for that please let me know. |
Beta Was this translation helpful? Give feedback.
-
DIFFICULTY: MEDIUM, SIZE: 175-350 HOURS (variable)
Overview
Our World in Data has been building a new publicly accessible data catalogue, where we bring together data from a huge variety of public sources into one space for remixing and reuse. Currently, this catalogue has a Pythonic API and publishes data in Feather (Apache Arrow) + JSON format.
We want to encourage public reuse of this data, and we see that many of the most interesting data visualisations today are done in Javascript using Observable notebooks. We believe that a simple but effective Javascript API could support FAIR data principles (findable, accessible, interoperable, reusable), and make it much easier for people to add their piece to the conversation on major global issues.
We believe the arquero project is a promising target for quickly loading data for visualisation for this API. Our goal is to allow users to find the data they want in a single-line of code, and likewise load the data they need in a single line. We also want to make it very simple for users to cite the data source correctly, based on metadata in the catalogue. This will help ensure that people can trace back where data came from, a level of transparency that builds trust and ensures data providers get credit for their work.
Technical background
In the past and present all the data that we process and then show using Grapher on our website was stored in a MySQL database in a somewhat limited format where everything had to fit a fixed structure of
(year, country, indicator_id, data_value)
. However, we have come to believe that this structure is not flexible enough, and makes the data too hard for the general public to reuse. For this reason, we are building a new public data catalog.The new catalog consists of three main projects:
The different sources of data that we snapshot into
walden
, and then transform in theetl
.A core difference in the new data catalog is that the entire catalog is based on flat files that can live locally on your disk. A catalog is just a folder, with an internal hierarchy:
<channel>/<namespace>/<dataset>/<version>/<table>
channel
: a top-level folder we use to indicate the level of data-cleanliness (e.g. "meadow" means the data was imported as-is, but "garden" means the data has been cleaned)namespace
: a folder in indicating the data provider, for examplewho
for "World Health Organisation"version
: a folder indicating the data's year or date, e.g. "2017", "2017-03-01"dataset
: a folder indicating the collection of data, for examplegbd
for "Global Burden of Disease"table
: acsv
orfeather
file representing a single table of data (a.k.a. data frame). The table has an index indicating its dimensions (e.g.(year, country, gender)
and the remaining columns are called variables, e.g.deaths_by_cancer
The
owid-catalog-py
project then allows a member of the public to find this data from Python and load it as a data frame. For example, you can install the package like any other Python package:You can then use
catalog.find()
orcatalog.find_one()
to search the available data tables for a topic that you're interested in, and easily download that data.It is this Pythonic API that we would like to be able to replicate in Javascript. The Python version works by firstly fetching a catalog index over HTTP in feather format, which lets the user search the available data. When loading the a data table, the feather file for that specific dataset is downloaded.
Note: Our Catalog currently only holds a smalls subset of all datasets that we have in our MySQL database but we are working on backporting existing datasets into the catalog at the moment.
Required skills
Expected outcomes
Potential mentors
Beta Was this translation helpful? Give feedback.
All reactions