Skip to content

Commit

Permalink
Add docs for pandas API. (datacommonsorg#55)
Browse files Browse the repository at this point in the history
  • Loading branch information
tjann authored Aug 26, 2020
1 parent 87314ca commit 6e21c83
Show file tree
Hide file tree
Showing 16 changed files with 357 additions and 28 deletions.
32 changes: 21 additions & 11 deletions Gemfile.lock
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@ GEM
colorator (1.1.0)
commonmarker (0.17.13)
ruby-enum (~> 0.5)
concurrent-ruby (1.1.6)
dnsruby (1.61.3)
addressable (~> 2.5)
concurrent-ruby (1.1.7)
dnsruby (1.61.4)
simpleidn (~> 0.1)
em-websocket (0.5.1)
eventmachine (>= 0.12.9)
http_parser.rb (~> 0.6.0)
Expand All @@ -31,9 +31,9 @@ GEM
ffi (1.13.1)
forwardable-extended (2.6.0)
gemoji (3.0.1)
github-pages (206)
github-pages (207)
github-pages-health-check (= 1.16.1)
jekyll (= 3.8.7)
jekyll (= 3.9.0)
jekyll-avatar (= 0.7.0)
jekyll-coffeescript (= 1.1.1)
jekyll-commonmark-ghpages (= 0.1.6)
Expand Down Expand Up @@ -67,7 +67,8 @@ GEM
jekyll-theme-time-machine (= 0.1.1)
jekyll-titles-from-headings (= 0.5.3)
jemoji (= 0.11.1)
kramdown (= 1.17.0)
kramdown (= 2.3.0)
kramdown-parser-gfm (= 1.1.0)
liquid (= 4.0.3)
mercenary (~> 0.3)
minima (= 2.5.1)
Expand All @@ -80,20 +81,20 @@ GEM
octokit (~> 4.0)
public_suffix (~> 3.0)
typhoeus (~> 1.3)
html-pipeline (2.13.0)
html-pipeline (2.14.0)
activesupport (>= 2)
nokogiri (>= 1.4)
http_parser.rb (0.6.0)
i18n (0.9.5)
concurrent-ruby (~> 1.0)
jekyll (3.8.7)
jekyll (3.9.0)
addressable (~> 2.4)
colorator (~> 1.0)
em-websocket (~> 0.5)
i18n (~> 0.7)
jekyll-sass-converter (~> 1.0)
jekyll-watch (~> 2.0)
kramdown (~> 1.14)
kramdown (>= 1.17, < 3)
liquid (~> 4.0)
mercenary (~> 0.3.3)
pathutil (~> 0.9)
Expand Down Expand Up @@ -191,7 +192,10 @@ GEM
gemoji (~> 3.0)
html-pipeline (~> 2.2)
jekyll (>= 3.0, < 5.0)
kramdown (1.17.0)
kramdown (2.3.0)
rexml
kramdown-parser-gfm (1.1.0)
kramdown (~> 2.0)
liquid (4.0.3)
listen (3.2.1)
rb-fsevent (~> 0.10, >= 0.10.3)
Expand All @@ -215,6 +219,7 @@ GEM
rb-fsevent (0.10.4)
rb-inotify (0.10.1)
ffi (~> 1.0)
rexml (3.2.4)
rouge (3.19.0)
ruby-enum (0.8.0)
i18n
Expand All @@ -228,13 +233,18 @@ GEM
sawyer (0.8.2)
addressable (>= 2.3.5)
faraday (> 0.8, < 2.0)
simpleidn (0.1.1)
unf (~> 0.1.4)
terminal-table (1.8.0)
unicode-display_width (~> 1.1, >= 1.1.1)
thread_safe (0.3.6)
typhoeus (1.4.0)
ethon (>= 0.9.0)
tzinfo (1.2.7)
thread_safe (~> 0.1)
unf (0.1.4)
unf_ext
unf_ext (0.0.7.7)
unicode-display_width (1.7.0)
zeitwerk (2.4.0)

Expand All @@ -246,4 +256,4 @@ DEPENDENCIES
jekyll-feed (~> 0.6)

BUNDLED WITH
2.0.2
2.1.4
84 changes: 84 additions & 0 deletions api/pandas/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
---
layout: default
title: Pandas
nav_order: 3
parent: API
has_children: true
---
# Data Commons Pandas API

The **Data Commons Pandas API** is a superset of the Data Commons Python API:
all functions from the Python API are also accessible from
the Pandas API, and supplemental functions help with directly creating
[pandas](https://pandas.pydata.org/)
objects using data from the Data Commons knowledge graph for common pandas
use cases. Please see the [Data Commons API Overview](/api) for more details
on the design and structure of the API.

Before proceeding, make sure you have followed the setup instructions below.

## Getting Started

To get started using the Pandas API:

* Install the API using `pip`.
* (Optional) Create an API key and enable the **Data Commons API**.
* Begin developing with the Pandas API

### Installing the Pandas API

First, install the `datacommons_pandas` package through `pip`.

```bash
$ pip install datacommons_pandas
```

For more information about installing `pip` and setting up other parts of
your Python development environment, please refer to the
[Python Development Environment Setup Guide](https://cloud.google.com/python/setup.html)
for Google Cloud Platform.

### Creating an API Key (Optional)

If you would like to provide an API key, follow the steps in [the API setup
guide](/api/setup.html). Data Commons *does not charge* users, but uses the
API key for understanding API usage.

With the API key created and Data Commons API activated, we can now get started
using the pandas API. There are two ways to provide your key
to the pandas API package.

1. You can set the API key by calling `datacommons_pandas.set_api_key`.
Start by importing `datacommons_pandas`, then set the API key like so.

```python
import datacommons_pandas as dcpd

dcpd.set_api_key('YOUR-API-KEY')
```

This will create an environment variable in your Python runtime called
`DC_API_KEY` holding your key. Your key will then be used whenever
the package sends a request to the Data Commons graph.

1. You can export an environment variable in your shell like so.

```python
export DC_API_KEY='YOUR-API-KEY'
```

After you've exported the variable, you can start using the Data Commons
package.

```
import datacommons_pandas as dcpd
```

This route is particularly useful if you are building applications that
depend on this API, and are deploying them to hosting services.

### Using the Pandas API

You are ready to go! From here you can view our [tutorials](/tutorials.html) on how to use the
API to perform certain tasks, or see a full list of functions, classes and
methods available for use in the sidebar.
88 changes: 88 additions & 0 deletions api/pandas/multivariate_dataframe.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
---
layout: default
title: Multivariate Table as pd.DataFrame
nav_order: 3
parent: Pandas
grand_parent: API
---

# Get Multivariate DataFrame

## `datacommons_pandas.build_multivariate_dataframe(places, stats_vars)`

Returns a `pandas.DataFrame` with [`places`](https://datacommons.org/browser/Place)
as index and [`stat_vars`](https://datacommons.org/browser/StatisticalVariable)
as columns, where each cell is latest observed statistic for
its `Place` and `StatisticalVariable`.

See the [full list of `StatisticalVariable`s](/statistical_variables.html).

**Arguments**

* `places (Iterable of str)`: A list of dcids of the
[`Place`](https://datacommons.org/browser/Place)s to query for.

* `stat_vars (Iterable of str)`: A list of dcids of the
[`StatisticalVariable`](https://datacommons.org/browser/StatisticalVariable)s
to query for.

**Returns**

A `pandas.DataFrame` with [`places`](https://datacommons.org/browser/Place)
(str)
as index and [`stat_vars`](https://datacommons.org/browser/StatisticalVariable)
(str) as columns, where each cell is latest observed statistic (float) for
its `Place` and `StatisticalVariable`.

**Raises**

* `ValueError` - If no statistical values found for the given parameters.

Be sure to initialize the library. See the
[datacommons_pandas library setup guide](/api/pandas/) for more details.

You can find a list of `StatisticalVariable`s with human-readable names [here](/statistical_variables.html).

## Examples

We would like to get a DataFrame of

- [Count_Person](https://datacommons.org/browser/Count_Person)
- [Median_Age_Person](https://datacommons.org/browser/Median_Age_Person)
- [UnemploymentRate_Person](https://datacommons.org/browser/UnemploymentRate_Person)

for
[the United States](https://datacommons.org/browser/country/USA),
[California](https://datacommons.org/browser/geoId/06),and
[Santa Clara County](https://datacommons.org/browser/geoId/06085).

```python
>>> import datacommons_pandas as dcpd
>>> dcpd.build_multivariate_dataframe(["country/USA", "geoId/06", "geoId/06085"],
["Count_Person", "Median_Age_Person", "UnemploymentRate_Person"])
Count_Person Median_Age_Person UnemploymentRate_Person
place
country/USA 328239523 37.9 NaN
geoId/06 39512223 36.3 15.1
geoId/06085 1927852 37.0 10.7
```

In the next example, there is no data about
`RetailDrugDistribution_DrugDistribution_14Hydroxycodeinone` nor
`RetailDrugDistribution_DrugDistribution_Amphetamine` for non-USA
places, so the API throws ValueError for no data:

```python
>>> import datacommons_pandas as dcpd
>>> dcpd.build_multivariate_dataframe(
["country/MEX", "nuts/AT32"],
["RetailDrugDistribution_DrugDistribution_14Hydroxycodeinone",
"RetailDrugDistribution_DrugDistribution_Amphetamine"
]
)
ValueError Traceback (most recent call last)
...
--> raise ValueError('No data for any of specified Places and StatisticalVariables.')

ValueError: No data for any of specified places and stat_vars.
```
74 changes: 74 additions & 0 deletions api/pandas/time_series.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
layout: default
title: Time Series as pd.Series
nav_order: 1
parent: Pandas
grand_parent: API
---

# Get Time Series for a Place

## `datacommons_pandas.build_time_series(place, stat_var, measurement_method=None,observation_period=None, unit=None, scaling_factor=None)`

Returns a `pandas.Series` representing a time series for the [`place`](https://datacommons.org/browser/Place) and
[`stat_var`](https://datacommons.org/browser/StatisticalVariable) satisfying any optional parameters.

See the [full list of `StatisticalVariable`s](/statistical_variables.html).

**Arguments**

* `place (str)`: The `dcid` of the [`Place`](https://datacommons.org/browser/Place) to query for.

* `stat_var (str)`: The `dcid` of the
[`StatisticalVariable`](https://datacommons.org/browser/StatisticalVariable).

* `measurement_method (str)`: (Optional) The `dcid` of the preferred [`measurementMethod`](https://datacommons.org/browser/measurementMethod) for the `stat_var`.

* `observation_period (str)`: (Optional) The preferred [`observationPeriod`](https://datacommons.org/browser/observationPeriod) for the `stat_var`. This is an [ISO 8601 duration](https://en.wikipedia.org/wiki/ISO_8601#Durations) such as "P1M" (one month).

* `unit (str)`: (Optional) The `dcid` of the preferred [`unit`](https://datacommons.org/browser/unit) for the `stat_var`.

* `scaling_factor (int)`: (Optional) The preferred [`scalingFactor`](https://datacommons.org/browser/scalingFactor) for the `stat_var`.

**Returns**

A `pandas.Series` with dates (str) as index for observed values (float) for the `stat_var` and `place`.

**Raises**

* `ValueError` - If no statistical value found for the place with the given parameters.

Be sure to initialize the library. Check the [datacommons_pandas library setup guide](/api/pandas/) for more details.

You can find a list of `StatisticalVariable`s with human-readable names [here](/statistical_variables.html).

## Examples

We would like to get the [male population](https://datacommons.org/browser/Count_Person_Male) in [Arkansas](https://datacommons.org/browser/geoId/05)

```python
>>> import datacommons_pandas as dcpd
>>> dcpd.build_time_series("geoId/05", "Count_Person_Male")
2015 1451913
2016 1456694
2017 1461651
2018 1468412
2011 1421287
2012 1431252
2013 1439862
2014 1447235
dtype: int64
```

In the next example, the parameter `observation_period='P3Y'` overly constrains the request so the API
throws ValueError:

```python
>>> import datacommons_pandas as dcpd
>>> dcpd.build_time_series('geoId/06085', 'Count_Person', observation_period='P3Y')
ValueError Traceback (most recent call last)
...
--> raise ValueError('No data in response.')

ValueError: No data in response.
```
Loading

0 comments on commit 6e21c83

Please sign in to comment.