Skip to content

Commit

Permalink
more edits
Browse files Browse the repository at this point in the history
  • Loading branch information
kmoscoe committed Dec 16, 2024
1 parent cb6e67c commit 52eca6b
Showing 1 changed file with 36 additions and 30 deletions.
66 changes: 36 additions & 30 deletions custom_dc/custom_data.md
Original file line number Diff line number Diff line change
Expand Up @@ -234,15 +234,15 @@ The following fields are specific to the variable-per-column format:

![group_screenshot](/assets/images/custom_dc/customdc_screenshot9.png){: width="800"}

The other fields are explained in the [Data config file specification reference](#json-ref)
The other fields are explained in the [Data config file specification reference](#json-ref).

## Prepare your data using explicit schema

In this section, we will walk you through a concrete example of how to go about setting up your CSV, MCF and JSON files.

### Write the MCF file {#mcf}

Nodes in the Data Commons knowledge graph are defined in Metadata Content Format. For custom Data Commons using explicit schema, you must define your statistical variables using MCF. Here's an example of defining the same statistical variables in the WHO data in MCF:
Nodes in the Data Commons knowledge graph are defined in Metadata Content Format(MCF). For custom Data Commons using explicit schema, you must define your statistical variables using MCF. Here's an example of defining the same statistical variables in the WHO data in MCF:

```
Node: dcid:Adult_curr_cig_smokers
Expand All @@ -269,20 +269,24 @@ The following fields are always required:
- `Node`: This is the DCID of the entity you are defining.
- `typeOf`: In the case of statistical variable, this is always `dcid:StatisticalVariable`.
- `name`: This is the descriptive name of the variable, that is displayed in the Statistical Variable Explorer and various other places in the UI.
- `populationType`: This is the type of thing being measured, and its value must be an existing `Class` type. In this example it is is `dcid:Person`. For a full list of supported classes, you will have to send an API request, as described in [Get a list of all existing statistical variables](/api/rest/v2/node.html#liststatvars).
- `populationType`: This is the type of thing being measured, and its value must be an existing `Class` type. It is mainly used to classify variables into categories that appear in the Statistical Variable Explorer. In this example it is is `dcid:Person`. For a full list of supported classes, you will have to send an API request, as described in [Get a list of all existing statistical variables](/api/rest/v2/node.html#liststatvars).
- `dcid:measuredProperty`: This is a property of the thing being measured. It must be a `domainIncludes` property of the `populationType` you have specified. In this example, it is the `percent` of persons being measured. You can see the set of `domainIncludes` properties for a given `populationType`, using either of the following methods:
- Go to <code>https://datacommons.org/browser/<var>POPULATION_TYPE</var>, e.g. `https://datacommons.org/browser/Person` and scroll to the `domainIncludes` section of the page. For example:
![domain incudes](/assets/images/customdc_screenshot9.png)
- Use the [`node` API](/api/rest/v2/node.html#wildcard) filtering on `domainIncludes` incoming arcs: <code>https://api.datacommons.org/v2/node?key=AIzaSyCTI4Xz-UW_G2Q2RfknhcfdAnTHq5X5XuI&nodes=<var>POPULATION_TYPE</var>&property=%3C-domainIncludes</code>, e.g. `https://api.datacommons.org/v2/node?key=AIzaSyCTI4Xz-UW_G2Q2RfknhcfdAnTHq5X5XuI&nodes=Person&property=%3C-domainIncludes`.
- Go to <code>https://datacommons.org/browser/<var>POPULATION_TYPE</var></code>, e.g. <https://datacommons.org/browser/Person>{: target="_blank"} and scroll to the `domainIncludes` section of the page. For example:

![domain incudes](/assets/images/custom_dc/customdc_screenshot9.png){: width="800"}

- Use the [Node API](/api/rest/v2/node.html#wildcard), filtering on `domainIncludes` incoming arcs: <code>https://api.datacommons.org/v2/node?key=AIzaSyCTI4Xz-UW_G2Q2RfknhcfdAnTHq5X5XuI&nodes=<var>POPULATION_TYPE</var>&property=%3C-domainIncludes</code>, e.g. <https://api.datacommons.org/v2/node?key=AIzaSyCTI4Xz-UW_G2Q2RfknhcfdAnTHq5X5XuI&nodes=Person&property=%3C-domainIncludes>{: target="_blank"}.

Note that all non-quoted field values must be prefixed with `dcid:` or `dcs:`, which are interchangeable. You may wish to add an optional namespace, separated by a slash (/); for example, `who/Adult_curr_cig_smokers`.

The following fields are optional:
- `statType`: By default this is `dcid:measuredValue`.
- `measurementQualifier`: additional qualifiers of the variable; e.g., Nominal for GDP.
- `measurementDenominator: for percentages or ratios, this refers to another StatisticalVariable node. E.g. for per-capita, the measurementDenominator is Count_Person.
- `statType`: By default this is `dcid:measuredValue`, which is simply a raw value of an observation. If your variable is a calculated value, such as an average, a minimum or maximum, you can use `minValue`, `maxValue`, `meanValue`, `medianValue`, `sumvalue`, `varianceValue`, `marginOfError`, `stdErr`. In this case, your data set should only include the observations that correspond to those calculated values.
- `measurementQualifier`: This is similar to `observationPeriod` field for CSV observations (see below) but applies to all observations of the variable. It can be any string representing additional properties of the variable, e.g. `Weekly`, `Monthly`, `Annual`. For instance if `measuredProperty` is "income", `Annual` or `Monthly` is used to distinguish income over different periods. If the time interval affects the meaning of variable and and values change significantly by the time period,you can use this field keep them separate.
- `measurementDenominator` : For percentages or ratios, this refers to another statistical variable. For example, for per-capita, the measurementDenominator is `Count_Person`.

Additionally, there can be a number of property-value pairs representing the constraints on the type identified by `populationType`. In these examples, there is one constraint property, `gender`, which is a property of `Person`. The constraint property values are typically enumerations; such as `genderType`, which is a `rangeIncludes` property of `gender`.
Additionally, you can specify any number of property-value pairs representing the constraints on the type identified by `populationType`. In our example, there is one constraint property, `gender`, which is a property of `Person`. The constraint property values are typically enumerations; such as `genderType`, which is a `rangeIncludes` property of `gender`. These will become additional sub-categories of the population type and displayed as such in the Statistical Variable Explorer. Using our example:

![Stat Var Explorer](/assets/images/custom_dc/customdc_screenshot10.png){: width="600"}

### Prepare the CSV data files

Expand All @@ -297,10 +301,11 @@ These columns are required:
- The `entity` is the DCID of an existing entity in the Data Commons knowledge graph, typically a place.
- The `variable` is the DCID of the node you have defined in the MCF. The variable values must be numeric. Zeros and null values are accepted: zeros will be recorded and null values ignored.
- The `date` is the date of the observation and should be in the format _YYYY_, _YYYY_-_MM_, or _YYYY_-_MM_-_DD_.
- The `value` is the value of the observation and must be numeric.

> **Note:** The type of the entities in a single file should be unique; do not mix multiple entity types in the same CSV file. For example, if you have observations for cities and counties, put all the city data in one CSV file and all the county data in another one.
The remaining columns are optional, and allow you to specify additional per-observation properties. See xxx for a description of these.
The remaining columns are optional, and allow you to specify additional per-observation properties; see the descriptions of these in the [JSON config file reference](#observation-properties).

Here is an example of some real-world data from the WHO on the prevalance of smoking in adult populations, broken down by sex, in the correct CSV format:

Expand All @@ -321,9 +326,9 @@ dcs:who/Adult_curr_cig_smokers,dcid:country/ARE,2018,6.3
### Write the JSON config file

You must define a `config.json` in the top-level directory where your CSV files are located. With the explicit schema method, you need to provide these specifications:
- the input files location and entity type
- the sources and provenances of the data
- column mappings, if you are using custom names for the column headings
- The input files location and entity type
- The sources and provenances of the data
- Column mappings, if you are using custom names for the column headings

Here is an example of how the config file would look for WHO CSV file we defined earlier. More details are below.

Expand Down Expand Up @@ -448,35 +453,36 @@ If you are using subdirectories, specify the file names using paths relative to

#### Input file parameters

`entityType` (implicit schema only)
entityType (implicit schema only)

: Required: All entities in a given file must be of a specific type. This type should be specified as the value of the `entityType` field. The importer tries to resolve entities to DCIDs of that type. In most cases, the `entityType` will be a supported place type; see [Place types](../place_types.html) for a list.

`ignoreColumns`
ignoreColumns

: Optional: The list of column names to be ignored by the importer, if any.

`provenance`
provenance

: Required: The provenance (name) of this input file. Provenances typically map to a dataset from a source. For example, `WorldDevelopmentIndicators` provenance (or dataset) is from the `WorldBank` source.

You must specify the provenance details under `sources.provenances`; this field associates one of the provenances defined there to this file.

`observationProperties` (implicit schema only)
{: #observation-properties}
observationProperties (implicit schema only)

: Optional: Additional information about each contained in the CSV file. Currently, four properties are supported:
: Optional: Additional information about each contained in the CSV file. Currently, the following properties are supported:
- [`unit`](/glossary.html#unit): The unit of measurement used in the observations. This is a string representing a currency, area, weight, volume, etc. For example, `SquareFoot`, `USD`, `Barrel`, etc.
- [`measurementPeriod`](/glossary.html#observation-period): The period of time in which the observations were recorded. This must be in ISO duration format, namely `P[0-9][Y|M|D|h|m|s]`. For example, `P1Y` is 1 year, `P3M` is 3 months, `P3h` is 3 hours.
- [`measurementMethod`](/glossary.html#measurement-method): The method used to gather the observations. This can be a random string or an existing DCID of [`MeasurementMethodEnum`](https://datacommons.org/browser/MeasurementMethodEnum){: target="_blank"} type; for example, `EDA_Estimate` or `WorldBankEstimate`.
- [`scalingFactor`](/glossary.html#scaling-factor): An integer representing the denominator used in measurements involving ratios or percentages. For example, for percentages, the denominator would be `100`.

Note that you cannot mix different property values in a single CSV file. If you have observations using different properties, you must put them in separate CSV files.

`format`
format

: Only needed to specify `variablePerRow` for explicit schemas. The assumed default is `variablePerColumn`.

`columnMappings` (explicit schema only)
columnMappings (explicit schema only)

: Optional: If headings in the CSV file does not use the default names, the equivalent names for each column.

Expand All @@ -486,16 +492,16 @@ The `variables` section is optional. You can use it to override names and associ

#### Variable parameters {#varparams}

`name`
name

: The display name of the variable, which will show up throughout the UI. If not specified, the column name is used as the display name.
The name should be concise and precise; that is, the shortest possible name that allow humans to uniquely identify a given variable. The name is used to generate NL embeddings.

`description`
description

: A long-form description of the variable.

`properties`
properties

: Additional Data Commons properties associated with this variable. This section is analogous to the fields specified in an [MCF Node definition](#mcf).

Expand All @@ -510,21 +516,21 @@ Each property is specified as a key:value pair. Here are some examples:
}
```

`group`
group

: By default, the Statistical Variables Explorer will display all custom variables as a group called "Custom Variables". You can use this option to create multi-level hierarchies, and assign different variables to groups. The value of the `group` option is used as the heading of the group. For example, in the sample data, the group name `OECD` is used to group together the two variables from the two CSV files:

![group_screenshot](/assets/images/custom_dc/customdc_screenshot5.png){: width="400"}

You can have a multi-level group hierarchy by using `/` as a separator between each group.

`searchDescriptions`
searchDescriptions

: An array of descriptions to be used for creating more NL embeddings for the variable. This is only needed if the variable `name` is not sufficient for generating embeddings.

### `groupStatVarsByProperty` (explicit schema only)
### groupStatVarsByProperty (explicit schema only)

: Optional: Causes the Statistical Variable Explorer to create a top-level category called "Custom Variables", and groups together variables with the same population types and measured properties. For example:
Optional: Causes the Statistical Variable Explorer to create a top-level category called "Custom Variables", and groups together variables with the same population types and measured properties. For example:

![group_screenshot](/assets/images/custom_dc/customdc_screenshot10.png){: width="400"}

Expand All @@ -536,10 +542,10 @@ The `sources` section encodes the sources and provenances associated with the in

#### Source parameters

`url`
url
: Required: The URL of the named source. For example, for named source `U.S. Social Security Administration`, it would be `https://www.ssa.gov`.

`provenances`
provenances
: Required: A set of _NAME_:_URL_ pairs. Here are some examples:

```json
Expand Down

0 comments on commit 52eca6b

Please sign in to comment.