Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Commit

Permalink
Update Readme (close #10)
Browse files Browse the repository at this point in the history
  • Loading branch information
adatzer committed Aug 25, 2022
1 parent 02e69fc commit 9bbeb58
Showing 1 changed file with 55 additions and 42 deletions.
97 changes: 55 additions & 42 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,14 @@

## Overview

Before you can send your own event and context types into Snowplow (using the track unstructured events or track self-describing events and custom contexts features of Snowplow), you need to:
Before you can send your own self-describing events and context entities into Snowplow, you need to:

1. Define a JSON schema for each of the events and context types
1. Define a JSON schema for each of the events and entities.
2. Upload those schemas to your Iglu schema registry
3. Define a corresponding jsonpath file, and make sure this is uploaded your jsonpaths directory in Amazon S3
4. Create a corresponding Redshfit table definition, and create this table in your Redshift cluster

Once you have completed the above, you can send in data that conforms to the schemas as custom unstructured events or custom contexts.
Once you have completed the above, you can send in data that conforms to the schemas as custom self-describing events or custom context entities.

## Prerequisites

Expand All @@ -23,7 +23,7 @@ We recommend setting up the following two tools before staring:

## 1. Creating the schemas

In order to start sending a new event or context type into Snowplow, you first need to define a new schema for that event.
In order to start sending a new event or context type into Snowplow, you first need to define a new schema for that event/entity.

1. Create a file in the repo for the new schema e.g. `/schemas/com.mycompany/new_event_or_context_name/jsonschema/1-0-0`
2. Create the schema in that file. Follow the `/schemas/com.example_company/example_event/jsonschema/1-0-0` example
Expand All @@ -33,20 +33,24 @@ Note that if you have JSON data already and you want to create a corresponding s

Once you have your schema, make sure to validate it using [Igluctl][igluctl]:

```
$ /path/to/igluctl lint /path/to/schemas/com.mycompany/my_new_event_or_context
```bash
/path/to/igluctl lint /path/to/schemas/com.mycompany/my_new_event_or_context
```

For Windows:

```bash
java -jar /path/to/igluctl lint /path/to/schemas/com.mycompany/my_new_event_or_context
```
> java -jar /path/to/igluctl lint /path/to/schemas/com.mycompany/my_new_event_or_context
```

Igluctl will fail validating schemas that:
Igluctl will fail validating:

- JSON Schema that has inconsistent self-describing information and path on filesystem
- JSON Schema that has invalid `$schema` keyword. It should be always set to iglu-specific, while users tend to set it to Draft v4 or even to self-referencing Iglu URI
- JSON Schema that is invalid against its standard (empty required, string maximum and similar)
- JSON Schema that contains properties which contradict each other, like `{"type": "integer", "maxLength": 0}` or `{"maximum": 0, "minimum": 10}`. These schemas are inherently useless as for some valiators there is no JSON instance they can validate

1. Define a string field without a `maxLength` property. That ensures that when e.g. the corresponding Redshift table DDL is generated, the correct associated column length can be unambiguously set
2. Define a numeric field without a `minimum` and `maximum` properties. That ensures that the when e.g. the corresponding Redshift table DDL is generated, the right numeric field type is set.
You can find more information about the `igluctl lint` command in the [corresponding documentation][igluctl-lint].

## 2. Uploading the schemas to Iglu

Expand All @@ -69,83 +73,88 @@ IGLU_REGISTRY_MASTER_KEY=fd08697f-435c-4916-9c85-d0e50bbb8913

Or for Windows:

```
```bash
SET SNOWPLOW_MINI_IP=127.0.0.1
SET IGLU_REGISTRY_MASTER_KEY=fd08697f-435c-4916-9c85-d0e50bbb8913
```

Run the following command to publish all schemas to the Iglu server bundled with Snowplow Mini:

```bash
$ /path/to/igluctl static push ./schemas $SNOWPLOW_MINI_IP/iglu-server/ $IGLU_REGISTRY_MASTER_KEY --public
/path/to/igluctl static push ./schemas $SNOWPLOW_MINI_IP/iglu-server/ $IGLU_REGISTRY_MASTER_KEY --public
```

Note that you can specify individual schemas if you prefer e.g.

```bash
$ /path/to/igluctl static push ./schemas/com.mycompany/my_new_event_schema $SNOWPLOW_MINI_IP/iglu-server/ $IGLU_REGISTRY_MASTER_KEY --public
/path/to/igluctl static push ./schemas/com.mycompany/my_new_event_schema $SNOWPLOW_MINI_IP/iglu-server/ $IGLU_REGISTRY_MASTER_KEY --public
```

Also note that if you're editing existing schemas, the server applications will need to be rebooted to clear the schema cache. This can be done directly from the server using the Control Plane tab that can be found in the UI.

### 2.2 Upload the schemas to Iglu for the full pipeline

Once you've created your schemas, you need to upload them to Iglu. In practice, this means copying them into S3.
Once you've created your schemas, you need to upload them to Iglu.

This can also be done via Igluctl. In the project root, first commit the schema to Git:

```
```bash
git add .
git commit -m "Committed finalized schema"
git push
```

Then push it to S3 bucket:
Then, to publish your schemas stored locally to a remote Iglu Server:

```bash
/path/to/igluctl static push /path/to/static/registry $HOST $APIKEY
```
$ /path/to/igluctl static s3cp ./schemas snowplow-com-mycompany-iglu-schemas-bucket --accessKeyId ABCDEF --secretAccessKey GHIJKILM/12345XYZ --region us-east-1

Alternatively, to push your schemas to an S3 bucket that serves as remote Iglu registry:

```bash
/path/to/igluctl static s3cp ./schemas snowplow-com-mycompany-iglu-schemas-bucket --accessKeyId ABCDEF --secretAccessKey GHIJKILM/12345XYZ --region us-east-1
```

Note that you also can pass credentials via configuration file or environment variables, as with any [AWS tool][aws-credentials].

Useful resources

* [Iglu schema repository 0.1.0 release blog post][schema-repo-blog]
* [Iglu central][iglu-central] - centralized registry for all the schemas hosted by the Snowplow team
* [Iglu][iglu] - umbrella repository for the Iglu ecosystem
Useful resources:

- [Iglu schema repository 0.1.0 release blog post][schema-repo-blog]
- [Iglu][iglu]: umbrella repository for the Iglu ecosystem
- [Igluctl `static push`][igluctl-push]: documentation of the `igluctl static push` command
- [Igluctl `static s3cp`][igluctl-s3cp]: documentation of the `igluctl static s3cp` command

## 3. Creating the JSON Path files and SQL table definitions

Once you've defined the jsonschema for your new event or context type you need to create a correpsonding jsonpath file and sql table definition. This can be done programmatically using Igluctl. From the root of the repo:

```
```bash
/path/to/igluctl static generate --with-json-paths /path/to/schemas/com.mycompany/new_event_or_context_name
```

A corresponding jsonpath file and sql table definition file will be generated in the appropriate folder in the repo.

Note that you can create SQL table definition and jsonpath files for all the events / contexts schema'd as follows:

```
```bash
/path/to/igluctl static generate --with-json-paths /path/to/schemas/com.mycompany
```


## 4. Uploading the jsonpath files to Iglu

Once you've finalized the new jsonpath file, commit it to Git. From the project root:

```
```bash
git add .
git commit -m "Committed finalized jsonpath"
git push
```

Then push to Iglu:

```
$ /path/to/igluctl static s3cp ./jsonpaths snowplow-com-mycompany-iglu-jsonpaths-bucket --accessKeyId ABCDEF --secretAccessKey GHIJKILM/12345XYZ --region us-east-1
```bash
/path/to/igluctl static s3cp ./jsonpaths snowplow-com-mycompany-iglu-jsonpaths-bucket --accessKeyId ABCDEF --secretAccessKey GHIJKILM/12345XYZ --region us-east-1
```

## 5. Creating or updating the table definition in Redshift
Expand Down Expand Up @@ -208,24 +217,24 @@ If you want to change your schema over time, you will need to:

## Additional resources

Documentation on jsonschemas:
### Documentation on jsonschemas

* Other example jsonschemas can be found in [Iglu Central][iglu-central]. Note how schemas are namespaced in different folders
* [Schema Guru][schema-guru-github] is a command line tool for programmatically generating schemas from existing JSON data
* [Snowplow 0.9.5 release blog post][versioning-release-blog], which gives an overview of the way that Snowplow uses jsonschemas to process, validate and shred unstructured event and custom context JSONs
* It can be useful to test jsonschemas using online validators e.g. [this one][schema-validator]
* [json-schema.org][json-schema] contains links to the actual jsonschema specification, examples and guide for schema authors
* The original specification for self-describing JSONs, produced by the Snowplow team, can be found [here][self-desc-blog]
- Other example jsonschemas can be found in [Iglu Central][iglu-central]. Note how schemas are namespaced in different folders.
- [Schema Guru][schema-guru-github] is a command line tool for programmatically generating schemas from existing JSON data.
- [Snowplow 0.9.5 release blog post][versioning-release-blog], which gives an overview of the way that Snowplow uses jsonschemas to process, validate and shred unstructured event and custom context JSONs.
- It can be useful to test jsonschemas using online validators e.g. [this one][schema-validator].
- [json-schema.org][json-schema] contains links to the actual jsonschema specification, examples and guide for schema authors.
- The original specification for self-describing JSONs, produced by the Snowplow team, can be found [here][self-desc-blog].

Documentation on jsonpaths:
### Documentation on jsonpaths

* Example jsonpath files can be found in [Iglu central][iglu-central-jsonpaths]. Note that the corresponding jsonschema definitions are also stored in [Iglu central][iglu-central-schemas].
* Amazon documentation on jsonpath files can be found [here][aws-copy-json]
- Example jsonpath files can be found in [Iglu central][iglu-central-jsonpaths]. Note that the corresponding jsonschema definitions are also stored in [Iglu central][iglu-central-schemas].
- Amazon documentation on jsonpath files can be found [here][aws-copy-json].

Documentation on creating tables in Redshift:
### Documentation on creating tables in Redshift

* Example Redshift table definitions can be found on the [Snowplow repo][snowplow-redshift-sql].
* Amazon documentation on Redshift create table statements can be found [here][redshift-create-table]. A list of Redshift data types can be found [here][redshift-data-types].
- Example Redshift table definitions can be found on the [Snowplow repo][snowplow-redshift-sql].
- Amazon documentation on Redshift create table statements can be found [here][redshift-create-table]. A list of Redshift data types can be found [here][redshift-data-types].

## Copyright and license

Expand All @@ -247,6 +256,10 @@ limitations under the License.
[discourse]: https://discourse.snowplowanalytics.com/

[igluctl]: https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/igluctl-2/
[igluctl-lint]: https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/igluctl-2/#lint-1
[igluctl-push]: https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/igluctl-2/#static-push
[igluctl-s3cp]: https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/igluctl-2/#static-s3cp

[iglu-central]: https://docs.snowplowanalytics.com/docs/pipeline-components-and-applications/iglu/iglu-repositories/iglu-central/
[iglu-central-jsonpaths]: https://github.com/snowplow/iglu-central/tree/master/jsonpaths
[iglu-central-schemas]: https://github.com/snowplow/iglu-central/tree/master/schemas
Expand Down

0 comments on commit 9bbeb58

Please sign in to comment.