Skip to content

Commit

Permalink
[KED-826] Use KedroContext class in project template (#111)
Browse files Browse the repository at this point in the history
  • Loading branch information
921kiyo authored Jul 16, 2019
1 parent 5cbaa77 commit 571bb80
Show file tree
Hide file tree
Showing 37 changed files with 630 additions and 451 deletions.
47 changes: 45 additions & 2 deletions RELEASE.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Release 0.15.0

## Major features and improvements
* Added a new CLI command `kedro jupyter convert` to facilitate converting Jupyter notebook cells into Kedro nodes.
* Added a new CLI command `kedro jupyter convert` to facilitate converting Jupyter notebook cells into Kedro nodes.
* Added `KedroContext` base class which holds the configuration and Kedro's main functionality (catalog, pipeline, config).
* Added a new I/O module `ParquetS3DataSet` in `contrib` for usage with Pandas. (by [@mmchougule](https://github.com/mmchougule))

Expand All @@ -11,18 +11,61 @@
* Add information on installed plugins to `kedro info`

## Breaking changes to the API
* Simplify the Kedro template in `run.py` with the introduction of `KedroContext` class.
* Merged `FilepathVersionMixIn` and `S3VersionMixIn` under one abstract class `AbstractVersionedDataSet` which extends`AbstractDataSet`.

#### Migration guide from Kedro 0.14.* to Kedro 0.15.0
##### Migration for Kedro project template
This guide assumes that:
* The framework specific code has not been altered significantly
* Your project specific code is stored in the dedicated python package under `src/`.

The breaking changes were introduced in the following project template files:
- `<project-name>/.ipython/profile_default/startup/00-kedro-init.py`
- `<project-name>/kedro_cli.py`
- `<project-name>/src/tests/test_run.py`
- `<project-name>/src/<package-name>/run.py`

The easiest way to migrate your project from Kedro 0.14.* to Kedro 0.15.0 is to create a new project (by using `kedro new`) and move code and files bit by bit as suggested in the detailed guide below:

1. Create a new project with the same name by running `kedro new`

2. Copy the following folders to the new project:
- `results/`
- `references/`
- `notebooks/`
- `logs/`
- `data/`
- `conf/`

3. If you customised your `src/<package>/run.py`, make sure you apply the same customisations to `src/<package>/run.py`
- If you customised `get_config()`, you can override `_create_config()` method in `ProjectContext` derived class
- If you customised `create_catalog()`, you can override `_create_catalog()` method in `ProjectContext` derived class
- If you customised `run()`, you can override `run()` method in `ProjectContext` derived class
- If you customised default `env`, you can override it in `ProjectContext` derived class or pass it at construction. By default, `env` is `local`.
- If you customised default `root_conf`, you can override `CONF_ROOT` attribute in `ProjectContext` derived class. By default, `KedroContext` base class has `CONF_ROOT` attribute set to `conf`.

4. The following syntax changes are introduced in ipython or Jupyter notebook/labs:
- `proj_dir` -> `context.project_path`
- `proj_name` -> `context.project_name`
- `conf` -> `context.config_loader`.
- `io` -> `context.catalog` (e.g., `io.load()` -> `context.catalog.load()`)

5. If you customised your `kedro_cli.py`, you need to apply the same customisations to your `kedro_cli.py` in the new project.

##### Migration for versioning custom dataset classes

If you defined any custom dataset classes which support versioning in your project, you need to apply the following changes:

1. Make sure your dataset inherits from `AbstractVersionedDataSet` only.
2. Call `super().__init__()` with the appropriate arguments in the dataset's `__init__`. If storing on local filesystem, providing the filepath and the version is enough. Otherwise, you should also pass in an `exists_function` and a `glob_function` that emulate `exists` and `glob` in a different filesystem (see `CSVS3DataSet` as an example).
2. Call `super().__init__()` with the appropriate arguments in the dataset's `__init__`. If storing on local filesystem, providing the filepath and the version is enough. Otherwise, you should also pass in an `exists_function` and a `glob_function` that emulate `exists` and `glob` in a different filesystem (see `CSVS3DataSet` as an example).
3. Remove setting of the `_filepath` and `_version` attributes in the dataset's `__init__`, as this is taken care of in the base abstract class.
4. Any calls to `_get_load_path` and `_get_save_path` methods should take no arguments.
5. Ensure you convert the output of `_get_load_path` and `_get_save_path` appropriately, as these now return [`PurePath`s](https://docs.python.org/3/library/pathlib.html#pure-paths) instead of strings.
6. Make sure `_check_paths_consistency` is called with [`PurePath`s](https://docs.python.org/3/library/pathlib.html#pure-paths) as input arguments, instead of strings.

These steps should have brought your project to Kedro 0.15.0. There might be some more minor tweaks needed as every project is unique, but now you have a pretty solid base to work with. If you run into any problems, please consult the [Kedro documentation](https://kedro.readthedocs.io).

## Thanks for supporting contributions
[Dmitry Vukolov](https://github.com/dvukolov), [Jo Stichbury](https://github.com/stichbury), [Angus Williams](https://github.com/awqb), [Deepyaman Datta](https://github.com/deepyaman), [Mayur Chougule](https://github.com/mmchougule)

Expand Down
4 changes: 2 additions & 2 deletions docs/source/02_getting_started/04_hello_world.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ A `kedro` project consists of the following main components:
+--------------+----------------------------------------------------------------------+
| Component | Description |
+==============+======================================================================+
| Data Catalog | A collection of datasets that can be used to form the data pipeline. |
| Data Catalog | A collection of datasets that can be used to form the data pipeline. |
| | Each dataset provides :code:`load` and :code:`save` capabilities for |
| | a specific data type, e.g. :code:`CSVS3DataSet` loads and saves data |
| | to a csv file in S3. |
Expand Down Expand Up @@ -139,7 +139,7 @@ kedro run

This command calls the `main()` function from `src/getting_started/run.py`, which in turn does the following:

1. Initiates the context:
1. Instantiates `ProjectContext` class defined in `src/getting_started/run.py`:
* Reads relevant configuration
* Configures Python `logging`
* Instantiates the `DataCatalog` and feeds a dictionary containing `parameters` config
Expand Down
8 changes: 4 additions & 4 deletions docs/source/03_tutorial/01_workflow.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ When building a Kedro project, you will typically follow a standard development
* Create the data transformation steps as [pure](https://en.wikipedia.org/wiki/Pure_function) Python functions
* Construct the pipeline by adding your functions as nodes
- Specify inputs and outputs for each node
* Choose how to run the pipeline: sequentially or in parallel
* Choose how to run the pipeline: sequentially or in parallel

### 4. Package the project

Expand Down Expand Up @@ -54,7 +54,7 @@ As you work on a project, you will periodically want to save your changes. If yo
# create a new feature branch called 'feature-project-template'
git checkout -b feature/project-template
# stage all the files you have changed
git add .
git add .
# commit changes to git with an instructive message
git commit -m 'Create project template'
# push changes to remote branch
Expand All @@ -65,9 +65,9 @@ It isn't necessary to branch, but if everyone in a team works on the same branch

```bash
# stage all files
git add .
git add .
# commit changes to git with an instructive message
git commit -m 'Create project template'
# push changes to remote master
git push origin master
git push origin master
```
53 changes: 12 additions & 41 deletions docs/source/03_tutorial/04_create_pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,8 @@

This section covers how to create a pipeline from a set of `node`s, which are Python functions, as described in more detail in the [nodes and pipelines user guide](../04_user_guide/05_nodes_and_pipelines.md) documentation.

1. As you draft experimental code, you can use a [Jupyter Notebook](./04_create_pipelines.md#working-with-kedro-projects-from-jupyter) or [IPython session](../06_resources/03_ipython.md#working-with-kedro-and-ipython). If you include [`docstrings`](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings) to explain what your functions do, you can take advantage of [auto-generated Sphinx documentation](http://www.sphinx-doc.org/en/master/) later on. Once you are happy with how you have written your `node` functions, you will copy & paste the code into the `src/kedro_tutorial/nodes/` folder as a `.py` file.
1. As you draft experimental code, you can use a [Jupyter Notebook](../04_user_guide/10_ipython.md#working-with-kedro-projects-from-jupyter) or [IPython session](../04_user_guide/10_ipython.md). If you include [`docstrings`](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings) to explain what your functions do, you can take advantage of [auto-generated Sphinx documentation](http://www.sphinx-doc.org/en/master/) later on. Once you are happy with how you have written your `node` functions, you will copy & paste the code into the `src/kedro_tutorial/nodes/` folder as a `.py` file.
2. When you are ready with a node you should add it to the pipeline in `src/kedro_tutorial/pipeline.py`, specifying its inputs and outputs.
3. Finally, you will choose how you would like to run your pipeline: sequentially or in parallel, by specifying your preference in `src/kedro_tutorial/run.py`.


## Node basics
Expand Down Expand Up @@ -154,34 +153,6 @@ kedro run

`CSVLocalDataSet` is chosen for its simplicity, but you can choose any other available dataset implementation class to save the data, for example, to a database table, cloud storage (like [AWS S3](https://aws.amazon.com/s3/), [Azure Blob Storage](https://azure.microsoft.com/en-gb/services/storage/blobs/), etc.) and others. If you cannot find the dataset implementation you need, you can easily implement your own as [you already did earlier](./03_set_up_data.md#creating-custom-datasets) and share it with the world by contributing back to Kedro!


## Working with Kedro projects from Jupyter

Currently, the pipeline consists of just two nodes and performs just some data pre-processing. Let's expand it by adding more nodes.

When you are developing new nodes for your pipeline, you can write them as regular Python functions, but you may want to use Jupyter Notebooks for experimenting with your code before committing to a specific implementation. To take advantage of Kedro's Jupyter session, you can run this in your terminal:

```bash
kedro jupyter notebook
```

This will open a Jupyter Notebook in your browser. Navigate to `notebooks` folder and create a notebook there with a Kedro kernel. After opening the newly created notebook you can check what the data looks like by pasting this into the first cell of the notebook and selecting **Run**:

```python
df = io.load('preprocessed_shuttles')
df.head()
```

You should be able to see the first 5 rows of the loaded dataset as follows:

![](img/jupyter_notebook_ch3.png)

> *Note:*
<br/>If you see an error message stating that `io` is undefined, you can see what went wrong by using `print(startup_error)`, where `startup_error` is available as a variable in Python.
<br/>If you cannot access `startup_error` please make sure you have run **`kedro`**`jupyter notebook` from the project's root directory. Running without the `kedro` command (that is, just running a `jupyter notebook`) will not load the Kedro session for you.
<br/>When you add new datasets to your `catalog.yml` file you need to reload Kedro's session by running `%reload_kedro` in your cell.


## Creating a master table

We need to add a function to join together the three dataframes into a single master table in a cell in the notebook as follows:
Expand Down Expand Up @@ -292,7 +263,7 @@ You can start by updating the dependencies in `src/requirements.txt` with the fo
scikit-learn==0.20.2
```

You can find out more about requirements files [here](https://pip.pypa.io/en/stable/user_guide/#requirements-files).
You can find out more about requirements files [here](https://pip.pypa.io/en/stable/user_guide/#requirements-files).

Then, from within the project directory, run:

Expand All @@ -319,7 +290,7 @@ def split_data(data: pd.DataFrame, parameters: Dict) -> List:
Args:
data: Source data.
parameters: Parameters defined in parameters.yml.
Returns:
A list containing split data.
Expand Down Expand Up @@ -347,7 +318,7 @@ def train_model(X_train: np.ndarray, y_train: np.ndarray) -> LinearRegression:
Args:
X_train: Training data of independent features.
y_train: Training data for price.
Returns:
Trained model.
Expand Down Expand Up @@ -378,7 +349,7 @@ test_size: 0.2
random_state: 3
```
These are the parameters fed into the `DataCatalog` when the pipeline is executed (see the details in `create_catalog()` and `main()` in `src/kedro_tutorial/run.py`).
These are the parameters fed into the `DataCatalog` when the pipeline is executed.

Next, register the dataset, which will save the trained model, by adding the following definition to `conf/base/catalog.yml`:

Expand Down Expand Up @@ -555,9 +526,9 @@ kedro run --tag=ds --tag=de
```python
node(
train_model,
["X_train", "y_train"],
"regressor",
train_model,
["X_train", "y_train"],
"regressor",
tags=["my-regressor-node"],
)
```
Expand Down Expand Up @@ -586,7 +557,7 @@ def log_running_time(func: Callable) -> Callable:
Args:
func: Function to be executed.
Returns:
Decorator for logging the running time.
Expand Down Expand Up @@ -688,17 +659,17 @@ Another built-in decorator is `kedro.pipeline.decorators.mem_profile`, which wil

## Kedro runners

Having specified the data catalog and the pipeline, you are now ready to run the pipeline. There are two different runners you can specify:
Having specified the data catalog and the pipeline, you are now ready to run the pipeline. There are two different runners you can specify:

* `SequentialRunner` - runs your nodes sequentially; once a node has completed its task then the next one starts.
* `ParallelRunner` - runs your nodes in parallel; independent nodes are able to run at the same time, allowing you to take advantage of multiple CPU cores.

By default, `src/kedro_tutorial/run.py` uses a `SequentialRunner`, which is instantiated when you execute `kedro run` from the command line. Switching to use `ParallelRunner` is as simple as providing an additional flag when running the pipeline from the command line as follows:
By default, Kedro uses a `SequentialRunner`, which is instantiated when you execute `kedro run` from the command line. Switching to use `ParallelRunner` is as simple as providing an additional flag when running the pipeline from the command line as follows:

```bash
kedro run --parallel
```

`ParallelRunner` executes the pipeline nodes in parallel, and is more efficient when there are independent branches in your pipeline.
`ParallelRunner` executes the pipeline nodes in parallel, and is more efficient when there are independent branches in your pipeline.

> *Note:* `ParallelRunner` performs task parallelisation, which is different from data parallelisation as seen in PySpark.
Binary file not shown.
12 changes: 6 additions & 6 deletions docs/source/04_user_guide/01_setting_up_vscode.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
# Setting up Visual Studio Code

> *Note:* This documentation is based on `Kedro 0.14.3`, if you spot anything that is incorrect then please create an [issue](https://github.com/quantumblacklabs/kedro/issues) or pull request.
> *Note:* This documentation is based on `Kedro 0.15.0`, if you spot anything that is incorrect then please create an [issue](https://github.com/quantumblacklabs/kedro/issues) or pull request.
Start by opening a new project directory in VS Code and installing the Python plugin under **Tools and languages**:

![](images/vscode_startup.png)

Python is an interpreted language; to run Python code you must tell VS Code which interpreter to use. From within VS Code, select a Python 3 interpreter by opening the **Command Palette** (`Cmd + Shift + P` for macOS), start typing the **Python: Select Interpreter** command to search, then select the command.
Python is an interpreted language; to run Python code you must tell VS Code which interpreter to use. From within VS Code, select a Python 3 interpreter by opening the **Command Palette** (`Cmd + Shift + P` for macOS), start typing the **Python: Select Interpreter** command to search, then select the command.

At this stage, you should be able to see the `conda` environment that you have created. Select the environment:

Expand Down Expand Up @@ -111,9 +111,9 @@ set PYTHONPATH=C:/path/to/project/src;%PYTHONPATH%

You can find more information about setting up environmental variables [here](https://code.visualstudio.com/docs/python/environments#_environment-variable-definitions-file).

Go to **Debug > Add Configurations**.
Go to **Debug > Add Configurations**.

> *Note:* If you encounter the following error: `Cannot read property 'openConfigFile' of undefined`, you can manually create `launch.json` file in your project root folder and paste the configuration from below.
> *Note:* If you encounter the following error: `Cannot read property 'openConfigFile' of undefined`, you can manually create `launch.json` file in `.vscode` directory and paste the configuration from below.
Edit the `launch.json` that opens in the editor with:

Expand All @@ -135,7 +135,7 @@ Edit the `launch.json` that opens in the editor with:
}
```

To add a breakpoint in your `run.py` script, for example, click on the left hand side of the line of code:
To add a breakpoint in your `pipeline.py` script, for example, click on the left hand side of the line of code:

![](images/vscode_set_breakpoint.png)

Expand All @@ -153,7 +153,7 @@ Execution should stop at the breakpoint:

### Advanced: Remote Interpreter / Debugging

It is possible to debug remotely using VS Code. The following example assumes SSH access is available on the remote computer (running a Unix-like OS) running the code that will be debugged.
It is possible to debug remotely using VS Code. The following example assumes SSH access is available on the remote computer (running a Unix-like OS) running the code that will be debugged.

First install the `ptvsd` Python library on both the local and remote computer using the following command (execute it on both computers in the appropriate `conda` environment):

Expand Down
6 changes: 3 additions & 3 deletions docs/source/04_user_guide/02_setting_up_pycharm.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Setting up PyCharm

> *Note:* This documentation is based on `Kedro 0.14.3`, if you spot anything that is incorrect then please create an [issue](https://github.com/quantumblacklabs/kedro/issues) or pull request.
> *Note:* This documentation is based on `Kedro 0.15.0`, if you spot anything that is incorrect then please create an [issue](https://github.com/quantumblacklabs/kedro/issues) or pull request.
This section will present a quick guide on how to configure [PyCharm](https://www.jetbrains.com/pycharm/) as a development environment for working on Kedro projects.

Expand Down Expand Up @@ -66,7 +66,7 @@ Edit the new Run configuration as follows:

![](images/pycharm_edit_py_run_config.png)

Replace **Script path** with path obtained above and **Working directory** with the path of your project directory and then click **OK**.
Replace **Script path** with path obtained above and **Working directory** with the path of your project directory and then click **OK**.

To execute the Run configuration, select it from the **Run / Debug Configurations** dropdown in the toolbar (if that toolbar is not visible, you can enable it by going to **View > Toolbar**). Click the green triangle:

Expand Down Expand Up @@ -94,7 +94,7 @@ Then click the bug button in the toolbar (![](images/pycharm_debugger_button.png

## Advanced: Remote SSH interpreter

> *Note:* This section uses the features supported in PyCharm Professional Edition only.
> *Note:* This section uses the features supported in PyCharm Professional Edition only.
Firstly, add an SSH interpreter. Go to **Preferences | Project Interpreter** as above and proceed to add a new interpreter. Select **SSH Interpreter** and fill in details of the remote computer:

Expand Down
Loading

0 comments on commit 571bb80

Please sign in to comment.