[KED-826] Use KedroContext class in project template (#111)

kedro-org · Jul 16, 2019 · 571bb80 · 571bb80
1 parent 5cbaa77
commit 571bb80
Show file tree

Hide file tree

Showing 37 changed files with 630 additions and 451 deletions.
diff --git a/RELEASE.md b/RELEASE.md
@@ -1,7 +1,7 @@
 # Release 0.15.0
 
 ## Major features and improvements
-* Added a new CLI command `kedro jupyter convert` to facilitate converting Jupyter notebook cells into Kedro nodes. 
+* Added a new CLI command `kedro jupyter convert` to facilitate converting Jupyter notebook cells into Kedro nodes.
 * Added `KedroContext` base class which holds the configuration and Kedro's main functionality (catalog, pipeline, config).
 * Added a new I/O module `ParquetS3DataSet` in `contrib` for usage with Pandas. (by [@mmchougule](https://github.com/mmchougule)) 
 
@@ -11,18 +11,61 @@
 * Add information on installed plugins to `kedro info`
 
 ## Breaking changes to the API
+* Simplify the Kedro template in `run.py` with the introduction of `KedroContext` class.
 * Merged `FilepathVersionMixIn` and `S3VersionMixIn` under one abstract class `AbstractVersionedDataSet` which extends`AbstractDataSet`.
 
 #### Migration guide from Kedro 0.14.* to Kedro 0.15.0
+##### Migration for Kedro project template
+This guide assumes that:
+  * The framework specific code has not been altered significantly
+  * Your project specific code is stored in the dedicated python package under `src/`.
+
+The breaking changes were introduced in the following project template files:
+- `<project-name>/.ipython/profile_default/startup/00-kedro-init.py`
+- `<project-name>/kedro_cli.py`
+- `<project-name>/src/tests/test_run.py`
+- `<project-name>/src/<package-name>/run.py`
+
+The easiest way to migrate your project from Kedro 0.14.* to Kedro 0.15.0 is to create a new project (by using `kedro new`) and move code and files bit by bit as suggested in the detailed guide below:
+
+1. Create a new project with the same name by running `kedro new`
+
+2. Copy the following folders to the new project:
+ - `results/`
+ - `references/`
+ - `notebooks/`
+ - `logs/`
+ - `data/`
+ - `conf/`
+
+3. If you customised your `src/<package>/run.py`, make sure you apply the same customisations to `src/<package>/run.py`
+ - If you customised `get_config()`, you can override `_create_config()` method in `ProjectContext` derived class
+ - If you customised `create_catalog()`, you can override `_create_catalog()` method in `ProjectContext` derived class
+ - If you customised `run()`, you can override `run()` method in `ProjectContext` derived class
+ - If you customised default `env`, you can override it in `ProjectContext` derived class or pass it at construction. By default, `env` is `local`.
+ - If you customised default `root_conf`, you can override `CONF_ROOT` attribute in `ProjectContext` derived class. By default, `KedroContext` base class has `CONF_ROOT` attribute set to `conf`.
+
+4. The following syntax changes are introduced in ipython or Jupyter notebook/labs:
+ - `proj_dir` -> `context.project_path`
+ - `proj_name` -> `context.project_name`
+ - `conf` -> `context.config_loader`.
+ - `io` -> `context.catalog` (e.g., `io.load()` -> `context.catalog.load()`)
+
+5. If you customised your `kedro_cli.py`, you need to apply the same customisations to your `kedro_cli.py` in the new project.
+
+##### Migration for versioning custom dataset classes
+
 If you defined any custom dataset classes which support versioning in your project, you need to apply the following changes:
 
 1. Make sure your dataset inherits from `AbstractVersionedDataSet` only.
-2. Call `super().__init__()` with the appropriate arguments in the dataset's `__init__`. If storing on local filesystem, providing the filepath and the version is enough. Otherwise, you should also pass in an `exists_function` and a `glob_function` that emulate `exists` and `glob` in a different filesystem (see `CSVS3DataSet` as an example). 
+2. Call `super().__init__()` with the appropriate arguments in the dataset's `__init__`. If storing on local filesystem, providing the filepath and the version is enough. Otherwise, you should also pass in an `exists_function` and a `glob_function` that emulate `exists` and `glob` in a different filesystem (see `CSVS3DataSet` as an example).
 3. Remove setting of the `_filepath` and `_version` attributes in the dataset's `__init__`, as this is taken care of in the base abstract class.
 4. Any calls to `_get_load_path` and `_get_save_path` methods should take no arguments.
 5. Ensure you convert the output of `_get_load_path` and `_get_save_path` appropriately, as these now return [`PurePath`s](https://docs.python.org/3/library/pathlib.html#pure-paths) instead of strings.
 6. Make sure `_check_paths_consistency` is called with [`PurePath`s](https://docs.python.org/3/library/pathlib.html#pure-paths) as input arguments, instead of strings.
 
+These steps should have brought your project to Kedro 0.15.0. There might be some more minor tweaks needed as every project is unique, but now you have a pretty solid base to work with. If you run into any problems, please consult the [Kedro documentation](https://kedro.readthedocs.io).
+
 ## Thanks for supporting contributions
 [Dmitry Vukolov](https://github.com/dvukolov), [Jo Stichbury](https://github.com/stichbury), [Angus Williams](https://github.com/awqb), [Deepyaman Datta](https://github.com/deepyaman), [Mayur Chougule](https://github.com/mmchougule)
 

diff --git a/docs/source/02_getting_started/04_hello_world.md b/docs/source/02_getting_started/04_hello_world.md
@@ -63,7 +63,7 @@ A `kedro` project consists of the following main components:
 +--------------+----------------------------------------------------------------------+
 | Component    | Description                                                          |
 +==============+======================================================================+
-| Data Catalog | A collection of datasets that can be used to form the data pipeline. | 
+| Data Catalog | A collection of datasets that can be used to form the data pipeline. |
 |              | Each dataset provides :code:`load` and :code:`save` capabilities for |
 |              | a specific data type, e.g. :code:`CSVS3DataSet` loads and saves data |
 |              | to a csv file in S3.                                                 |
@@ -139,7 +139,7 @@ kedro run
 
 This command calls the `main()` function from `src/getting_started/run.py`, which in turn does the following:
 
-1. Initiates the context:
+1. Instantiates `ProjectContext` class defined in `src/getting_started/run.py`:
     * Reads relevant configuration
     * Configures Python `logging`
     * Instantiates the `DataCatalog` and feeds a dictionary containing `parameters` config

diff --git a/docs/source/03_tutorial/01_workflow.md b/docs/source/03_tutorial/01_workflow.md
@@ -24,7 +24,7 @@ When building a Kedro project, you will typically follow a standard development
 * Create the data transformation steps as [pure](https://en.wikipedia.org/wiki/Pure_function) Python functions
 * Construct the pipeline by adding your functions as nodes
   - Specify inputs and outputs for each node
-* Choose how to run the pipeline: sequentially or in parallel  
+* Choose how to run the pipeline: sequentially or in parallel
 
 ### 4. Package the project
 
@@ -54,7 +54,7 @@ As you work on a project, you will periodically want to save your changes. If yo
 # create a new feature branch called 'feature-project-template'
 git checkout -b feature/project-template
 # stage all the files you have changed
-git add .                                
+git add .
 # commit changes to git with an instructive message
 git commit -m 'Create project template'
 # push changes to remote branch
@@ -65,9 +65,9 @@ It isn't necessary to branch, but if everyone in a team works on the same branch
 
 ```bash
 # stage all files
-git add .                                
+git add .
 # commit changes to git with an instructive message
 git commit -m 'Create project template'
 # push changes to remote master
-git push origin master                   
+git push origin master
 ```
diff --git a/docs/source/03_tutorial/04_create_pipelines.md b/docs/source/03_tutorial/04_create_pipelines.md
@@ -2,9 +2,8 @@
 
 This section covers how to create a pipeline from a set of `node`s, which are Python functions, as described in more detail in the [nodes and pipelines user guide](../04_user_guide/05_nodes_and_pipelines.md) documentation.
 
-1. As you draft experimental code, you can use a [Jupyter Notebook](./04_create_pipelines.md#working-with-kedro-projects-from-jupyter) or [IPython session](../06_resources/03_ipython.md#working-with-kedro-and-ipython). If you include [`docstrings`](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings) to explain what your functions do, you can take advantage of [auto-generated Sphinx documentation](http://www.sphinx-doc.org/en/master/) later on. Once you are happy with how you have written your `node` functions, you will copy & paste the code into the `src/kedro_tutorial/nodes/` folder as a `.py` file.
+1. As you draft experimental code, you can use a [Jupyter Notebook](../04_user_guide/10_ipython.md#working-with-kedro-projects-from-jupyter) or [IPython session](../04_user_guide/10_ipython.md). If you include [`docstrings`](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings) to explain what your functions do, you can take advantage of [auto-generated Sphinx documentation](http://www.sphinx-doc.org/en/master/) later on. Once you are happy with how you have written your `node` functions, you will copy & paste the code into the `src/kedro_tutorial/nodes/` folder as a `.py` file.
 2. When you are ready with a node you should add it to the pipeline in `src/kedro_tutorial/pipeline.py`, specifying its inputs and outputs.
-3. Finally, you will choose how you would like to run your pipeline: sequentially or in parallel, by specifying your preference in `src/kedro_tutorial/run.py`.
 
 
 ## Node basics
@@ -154,34 +153,6 @@ kedro run
 
 `CSVLocalDataSet` is chosen for its simplicity, but you can choose any other available dataset implementation class to save the data, for example, to a database table, cloud storage (like [AWS S3](https://aws.amazon.com/s3/), [Azure Blob Storage](https://azure.microsoft.com/en-gb/services/storage/blobs/), etc.) and others. If you cannot find the dataset implementation you need, you can easily implement your own as [you already did earlier](./03_set_up_data.md#creating-custom-datasets) and share it with the world by contributing back to Kedro!
 
-
-## Working with Kedro projects from Jupyter
-
-Currently, the pipeline consists of just two nodes and performs just some data pre-processing. Let's expand it by adding more nodes.
-
-When you are developing new nodes for your pipeline, you can write them as regular Python functions, but you may want to use Jupyter Notebooks for experimenting with your code before committing to a specific implementation. To take advantage of Kedro's Jupyter session, you can run this in your terminal:
-
-```bash
-kedro jupyter notebook
-```
-
-This will open a Jupyter Notebook in your browser. Navigate to `notebooks` folder and create a notebook there with a Kedro kernel. After opening the newly created notebook you can check what the data looks like by pasting this into the first cell of the notebook and selecting **Run**:
-
-```python
-df = io.load('preprocessed_shuttles')
-df.head()
-```
-
-You should be able to see the first 5 rows of the loaded dataset as follows:
-
-![](img/jupyter_notebook_ch3.png)
-
-> *Note:* 
-<br/>If you see an error message stating that `io` is undefined, you can see what went wrong by using `print(startup_error)`, where `startup_error` is available as a variable in Python. 
-<br/>If you cannot access `startup_error` please make sure you have run **`kedro`**`jupyter notebook` from the project's root directory. Running without the `kedro` command (that is, just running a `jupyter notebook`) will not load the Kedro session for you.
-<br/>When you add new datasets to your `catalog.yml` file you need to reload Kedro's session by running `%reload_kedro` in your cell.
-
-
 ## Creating a master table
 
 We need to add a function to join together the three dataframes into a single master table in a cell in the notebook as follows:
@@ -292,7 +263,7 @@ You can start by updating the dependencies in `src/requirements.txt` with the fo
 scikit-learn==0.20.2
 ```
 
-You can find out more about requirements files [here](https://pip.pypa.io/en/stable/user_guide/#requirements-files). 
+You can find out more about requirements files [here](https://pip.pypa.io/en/stable/user_guide/#requirements-files).
 
 Then, from within the project directory, run:
 
@@ -319,7 +290,7 @@ def split_data(data: pd.DataFrame, parameters: Dict) -> List:
         Args:
             data: Source data.
             parameters: Parameters defined in parameters.yml.
-            
+
         Returns:
             A list containing split data.
 
@@ -347,7 +318,7 @@ def train_model(X_train: np.ndarray, y_train: np.ndarray) -> LinearRegression:
         Args:
             X_train: Training data of independent features.
             y_train: Training data for price.
-            
+
         Returns:
             Trained model.
 
@@ -378,7 +349,7 @@ test_size: 0.2
 random_state: 3
 ```
 
-These are the parameters fed into the `DataCatalog` when the pipeline is executed (see the details in `create_catalog()` and `main()` in `src/kedro_tutorial/run.py`).
+These are the parameters fed into the `DataCatalog` when the pipeline is executed.
 
 Next, register the dataset, which will save the trained model, by adding the following definition to `conf/base/catalog.yml`:
 
@@ -555,9 +526,9 @@ kedro run --tag=ds --tag=de
 
 ```python
 node(
-    train_model, 
-    ["X_train", "y_train"], 
-    "regressor", 
+    train_model,
+    ["X_train", "y_train"],
+    "regressor",
     tags=["my-regressor-node"],
 )
 ```
@@ -586,7 +557,7 @@ def log_running_time(func: Callable) -> Callable:
 
         Args:
             func: Function to be executed.
-            
+
         Returns:
             Decorator for logging the running time.
 
@@ -688,17 +659,17 @@ Another built-in decorator is `kedro.pipeline.decorators.mem_profile`, which wil
 
 ## Kedro runners
 
-Having specified the data catalog and the pipeline, you are now ready to run the pipeline. There are two different runners you can specify:  
+Having specified the data catalog and the pipeline, you are now ready to run the pipeline. There are two different runners you can specify:
 
 * `SequentialRunner` - runs your nodes sequentially; once a node has completed its task then the next one starts.
 * `ParallelRunner` - runs your nodes in parallel; independent nodes are able to run at the same time, allowing you to take advantage of multiple CPU cores.
 
-By default, `src/kedro_tutorial/run.py` uses a `SequentialRunner`, which is instantiated when you execute `kedro run` from the command line. Switching to use `ParallelRunner` is as simple as providing an additional flag when running the pipeline from the command line as follows:
+By default, Kedro uses a `SequentialRunner`, which is instantiated when you execute `kedro run` from the command line. Switching to use `ParallelRunner` is as simple as providing an additional flag when running the pipeline from the command line as follows:
 
 ```bash
 kedro run --parallel
 ```
 
-`ParallelRunner` executes the pipeline nodes in parallel, and is more efficient when there are independent branches in your pipeline. 
+`ParallelRunner` executes the pipeline nodes in parallel, and is more efficient when there are independent branches in your pipeline.
 
 > *Note:* `ParallelRunner` performs task parallelisation, which is different from data parallelisation as seen in PySpark.
diff --git a/docs/source/03_tutorial/img/jupyter_notebook_ch3.png b/docs/source/03_tutorial/img/jupyter_notebook_ch3.png
diff --git a/docs/source/04_user_guide/01_setting_up_vscode.md b/docs/source/04_user_guide/01_setting_up_vscode.md
@@ -1,12 +1,12 @@
 # Setting up Visual Studio Code
 
-> *Note:* This documentation is based on `Kedro 0.14.3`, if you spot anything that is incorrect then please create an [issue](https://github.com/quantumblacklabs/kedro/issues) or pull request.
+> *Note:* This documentation is based on `Kedro 0.15.0`, if you spot anything that is incorrect then please create an [issue](https://github.com/quantumblacklabs/kedro/issues) or pull request.
 
 Start by opening a new project directory in VS Code and installing the Python plugin under **Tools and languages**:
 
 ![](images/vscode_startup.png)
 
-Python is an interpreted language; to run Python code you must tell VS Code which interpreter to use. From within VS Code, select a Python 3 interpreter by opening the **Command Palette** (`Cmd + Shift + P` for macOS), start typing the **Python: Select Interpreter** command to search, then select the command. 
+Python is an interpreted language; to run Python code you must tell VS Code which interpreter to use. From within VS Code, select a Python 3 interpreter by opening the **Command Palette** (`Cmd + Shift + P` for macOS), start typing the **Python: Select Interpreter** command to search, then select the command.
 
 At this stage, you should be able to see the `conda` environment that you have created. Select the environment:
 
@@ -111,9 +111,9 @@ set PYTHONPATH=C:/path/to/project/src;%PYTHONPATH%
 
 You can find more information about setting up environmental variables [here](https://code.visualstudio.com/docs/python/environments#_environment-variable-definitions-file).
 
-Go to **Debug > Add Configurations**. 
+Go to **Debug > Add Configurations**.
 
-> *Note:* If you encounter the following error: `Cannot read property 'openConfigFile' of undefined`, you can manually create `launch.json` file in your project root folder and paste the configuration from below.
+> *Note:* If you encounter the following error: `Cannot read property 'openConfigFile' of undefined`, you can manually create `launch.json` file in `.vscode` directory and paste the configuration from below.
 
 Edit the `launch.json` that opens in the editor with:
 
@@ -135,7 +135,7 @@ Edit the `launch.json` that opens in the editor with:
 }
 ```
 
-To add a breakpoint in your `run.py` script, for example, click on the left hand side of the line of code:
+To add a breakpoint in your `pipeline.py` script, for example, click on the left hand side of the line of code:
 
 ![](images/vscode_set_breakpoint.png)
 
@@ -153,7 +153,7 @@ Execution should stop at the breakpoint:
 
 ### Advanced: Remote Interpreter / Debugging
 
-It is possible to debug remotely using VS Code. The following example assumes SSH access is available on the remote computer (running a Unix-like OS) running the code that will be debugged. 
+It is possible to debug remotely using VS Code. The following example assumes SSH access is available on the remote computer (running a Unix-like OS) running the code that will be debugged.
 
 First install the `ptvsd` Python library on both the local and remote computer using the following command (execute it on both computers in the appropriate `conda` environment):
 

diff --git a/docs/source/04_user_guide/02_setting_up_pycharm.md b/docs/source/04_user_guide/02_setting_up_pycharm.md
@@ -1,6 +1,6 @@
 # Setting up PyCharm
 
-> *Note:* This documentation is based on `Kedro 0.14.3`, if you spot anything that is incorrect then please create an [issue](https://github.com/quantumblacklabs/kedro/issues) or pull request.
+> *Note:* This documentation is based on `Kedro 0.15.0`, if you spot anything that is incorrect then please create an [issue](https://github.com/quantumblacklabs/kedro/issues) or pull request.
 
 This section will present a quick guide on how to configure [PyCharm](https://www.jetbrains.com/pycharm/) as a development environment for working on Kedro projects.
 
@@ -66,7 +66,7 @@ Edit the new Run configuration as follows:
 
 ![](images/pycharm_edit_py_run_config.png)
 
-Replace **Script path** with path obtained above and **Working directory** with the path of your project directory and then click **OK**. 
+Replace **Script path** with path obtained above and **Working directory** with the path of your project directory and then click **OK**.
 
 To execute the Run configuration, select it from the **Run / Debug Configurations** dropdown in the toolbar (if that toolbar is not visible, you can enable it by going to **View > Toolbar**). Click the green triangle:
 
@@ -94,7 +94,7 @@ Then click the bug button in the toolbar (![](images/pycharm_debugger_button.png
 
 ## Advanced: Remote SSH interpreter
 
-> *Note:* This section uses the features supported in PyCharm Professional Edition only. 
+> *Note:* This section uses the features supported in PyCharm Professional Edition only.
 
 Firstly, add an SSH interpreter. Go to **Preferences | Project Interpreter** as above and proceed to add a new interpreter. Select **SSH Interpreter** and fill in details of the remote computer: