From 81a69a2c5ca3e75f7659b93b4c505c179edf0212 Mon Sep 17 00:00:00 2001 From: Jo Stichbury Date: Thu, 24 Aug 2023 09:28:01 +0100 Subject: [PATCH 1/2] Minor changes to create a PR and test Vale styles Signed-off-by: Jo Stichbury --- docs/source/data/advanced_data_catalog_usage.md | 2 +- docs/source/data/data_catalog.md | 2 +- docs/source/data/data_catalog_yaml_examples.md | 2 +- docs/source/data/how_to_create_a_custom_dataset.md | 2 +- docs/source/data/index.md | 2 +- docs/source/data/kedro_dataset_factories.md | 2 +- docs/source/data/partitioned_and_incremental_datasets.md | 2 +- 7 files changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/source/data/advanced_data_catalog_usage.md b/docs/source/data/advanced_data_catalog_usage.md index 1906500d35..78fb6e183a 100644 --- a/docs/source/data/advanced_data_catalog_usage.md +++ b/docs/source/data/advanced_data_catalog_usage.md @@ -6,7 +6,7 @@ You can define a Data Catalog in two ways. Most use cases can be through a YAML To use the `DataCatalog` API, construct a `DataCatalog` object programmatically in a file like `catalog.py`. -In the following, we are using several pre-built data loaders documented in the [API reference documentation](/kedro_datasets). +In the following code, we use several pre-built data loaders documented in the [API reference documentation](/kedro_datasets). ```python from kedro.io import DataCatalog diff --git a/docs/source/data/data_catalog.md b/docs/source/data/data_catalog.md index b4a6c4d7da..8c95ac8309 100644 --- a/docs/source/data/data_catalog.md +++ b/docs/source/data/data_catalog.md @@ -3,7 +3,7 @@ In a Kedro project, the Data Catalog is a registry of all data sources available for use by the project. It is specified with a YAML catalog file that maps the names of node inputs and outputs as keys in the `DataCatalog` class. -This page introduces the basic sections of `catalog.yml`, which is the file used to register data sources for a Kedro project. +This page introduces the basic sections of `catalog.yml`, which is the file Kedro uses to register data sources for a project. ## The basics of `catalog.yml` A separate page of [Data Catalog YAML examples](./data_catalog_yaml_examples.md) gives further examples of how to work with `catalog.yml`, but here we revisit the [basic `catalog.yml` introduced by the spaceflights tutorial](../tutorial/set_up_data.md). diff --git a/docs/source/data/data_catalog_yaml_examples.md b/docs/source/data/data_catalog_yaml_examples.md index f27981600d..57715648ff 100644 --- a/docs/source/data/data_catalog_yaml_examples.md +++ b/docs/source/data/data_catalog_yaml_examples.md @@ -8,7 +8,7 @@ This page contains a set of examples to help you structure your YAML configurati ## Load data from a local binary file using `utf-8` encoding -The `open_args_load` and `open_args_save` parameters are passed to the filesystem's `open` method to configure how a dataset file (on a specific filesystem) is opened during a load or save operation, respectively. +The `open_args_load` and `open_args_save` parameters are passed to the filesystem's `open` method to configure how a dataset file (on a specific filesystem) is opened during a load or save operation respectively. ```yaml test_dataset: diff --git a/docs/source/data/how_to_create_a_custom_dataset.md b/docs/source/data/how_to_create_a_custom_dataset.md index 46364031a0..8aedac6984 100644 --- a/docs/source/data/how_to_create_a_custom_dataset.md +++ b/docs/source/data/how_to_create_a_custom_dataset.md @@ -4,7 +4,7 @@ ## AbstractDataset -For contributors, if you would like to submit a new dataset, you must extend the [`AbstractDataset` interface](/kedro.io.AbstractDataset) or [`AbstractVersionedDataset` interface](/kedro.io.AbstractVersionedDataset) if you plan to support versioning. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataset` implementation. +If you are a contributor and would like to submit a new dataset, you must extend the [`AbstractDataset` interface](/kedro.io.AbstractDataset) or [`AbstractVersionedDataset` interface](/kedro.io.AbstractVersionedDataset) if you plan to support versioning. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataset` implementation. ## Scenario diff --git a/docs/source/data/index.md b/docs/source/data/index.md index a6196bcc13..e95f48bf0b 100644 --- a/docs/source/data/index.md +++ b/docs/source/data/index.md @@ -3,7 +3,7 @@ In a Kedro project, the Data Catalog is a registry of all data sources available for use by the project. The catalog is stored in a YAML file (`catalog.yml`) that maps the names of node inputs and outputs as keys in the `DataCatalog` class. -[Kedro provides different built-in datasets in the `kedro-datasets` package](/kedro_datasets) for numerous file types and file systems, so you don’t have to write any of the logic for reading/writing data. +[Kedro provides different built-in datasets in the `kedro-datasets` package](/kedro_datasets) for numerous file types and file systems so you don’t have to write any of the logic for reading/writing data. We first introduce the basic sections of `catalog.yml`, which is the file used to register data sources for a Kedro project. diff --git a/docs/source/data/kedro_dataset_factories.md b/docs/source/data/kedro_dataset_factories.md index 2a65b4359e..bb6714e21d 100644 --- a/docs/source/data/kedro_dataset_factories.md +++ b/docs/source/data/kedro_dataset_factories.md @@ -1,7 +1,7 @@ # Kedro dataset factories You can load multiple datasets with similar configuration using dataset factories, introduced in Kedro 0.18.12. -The syntax allows you to generalise the configuration and reduce the number of similar catalog entries by matching datasets used in your project's pipelines to dataset factory patterns. +The syntax allows you to generalise your configuration and reduce the number of similar catalog entries by matching datasets used in your project's pipelines to dataset factory patterns. ## How to generalise datasets with similar names and types diff --git a/docs/source/data/partitioned_and_incremental_datasets.md b/docs/source/data/partitioned_and_incremental_datasets.md index a57b56d2a4..84b803813f 100644 --- a/docs/source/data/partitioned_and_incremental_datasets.md +++ b/docs/source/data/partitioned_and_incremental_datasets.md @@ -2,7 +2,7 @@ ## Partitioned datasets -Distributed systems play an increasingly important role in ETL data pipelines. They significantly increase the processing throughput, enabling us to work with much larger volumes of input data. However, these benefits sometimes come at a cost. When dealing with the input data generated by such distributed systems, you might encounter a situation where your Kedro node needs to read the data from a directory full of uniform files of the same type (e.g. JSON, CSV, Parquet, etc.) rather than from a single file. Tools like `PySpark` and the corresponding [SparkDataSet](/kedro_datasets.spark.SparkDataSet) cater for such use cases, but the use of Spark is not always feasible. +Distributed systems play an increasingly important role in ETL data pipelines. They significantly increase the processing throughput, enabling us to work with much larger volumes of input data. However, these benefits sometimes come at a cost. When dealing with the input data generated by such distributed systems, you might encounter a situation where your Kedro node needs to read the data from a directory full of uniform files of the same type (e.g. JSON, CSV, Parquet, etc.) rather than from a single file. Tools like `PySpark` and the corresponding [SparkDataSet](/kedro_datasets.spark.SparkDataSet) cater for such use cases, but using Spark is not always feasible. This is why Kedro provides a built-in [PartitionedDataset](/kedro.io.PartitionedDataset), with the following features: From be1a60cf79b9ae952aa6d411e50b74d6d37b9d4b Mon Sep 17 00:00:00 2001 From: Jo Stichbury Date: Sun, 27 Aug 2023 17:26:50 +0100 Subject: [PATCH 2/2] fix some vale warnings Signed-off-by: Jo Stichbury --- .github/styles/Kedro/quotes.yml | 10 ---------- docs/source/data/data_catalog_yaml_examples.md | 2 +- docs/source/data/index.md | 2 +- .../data/partitioned_and_incremental_datasets.md | 2 +- 4 files changed, 3 insertions(+), 13 deletions(-) delete mode 100644 .github/styles/Kedro/quotes.yml diff --git a/.github/styles/Kedro/quotes.yml b/.github/styles/Kedro/quotes.yml deleted file mode 100644 index 7e4ed44be0..0000000000 --- a/.github/styles/Kedro/quotes.yml +++ /dev/null @@ -1,10 +0,0 @@ -extends: existence -message: Use straight quotes instead of smart quotes. -level: warning -nonword: true -action: -tokens: - - “ - - ” - - ‘ - - ’ diff --git a/docs/source/data/data_catalog_yaml_examples.md b/docs/source/data/data_catalog_yaml_examples.md index 57715648ff..4ee0a64a93 100644 --- a/docs/source/data/data_catalog_yaml_examples.md +++ b/docs/source/data/data_catalog_yaml_examples.md @@ -8,7 +8,7 @@ This page contains a set of examples to help you structure your YAML configurati ## Load data from a local binary file using `utf-8` encoding -The `open_args_load` and `open_args_save` parameters are passed to the filesystem's `open` method to configure how a dataset file (on a specific filesystem) is opened during a load or save operation respectively. +The `open_args_load` and `open_args_save` parameters are passed to the filesystem `open` method to configure how a dataset file (on a specific filesystem) is opened during a load or save operation respectively. ```yaml test_dataset: diff --git a/docs/source/data/index.md b/docs/source/data/index.md index e95f48bf0b..6f95cf84f0 100644 --- a/docs/source/data/index.md +++ b/docs/source/data/index.md @@ -3,7 +3,7 @@ In a Kedro project, the Data Catalog is a registry of all data sources available for use by the project. The catalog is stored in a YAML file (`catalog.yml`) that maps the names of node inputs and outputs as keys in the `DataCatalog` class. -[Kedro provides different built-in datasets in the `kedro-datasets` package](/kedro_datasets) for numerous file types and file systems so you don’t have to write any of the logic for reading/writing data. +[The `kedro-datasets` package offers built-in datasets](/kedro_datasets) for common file types and file systems. We first introduce the basic sections of `catalog.yml`, which is the file used to register data sources for a Kedro project. diff --git a/docs/source/data/partitioned_and_incremental_datasets.md b/docs/source/data/partitioned_and_incremental_datasets.md index 84b803813f..f54d9b998b 100644 --- a/docs/source/data/partitioned_and_incremental_datasets.md +++ b/docs/source/data/partitioned_and_incremental_datasets.md @@ -2,7 +2,7 @@ ## Partitioned datasets -Distributed systems play an increasingly important role in ETL data pipelines. They significantly increase the processing throughput, enabling us to work with much larger volumes of input data. However, these benefits sometimes come at a cost. When dealing with the input data generated by such distributed systems, you might encounter a situation where your Kedro node needs to read the data from a directory full of uniform files of the same type (e.g. JSON, CSV, Parquet, etc.) rather than from a single file. Tools like `PySpark` and the corresponding [SparkDataSet](/kedro_datasets.spark.SparkDataSet) cater for such use cases, but using Spark is not always feasible. +Distributed systems play an increasingly important role in ETL data pipelines. They increase the processing throughput, enabling us to work with much larger volumes of input data. A situation may arise where your Kedro node needs to read the data from a directory full of uniform files of the same type like JSON or CSV. Tools like `PySpark` and the corresponding [SparkDataSet](/kedro_datasets.spark.SparkDataSet) cater for such use cases but may not always be possible. This is why Kedro provides a built-in [PartitionedDataset](/kedro.io.PartitionedDataset), with the following features: