Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor docs changes on data section to create a PR and test Vale styles #2966

Merged
merged 7 commits into from
Sep 1, 2023
2 changes: 1 addition & 1 deletion docs/source/data/advanced_data_catalog_usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ You can define a Data Catalog in two ways. Most use cases can be through a YAML

To use the `DataCatalog` API, construct a `DataCatalog` object programmatically in a file like `catalog.py`.

In the following, we are using several pre-built data loaders documented in the [API reference documentation](/kedro_datasets).
In the following code, we use several pre-built data loaders documented in the [API reference documentation](/kedro_datasets).

```python
from kedro.io import DataCatalog
Expand Down
2 changes: 1 addition & 1 deletion docs/source/data/data_catalog.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

In a Kedro project, the Data Catalog is a registry of all data sources available for use by the project. It is specified with a YAML catalog file that maps the names of node inputs and outputs as keys in the `DataCatalog` class.

This page introduces the basic sections of `catalog.yml`, which is the file used to register data sources for a Kedro project.
This page introduces the basic sections of `catalog.yml`, which is the file Kedro uses to register data sources for a project.

## The basics of `catalog.yml`
A separate page of [Data Catalog YAML examples](./data_catalog_yaml_examples.md) gives further examples of how to work with `catalog.yml`, but here we revisit the [basic `catalog.yml` introduced by the spaceflights tutorial](../tutorial/set_up_data.md).
Expand Down
2 changes: 1 addition & 1 deletion docs/source/data/data_catalog_yaml_examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ This page contains a set of examples to help you structure your YAML configurati

## Load data from a local binary file using `utf-8` encoding

The `open_args_load` and `open_args_save` parameters are passed to the filesystem's `open` method to configure how a dataset file (on a specific filesystem) is opened during a load or save operation, respectively.
The `open_args_load` and `open_args_save` parameters are passed to the filesystem's `open` method to configure how a dataset file (on a specific filesystem) is opened during a load or save operation respectively.
stichbury marked this conversation as resolved.
Show resolved Hide resolved
stichbury marked this conversation as resolved.
Show resolved Hide resolved

```yaml
test_dataset:
Expand Down
2 changes: 1 addition & 1 deletion docs/source/data/how_to_create_a_custom_dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

## AbstractDataset

For contributors, if you would like to submit a new dataset, you must extend the [`AbstractDataset` interface](/kedro.io.AbstractDataset) or [`AbstractVersionedDataset` interface](/kedro.io.AbstractVersionedDataset) if you plan to support versioning. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataset` implementation.
If you are a contributor and would like to submit a new dataset, you must extend the [`AbstractDataset` interface](/kedro.io.AbstractDataset) or [`AbstractVersionedDataset` interface](/kedro.io.AbstractVersionedDataset) if you plan to support versioning. It requires subclasses to override the `_load` and `_save` and provides `load` and `save` methods that enrich the corresponding private methods with uniform error handling. It also requires subclasses to override `_describe`, which is used in logging the internal information about the instances of your custom `AbstractDataset` implementation.


## Scenario
Expand Down
2 changes: 1 addition & 1 deletion docs/source/data/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

In a Kedro project, the Data Catalog is a registry of all data sources available for use by the project. The catalog is stored in a YAML file (`catalog.yml`) that maps the names of node inputs and outputs as keys in the `DataCatalog` class.

[Kedro provides different built-in datasets in the `kedro-datasets` package](/kedro_datasets) for numerous file types and file systems, so you don’t have to write any of the logic for reading/writing data.
[Kedro provides different built-in datasets in the `kedro-datasets` package](/kedro_datasets) for numerous file types and file systems so you don’t have to write any of the logic for reading/writing data.
stichbury marked this conversation as resolved.
Show resolved Hide resolved
stichbury marked this conversation as resolved.
Show resolved Hide resolved
stichbury marked this conversation as resolved.
Show resolved Hide resolved
stichbury marked this conversation as resolved.
Show resolved Hide resolved
stichbury marked this conversation as resolved.
Show resolved Hide resolved
stichbury marked this conversation as resolved.
Show resolved Hide resolved


We first introduce the basic sections of `catalog.yml`, which is the file used to register data sources for a Kedro project.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/data/kedro_dataset_factories.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Kedro dataset factories
You can load multiple datasets with similar configuration using dataset factories, introduced in Kedro 0.18.12.

The syntax allows you to generalise the configuration and reduce the number of similar catalog entries by matching datasets used in your project's pipelines to dataset factory patterns.
The syntax allows you to generalise your configuration and reduce the number of similar catalog entries by matching datasets used in your project's pipelines to dataset factory patterns.

## How to generalise datasets with similar names and types

Expand Down
2 changes: 1 addition & 1 deletion docs/source/data/partitioned_and_incremental_datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Partitioned datasets

Distributed systems play an increasingly important role in ETL data pipelines. They significantly increase the processing throughput, enabling us to work with much larger volumes of input data. However, these benefits sometimes come at a cost. When dealing with the input data generated by such distributed systems, you might encounter a situation where your Kedro node needs to read the data from a directory full of uniform files of the same type (e.g. JSON, CSV, Parquet, etc.) rather than from a single file. Tools like `PySpark` and the corresponding [SparkDataSet](/kedro_datasets.spark.SparkDataSet) cater for such use cases, but the use of Spark is not always feasible.
Distributed systems play an increasingly important role in ETL data pipelines. They significantly increase the processing throughput, enabling us to work with much larger volumes of input data. However, these benefits sometimes come at a cost. When dealing with the input data generated by such distributed systems, you might encounter a situation where your Kedro node needs to read the data from a directory full of uniform files of the same type (e.g. JSON, CSV, Parquet, etc.) rather than from a single file. Tools like `PySpark` and the corresponding [SparkDataSet](/kedro_datasets.spark.SparkDataSet) cater for such use cases, but using Spark is not always feasible.
stichbury marked this conversation as resolved.
Show resolved Hide resolved
stichbury marked this conversation as resolved.
Show resolved Hide resolved
stichbury marked this conversation as resolved.
Show resolved Hide resolved
stichbury marked this conversation as resolved.
Show resolved Hide resolved
stichbury marked this conversation as resolved.
Show resolved Hide resolved
stichbury marked this conversation as resolved.
Show resolved Hide resolved
stichbury marked this conversation as resolved.
Show resolved Hide resolved
stichbury marked this conversation as resolved.
Show resolved Hide resolved
stichbury marked this conversation as resolved.
Show resolved Hide resolved
stichbury marked this conversation as resolved.
Show resolved Hide resolved
stichbury marked this conversation as resolved.
Show resolved Hide resolved
stichbury marked this conversation as resolved.
Show resolved Hide resolved

This is why Kedro provides a built-in [PartitionedDataset](/kedro.io.PartitionedDataset), with the following features:

Expand Down