New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[Data] Add read_clickhouse API to read ClickHouse Dataset #48817

Open

jecsand838 wants to merge 9 commits into ray-project:master from jecsand838:clickhouse_datasource

+422 −0

jecsand838 commented Nov 20, 2024

Why are these changes needed?

Greetings from Elastiflow!

This PR introduces a new ClickHouseDatasource connector for Ray, which provides a convenient way to read data from ClickHouse into Ray Datasets. The ClickHouseDatasource is particularly useful for users who are working with large datasets stored in ClickHouse and want to leverage Ray's distributed computing capabilities for AI and ML use-cases. We found this functionality useful while evaluating ML technologies and wanted to contribute this back.

Key Features and Benefits:

Seamless Integration: The ClickHouseDatasource allows for seamless integration of ClickHouse data into Ray workflows, enabling users to easily access their data and apply Ray's powerful parallel computation.
Custom Query Support: Users can specify custom columns, filters, and orderings, allowing for flexible query generation directly from the Ray interface, which helps in reading only the necessary data, thereby improving performance.
User-Friendly API: The connector abstracts the complexity of setting up and querying ClickHouse, providing a simple API that allows users to focus on data analysis rather than data extraction.

Tested locally with a ClickHouse table containing ~12m records.

Screenshot 2024-11-20 at 3 52 42 AM

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

jecsand838 added 3 commits

November 20, 2024 05:36


          Added clickhouse datasource along with tests

2ed8e27

Signed-off-by: Connor Sanders <[email protected]>


          lint

026a845

Signed-off-by: Connor Sanders <[email protected]>


          changed columns param to be list in test_clickhouse

6aa0955

Signed-off-by: Connor Sanders <[email protected]>

jecsand838 requested review from scottjlee, bveeramani, raulchen, stephanie-wang, omatthew98, alexeykudinkin and srinathk10 as code owners

November 20, 2024 12:18

jecsand838 added 2 commits

November 20, 2024 07:03


          Updated read_api.py ClickHouse docstring

0e0073b

Signed-off-by: Connor Sanders <[email protected]>


          Fixed underline too short warning

952fee5

Signed-off-by: Connor Sanders <[email protected]>

jcotant1 added the data label

alexeykudinkin reviewed

View reviewed changes

python/ray/data/_internal/datasource/clickhouse_datasource.py

Comment on lines +22 to +26

+                      columns: Optional[List[str]] = None,
+                      filters: Optional[Dict[str, Tuple[str, Any]]] = None,
+                      order_by: Optional[Tuple[List[str], bool]] = None,
+                      client_settings: Optional[Dict[str, Any]] = None,
+                      client_kwargs: Optional[Dict[str, Any]] = None,

Contributor

alexeykudinkin Nov 20, 2024

Let's make all optional args as kwargs

python/ray/data/_internal/datasource/clickhouse_datasource.py Outdated Show resolved Hide resolved

python/ray/data/_internal/datasource/clickhouse_datasource.py

Comment on lines +20 to +21

		entity: str,
		dsn: str,

Contributor

alexeykudinkin Nov 20, 2024

Can you please give an example of the DSN?

python/ray/data/_internal/datasource/clickhouse_datasource.py

+                  def __init__(
+                      self,
+                      entity: str,

Contributor

alexeykudinkin Nov 20, 2024

nit: I'd suggest we employ more common term like table (and in py-doc expand that this could also be a view of one)

python/ray/data/_internal/datasource/clickhouse_datasource.py

Comment on lines +37 to +42

+                          filters: Optional fields and values mapping to use to filter the data via
+                              WHERE clause. The value should be a tuple where the first element is
+                              one of ('is', 'not', 'less', 'greater') and the second
+                              element is the value to filter by. The default operator
+                              is 'is'. Only strings, ints, floats, booleans,
+                              and None are allowed as values.

Contributor

alexeykudinkin Nov 20, 2024

IIUC this is requiring predicate in DNF format, let's call it out explicitly and add an example to help with understanding of it.

Also let's add a link to the page of parameters to ClickHouse explaining these in more details

python/ray/data/_internal/datasource/clickhouse_datasource.py

+                                      f"Unsupported operator '{op}' for filter on '{column}'. "
+                                      f"Defaulting to 'is'"
+                                  )
+                                  op = "is"

Contributor

alexeykudinkin Nov 20, 2024

Same as below

python/ray/data/_internal/datasource/clickhouse_datasource.py Outdated Show resolved Hide resolved

python/ray/data/_internal/datasource/clickhouse_datasource.py

Comment on lines +98 to +113

+                              if value is None:
+                                  operator = validate_non_numeric_ops(key, operator)
+                                  if operator == "is":
+                                      filter_conditions.append(f"{key} IS NULL")
+                                  elif operator == "not":
+                                      filter_conditions.append(f"{key} IS NOT NULL")
+                              elif isinstance(value, str):
+                                  operator = validate_non_numeric_ops(key, operator)
+                                  filter_conditions.append(f"{key} {ops[operator]} '{value}'")
+                              elif isinstance(value, bool):
+                                  operator = validate_non_numeric_ops(key, operator)
+                                  filter_conditions.append(
+                                      f"{key} {ops[operator]} {str(value).lower()}"
+                                  )
+                              elif isinstance(value, (int, float)):
+                                  filter_conditions.append(f"{key} {ops[operator]} {value}")

Contributor

alexeykudinkin Nov 20, 2024

Let's split up value conversion from filter composition to avoid duplication

python/ray/data/_internal/datasource/clickhouse_datasource.py

+                                  op = "is"
+                              return op
+                          ops = {"is": "=", "not": "!=", "less": "<", "greater": ">"}

Contributor

alexeykudinkin Nov 20, 2024

Let's use Python operators so that we're not reinventing the wheel here

python/ray/data/_internal/datasource/clickhouse_datasource.py

Comment on lines +171 to +173

+                      # Fetch the fragments from the ClickHouse client
+                      with self._client.query_arrow_stream(self._query) as stream:
+                          record_batches = list(stream)  # Collect all record batches

Contributor

alexeykudinkin Nov 20, 2024

So actual reading of the data needs to be performed inside the read task (that's currently consolidated in _read_fn)

Contributor

alexeykudinkin Nov 20, 2024

I'd also recommend you to take a look at SQLDatasource to see how it could be structured in a way compatible with Ray read APIs

Author

jecsand838 Nov 22, 2024

Ah, I see what you mean. I'll make that change. Thank you for pointing that out!

jecsand838 and others added 4 commits

November 21, 2024 18:01


          Update python/ray/data/_internal/datasource/clickhouse_datasource.py

5a8194b

Co-authored-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>


          Merge branch 'ray-project:master' into clickhouse_datasource

a653168


          Update python/ray/data/_internal/datasource/clickhouse_datasource.py

e2d4071

Co-authored-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>


          Update python/ray/data/_internal/datasource/clickhouse_datasource.py

4486c7c

Co-authored-by: Alexey Kudinkin <[email protected]>
Signed-off-by: Connor Sanders <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

alexeykudinkin alexeykudinkin left review comments

scottjlee Awaiting requested review from scottjlee scottjlee is a code owner

bveeramani Awaiting requested review from bveeramani bveeramani is a code owner

raulchen Awaiting requested review from raulchen raulchen is a code owner

stephanie-wang Awaiting requested review from stephanie-wang stephanie-wang is a code owner

omatthew98 Awaiting requested review from omatthew98 omatthew98 is a code owner

srinathk10 Awaiting requested review from srinathk10 srinathk10 is a code owner

At least 1 approving review is required to merge this pull request.

Labels

data