Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kusto Stream Ingestion #301

Closed
wants to merge 10 commits into from

Conversation

jrob5756
Copy link

@jrob5756 jrob5756 commented Apr 12, 2023

Pull Request Description

This pull request contains code (& tests) to introduce kusto streaming ingestion support in the Spark Kusto Connector.
Stream ingestion is enabled by setting the KUSTO_WRITE_MODE option to Stream.

Notes

  • Stream ingestion must be enabled on the destination Table and/or Cluster (see here for details).
  • Stream ingestion has a data size limit of 4 MB. If a partition exceeds this size, the connector will send multiple appropriately sized requests.

Breaking Changes:

  • None

Features:

Fixes:

  • None

@ag-ramachandran
Copy link
Contributor

Hello @jrob5756 , Thanks for the PR.

We're going to have a look at this and provide a review on this in a week or 2 ( we are in the process of a version bump for JDK 11 and up).

@ohadbitt ohadbitt requested review from ag-ramachandran and ohadbitt and removed request for ag-ramachandran April 13, 2023 17:27
@ohadbitt
Copy link
Contributor

ohadbitt commented Apr 13, 2023

We need to discuss internally the way we want to deal with streaming,
From the PR note : Stream ingestion has a data size limit of 4 MB. If a micro-batch exceeds this size, the connector will fall back to the Queued write mode.
For clarification, the decision made here is per partition and not per micro-batch, which is ok- jsut calling it by the right name..
Anyway I think we should try and see if its possible to split the data into 4 MB chunks instead and try to stream each one individually. There are cons to this as well - as we might be trying to stream this way GBs of data which is not ideal - maybe a big partition does indicate its not streaming material.
Another issue is that we would probably want the ManagedStreaming client instead of plain streaming.. cons of that is that managed streaming doesn't provide Transactionality (same with the 4mb Queued fallback ... ) .. so we need to discuss this further as well because Streaming is Transactional by definition.
@ag-ramachandran - lets discuss this internally and then resort to reviewing this PR

@jrob5756
Copy link
Author

We need to discuss internally the way we want to deal with streaming, From the PR note : Stream ingestion has a data size limit of 4 MB. If a micro-batch exceeds this size, the connector will fall back to the Queued write mode. For clarification, the decision made here is per partition and not per micro-batch, which is ok- jsut calling it by the right name.. Anyway I think we should try and see if its possible to split the data into 4 MB chunks instead and try to stream each one individually. There are cons to this as well - as we might be trying to stream this way GBs of data which is not ideal - maybe a big partition does indicate its not streaming material. Another issue is that we would probably want the ManagedStreaming client instead of plain streaming.. cons of that is that managed streaming doesn't provide Transactionality (same with the 4mb Queued fallback ... ) .. so we need to discuss this further as well because Streaming is Transactional by definition. @ag-ramachandran - lets discuss this internally and then resort to reviewing this PR

Thanks @ohadbitt. Good call on the microbatch vs. partition comment. I updated the pr summary to reflect that. Regarding the Queued fallback approach, I had the same thoughts while I was implementing these changes. As you pointed out, there are pros and cons to both strategies. If the team is more comfortable splitting the data into 4 MB chunks, I can certainly do that as it would not be difficult to implement. Let me know!

@jrob5756 jrob5756 force-pushed the feature/kusto-stream-ingestions branch from 7c77f2c to 98994ec Compare April 21, 2023 01:13
@jrob5756 jrob5756 force-pushed the feature/kusto-stream-ingestions branch from 98994ec to e86b62b Compare May 2, 2023 20:43
@jrob5756 jrob5756 force-pushed the feature/kusto-stream-ingestions branch from 1f9c022 to cc014df Compare May 18, 2023 12:56
@jrob5756
Copy link
Author

@ag-ramachandran quick update. We discussed offline but, adding comments here as well for transparency. Since the PR was created the code has been updated to chunk the requests as discussed above. Also, I saw that there have been some structural changes recently which were incompatible with the PR. I've updated the code so we should be squared away there as well. I await your review...

@jrob5756 jrob5756 force-pushed the feature/kusto-stream-ingestions branch from 47b4c47 to fbaa32f Compare May 18, 2023 17:38
@Raghu0210
Copy link

@jrob5756 @ag-ramachandran
As a user, I truly appreciate the amount of work you are putting into this feature. Any update on the timeline for this, much awaited feature (spark-kusto streaming ingest).

We are working on a POC to stream large volume of data using Spark-Kusto connector (Databricks) into ADX streaming tables ingesting with a requirement of very low latency(ms).

Thanks for your time and dedication for developing this feature. Looking forward to your update.

@ag-ramachandran
Copy link
Contributor

@jrob5756 @ag-ramachandran As a user, I truly appreciate the amount of work you are putting into this feature. Any update on the timeline for this, much awaited feature (spark-kusto streaming ingest).

We are working on a POC to stream large volume of data using Spark-Kusto connector (Databricks) into ADX streaming tables ingesting with a requirement of very low latency(ms).

Thanks for your time and dedication for developing this feature. Looking forward to your update.

Hello @Raghu0210 , we should get this picked up by the end of this year. The challenge has been in changes we have been making in the connector and this taking a lower priority. Will keep this thread updated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants