Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support streaming ingest #222

Open
epa095 opened this issue Feb 16, 2022 · 8 comments
Open

Support streaming ingest #222

epa095 opened this issue Feb 16, 2022 · 8 comments

Comments

@epa095
Copy link

epa095 commented Feb 16, 2022

I am looking into how low latency I can get between spark and ADX. And I see that one way to get lower latency is to enable Streaming ingestion policy in both the ADX cluster and database/table. But after enabling I still see that it takes 5 minutes for my batches from spark (through this connector) to arive.

The documentation only mentiones Ingestion Batching Policy, making me think that maybe this connector does not support using the streaming ingestion into ADX? If so, it would be very nice (and natural) if it started supporting it. Maybe this new feature in azure-kusto-python makes it easier?

@yogilad
Copy link
Contributor

yogilad commented Feb 24, 2022

Same issue as #217

@yogilad
Copy link
Contributor

yogilad commented Feb 24, 2022

@epa095,
Spark uses temporary tables, which don't adhere to the target table ingestion batching policy configuration.
We have a bug on this (above).

As a work around, if this is ok for your specific case, you can set the ingestion batching policy on the database level.
You can also consider one of the following workarounds
• Lowering KustoSink’s “KUSTO_CLIENT_BATCHING_LIMIT” config from its default of 100MB
. Threshold for Spark to ship its data to the Kusto service.
• Setting the “KustoSinkOptions.KUSTO_SPARK_INGESTION_PROPERTIES_JSON.[flushImmediately] config to true
on The Kusto service will ignore the ingestion batching policy I described earlier, and ingest immediately without batching.

@epa095
Copy link
Author

epa095 commented Mar 9, 2022

@yogilad : I have changed the policy on the database level, and both enabled streamingingestion and set ingestionbatching - MaximumBatchingTimeSpan to 10 seconds. By setting KustoSinkOptions.KUSTO_SPARK_INGESTION_PROPERTIES_JSON.[flushImmediately] to true I get the following timings from spark using this sink:

  "durationMs" : {
    "addBatch" : 4826,
    "getBatch" : 32,
    "latestOffset" : 30,
    "queryPlanning" : 5,
    "triggerExecution" : 5093,
    "walCommit" : 111
  },

So writing the batch takes roughly 4,5 seconds. That is of course much better than 5 minutes, so great:-D

BUT I am stull under the impression that streaming ingestion is a completely different ingestion mode for kusto, and that we should be able to expect "Latency of less than a second[...]". And I see that the python kusto client has special handling for streaming. So, does this connector support writing to the streaming ingestion endpoint? If it does, what is the "magic toggle"? Or is it just kind of "automagically" enabled if I enable streaming ingestion on the database and set
flushImmediately=True?

@yihezkel
Copy link
Member

The Spark Connector does not currently support streaming ingestion, though we may add this in the future. Thanks for the suggestion!

@yihezkel yihezkel added enhancement New feature or request and removed enhancement New feature or request labels Mar 14, 2022
@ohadbitt
Copy link
Contributor

Just for clarification we have different solutions that could be used here - you can create yourself blobs and set an EventGrid connection to the cluster with streaming ingest
https://docs.microsoft.com/en-us/azure/data-explorer/ingest-data-event-grid?tabs=adx

@ohadbitt
Copy link
Contributor

ohadbitt commented Aug 14, 2022

The current implementation is now good for streaming support but the discussion for not implementing this is for avoiding wrong usage with spark streaming
New version 3.1.0 allows 2 new options to lower the latency: 1.table ingestion batching policy now takes effect over the temporary table. 2. User can give the temporary table himself which should already have an updated table batching policy.
If still there is a need for supporting Kusto streaming on this connector please react here

@timwilke
Copy link

Hi @ohadbitt, can you explain what you mean by 'wrong usage with Spark streaming'?

And regarding the new version: If I understood the documentation correctly, with table ingestion batching policies you get a minimum latency of 10 seconds. This is not sufficient for time critical streaming use cases (like ours). So it would be great to have a way to use the streaming ingestion feature of Data Explorer directly through the Spark Connector.
I like your suggestion to generate blobs and forward them via EventGrid. But I think this should only be an interim solution until the Spark Connector supports streaming ingestion.

@ohadbitt
Copy link
Contributor

ohadbitt commented Mar 20, 2023

Hi @timwilke,
I referred to the possible mistaking of Kusto Streaming for Spark Streaming, which happens a lot of times.
We were thinking about implementing it anyway but @Ram-G will respond with these details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants