Support streaming ingest #222

epa095 · 2022-02-16T13:03:59Z

I am looking into how low latency I can get between spark and ADX. And I see that one way to get lower latency is to enable Streaming ingestion policy in both the ADX cluster and database/table. But after enabling I still see that it takes 5 minutes for my batches from spark (through this connector) to arive.

The documentation only mentiones Ingestion Batching Policy, making me think that maybe this connector does not support using the streaming ingestion into ADX? If so, it would be very nice (and natural) if it started supporting it. Maybe this new feature in azure-kusto-python makes it easier?

yogilad · 2022-02-24T12:20:52Z

Same issue as #217

yogilad · 2022-02-24T12:26:06Z

@epa095,
Spark uses temporary tables, which don't adhere to the target table ingestion batching policy configuration.
We have a bug on this (above).

As a work around, if this is ok for your specific case, you can set the ingestion batching policy on the database level.
You can also consider one of the following workarounds
• Lowering KustoSink’s “KUSTO_CLIENT_BATCHING_LIMIT” config from its default of 100MB
. Threshold for Spark to ship its data to the Kusto service.
• Setting the “KustoSinkOptions.KUSTO_SPARK_INGESTION_PROPERTIES_JSON.[flushImmediately] config to true
on The Kusto service will ignore the ingestion batching policy I described earlier, and ingest immediately without batching.

epa095 · 2022-03-09T14:44:04Z

@yogilad : I have changed the policy on the database level, and both enabled streamingingestion and set ingestionbatching - MaximumBatchingTimeSpan to 10 seconds. By setting KustoSinkOptions.KUSTO_SPARK_INGESTION_PROPERTIES_JSON.[flushImmediately] to true I get the following timings from spark using this sink:

  "durationMs" : {
    "addBatch" : 4826,
    "getBatch" : 32,
    "latestOffset" : 30,
    "queryPlanning" : 5,
    "triggerExecution" : 5093,
    "walCommit" : 111
  },

So writing the batch takes roughly 4,5 seconds. That is of course much better than 5 minutes, so great:-D

BUT I am stull under the impression that streaming ingestion is a completely different ingestion mode for kusto, and that we should be able to expect "Latency of less than a second[...]". And I see that the python kusto client has special handling for streaming. So, does this connector support writing to the streaming ingestion endpoint? If it does, what is the "magic toggle"? Or is it just kind of "automagically" enabled if I enable streaming ingestion on the database and set
flushImmediately=True?

yihezkel · 2022-03-14T15:11:21Z

The Spark Connector does not currently support streaming ingestion, though we may add this in the future. Thanks for the suggestion!

ohadbitt · 2022-06-21T09:33:07Z

Just for clarification we have different solutions that could be used here - you can create yourself blobs and set an EventGrid connection to the cluster with streaming ingest
https://docs.microsoft.com/en-us/azure/data-explorer/ingest-data-event-grid?tabs=adx

ohadbitt · 2022-08-14T15:15:15Z

The current implementation is now good for streaming support but the discussion for not implementing this is for avoiding wrong usage with spark streaming
New version 3.1.0 allows 2 new options to lower the latency: 1.table ingestion batching policy now takes effect over the temporary table. 2. User can give the temporary table himself which should already have an updated table batching policy.
If still there is a need for supporting Kusto streaming on this connector please react here

timwilke · 2023-03-17T08:30:58Z

Hi @ohadbitt, can you explain what you mean by 'wrong usage with Spark streaming'?

And regarding the new version: If I understood the documentation correctly, with table ingestion batching policies you get a minimum latency of 10 seconds. This is not sufficient for time critical streaming use cases (like ours). So it would be great to have a way to use the streaming ingestion feature of Data Explorer directly through the Spark Connector.
I like your suggestion to generate blobs and forward them via EventGrid. But I think this should only be an interim solution until the Spark Connector supports streaming ingestion.

ohadbitt · 2023-03-20T12:15:24Z

Hi @timwilke,
I referred to the possible mistaking of Kusto Streaming for Spark Streaming, which happens a lot of times.
We were thinking about implementing it anyway but @Ram-G will respond with these details.

yihezkel added enhancement New feature or request and removed enhancement New feature or request labels Mar 14, 2022

jrob5756 mentioned this issue Apr 12, 2023

Kusto Stream Ingestion #301

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support streaming ingest #222

Support streaming ingest #222

epa095 commented Feb 16, 2022

yogilad commented Feb 24, 2022

yogilad commented Feb 24, 2022

epa095 commented Mar 9, 2022

yihezkel commented Mar 14, 2022

ohadbitt commented Jun 21, 2022

ohadbitt commented Aug 14, 2022 •

edited

Loading

timwilke commented Mar 17, 2023

ohadbitt commented Mar 20, 2023 •

edited

Loading

Support streaming ingest #222

Support streaming ingest #222

Comments

epa095 commented Feb 16, 2022

yogilad commented Feb 24, 2022

yogilad commented Feb 24, 2022

epa095 commented Mar 9, 2022

yihezkel commented Mar 14, 2022

ohadbitt commented Jun 21, 2022

ohadbitt commented Aug 14, 2022 • edited Loading

timwilke commented Mar 17, 2023

ohadbitt commented Mar 20, 2023 • edited Loading

ohadbitt commented Aug 14, 2022 •

edited

Loading

ohadbitt commented Mar 20, 2023 •

edited

Loading