http scraping with an `IntervalStream` piles up requests if the server takes > `interval_secs` to respond #14087

neuronull · 2022-08-23T22:25:48Z

I'm not familiar with the current approach to scraping. Could this cause multiple requests to be outstanding simultaneously if a slow server causes a timeout longer than interval_secs?

This, incidentally, is one reason to avoid mutable state in per-request context data.

Originally posted by @bruceg in #13793 (comment)

See conversation thread in linked discussion for more context, specifically #13793 (comment).

The text was updated successfully, but these errors were encountered:

neuronull · 2022-08-25T19:38:26Z

Here is a python script graciously provided by @hhromic , that can be used to test out the http_scrape source for this behavior.

The aiohttp is capable of concurrent requests. The below output demonstrates that the stream is blocking on the first request.

import asyncio
import random
from aiohttp import web

async def handler(request):
    data = request.config_dict["data"]
    data["req_num"] += 1
    req_num = data["req_num"]  # store locally as it can change during awaits
    delay = 10 if req_num == 1 else 1
    await asyncio.sleep(delay)
    return web.Response(text=f"req {req_num} = {delay}s\n")

app = web.Application()
app.add_routes([web.get("/", handler)])  # can be web.post(...) etc
app["data"] = {"req_num": 0}
web.run_app(app)

data_dir = "/var/lib/vector/"

[sources.source0]
endpoint = "http://localhost:8080"
scrape_interval_secs = 1
type = "http_scrape"

[sources.source0.decoding]
codec = "bytes"

[sources.source0.framing]
method = "bytes"

[sources.source0.headers]

[sources.source0.query]

[sinks.sink0]
inputs = ["source0"]
target = "stdout"
type = "console"

[sinks.sink0.encoding]
codec = "json"

[sinks.sink0.healthcheck]
enabled = true

[sinks.sink0.buffer]
type = "memory"
max_events = 500
when_full = "block"

$ ./vector --version
vector 0.24.0 (x86_64-unknown-linux-gnu debug=full)

$ ./vector -c ./config.toml 
2022-08-25T19:42:15.645932Z  INFO vector::app: Log level is enabled. level="vector=info,codec=info,vrl=info,file_source=info,tower_limit=trace,rdkafka=info,buffers=info,kube=info"
2022-08-25T19:42:15.646972Z  INFO vector::app: Loading configs. paths=["config.toml"]
2022-08-25T19:42:15.721984Z  INFO vector::topology::running: Running healthchecks.
2022-08-25T19:42:15.722456Z  INFO vector::topology::builder: Healthcheck: Passed.
2022-08-25T19:42:15.723060Z  INFO vector: Vector has started. debug="true" version="0.24.0" arch="x86_64" build_id="none"
2022-08-25T19:42:15.723277Z  INFO vector::app: API is disabled, enable by setting `api.enabled` to `true` and use commands like `vector top`.
{"message":"req 1 = 10s\n","source_type":"http_scrape","timestamp":"2022-08-25T19:42:25.749900702Z"}
{"message":"req 2 = 1s\n","source_type":"http_scrape","timestamp":"2022-08-25T19:42:26.770875329Z"}
{"message":"req 3 = 1s\n","source_type":"http_scrape","timestamp":"2022-08-25T19:42:27.813741409Z"}
{"message":"req 4 = 1s\n","source_type":"http_scrape","timestamp":"2022-08-25T19:42:28.857090004Z"}
{"message":"req 5 = 1s\n","source_type":"http_scrape","timestamp":"2022-08-25T19:42:29.875683532Z"}
^C2022-08-25T19:42:30.022671Z  INFO vector: Vector has stopped.
2022-08-25T19:42:30.025055Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="sink0, source0" time_remaining="59 seconds left"
2022-08-25T19:42:30.919854Z  INFO source{component_kind="source" component_id=source0 component_type=http_scrape component_name=source0}: vector::sources::util::http_scrape: Finished sending.
{"message":"req 6 = 1s\n","source_type":"http_scrape","timestamp":"2022-08-25T19:42:30.918315434Z"}

@wjordan

…timeouts (#18021)  fixes #14087 fixes #14132 fixes #17659 - [x] make target timeout configurable this builds on what @wjordan did in #17660 ### what's changed - prometheus scrapes happen concurrently - requests to targets can timeout - the timeout can be configured (user facing change) - small change in how the http was instantiated --------- Co-authored-by: Doug Smith <[email protected]> Co-authored-by: Stephen Wakely <[email protected]>

neuronull added the source: http_client Anything `http_client` source related label Aug 23, 2022

neuronull mentioned this issue Aug 23, 2022

feat(new source): A generic http_scrape source #13793

Merged

8 tasks

neuronull changed the title ~~http scraping with on IntervalStream piles up requests if the server takes > interval to respond~~ http scraping with an IntervalStream piles up requests if the server takes > interval to respond Aug 25, 2022

neuronull changed the title ~~http scraping with an IntervalStream piles up requests if the server takes > interval to respond~~ http scraping with an IntervalStream piles up requests if the server takes > interval_secs to respond Aug 25, 2022

jszwedko mentioned this issue Aug 26, 2022

prometheus_scrape source hangs indefinitely if the endpoint never responds #14132

Closed

This was referenced Jul 14, 2023

Multiple prometheus_scrape endpoints not scraped in parallel #17659

Closed

enhancement(prometheus_scrape source): run requests in parallel with timeouts #18021

Merged

neuronull closed this as completed in #18021 Jul 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

http scraping with an `IntervalStream` piles up requests if the server takes > `interval_secs` to respond #14087

http scraping with an `IntervalStream` piles up requests if the server takes > `interval_secs` to respond #14087

neuronull commented Aug 23, 2022 •

edited

Loading

neuronull commented Aug 25, 2022 •

edited

Loading

http scraping with an IntervalStream piles up requests if the server takes > interval_secs to respond #14087

http scraping with an IntervalStream piles up requests if the server takes > interval_secs to respond #14087

Comments

neuronull commented Aug 23, 2022 • edited Loading

neuronull commented Aug 25, 2022 • edited Loading

http scraping with an `IntervalStream` piles up requests if the server takes > `interval_secs` to respond #14087

http scraping with an `IntervalStream` piles up requests if the server takes > `interval_secs` to respond #14087

neuronull commented Aug 23, 2022 •

edited

Loading

neuronull commented Aug 25, 2022 •

edited

Loading