Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http scraping with an IntervalStream piles up requests if the server takes > interval_secs to respond #14087

Closed
neuronull opened this issue Aug 23, 2022 · 1 comment · Fixed by #18021
Labels
source: http_client Anything `http_client` source related

Comments

@neuronull
Copy link
Contributor

neuronull commented Aug 23, 2022

I'm not familiar with the current approach to scraping. Could this cause multiple requests to be outstanding simultaneously if a slow server causes a timeout longer than interval_secs?

This, incidentally, is one reason to avoid mutable state in per-request context data.

Originally posted by @bruceg in #13793 (comment)

See conversation thread in linked discussion for more context, specifically #13793 (comment).

@neuronull neuronull added the source: http_client Anything `http_client` source related label Aug 23, 2022
@neuronull
Copy link
Contributor Author

neuronull commented Aug 25, 2022

Here is a python script graciously provided by @hhromic , that can be used to test out the http_scrape source for this behavior.

The aiohttp is capable of concurrent requests. The below output demonstrates that the stream is blocking on the first request.

import asyncio
import random
from aiohttp import web

async def handler(request):
    data = request.config_dict["data"]
    data["req_num"] += 1
    req_num = data["req_num"]  # store locally as it can change during awaits
    delay = 10 if req_num == 1 else 1
    await asyncio.sleep(delay)
    return web.Response(text=f"req {req_num} = {delay}s\n")

app = web.Application()
app.add_routes([web.get("/", handler)])  # can be web.post(...) etc
app["data"] = {"req_num": 0}
web.run_app(app)
data_dir = "/var/lib/vector/"

[sources.source0]
endpoint = "http://localhost:8080"
scrape_interval_secs = 1
type = "http_scrape"

[sources.source0.decoding]
codec = "bytes"

[sources.source0.framing]
method = "bytes"

[sources.source0.headers]

[sources.source0.query]

[sinks.sink0]
inputs = ["source0"]
target = "stdout"
type = "console"

[sinks.sink0.encoding]
codec = "json"

[sinks.sink0.healthcheck]
enabled = true

[sinks.sink0.buffer]
type = "memory"
max_events = 500
when_full = "block"
$ ./vector --version
vector 0.24.0 (x86_64-unknown-linux-gnu debug=full)
$ ./vector -c ./config.toml 
2022-08-25T19:42:15.645932Z  INFO vector::app: Log level is enabled. level="vector=info,codec=info,vrl=info,file_source=info,tower_limit=trace,rdkafka=info,buffers=info,kube=info"
2022-08-25T19:42:15.646972Z  INFO vector::app: Loading configs. paths=["config.toml"]
2022-08-25T19:42:15.721984Z  INFO vector::topology::running: Running healthchecks.
2022-08-25T19:42:15.722456Z  INFO vector::topology::builder: Healthcheck: Passed.
2022-08-25T19:42:15.723060Z  INFO vector: Vector has started. debug="true" version="0.24.0" arch="x86_64" build_id="none"
2022-08-25T19:42:15.723277Z  INFO vector::app: API is disabled, enable by setting `api.enabled` to `true` and use commands like `vector top`.
{"message":"req 1 = 10s\n","source_type":"http_scrape","timestamp":"2022-08-25T19:42:25.749900702Z"}
{"message":"req 2 = 1s\n","source_type":"http_scrape","timestamp":"2022-08-25T19:42:26.770875329Z"}
{"message":"req 3 = 1s\n","source_type":"http_scrape","timestamp":"2022-08-25T19:42:27.813741409Z"}
{"message":"req 4 = 1s\n","source_type":"http_scrape","timestamp":"2022-08-25T19:42:28.857090004Z"}
{"message":"req 5 = 1s\n","source_type":"http_scrape","timestamp":"2022-08-25T19:42:29.875683532Z"}
^C2022-08-25T19:42:30.022671Z  INFO vector: Vector has stopped.
2022-08-25T19:42:30.025055Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="sink0, source0" time_remaining="59 seconds left"
2022-08-25T19:42:30.919854Z  INFO source{component_kind="source" component_id=source0 component_type=http_scrape component_name=source0}: vector::sources::util::http_scrape: Finished sending.
{"message":"req 6 = 1s\n","source_type":"http_scrape","timestamp":"2022-08-25T19:42:30.918315434Z"}

@neuronull neuronull changed the title http scraping with on IntervalStream piles up requests if the server takes > interval to respond http scraping with an IntervalStream piles up requests if the server takes > interval to respond Aug 25, 2022
@neuronull neuronull changed the title http scraping with an IntervalStream piles up requests if the server takes > interval to respond http scraping with an IntervalStream piles up requests if the server takes > interval_secs to respond Aug 25, 2022
github-merge-queue bot pushed a commit that referenced this issue Jul 24, 2023
…timeouts (#18021)

<!--
**Your PR title must conform to the conventional commit spec!**

  <type>(<scope>)!: <description>

  * `type` = chore, enhancement, feat, fix, docs
  * `!` = OPTIONAL: signals a breaking change
* `scope` = Optional when `type` is "chore" or "docs", available scopes
https://github.com/vectordotdev/vector/blob/master/.github/semantic.yml#L20
  * `description` = short description of the change

Examples:

  * enhancement(file source): Add `sort` option to sort discovered files
  * feat(new source): Initial `statsd` source
  * fix(file source): Fix a bug discovering new files
  * chore(external docs): Clarify `batch_size` option
-->

fixes #14087 
fixes #14132 
fixes #17659

- [x] make target timeout configurable

this builds on what @wjordan did in
#17660

### what's changed
- prometheus scrapes happen concurrently
- requests to targets can timeout
- the timeout can be configured (user facing change)
- small change in how the http was instantiated

---------

Co-authored-by: Doug Smith <[email protected]>
Co-authored-by: Stephen Wakely <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
source: http_client Anything `http_client` source related
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant