Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(cmd/influx/write): allow to limit write rate #19660

Merged
merged 2 commits into from
Oct 5, 2020
Merged

Conversation

sranka
Copy link
Contributor

@sranka sranka commented Sep 29, 2020

Closes #19335

This PR adds a new flag to influx write :

--rate-limit string      Throttles write, examples: "5 MB / 5 min" , "17kBs". "" (default) disables throttling.
  • the rate is specified as: COUNT(B|kB|MB)/TIME(s|sec|m|min) with / and TIME being optional; COUNT is a decimal number, TIME is a positive whole number; spaces in the value are ignored.
    • "5MB / 5min" matches the current free tier in the cloud, it can be also expressed as 17476.266666667Bs, 1MB/1min, 1MB/min, 1MBmin or 1MBm
  • when an invalid rate is supplied, the tool prints out the format and also an exact regular expression to recognize format
  • --rate-limit can be also used in influx write dryrun

@sranka sranka marked this pull request as ready for review September 29, 2020 11:20
Copy link
Contributor

@stuartcarnie stuartcarnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I've added some comments to fix a few nits.

CHANGELOG.md Outdated Show resolved Hide resolved
@@ -89,6 +91,7 @@ func cmdWrite(f *globalFlags, opt genericCLIOpts) *cobra.Command {
cmd.PersistentFlags().MarkHidden("xIgnoreDataTypeInColumnName") // should be used only upon explicit advice
cmd.PersistentFlags().StringVar(&writeFlags.Encoding, "encoding", "UTF-8", "Character encoding of input files or stdin")
cmd.PersistentFlags().StringVar(&writeFlags.ErrorsFile, "errors-file", "", "The path to the file to write rejected rows to")
cmd.PersistentFlags().Float64Var(&writeFlags.LimitRate, "limit-rate", 0.0, "How many megabytes per minute the write will allow. Defaults to zero, which disables throttling.")
Copy link
Contributor

@stuartcarnie stuartcarnie Sep 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to call this rate-limit, which is more conventional

Copy link
Contributor

@stuartcarnie stuartcarnie Sep 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @glinton's suggestion to use a KB/s as a better metric

Copy link
Contributor Author

@sranka sranka Sep 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was apparently defocused herein, sorry for that. It is now named --rate-limit. I am using MB/minute to be consistent with InfluxDB pricing options. MB/min better supports end-user experience. so that a calculator is not required to run the tool. I agree that kB/s would be better from the technical perspective, but I am not internally convinced to change it that way. I will follow any explicit advice herein. Can you please confirm that you still want kB/s ?

CHANGELOG.md Outdated Show resolved Hide resolved
@@ -240,6 +243,13 @@ func (writeFlags *writeFlagsType) createLineReader(ctx context.Context, cmd *cob
csvReader.RowSkipped = rowSkippedListener
r = csvReader
}
// throttle reader if requested
if writeFlags.LimitRate > 0.0 {
throttledReader := shapeio.NewReaderWithContext(r, ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This simple library doesn't actually appear to perform proper rate limiting. It performs a full read before pausing to allow subsequent reads, which themselves could also be above the specified rate limit.

This example clearly shows the failure to rate limit.

	w := shapeio.NewWriter(os.Stdout)
	w.SetRateLimit(1)
	w.Write([]byte("this is a test that should take some time"))

Though, I suppose this could work if bursting is allowed in the api 🤔

Copy link
Contributor Author

@sranka sranka Sep 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reader implementation reads n bytes and then it waits until these bytes are allowed to read in compliance with the rate limit. The fact that the sources can be read faster does not matter much, the data that are sent to influxd are copied from a properly throttled reader.

The burst property of the rate limiter does not influence the result. It could only if the count of bytes that are read at once would be more than a billion, but the size of internal read buffers is much lower (4096).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you confirm that the api rate limits by average? if we have a hard restriction to 1mb/minute and want to send 5mb of data, does our api require clients to send 1mb every minute or does it allow sending 5mb every 5 minutes. If the latter is true, this library will work, otherwise it will not as mentioned previously.

When running the following, i'd expect one byte (character) to be printed each second. in reality i get nothing for 5 seconds followed by all 5 characters immediately. on average, the rate limits was respected, but if the limits is strict and doesn't allow "bursting", this library/method will not sufficiently rate limit.


r := shapeio.NewReader(bytes.NewBufferString("howdy"))
r.SetRateLimit(1)
io.Copy(os.Stdout, r)
time go run thing.go 
# howdy
# real	0m5.245s
# user	0m0.276s
# sys	0m0.161s

Copy link
Contributor Author

@sranka sranka Oct 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current implementation converts MB/min to bytes/sec and then ensures that the rate limit is never exceeded in every second. Assuming a limit of 5MB/5min, it does not allow to send 5MB quicker than in 5 minutes, it allows ~17.5kB every second. I don't consider that this is essentially wrong if the ultimate goal is to avoid "429 - limit exceeded". The tool can be now used repetitively in scripts to safely import more files. A refill period of 5 minutes (instead of 1 second) could be implemented, but that would certainly cause problems if the tool is executed repetitively. I could possibly implement a technique that would allow specifying the rate using both max size (5MB) and unit (5min), but that would require much more time to implement and test.

Even better alternative/additive (to specifying the rate) would be to react upon HTTP 429 errors during writing to InfluxDB and retry sending the message after the time specified in the server response (Retry-After HTTP header), a retry strategy would have to be implemented influx write.

The implementation provided herein is both quite simple for users and quite simple to implement and test, even though it does not provide the best throughput it possibly could. It seems better than '--pps' that is provided in influx v1. Maybe that I am lost in translation, can you please express your expectations for this PR to get approved?

Regarding the example that you provided, it is true, but the situation is more complex in this context. The basic unit that can be sent to the server is a protocol line. If you provide a CSV input (including a flux result CSV), the reader will not emit more than a single protocol line. More than a single line could be provided when reading protocol lines on input, so I rather improved the implementation with LineReader to avoid bigger pauses than a fluent throttling really requires. The writer part (write/Batcher) then optimizes the network traffic by sending chunks of lines as soon as they extend 500kB in total. It IMHO does not matter much if that 500kB are delivered in smaller or bigger pieces, the rate will be +- the same when measured on input as a whole.

@@ -89,6 +91,7 @@ func cmdWrite(f *globalFlags, opt genericCLIOpts) *cobra.Command {
cmd.PersistentFlags().MarkHidden("xIgnoreDataTypeInColumnName") // should be used only upon explicit advice
cmd.PersistentFlags().StringVar(&writeFlags.Encoding, "encoding", "UTF-8", "Character encoding of input files or stdin")
cmd.PersistentFlags().StringVar(&writeFlags.ErrorsFile, "errors-file", "", "The path to the file to write rejected rows to")
cmd.PersistentFlags().Float64Var(&writeFlags.LimitRate, "limit-rate", 0.0, "How many megabytes per minute the write will allow. Defaults to zero, which disables throttling.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for megabytes per minute, that seems like a rather uncommon measurement? Maybe use [kilo]bytes/sec?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The megabytes per minute limit was done to match the published rate limit:
https://docs.influxdata.com/influxdb/v2.0/account-management/pricing-plans/#usage-based-plan-rate-limits

Data In: 300MB every 5 minutes

https://docs.influxdata.com/influxdb/v2.0/account-management/pricing-plans/#free-plan-rate-limits

Data In: 5.1MB every 5 minutes

Copy link
Contributor

@glinton glinton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the goal to match the marketing specified rates (MB/min), but it still feels "wrong" as it will require a change to the flag/internals if the limits or the way they are conveyed change. If product is good with leaving it MB/min, so am i.

@sranka
Copy link
Contributor Author

sranka commented Oct 1, 2020

@stuartcarnie Can you please approve this PR to confirm that the requested changes are addressed?

@sranka
Copy link
Contributor Author

sranka commented Oct 1, 2020

Thank you @glinton for your review. I understand that MB/min is "weird", it is to me as well. I think that I will rather invest a bit more time herein. The user will have to specify the rate together with a unit, for example "5 MB / 5 min" or 17kBs. This is not hard to implement or test using a simple regexp to match /(floatNumber)(MB|kB|B)\/?(intNumber)?(m(in)?|s(ec)?)/g with spaces ignored.

Copy link
Contributor

@stuartcarnie stuartcarnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

Based on prior discussion, it appears the decision to use a GB/min rate-limit is in line with how Cloud publishes rates.

In the future, we can consider adding a --rate-interval option, so that we can specify different time periods.

@sranka sranka merged commit a8c7254 into master Oct 5, 2020
@sranka sranka deleted the 19335/rate_limit branch October 5, 2020 15:01
sranka added a commit to sranka/docs-v2 that referenced this pull request Oct 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add -pps argument to influx write CLI
4 participants