Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we drop support for INTs? #3479

Closed
pauldix opened this issue Jul 27, 2015 · 18 comments
Closed

Should we drop support for INTs? #3479

pauldix opened this issue Jul 27, 2015 · 18 comments

Comments

@pauldix
Copy link
Member

pauldix commented Jul 27, 2015

We're considering dropping support for INT types. These have been the source of many bugs, which we can fix, but more importantly, they've been a bit of a usability headache. People are having trouble writing libraries that actually force float data types when that's what they want. Thus they're getting errors about the field type and reporting these as bugs in our code, which in this case isn't true.

We could remedy this by writing client libraries ourselves since we know the protocol for writing data, but it's problematic and we're unlikely to have the bandwidth to write all those libraries any time soon.

Given the range of values that can be represented with float64s, I'm at a bit of a loss for what we'd need the int types for. I think the Prometheus team have a pretty good argument: http://prometheus.io/docs/introduction/faq/#why-are-all-sample-values-64-bit-floats-i-want-integers

The reason I'm asking is because we promised no breaking changes in the 0.9 line. This would be a breaking change. We'd make sure that databases that had INT types would work, but those values would get cast to float64.

What do people think about this change? Are there use cases where you think an int64 is actually required? I'm interested in hearing people's thoughts.

@chadbrewbaker
Copy link

Can't tell if joking or serious... I can write succinct encodings of abstract datatypes into int64s. Float? My god I wouldn't want to write those converters.

@beckettsean
Copy link
Contributor

@chadbrewbaker we are serious, can you give a relevant use case where float64 doesn't meet your needs? Is encoding an abstract datatype something you find essential in a TSDB? Why wouldn't string solve your need just as well as int64?

@chadbrewbaker
Copy link

  1. I don't trust a lot of client side float code. Especially on embedded devices with custom FPU libraries in firmware.

  2. Strings aren't uniform length. Have to do parallel prefix an whatnot to manipulate them in bulk.

  3. Machine clock error + relativity (light only travels around the earth about 7 times a second). Most clocks I don't trust beyond being a local stopwatch.

For data pipelines I usually like uint64 arrays which you just memcpy or RMA across the wire. Type them appropriately as needed. Learn from the MPI masters. Judiciously reuse their open source code, http://static.msi.umn.edu/tutorial/scicomp/general/MPI/content6.html

@pauldix
Copy link
Member Author

pauldix commented Jul 27, 2015

@chadbrewbaker I'm not sure what to say about 1. Floats can't be trusted?

For 2, if you know you're writing in strings as ints because you're changing it to that because of succinct abstract data types, I would suggest one of two things. First, don't bother, use the actual type names in the string. We'll take care of compression for you. Or use tags, which already avoid copying those strings around. Second, if you still want to do it, you know all those strings are ints, so you can just convert them on your end instead of using padded strings.

For 3, I'm not sure what you're talking about. The timestamps in InfluxDB are all int64 nano-second precision epochs. This isn't the proposed change. If you wish to store additional timestamps, that's a request for a date/time data type, not an int64.

But even then I have no idea what not trusting clocks has to do with this proposed change. Time is never exact in a distributed system, this is common knowledge and not particularly controversial.

If you want a more efficient wire transfer protocol, that's a separate request. At this point we're not getting to it anytime soon so you'd be best to write a proxy that can handle your efficient representation, which will then submit the correct type values to Influx.

Nothing I'm seeing in this argument is convincing since you're not saying that the loss of precision from int64 to float64 will have an impact.

@majst01
Copy link
Contributor

majst01 commented Jul 28, 2015

Hi Paul,

I'm fine with that. In influxdb-java i already write numbers always with at least one fraction digit, even if a int was given. This is done because the troubles you mentioned.

The only concern i have with floats is the loss of accuracy compared to BigDecimals or even int64 a big int64 number converted to a float64 will loose accuracy and some people will complain about that im sure.

On the other side storing BigDecimals (i dont know if a golang equivalent exists) is an overkill performance wise.

Just my 0.02$
Stefan

@jcsp
Copy link

jcsp commented Jul 28, 2015

Hmm, I can see cases where int64 is useful for firing in data from sensors where we don't know the encoding. e.g. if a sensor uses the high bits for something then we'll end up outside the 2^53 range, or if it's giving us different-endian data (and we want to acquire the data opaquely before sorting it out later). Aside from physical sensors, consider hardware counters from chips etc that very much will use all 64 bits.

If the primary issue is that client libraries are fumbling the types of things and giving ints when not desired, then perhaps keep the int64 support internally and just make sure it only kicks in when a client does something very explicit to request it?

@majst01
Copy link
Contributor

majst01 commented Jul 28, 2015

This is true especially for routers and such, if an application later wants to calculate rates from the absolute values the error will raise if counter values are stored as float.

@haf
Copy link

haf commented Jul 28, 2015

+1 for keeping the integers; they make life a lot easier by being easy to bit-manipulate, as opposed to floats where implementations vary a lot more. @chadbrewbaker 's 1) and 2) and @jcsp 's cases are valid. Another example of mine: it's easy to program a strictly monotonically increasing integer, but it's less so to program a strictly monotonically increasing float without first converting the float to an int64 and then back.

@pauldix
Copy link
Member Author

pauldix commented Jul 28, 2015

@haf, even if we used a float under the hood, you could use an int in your client code. The only difference is that every number in the line protocol would be converted to a float and stored using that.

The conversion isn't on the client end, it's on our side. And the loss of precision for counters isn't really a concern given the range of values that can still be represented as a float. Like Prometheus' argument that you'd have to increment that counter millions of times a second for hundreds of years. You're going to get a reset first.

The rates of change all still work with float64 on our end.

The one thing I've seen valid so far in this thread is that hardware sensors might use the higher order bits on ints to do something different. Is that from actual experience that it's something people do?

Here's another option. We update the line protocol so that if you want an int for a field value, you follow the number with an i. Something like:

some_counter,host=serverA value=23i 1438039663000000000

Then any field value without an i is always parsed as a float64.

@jcsp
Copy link

jcsp commented Jul 28, 2015

Requiring explicit trailing 'i' for ints and otherwise defaulting to floats sounds perfectly sensible to me.

@haf
Copy link

haf commented Jul 28, 2015

Who says the ADT always is a counter in the range of natural numbers?

That only counts ones?

It could be any ADT that relies on the property of static monotonicity. That + the fact that you can represent the floats easily with ints but not the other way around makes keeping ints a compelling argument. You don't need a paper on how to work with ints, but you probably need one to work with floats – hence the previous comment about not trusting client libraries from @chadbrewbaker

Sample ADTs (not necessarily be the best examples though): BigInteger values? 128 bit decimals? Interval Tree Clocks? A counter of network traffic for an internet router that counts in increments of terrabytes with a stored value of #bytes? E.g. "my backup SaaS company" has transferred 1 PiB this month ingestion, and now 512 KiB more:

> float (1024I**5) + 512. * 1024.**1.;;
val it : float = 1.125899907e+15
> float (1024I**5);;
val it : float = 1.125899907e+15

Now my counters don't update anymore. ;)

@myurko2
Copy link

myurko2 commented Jul 28, 2015

My main use for Influx requires that I store many nanosecond timestamps as a value in the DB. Using a float64 would limit me to ~us precision which isn't sufficient. If I were forced to use float64, then I would be forced to drop Influx.

@samhatchett
Copy link

We use the int field to store OPC quality codes, which are 8-bit masks. However I think client-side casting can handle this need just fine.

@pauldix
Copy link
Member Author

pauldix commented Jul 29, 2015

After the reaction on this thread I'm thinking the best plan is the one I outlined in my last comment. Update the line protocol to require a trailing i after any int, all other numbers are assumed to be floats.

Then we need to fix the outstanding bugs associated with using ints and we should be good to go.

This will give us the best of both worlds. It'll fix the usability problem with the API that is causing a bunch of people pain, but it will keep support for ints, which many people want.

@jprichardson
Copy link

I think most are going to tell you to keep integer support. Perhaps this is harsh, but few understand what it takes to build, maintain, and ship quality software, especially if it isn't their software to build, maintain, and ship.

I think in the short-term you need to optimize for stability, simplicity, and production-level quality. Whichever decision satisfies that quicker, is what you should do.

@Adrien-P
Copy link

Thanks pauldix for addressing this issue.
I'm just starting with InfluxDB and I first sent a value 0 in a non-existing serie and then tried to send the value 1.2. It failed (conflict because 1.2 is a float whereas the serie is expecting int).
Personally I do not need the type int and I can't judge if people really need it or not.
Tell me if I'm wrong, but it looks like InfluxDB infers the type you want for you serie from the type of the first point inserted. If this is how it works then it's very convenient that you don't have to explicitly define the type but is also the reason of the issue regarding int&float types.
The only non-breaking change I can think about to solve this issue is to allow an optional field to define the type of the serie you want. Not sure about how it would fit in the current protocol though...
Good luck !

@pauldix
Copy link
Member Author

pauldix commented Jul 30, 2015

Ok, it's official, we're keeping integer support but modifying the line protocol slightly to improve usability: #3519. It'll be in the 0.9.3 release. Thanks for the feedback everyone.

@pauldix pauldix closed this as completed Jul 30, 2015
@pauldix
Copy link
Member Author

pauldix commented Jul 30, 2015

@Adrien-P That's right, we set the type based on the first value written in. The issue with having mixed types for a given field in a measurement is that we end up having to case them when doing aggregations, which could yield unexpected results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants