Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stream Processing converts strings to numbers #2498

Closed
angristan opened this issue Aug 25, 2020 · 9 comments · Fixed by #7005
Closed

Stream Processing converts strings to numbers #2498

angristan opened this issue Aug 25, 2020 · 9 comments · Fixed by #7005
Labels

Comments

@angristan
Copy link

The issue

Version 1.4.6 - Ubuntu 18.04

We're using fluent-bit to process some Haproxy logs. Using [PARSER], we extract some fields using a regex. By looking at #310, it seems these fields are typeable, and strings by defaults.

The output of [PARSER] looks like this:

[1598358837.749001000, {"pri"=>"134", "datetime"=>"2020-08-25T12:33:57.749001+00:00", "host"=>"xxx", "ident"=>"xxx", "client_ip"=>"-", "remote_address"=>"-", "time"=>"25/Aug/2020/12/33/57", "request_type"=>"HEAD", "request_path"=>"xxx", "user"=>"xxx", "name"=>"xxx", "protocol"=>"HTTP/1.0", "return_code"=>"204", "referer"=>"-", "user_agent"=>"-", "request_time"=>"0.0085", "log_info"=>"-", "request_start_time"=>"1598358837.739902973", "request_end_time"=>"1598358837.748387098"}]

As we can, see, these are all strings.

We use a simple stream task:

[STREAM_TASK]
    Name http_count
    Exec CREATE STREAM http_count WITH (tag='http_count') AS SELECT name, user, request_type, return_code, COUNT(*) FROM STREAM:syslog.0 WINDOW TUMBLING (10 MINUTE) GROUP BY name, user, request_type, return_code;

Looking at the output, everything seems fine:

[1598278228.540267575, {"name"=>"foo", "user"=>"xxx", "request_type"=>"HEAD", "return_code"=>204, "COUNT(*)"=>6232}]

But we can already see than return_code has been transformed from a string to an int. We didn't notice because this was convenient for us: we output this in Elasticsearch and when querying it, we would get a number in the JSON response.

However, it became an issue for another field, name, which should be a string. Below, the name should be "12345.67890", a valid name in our system, but fluent-bit transformed it to a float, which we can't process anymore because of the approximation.

[1598278228.539670468, {"name"=>12345.678711, "user"=>"xxx", "request_type"=>"DELETE", "return_code"=>204, "COUNT(*)"=>4}]

Looking at the output right before stream processing, after the [PARSER], we get "name"=>"1234567890" or "name"=>"12345.67890", which means integers and floats are not converted at this point, which means stream processing is the culprit.

After looking around the code, here's the problem:

/*
* Convert a msgpack object value to a number 'if possible'. The conversion
* result is either stored on 'i' for 64 bits integers or in 'd' for
* float/doubles.
*
* This function aims to take care of strings representing a value too.
*/
static int object_to_number(msgpack_object obj, int64_t *i, double *d)
{
int ret;
int64_t i_out;
double d_out;
char str_num[20];
if (obj.type == MSGPACK_OBJECT_POSITIVE_INTEGER ||
obj.type == MSGPACK_OBJECT_NEGATIVE_INTEGER) {
*i = obj.via.i64;
return FLB_STR_INT;
}
else if (obj.type == MSGPACK_OBJECT_FLOAT32 ||
obj.type == MSGPACK_OBJECT_FLOAT) {
*d = obj.via.f64;
return FLB_STR_FLOAT;
}
else if (obj.type == MSGPACK_OBJECT_STR) {
/* A numeric representation of a string should not exceed 19 chars */
if (obj.via.str.size > 19) {
return -1;
}
memcpy(str_num, obj.via.str.ptr, obj.via.str.size);
str_num[obj.via.str.size] = '\0';
ret = string_to_number(str_num, obj.via.str.size,
&i_out, &d_out);
if (ret == FLB_STR_FLOAT) {
*d = d_out;
return FLB_STR_FLOAT;
}
else if (ret == FLB_STR_INT) {
*i = i_out;
return FLB_STR_INT;
}
}
return -1;
}

Strings are automatically converted to numbers when possible, which we don't want.

Our patch

We patched our fluent-bit system here:

ret = string_to_number(str_num, obj.via.str.size,
&i_out, &d_out);
if (ret == FLB_STR_FLOAT) {
*d = d_out;
return FLB_STR_FLOAT;
}
else if (ret == FLB_STR_INT) {
*i = i_out;
return FLB_STR_INT;
}

To bypass string conversion (quick & dirty):

diff --git a/src/stream_processor/flb_sp.c b/src/stream_processor/flb_sp.c
index 00c280df..3d3c0ed7 100644
--- a/src/stream_processor/flb_sp.c
+++ b/src/stream_processor/flb_sp.c
@@ -369,9 +369,9 @@ static int object_to_number(msgpack_object obj, int64_t *i, double *d)
         memcpy(str_num, obj.via.str.ptr, obj.via.str.size);
         str_num[obj.via.str.size] = '\0';

-        ret = string_to_number(str_num, obj.via.str.size,
-                               &i_out, &d_out);
+        // Bypass string to number conversion
+        ret = -1;
         if (ret == FLB_STR_FLOAT) {
             *d = d_out;
             return FLB_STR_FLOAT;
         }

And now we actually type the fields we need to not be strings by using the Types feature (#310) (<- btw documentation is gone)

Types return_code:integer request_time:float bytes_recvd:integer bytes_sent:integer

Now, types are correctly preserved:

[1598278228.539670469, {"name"=>"12345.67890", "user"=>"xxx", "request_type"=>"DELETE", "return_code"=>204, "COUNT(*)"=>4}]

(return_code is still an int, as it was in my first snippet, but that's because we type it during the parsing, so we expect it to be an int)

Long-term solution

The automatic string-to-number feature in stream processing by itself makes sense, however, since we can type fields earlier in the pipeline using [PARSER], I'm not sure it is useful, on the contrary.

My understanding is that it's probably been overlooked when the stream processing feature was introduced?

Based on our use-case, I would have sent a PR deleting all the code related to that feature, but I lack knowledge and historical context so I would like the maintainers' input on this. I think it should at least be optional/configurable (it isn't, right?). We maintain our patched build of fluent-bit so we're fine as it is with the one-line patch I added above, but it would be great to fix this behavior upstream.

Hopefully we provided enough details, let us know if we missed something!

Thanks for fluent-bit 🙏

@koleini
Copy link
Collaborator

koleini commented Aug 25, 2020

Hi @angristan, that was an intention in the initial design of Stream Processor. The main motivation was the assumption that clients may mistakenly use stringified numbers in records (for example in JSON objects), and such string-to-int conversion would be helpful in general and allows applying mathematical computations on such fields (similar to return_code in your record).

However, in your case, name would be an exception showing a problem that can occur with such conversion.

@edsiper What do you think?

@angristan
Copy link
Author

I see! It makes total sense, unfortunately I hit an edge case 🙂

@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Apr 13, 2021
@angristan
Copy link
Author

angristan commented Apr 13, 2021

.

@github-actions github-actions bot removed the Stale label Apr 14, 2021
@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label May 16, 2021
@angristan
Copy link
Author

.

@github-actions github-actions bot removed the Stale label May 17, 2021
@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jun 17, 2021
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

@GLStephen
Copy link

GLStephen commented Jun 21, 2022

@koleini This should be looked at again. Items like short guids, for example "39hy3a" are being converted to 39 by this code. Which is entirely incorrect. This magic conversion is pretty bad in the stream/SQL world. An explicit cast function, even one which supports some form of duck typing would be far more functional. I'm trying to group by short guids and they are getting converted all over the place, which is weird.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants