-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a scalar function for creating ScalarValue::Map #11268
Comments
take |
https://duckdb.org/docs/sql/data_types/map.html we can follow the syntax from DuckDB, |
I like this idea very mch This is also consistent with how Here is Here is the syntax support datafusion/datafusion/functions/src/core/planner.rs Lines 27 to 40 in b46d5b7
|
Thanks @alamb @jayzhan211
It's a good idea. I also considered how to implement SQL syntax to support this scalar function, but I had no idea how to do it. It seems I will try to implement the syntax for Besides the SQL syntax, I'm considering what the scalar function would look like. Should we consider the data layout of
The function would be used like this:
If we don't need to consider it, we could only allow one element for
The layout would be
What do you think? |
👍 ❤️
I think following the model for From my reading of https://docs.rs/arrow/latest/arrow/array/struct.MapArray.html the idea is that each element of a I think we could use https://docs.rs/arrow/latest/arrow/array/builder/struct.MapBuilder.html to create the actual array/value Syntax ProposalMaybe we could use a function like -- create {'foo': 1, 'bar': 2}
select make_map('foo', 1, 'bar', 2); Implementation suggestionI recommend doing this in 2 PRs:
Notes / caveatsI am not sure how complete the MapArray support in arrow-rs is, so it wouldn't surprise me if we hit various unimplemented errors when we tried to do basic operations like compare maps or sort them. I don't think that is a reason not to proceed here, but I wanted to point it out |
I have finished a draft function that followed my original design yesterday. I follow how arrow-rs to create a map from strings. It could be used like DuckDB syntax that accepts two lists to construct a map. (In my design, it could accept many key-value array pairs) SELECT MAP(['key1', 'key2', 'key3'], [10, 20, 30]); It may not fit the syntax you propose. I will follow the The reason I didn't use
I agree with you. I will implement the function in this issue and have a follow-up one for the map literal.
Thanks for reminding this problem. I think if I occurred this kind of problem, I might implement the required functions on the DataFusion side first. |
The datatype of key and value should be known before |
Indeed. I'll try to use MapBuilder. Thanks :) |
btw, there is probably a downside for using MapBuilder, since we need to create different builder for different types (therefore many macros) which easily causes code bloat. We have similar issue while building array function before #7988. I'm not sure what is the best approach yet (probably current draft function is the best), worth to explore it. |
Besides the code bloat, I'm also concerned about the performance. Because I did some benchmarks to compare the I haven't tried any optimizations for them. The results are as follows: // The first run is for warm-up
Generating 1 random key-value pair
Time elapsed in make_map() is: 449.737µs
Time elapsed in make_map_batch() is: 129.096µs
Generating 10 random key-value pairs
Time elapsed in make_map() is: 24.932µs
Time elapsed in make_map_batch() is: 18.007µs
Generating 50 random key-value pairs
Time elapsed in make_map() is: 34.587µs
Time elapsed in make_map_batch() is: 17.037µs
Generating 100 random key-value pairs
Time elapsed in make_map() is: 66.02µs
Time elapsed in make_map_batch() is: 19.027µs
Generating 500 random key-value pairs
Time elapsed in make_map() is: 586.476µs
Time elapsed in make_map_batch() is: 63.832µs
Generating 1000 random key-value pairs
Time elapsed in make_map() is: 722.771µs
Time elapsed in make_map_batch() is: 47.455µs The scenario for this function may not be for large maps, but I think the What do you think? @alamb @jayzhan211 |
Upd: I tried to remove clone in MapBuilder version, it still seems slower than manually constructed one |
Thanks for your effort. I think
However, I found it hard to provide a function like
In conclusion, I will provide two functions:
For the user-friendly one, I'll choose the first solution to keep the codebase simpler. I found that
We could suggest that users who intend to get better performance or create a large map use the Some appendixs for the benchmark result:
|
what is your code for benchmarking? Given the code, it seems you clone ColumnValue and push to VecDeque, we can avoid the clone and use Vec instead of VecDeque, since we don't need fn make_map_from(args: &[ColumnarValue]) -> Result<ColumnarValue> {
let mut key_buffer = VecDeque::new();
let mut value_buffer = VecDeque::new();
args.chunks_exact(2)
.enumerate()
.for_each(|(_, chunk)| {
let key = chunk[0].clone();
let value = chunk[1].clone();
key_buffer.push_back(key);
value_buffer.push_back(value);
});
let key: Vec<_> = key_buffer.into();
let value: Vec<_> = value_buffer.into();
let key = ColumnarValue::values_to_arrays(&key)?;
let value = ColumnarValue::values_to_arrays(&value)?;
make_map_batch(&[key[0].clone(), value[0].clone()])
} You only take the first element, but shouldn't we process all the kv? make_map_batch(&[key[0].clone(), value[0].clone()]) |
Here's my testing code.
Sounds great. I will try it.
oops. It's an unfinished work. I forgot it. Yes, you're right.
I think I need to contact them to be one key array and one value array. |
You can also do the benchmarking in Add this to cargo.toml, and run with cargo bench --bench map [[bench]]
harness = false
name = "map" |
Is your feature request related to a problem or challenge?
After #11224, we have
ScalarValue::Map
forMapArray
. It's better to have a scalar function for it.Describe the solution you'd like
It would be similar to the
make_array
andstruct
functions. I guess it will accept two arrays to construct a map value. Something likeDescribe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: