Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow concat_batches to take non owned RecordBatch #3456

Closed
alamb opened this issue Jan 4, 2023 · 3 comments · Fixed by #3480
Closed

Allow concat_batches to take non owned RecordBatch #3456

alamb opened this issue Jan 4, 2023 · 3 comments · Fixed by #3480
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers help wanted

Comments

@alamb
Copy link
Contributor

alamb commented Jan 4, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

We have code that has Vec<&RecordBatch> (not Vec<RecordBatch>) and thus we can't call concat_batches as that requires taking ownership. I looked at the code for concat_batches and it is not at all clear why it needs ownership of the RecordBatch

https://docs.rs/arrow/30.0.0/arrow/compute/fn.concat_batches.html

Describe the solution you'd like
I would like some way to call concat_batches that doesn't require an bunch of owned RecordBatches

Perhaps something like this would work:

pub fn concat_batches(
    schema: &Arc<Schema>,
    batches: impl IntoIter<Item = &RecordBatch>
) -> Result<RecordBatch, ArrowError>

And existing callers could call concat_batches(&batches)

Describe alternatives you've considered

Additional context
in a DataFusion PR: apache/datafusion#4777 (comment)

@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Jan 4, 2023
@askoa
Copy link
Contributor

askoa commented Jan 6, 2023

I can pick this up. I briefly looked into this. concat_batches uses iter() in multiple places in the function. It also passes the iter() to another function. into_iter() is consuming and hence cannot be used multiple times within the function.

I am thinking of creating a Vec using the into_iter() as the first step in the function and then use that Vec's iter() in the rest of the function. Let me know if you have different suggestion.

Edit: One more question is concat_batches part of public api?

@alamb
Copy link
Contributor Author

alamb commented Jan 6, 2023

I think collecting references into a Vec<&RecordBatch> seems like a very reasonable step to me

@tustvold
Copy link
Contributor

label_issue.py automatically added labels {'arrow'} from #3480

@tustvold tustvold added the arrow Changes to the arrow crate label Jan 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers help wanted
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants