-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-44246: [JS] Add a lightweight array view to Table #44247
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. Would you also add this to vectors or if not, why only tables?
/** | ||
* Returns a string representation of the Table rows. | ||
* | ||
* @returns A string representation of the Table rows. | ||
*/ | ||
public toString() { | ||
return `[\n ${this.toArray().join(',\n ')}\n]`; | ||
return `[\n ${this.toArrayView().join(',\n ')}\n]`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are stringifying the whole table, is it worth using an array view here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need an array (or array-like) to use join()
. This was creating an array, and with toArrayView()
this now just creates a proxy. Or did I misunderstand your comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the difference but I wonder whether this change introduces a performance regression since "the proxy adds some overhead to direct array access."
I'm not against this change but want to make sure we have thought through the implications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. I don't think this is an issue here, as toString()
is generally used for debugging purposes. Actually I'm wondering if it's practically usable for Table
beyond tests as it can create huge strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some benchmarking, toString
is a perfect candidate for using toArrayView()
: the access overhead is actually the same as with toArray()
. It is just spread while iterating while with toArrayView()
while it's paid once upfront with toString
. Since we're not reusing the array, after join
, a view that doesn't allocate memory is a better choice.
Thank @domoritz!
That absolutely makes sense. My immediate need was for tables for an application that uses plain objects, and I will add it for vectors. Also, do you agree that |
Good question. I wonder actually whether we need |
checkArray(table.toArray()); | ||
}); | ||
describe('table.toArrayView()', () => { | ||
checkArray(table.toArrayView()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we make this work?
checkArray(table.toArrayView()); | |
checkArray(table); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the discussion here. We cannot make Table
behave like an array.
js/src/table.ts
Outdated
@@ -238,17 +238,30 @@ export class Table<T extends TypeMap = any> { | |||
* | |||
* @returns An Array of Table rows. | |||
*/ | |||
public toArray() { | |||
public toArray(): Array<Struct<T>['TValue']> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this improve the type returned by this method or just make the type explicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It just makes it explicit. I added it to know precisely what toArrayView()
was supposed to return. And also because I like explicit types 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unlike in other languages, explicitly annotating return types is somewhat of an anti-pattern in TypeScript. Unless there's a special case where we need to inform the compiler what the real return type is, I'm with @domoritz that we should default to letting the compiler infer the type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, no problem, I removed it in 4b706a9.
Mimicking an array is a bit tricky: it has a lot of methods that are implemented by iterating it (like This is why return new Proxy([] as Array<Struct<T>['TValue']>, new TableArrayProxyHandler(this)); That way the proxy handler only has to implement iteration and direct access, and benefits from all other array methods by delegating to the wrapped array. |
I see. I'll page @trxcllnt to weigh in on this one. |
I did some extensive benchmarks to understand memory usage and performance overhead of There are options for both the array and its elements:
All combinations can be valid depending on how the application uses the resulting array-like object. To account for this, I'd like to add an optional enum ToArrayBehavior (eager, cache, noCache);
class ToArrayOptions {
array: ToArrayBehavior,
rows: ToArrayBehavior
}
function toArray(options? ToArrayOptions) ... with a default of WDYT? |
Table itself is already a view... why not just use |
I'm not sure how I feel about this PR yet, and have a few questions. I'm not saying I disagree with it, just trying to provide some context and discuss general strategies.
Yeah, this is something we've struggled with since the project's inception. Every "convenience" API we've added is almost always also the least-performant way to interact with the data. While ergonomic for vanilla JS users, these APIs encourage abuse and lead to performance pitfalls in everyday use. Anybody profiling their code blames Arrow for "being slow," when in reality all we did was make it easy for users to do slow things. I personally would rather focus efforts on making it easy for users to do the fastest thing (so-called helping users "fall into the pit of success"). To date, this has manifested in three ways. First, we ensure all data movement (e.g. IPC) is zero copy and provide/encourage users to use zero-copy views of the data at rest. Second, we provide We have
Part of this performance deficit is definitely due to the A while back I had a branch that attempted this approach for Table/Vector. The idea was to use In reference to to my point above, even though our implementation could be the fastest way to implement a JS-Array-style API via Proxies, we'll still get blamed for encouraging users to use it. With that context out of the way, I have some questions. First, why/how are users (or you) using Is it to get the table into a form where it can be used with the list alebra (map/filter/reduce/etc.)? If so, have you considered a library like IxJS, or the stage 3 TC39 iterator-helpers proposal? One of the main reasons the Arrow types implement the [Async]Iterator spec is to make it easier for users to interoperate with libraries like these, while avoiding the Second, aside from the Array algebra, what benefit does an Array-like API bring over the existing Or is there a reason that operating directly on the binary data isn't easy/sufficient? For example, is the following insufficient/too inconvenient? const times2 = table.data.reduce((results, batch) => {
const [data] = batch.getChild("floats").data;
return times2.concat(data.values.map((f, i) => data.isNull(i) ? f * 2 : 0));
}, new Float32Array()); Finally, is there any reason this can't live in application code instead (e.g. you create a Proxy from a table that implements the Array-subscript syntax)? I have done this a number of times until I had time to refactor routines handle tables/vectors more efficiently. I am not intending to come off as obtuse, and I hope my questions make sense in context. I just worry about adding APIs that encourage users to do something convenient at the expense of performance. |
Thanks for the extensive explanation @trxcllnt. I totally agree that providing helpers like I work at Elastic where we're in the process of adding support for Arrow. The products currently use a JSON array-of-objects representation to exchange data, and have done so for more than a decade, even if deep inside Elasticsearch there is some column-oriented data processing happening. We recently added support for Arrow on the Elasticsearch server side and are now considering to consume it from Kibana, the UI front-end to manage and display data. There is a lot of existing code in Kibana based on array-of-objects and refactoring it to use Arrow vectors will take some time, and every iteration needs to bring some benefits in some area to foster adoption. So the idea is to start from the edge (data exchange) where we can have significant gains on payload size, memory usage and eliminate parsing time, and then progressively use it more deeply in the rest of the code base. Hence the use of Now I agree that So what about providing helpers as what they are, helpers and not primary features, and make this clear by moving them to a different location, like a I hope this makes sense. |
Thanks for the additional context. I really appreciate the effort you put into preparing this pull request and hope that you send more in the future as you find wants to improve the library. Maybe for this specific feature it makes sense to just leave arrow as is (with the iterators and |
Ok, I understand your concerns and I'm closing this PR. I will keep experimenting on this topic in my benchmarks project, and may ping you at some point to see if anything interesting could be contributed here. That being said, this discussion outlined the fact that I also opened PR #44289 with only the change for |
Good point on the examples in the introduction. We should first show how to iterate instead. |
I took a stab at encouraging more effective APIs. How's #44292? |
Rationale for this change
Table.toArray()
provides the convenience of using Arrow tables as common object arrays. However it does allocate an array of the size of the table that hold proxies to the table rows. This array can consume a significant amount of memory with large tables.Providing an array proxy that wraps the table and creates struct row proxies on demand would avoid this and provide a object array view on top of a table at almost zero memory cost.
What changes are included in this PR?
This PR adds a new
Table.toArrayView()
method that returns an array proxy to the table rows.It complements
Table.toArray()
and doesn't replace it for the following reasons:toArray()
returns a plain mutable array. Replacing it would be a breaking change.Also fixes #30863 by using a singleton proxy handler for
StructRowProxyHandler
, which further reduces memory usage.Are these changes tested?
Yes. New tests are added to
js/test/unit/table/table-test.ts
for bothtoArrayView()
andtoArray()
that was not tested.Are there any user-facing changes?
This adds a new
Table.toArrayView()
method.