Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buffer: add indexOf method #160

Closed
wants to merge 1 commit into from
Closed

Conversation

srijs
Copy link

@srijs srijs commented Dec 14, 2014

Adds a .indexOf method to Buffer, which borrows semantics from
both Array.prototype.indexOf and String.prototype.indexOf.

Buffer.prototype.indexOf can be invoked with a Buffer, a string
or a number as the needle. If the needle a Buffer or string, it will
find the first occurrence of this sequence of bytes. If the needle is
a number, it will find the first occurrence of this byte.

Fixes #95.

If you're unsure about or not happy with the semantics, let's discuss :)

Adds a `.indexOf` method to `Buffer`, which borrows semantics from
both `Array.prototype.indexOf` and `String.prototype.indexOf`.

`Buffer.prototype.indexOf` can be invoked with a Buffer, a string
or a number as the needle.  If the needle a Buffer or string, it will
find the first occurrence of this sequence of bytes.  If the needle is
a number, it will find the first occurrence of this byte.

Reviewed-by: Sam Rijs <[email protected]>
Fixes: nodejs#95
PR-URL: nodejs#160
@vkurchatkin
Copy link
Contributor

Though this function might be useful as it is, I think indexOf is a bad name, because it doesn't work the same way as %TypedArray%.prototype.indexOf

@@ -332,6 +332,16 @@ Buffer.prototype.fill = function fill(val, start, end) {
return this;
};

Buffer.prototype.indexOf = function indexOf(needle, pos) {
if (typeof needle === 'number') {
needle = new Buffer([needle]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allocating a buffer for a single byte seems too much. Also it's unclear that only one byte numbers are accepted. What if I want to search for 16 byte number? How to specify endianness?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This intentionally follows the semantics of Array.prototype.indexOf.

It can thus only search for an element of the array. If you wanted to search for a sequence of elements (bytes), you can supply a Buffer as the needle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkurchatkin As I understand it, the backing slow buffer will have already been allocated, so this shouldn't be too bad of an allocation.

We could amortize the cost of repeated calls by having a statically available single byte buffer into which we swap the value, which might be a nice optimization.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about special functions on internal for each case

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go with a hybrid interface, I also think different specialised functions would be a cleaner approach.

@srijs
Copy link
Author

srijs commented Dec 14, 2014

That depends on how you look at it.

The situation is that there is String.prototype.indexOf as well as Array.prototype.indexOf (which is semantically equivalent to %TypedArray%.prototype.indexOf). The method implemented by me is polymorphic wrt the needle you can supply, and can work perfectly fine for "find the index of byte 0xFF in this buffer" (Array.prototype.indexOf) and "find the index of the substring 'abc' (or Buffer<61,62,63>) in this buffer".

It might be that I am missing something, in that case don't hesitate to point it out :)

@vkurchatkin
Copy link
Contributor

I think that function that finds a buffer inside of another buffer is generally more useful than function that finds a byte inside a buffer. But the spec says that Typed Arrays should behave just like arrays of numbers. And IMO Buffer should behave just like Uint8Array. Two reasons for that:

  • there were some talks about making Buffer inherit from Uint8Array when it's possible; in that case overriding methods is unwanted;
  • portability; as an example:
obj.on('data', function (data) {
// what is data?
// in node it's a `Buffer`
// in browser or ES6-compatible environment it's a `Uint8Array`
// but they are compatible (to some extent), so we don't really care
});

It is maybe too late for full compatibility, but we shouldn't make things worse for no good reason.

@mscdex
Copy link
Contributor

mscdex commented Dec 15, 2014

What about using something more efficient for multi-byte needles? I'm thinking of boyer-moore or a similar algorithm. streamsearch (used in dicer and other modules) uses a pure-javascript implementation of the boyer-moore-horspool algorithm and it is significantly faster than the typical naive search algorithm. Perhaps such an algorithm could be automatically used on haystacks whose lengths exceed some constant threshold.

@aredridel
Copy link
Contributor

boyer-moore-horspool ++

@bnoordhuis
Copy link
Member

However, Boyer-Moore-Horspool's worst case is O(n*m) whereas Boyer-Moore runs in linear time.

@srijs
Copy link
Author

srijs commented Dec 16, 2014

While I would be in favour of using a more sophisticated approach than naive searching, for the case that we go with String.indexOf, I'd be interested to hear a few other opinions on the interface:

Go with the current one (hybrid), or choose only one of Array.indexOf or String.indexOf? And if yes, which one?

I think @vkurchatkin has some valid points, I'd be interested to hear what some other people think of it.

@mikeal
Copy link
Contributor

mikeal commented Dec 16, 2014

This will make writing parsers so much better ;)

@piscisaureus piscisaureus reopened this Dec 30, 2014
@sonewman sonewman mentioned this pull request Jan 6, 2015
@chrisdickinson
Copy link
Contributor

Checking in.

@vkurchatkin I think the ship has sailed re: divergence from Uint8Array. We have already greatly diverged from the TypedArray spec in terms of public methods, and buffer-browserify shows how Buffers can be implemented on top of typed arrays for in-browser use.

@srijs There seems to be a preference for implementing this function in terms of Boyer-Moore, would you be interested in continuing this PR in that direction?

@srijs
Copy link
Author

srijs commented Jan 14, 2015

@chrisdickinson Yes, I'd be interested in going that direction. I will start with the implementation in the next couple of days.

@feross
Copy link
Contributor

feross commented Jan 20, 2015

@srijs Ping. Any progress?

@@ -764,6 +764,13 @@ buffer.
var b = new Buffer(50);
b.fill("h");

### buf.indexOf(value[, fromIndex])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly something like byteOffset would be better than fromIndex. The latter makes it sound like the character placement, but that's incorrect for multi-byte strings.

@trevnorris
Copy link
Contributor

@srijs Thanks for doing this work. There are several optimizations that could be implemented, but it'd take me a while to explain how to do them. Mind if I implement this myself, and I could throw in regex support as well. :)

@mscdex
Copy link
Contributor

mscdex commented Jan 22, 2015

+100

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-minor PRs that contain new features and should be released in the next minor version.
Projects
None yet
Development

Successfully merging this pull request may close these issues.