-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance problem #131
Comments
There are a bunch of performance notes here: https://github.com/Engelberg/instaparse/blob/master/docs/Performance.md Usually an instaparse parser performs linearly with respect to the size of the input up to some point at which it exhausts memory and starts thrashing the garbage collector. Usually this point where it breaks down is when the input gets somewhere between 20k-200k, depending on the complexity of the grammar. (If the parser performance isn't linear on small inputs, then the culprit is usually ambiguity in the grammar.) So a 700k file is really pushing the limits of what instaparse is good for. The reason is that instaparse has really robust backtracking, so it has to consider the possibility that the very last character of the file may force it to reinterpret the way that the entire file is matched against the grammar, conceivably backtracking all the way back to the beginning if necessary. This means maintaining a history of every decision point made for the entire 700k of parsing, which is quite a bit to track. If the 700k file is comprised of a bunch of individual records, and it is easy to identify the boundaries, the best thing to do is to chop them up into individual records and pass these smaller strings to your instaparse parser. You can also try to use the If you get a chance to try the |
BTW, here's a simple example illustrating my point about backtracking. Imagine the following grammar: S = Option1 | Option2 If you imagine matching this grammar against a 700k string of all a's up until the final 'b', instaparse would first try matching using Option1, and it would hum along merrily until it hit the final 'b', at which point it has to backtrack all the way to the beginning and try Option2. This is why most parsing engines require you to write a LL(k) grammar which can determine with certainty what rule to use based on looking ahead at most k characters. Beyond |
Ah! okay, I'll switch to handwritten parsers then - I have some files up to On Wed, Apr 13, 2016 at 4:01 AM, Mark Engelberg [email protected]
|
I'm pretty sure I've hit the limit. One more rule and running the parser thrashes my machine possibly forever, never returning. Using I used |
My instinct would be to use re-seq with an appropriate regex to break it apart. In any case, this strategy of making a high level parse and then handing chunks off to more specific parsers is something I'd like to make easier to do in instaparse. If you have any tips that you've learned from going through it, let me know. |
I looked at all the regex functions and none of them tell you the index in the String. I need indexes because simply want to find 'BEGIN_BLOCK' and 'END_BLOCK' - then find the whole of the middle in between. Perhaps real problem is I'm not good enough at regular expressions to match ['BEGIN_BLOCK' - anything but 'END_BLOCK' - 'END_BLOCK']. |
The key is the Also, if you want indexes, you can call Clojure's
On Wed, Aug 31, 2016 at 7:46 PM, Chris Murphy [email protected]
|
One thing you might be interested in (in case you don't already know) is the concept of Cuts from http://www.lihaoyi.com/fastparse/. Seems to me to be like |
old link is broken, here is the new one https://com-lihaoyi.github.io/fastparse/#Cuts |
There seems to be a performance problem, though I'm not sure whether it's intrinsic or something I'm doing wrong:
http://stackoverflow.com/questions/36572997/any-way-to-speed-up-instaparse
The text was updated successfully, but these errors were encountered: