Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jsinterp] Actual JS interpreter #11272

Closed
wants to merge 127 commits into from
Closed

[jsinterp] Actual JS interpreter #11272

wants to merge 127 commits into from

Conversation

sulyi
Copy link

@sulyi sulyi commented Nov 23, 2016

Please follow the guide below

  • You will be asked some questions, please read them carefully and answer honestly
  • Put an x into all the boxes [ ] relevant to your pull request (like that [x])
  • Use Preview tab to see how your pull request will actually look like

Before submitting a pull request make sure you have:

In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:

  • I am the original author of this code and I am willing to release it under Unlicense
  • I am not the original author of this code but it is in public domain or released under Unlicense (provide reliable evidence)

What is the purpose of your pull request?

  • Bug fix
  • Improvement
  • New extractor
  • New feature

Description of your pull request and other information

I've started to implement an actual JavaScript syntax parser.
-- EDIT --
And moved on making an interpreter.

@sulyi sulyi mentioned this pull request Nov 24, 2016
8 tasks
_STRING_RE = r'%s|%s' % (_SINGLE_QUOTED, _DOUBLE_QUOTED)

_INTEGER_RE = r'%(hex)s|%(dec)s|%(oct)s' % {'hex': __HEXADECIMAL_RE, 'dec': __DECIMAL_RE, 'oct': __OCTAL_RE}
_FLOAT_RE = r'%(dec)s\.%(dec)s' % {'dec': __DECIMAL_RE}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.3

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, thx @yan12125!

_NAME_RE = r'[a-zA-Z_$][a-zA-Z_$0-9]*'

_SINGLE_QUOTED = r"""'(?:[^'\\\\]*(?:\\\\\\\\|\\\\['"nurtbfx/\\n]))*[^'\\\\]*'"""
_DOUBLE_QUOTED = r'''"(?:[^"\\\\]*(?:\\\\\\\\|\\\\['"nurtbfx/\\n]))*[^"\\\\]*"'''
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you misuse \. For example:

>>> repr(re.match(r"""'(?:[^'\\\\]*(?:\\\\\\\\|\\\\['"nurtbfx/\\n]))*[^'\\\\]*'""", r"""'\'"""))
'None'

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check it, but I've borrowed that from utils though.

Copy link
Author

@sulyi sulyi Nov 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure it wasn't right, but r"""'\'""" shouldn't be matched anyway (it's not closed).
This kinda looks ok to me:

>>> repr(re.match(r"""'(?:[^'\\]|\\['"nurtbfx/\\n])*'""", r"""'\''"""))
'<_sre.SRE_Match object; span=(0, 4), match="\'\\\\\'\'">'
>>> repr(re.match(r"""'(?:[^'\\]|\\['"nurtbfx/\\n])*'""", r"""'\'"""))
'None'
>>> repr(re.match(r"""'(?:[^'\\]|\\['"nurtbfx/\\n])*'""", """'\''"""))
'<_sre.SRE_Match object; span=(0, 2), match="\'\'">'
>>> repr(re.match(r"""'(?:[^'\\]|\\['"nurtbfx/\\n])*'""", """'\'"""))
'<_sre.SRE_Match object; span=(0, 2), match="\'\'">'

_BOOL_RE = r'true|false'
# XXX: it seams group cannot be refed this way
# r'/(?=[^*])[^/\n]*/(?![gimy]*(?P<reflag>[gimy])[gimy]*\g<reflag>)[gimy]{0,4}'
_REGEX_RE = r'/(?=[^*])[^/\n]*/[gimy]{0,4}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> re.match(r'/(?=[^*])[^/\n]*/[gimy]{0,4}', r'''/\/\/\//''')
<_sre.SRE_Match object; span=(0, 3), match='/\\/'>

Copy link
Author

@sulyi sulyi Nov 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully I've managed to improve on it a little.
--- edit ---
They can't be multiline, can they? I'll need to check that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They can't be multiline, can they?

Yep. According to ECMA 262 5.1, CR (U+000D), LF (U+000A), LS (U+2028) and PS (U+2029) are not allowed in RegExp literals

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx. I'll need to read that, a couple more times.


# _ARRAY_RE = r'\[(%(literal)s\s*,\s*)*(%(literal)s\s*)?\]' % {'literal': _LITERAL_RE}
# _VALUE_RE = r'(?:%(literal)s)|(%(array)s)' % {'literal': _LITERAL_RE, 'array': _ARRAY_RE}
_CALL_RE = r'\.?%(name)s\s*\(' % {'name': _NAME_RE} # function or method!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function calls are complex. For example:

from youtube_dl.jsinterp import JSInterpreter

jsi = JSInterpreter('''
function a(x) { return x; }
function b(x) { return x; }
function c()  { return [a, b][0](0); }
''')
print(jsi.call_function('c'))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added test.
Tokenizing seams to be fine, but I haven't migrated the interpreter and old one does not support this.

]
_ASSIGN_OPERATORS = [(op + '=', opfunc) for op, opfunc in _OPERATORS]
_ASSIGN_OPERATORS.append(('=', lambda cur, right: right))

_RESERVED_RE = r'(?:function|var|(?P<ret>return))\s'
Copy link
Collaborator

@yan12125 yan12125 Nov 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but Javascript is not context-free. For examlpe:

code3 = '''
a = {'var': 3};
function c() { return a.var; }
'''
jsi = JSInterpreter(code3)
print(jsi.call_function('c'))

Copy link
Author

@sulyi sulyi Nov 25, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added test, but I don't see any problem.
And what do you mean @yan12125 "not context-free"?
Didn't you wanted to say not regular?
--- edit---
Sry, you're right.
Although, according to http://stackoverflow.com/questions/30697267/is-javascript-a-context-free-language:

That object literals must not contain duplicate property names and that function parameter lists must not contain duplicate identifiers are two rules that cannot be expressed using (finite) context-free grammars.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated keys/parameter names are another issue, which can be ignored in parsing and checked in semantic checking. In youtube-dl it's safe to assume all inputs are valid Javascript so there's no need to handle it.

@yan12125 yan12125 self-assigned this Nov 26, 2016
@yan12125
Copy link
Collaborator

A notice: OrderedDict are not available in Python 2.6. There was a proposal to drop 2.6 but no consensus yet (#5697)

@sulyi
Copy link
Author

sulyi commented Nov 29, 2016

For some reason code this had been missed by code inspector of the IDE I'm working with. I'll try to come up a workaround. Thanks.

- missing enumerate in op_ids and aop_ids
- order of relation and operator regex in input_element
Copy link
Author

@sulyi sulyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left import line.

Also a bunch of changes got in that shouldn't have.

@sulyi
Copy link
Author

sulyi commented Dec 1, 2016

I've just realise my original idea, that _next_statement method would do the lexical analysis and interpret_statement the parsing is fraud. To yield a statement parsing has to had happened, since Statement is one of the symbols (along with FunctionDeclaration) replacing SourceElement in the syntactic grammar.

sulyi added 3 commits December 3, 2016 06:32
- new class TokenStream with peek and pop methods
 - _assign_expression handling precedence
 - new logical, unary, equality and relation operators
 - yet another try replacing OrderedDict
 - minor change in lexical grammar
    allowing identifiers to match reserved words
    _chk_id staticmethod has been added to handle it
    in syntactic grammar
Supports:
  - arrays
  - expressions
  - calls
  - assignment
  - variable declaration
  - blocks
  - return statement
  - element and property access

  Semantics not yet implemented, tho.
@@ -12,7 +12,7 @@
import traceback

from .common import InfoExtractor, SearchInfoExtractor
from ..jsinterp import JSInterpreter
from ..jsinterp2 import JSInterpreter
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidental changes, but oddly passed test_youtube_signature. For me it fails due to JSArrayPrototype._slice doesn't handle arguments correctly since not being implemented yet.

@sulyi sulyi changed the title [jsinterp] Actual parsing [jsinterp] Actual JS interpreter Jun 10, 2018
@Tatsh
Copy link
Contributor

Tatsh commented Jun 10, 2018

This is of course, very neat. But a lot of Chrome's (and maybe others) standard library for many things are implemented in JavaScript. Instead of making the built-ins (like String.prototype.match) in Python, why not write them in JavaScript (where possible)?

try:
ref = (self.this[id] if id in self.this else
self.global_vars[id])
except KeyError:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think JSInterpreter#extract_function is useful filtering code before execution, therefore I'd like to continue supporting it. But here it needs to get object from outer context, and that behaviour is not in spec. I'd like to suggest a flag that disables these kind of features.

@sulyi
Copy link
Author

sulyi commented Jun 10, 2018

@Tatsh That sounds great, and I remember seeing such implementation somewhere, but I didn't understand it, well enough to try to adopt it. Can you help?

@Tatsh
Copy link
Contributor

Tatsh commented Jun 10, 2018

Main thing is to get the basics of the interpreter in, which includes the built-in types, and then the JavaScript portions can be written very similarly to how polyfills are written today. So you would not need to implement Array.isArray() in Python if you have the === operator working for comparing function references and .constructor property working on all objects. Then code Array.isArray = function (x) { return x.constructor === Array }; and make this load at runtime before anything else.

I can take a look later since this does interest me. Mainly this was needed for me to get past CloudFlare anti-DDoS without needing cookies.

@sulyi
Copy link
Author

sulyi commented Jun 10, 2018

Well, I've already implemented operator === and most constructors I believe, but neither have been tested properly.
Also spec of isArray states:

  1. If Type(arg) is not Object, return false.
  2. If the value of the [[Class]] internal property of arg is "Array", then return true.
  3. Return false.

I'm not sure when 1. would execute or whether your solution takes care of it.
My solution for this particular function would look like this:

from .internals import jstype, object_type

def _is_array(arg):
   if jstype(arg) is not object_type:
      return False
   if arg.jsclass == 'Array':
      return True
  return False

I'm not sticking to it or claim that it's elegant, but this is how I'm able to solve the task at hand.
I have way less xp programming in js than in py. So, if you could tell me what needs to be get done for this to work I can probably take care of it, but to actually implement it that would be much harder to do it like this on my own.

- adds `jsgrammar.LINETERMINATORSEQ_RE`
- lexer `tstream.TokenStream` checks for lineterminators in tokens
- adds `tstream.Token`
- refractors `tstream.TokenStream` and `jsparser.Parser` and to use it
@sulyi
Copy link
Author

sulyi commented Jun 10, 2018

I've added feature to lexer (tstream.TokenStream) ability to handle line terminators. This is the first step in implementing correct line reporting.

This is also useful to have in my other plan to change the test suite using json files instead of py to generate test cases. My reasoning behind it is if jsparser.Parser would support converting AST to estree or some pretty similar format, it would be possible to easily compare it against the output of acorn or some other parser.

sulyi added 2 commits June 11, 2018 07:47
- Adds `jsbuilt_ins.nan` and `jsbuilt_ins.infinity`
- Adds arithmetic operator overload to
  `jsbuilt_ins.jsnumber.JSNumberPrototype`
- Adds equality operator overload to `jsinterp.Reference`
- Adds better strict equality and typeof operator in `tstream`
- Refractors `Context` and `Reference` classes into their own module
  named `environment` (saves local import in `tstream`)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants