-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jsinterp] Actual JS interpreter #11272
Conversation
youtube_dl/jsinterp.py
Outdated
_STRING_RE = r'%s|%s' % (_SINGLE_QUOTED, _DOUBLE_QUOTED) | ||
|
||
_INTEGER_RE = r'%(hex)s|%(dec)s|%(oct)s' % {'hex': __HEXADECIMAL_RE, 'dec': __DECIMAL_RE, 'oct': __OCTAL_RE} | ||
_FLOAT_RE = r'%(dec)s\.%(dec)s' % {'dec': __DECIMAL_RE} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, thx @yan12125!
youtube_dl/jsinterp.py
Outdated
_NAME_RE = r'[a-zA-Z_$][a-zA-Z_$0-9]*' | ||
|
||
_SINGLE_QUOTED = r"""'(?:[^'\\\\]*(?:\\\\\\\\|\\\\['"nurtbfx/\\n]))*[^'\\\\]*'""" | ||
_DOUBLE_QUOTED = r'''"(?:[^"\\\\]*(?:\\\\\\\\|\\\\['"nurtbfx/\\n]))*[^"\\\\]*"''' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you misuse \. For example:
>>> repr(re.match(r"""'(?:[^'\\\\]*(?:\\\\\\\\|\\\\['"nurtbfx/\\n]))*[^'\\\\]*'""", r"""'\'"""))
'None'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll check it, but I've borrowed that from utils though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure it wasn't right, but r"""'\'"""
shouldn't be matched anyway (it's not closed).
This kinda looks ok to me:
>>> repr(re.match(r"""'(?:[^'\\]|\\['"nurtbfx/\\n])*'""", r"""'\''"""))
'<_sre.SRE_Match object; span=(0, 4), match="\'\\\\\'\'">'
>>> repr(re.match(r"""'(?:[^'\\]|\\['"nurtbfx/\\n])*'""", r"""'\'"""))
'None'
>>> repr(re.match(r"""'(?:[^'\\]|\\['"nurtbfx/\\n])*'""", """'\''"""))
'<_sre.SRE_Match object; span=(0, 2), match="\'\'">'
>>> repr(re.match(r"""'(?:[^'\\]|\\['"nurtbfx/\\n])*'""", """'\'"""))
'<_sre.SRE_Match object; span=(0, 2), match="\'\'">'
youtube_dl/jsinterp.py
Outdated
_BOOL_RE = r'true|false' | ||
# XXX: it seams group cannot be refed this way | ||
# r'/(?=[^*])[^/\n]*/(?![gimy]*(?P<reflag>[gimy])[gimy]*\g<reflag>)[gimy]{0,4}' | ||
_REGEX_RE = r'/(?=[^*])[^/\n]*/[gimy]{0,4}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> re.match(r'/(?=[^*])[^/\n]*/[gimy]{0,4}', r'''/\/\/\//''')
<_sre.SRE_Match object; span=(0, 3), match='/\\/'>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully I've managed to improve on it a little.
--- edit ---
They can't be multiline, can they? I'll need to check that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They can't be multiline, can they?
Yep. According to ECMA 262 5.1, CR (U+000D), LF (U+000A), LS (U+2028) and PS (U+2029) are not allowed in RegExp literals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx. I'll need to read that, a couple more times.
youtube_dl/jsinterp.py
Outdated
|
||
# _ARRAY_RE = r'\[(%(literal)s\s*,\s*)*(%(literal)s\s*)?\]' % {'literal': _LITERAL_RE} | ||
# _VALUE_RE = r'(?:%(literal)s)|(%(array)s)' % {'literal': _LITERAL_RE, 'array': _ARRAY_RE} | ||
_CALL_RE = r'\.?%(name)s\s*\(' % {'name': _NAME_RE} # function or method! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function calls are complex. For example:
from youtube_dl.jsinterp import JSInterpreter
jsi = JSInterpreter('''
function a(x) { return x; }
function b(x) { return x; }
function c() { return [a, b][0](0); }
''')
print(jsi.call_function('c'))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added test.
Tokenizing seams to be fine, but I haven't migrated the interpreter and old one does not support this.
youtube_dl/jsinterp.py
Outdated
] | ||
_ASSIGN_OPERATORS = [(op + '=', opfunc) for op, opfunc in _OPERATORS] | ||
_ASSIGN_OPERATORS.append(('=', lambda cur, right: right)) | ||
|
||
_RESERVED_RE = r'(?:function|var|(?P<ret>return))\s' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry but Javascript is not context-free. For examlpe:
code3 = '''
a = {'var': 3};
function c() { return a.var; }
'''
jsi = JSInterpreter(code3)
print(jsi.call_function('c'))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added test, but I don't see any problem.
And what do you mean @yan12125 "not context-free"?
Didn't you wanted to say not regular?
--- edit---
Sry, you're right.
Although, according to http://stackoverflow.com/questions/30697267/is-javascript-a-context-free-language:
That object literals must not contain duplicate property names and that function parameter lists must not contain duplicate identifiers are two rules that cannot be expressed using (finite) context-free grammars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicated keys/parameter names are another issue, which can be ignored in parsing and checked in semantic checking. In youtube-dl it's safe to assume all inputs are valid Javascript so there's no need to handle it.
A notice: |
For some reason code this had been missed by code inspector of the IDE I'm working with. I'll try to come up a workaround. Thanks. |
- missing enumerate in op_ids and aop_ids - order of relation and operator regex in input_element
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left import line.
Also a bunch of changes got in that shouldn't have.
I've just realise my original idea, that _next_statement method would do the lexical analysis and interpret_statement the parsing is fraud. To yield a statement parsing has to had happened, since Statement is one of the symbols (along with FunctionDeclaration) replacing SourceElement in the syntactic grammar. |
- new class TokenStream with peek and pop methods - _assign_expression handling precedence - new logical, unary, equality and relation operators - yet another try replacing OrderedDict - minor change in lexical grammar allowing identifiers to match reserved words _chk_id staticmethod has been added to handle it in syntactic grammar
Supports: - arrays - expressions - calls - assignment - variable declaration - blocks - return statement - element and property access Semantics not yet implemented, tho.
youtube_dl/extractor/youtube.py
Outdated
@@ -12,7 +12,7 @@ | |||
import traceback | |||
|
|||
from .common import InfoExtractor, SearchInfoExtractor | |||
from ..jsinterp import JSInterpreter | |||
from ..jsinterp2 import JSInterpreter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accidental changes, but oddly passed test_youtube_signature
. For me it fails due to JSArrayPrototype._slice
doesn't handle arguments correctly since not being implemented yet.
- Fixes TestCase class names
This is of course, very neat. But a lot of Chrome's (and maybe others) standard library for many things are implemented in JavaScript. Instead of making the built-ins (like |
try: | ||
ref = (self.this[id] if id in self.this else | ||
self.global_vars[id]) | ||
except KeyError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think JSInterpreter#extract_function
is useful filtering code before execution, therefore I'd like to continue supporting it. But here it needs to get object from outer context, and that behaviour is not in spec. I'd like to suggest a flag that disables these kind of features.
@Tatsh That sounds great, and I remember seeing such implementation somewhere, but I didn't understand it, well enough to try to adopt it. Can you help? |
Main thing is to get the basics of the interpreter in, which includes the built-in types, and then the JavaScript portions can be written very similarly to how polyfills are written today. So you would not need to implement I can take a look later since this does interest me. Mainly this was needed for me to get past CloudFlare anti-DDoS without needing cookies. |
Well, I've already implemented operator
I'm not sure when 1. would execute or whether your solution takes care of it. from .internals import jstype, object_type
def _is_array(arg):
if jstype(arg) is not object_type:
return False
if arg.jsclass == 'Array':
return True
return False I'm not sticking to it or claim that it's elegant, but this is how I'm able to solve the task at hand. |
- adds `jsgrammar.LINETERMINATORSEQ_RE` - lexer `tstream.TokenStream` checks for lineterminators in tokens - adds `tstream.Token` - refractors `tstream.TokenStream` and `jsparser.Parser` and to use it
I've added feature to lexer ( This is also useful to have in my other plan to change the test suite using json files instead of py to generate test cases. My reasoning behind it is if |
- Adds `jsbuilt_ins.nan` and `jsbuilt_ins.infinity` - Adds arithmetic operator overload to `jsbuilt_ins.jsnumber.JSNumberPrototype` - Adds equality operator overload to `jsinterp.Reference` - Adds better strict equality and typeof operator in `tstream`
- Refractors `Context` and `Reference` classes into their own module named `environment` (saves local import in `tstream`)
Please follow the guide below
x
into all the boxes [ ] relevant to your pull request (like that [x])Before submitting a pull request make sure you have:
In order to be accepted and merged into youtube-dl each piece of code must be in public domain or released under Unlicense. Check one of the following options:
What is the purpose of your pull request?
Description of your pull request and other information
I've started to implement an actual JavaScript syntax parser.
-- EDIT --
And moved on making an interpreter.