-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Protobuf parser library and first part of working tests. #7
Conversation
Hey @libretto , thank you for doing this.
We had a little chat and starting with the library inside the karapace repo seems like the better option. The idea is that fixing issues is easier when everything is in the same repo. Again, thanks for doing this! |
@hackaugusto would be great if you could review and provide feedback if any. |
Hi @amrutha-shanbhag , my colleagues and I will have a look at the PR. It is a big one, so it will take a bit of time before we go over it. Awesome work, thank you :) |
Hey there. A few comments. I think it'd be worth mentioning the exact source from where the Python code is originating. That way if there are ever fixes / improvements in the original Kotlin code, maybe it'd be possible to port those to Karapace. I think for example Would it be a biggie to use What comes to the code itself. I think that since the tests have been ported together with the main code, it's very likely things work as expected. Eventually, we'd write an integration test that runs against Karapace and against Schema Registry and tests protobuf schemas. That'd will reveal if there are any functional differences or problems. Good work so far! Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks really great to me. Just a comment on the unit test. Would it be possible to add more validation errors on each type of element? Mainly to show what happens when certain rules for a particular element is violated, robustness against syntax error, and sensible error messages.
There are only 20% of tests ported yet. I continue porting them. |
@juha-aiven I converted class naming from Kotlin to Python 1:1 (it both are UpperCamelCase) so class ProtoFileElement in Kotlin has the same name in Python. Also, most methods arguments' names and method names themselves were converted from lowerCamelCase in Kotlin to lowercase_separated_by_underscores in Python. Also, the same way was converted modules naming. There is the only difference in paths as soon as Kotlin classes are located in different directories. I will add the path to .kt file in the .py file for reference. I selected unittest because it is similar to JUnit by asserts and it adds some comfort and performance to porting process. But now when the test porting process is fast enough and fewer errors appear I can convert it to pytest at once... |
OK, sounds good. So, what's remaining to finish the feature is:
The steps 3-5 are kind of simultaneous. Anything else? |
Everything correct. Change We are going to focus on Then Instaclustr team will work on 1, 4 while Karapace team can start verifying compatibility (5). |
Sounds like a plan. You can change this PR to a draft and mark it for review when it's time to merge it in. |
@juha-aiven I changed test framework to pytest and added references to the square/wire. I think this PR coding is finished. And next part of tests, library utilization e.t.c will be in other PRs. |
Good. There's now Could you open a PR in https://github.com/aiven/karapace? Now the #6 is already merged and this PR builds on top of that. Seeing the overall diff with https://github.com/aiven/karapace master is a little hard. I think we should in https://github.com/aiven/karapace have first a PR that pulls in the stuff in this PR, essentially the protobuf parsing and the associated tests. At this point we wouldn't expose Protobu schema support in the API. After that PR is merged, the next PR would add the Protobuf support to Karapace. That PR should also contain integration tests. We don't want to go live with Protobuf support without good integration test coverage. How does this sound? |
The other approach is that the feature is implemented fully in this fork (with integration tests and all) and merged then to https://github.com/aiven/karapace? So far it looks like this is the approach taken. This works for us too. |
We are going to implement the full feature in this fork. We keep the fork in sync with |
def min_of(a: int, b: int) -> int: | ||
return a if a < b else b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This too can be replaced with a language built-in min
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
def hex_digit(c: str) -> int: | ||
if ord(c) in range(ord('0'), ord('9') + 1): | ||
return ord(c) - ord('0') | ||
if ord(c) in range(ord('a'), ord('f') + 1): | ||
return ord(c) - ord('a') + 10 | ||
if ord(c) in range(ord('A'), ord('F') + 1): | ||
return ord(c) - ord('A') + 10 | ||
return -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this utility should be removed, the int constructor can be used for this:
>>> int('9', 16)
9
>>> int('a', 16)
10
>>> int('n', 16)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 16: 'n'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
def read_numeric_escape(self, radix: int, length: int) -> str: | ||
value = -1 | ||
end_pos = min_of(self.pos + length, len(self.data)) | ||
while self.pos < end_pos: | ||
digit = hex_digit(self.data[self.pos]) | ||
if digit == -1 or digit >= radix: | ||
break | ||
|
||
if value < 0: | ||
value = digit | ||
else: | ||
value = value * radix + digit | ||
self.pos += 1 | ||
|
||
self.expect(value >= 0, "expected a digit after \\x or \\X") | ||
return chr(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be simplified with the use of slices and the built-ins. It should be faster and less-error prone
def read_numeric_escape(self, radix: int, length: int) -> str:
codepoint = int(self.data[self.pos:self.pos + length], radix)
return chr(codepoint)
To clarify:
- Adjusting the range is not necessary, this is done automatically:
>>> s = "this is a long string\x01"
>>> len(s)
22
>>> s[21:40]
'\x01'
- Since this is called after the
\x
, theint
constructor can be used (and exception is raised if the value is invalid, you may want to caught it and re-throw):
>>> int('ff', 16)
255
>>> int('11', 8)
9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: How is the encoding determined? chr
may need some tweaking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is part of parsing library. Number of digits is unknown so we cannot use int() call there as soon as we will receive exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About encoding - I cannot find info about affect of encoding on python call chr( int ).
This code was ported from Kotlin
chr(value) is port of Kotlin's value.toChar()
In both cases it is unicode representation of least 16 bit of int value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Number of digits is unknown so we cannot use int() call there as soon as we will receive exception.
Isn't that what the arg `length' does?
In both cases it is unicode representation of least 16 bit of int value.
cool, thanks for checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Length limits number of chars which must be parsed But in case if non digit met then routine return number which found or exception if no number found. So 0x1\t must return 0x1 but int("0x1\t") return exception.
syntax_reader.py contain code fully ported from square/wire using their parsing method. It is reason why there no regexps. Because they absent in Kotlin code too. I think optimizing and improving of this part must be done only in case of performance issues and benchmarks.
except OSError as err: | ||
print("OS error: {0}".format(err)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what would throw this exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it...
try: | ||
radix = 10 | ||
if tag.startswith("0x") or tag.startswith("0X"): | ||
tag = tag[len("0x"):] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unnecessary, the int
constructor should handle this already:
>>> int('0x16',16)
22
>>> int('0X16',16)
22
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it maybe removed.
tag = tag[len("0x"):]
self.expect(self.read_char() == ',', "expected ','") | ||
value_type = self.read_data_type() | ||
|
||
self.expect(self.read_char() == '>', "expected '>'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.require('>')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
answered above
while self.pos < len(self.data): | ||
c = self.data[self.pos] | ||
if ord(c) in range(ord('a'), ord('z') + 1) \ | ||
or ord(c) in range(ord('A'), ord('Z') + 1) \ | ||
or ord(c) in range(ord('0'), ord('9') + 1) or c in ['_', '-', '.']: | ||
self.pos += 1 | ||
else: | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be re-written as a regex:
import re
WORD_RE = re.compile(rb"[a-zA-Z0-9_-.]+")
match = WORD_RE.match(self.data[self.pos:])
if match:
self.advance(len(match.string))
reutrn match.string
self.unexpected("expected a word")
Question: Is there a definition of a work somewhere? I'm surprised that numbers and punctuation are allowed at the beginning of a word To answer my own question, a word seems to be a lexical element, this is also used to parse numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is direct port of square/wire Kotlin code. We had no plan to optimize or improve it yet.
# return (MIN_TAG_VALUE <= self <= RESERVED_TAG_VALUE_START) or\ | ||
# (RESERVED_TAG_VALUE_END + 1 <= self <= MAX_TAG_VALUE + 1) | ||
|
||
# builtins.int = MyInt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't monkey patch built-ins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a workaround when i worked on porting and I forget to remove it...
RESERVED_TAG_VALUE_END = 19999 | ||
""" True if the supplied value is in the valid tag range and not reserved. """ | ||
|
||
# class MyInt(int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dead code should be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed already
from karapace.protobuf.location import Location | ||
|
||
|
||
class TypeElement: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is used as an immutable value, as a result of the parser. It would make sense to have them as frozen dataclasses instead. Like this:
from karapace.protobuf.location import Location
from dataclasses import dataclass
from typing import List
@dataclass(frozen=True)
class TypeElement:
location: Location
name: str
documentation: str
options: list
nested_types: List['TypeElement']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
To achieve 1:1 functionality as a SchemaRegistry, we need to use a ProtoBuf schema parser which functionality will be similar to SchemaRegistry used library.
SchemaRegistry uses a Square Wire library. It distributed under Apache 2.0 License. But it is written in Kotlin and no analogical library available in python. To use Kotlin code from Karapace we will need JVM as a dependency. To avoid that we decide to port the required ProtoParser code to Python and use it as part of Karapace. In this pull request, we add ProtoParser code and the first part of ported unittest code for it.
We want to discuss with Aiven representatives which way of using this library is better for the project. May we integrate it now as part of karapace or must we extract from the project and use it as an independent pip library?