Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protobuf parser library and first part of working tests. #7

Merged
merged 16 commits into from
Jun 3, 2021

Conversation

libretto
Copy link
Collaborator

To achieve 1:1 functionality as a SchemaRegistry, we need to use a ProtoBuf schema parser which functionality will be similar to SchemaRegistry used library.

SchemaRegistry uses a Square Wire library. It distributed under Apache 2.0 License. But it is written in Kotlin and no analogical library available in python. To use Kotlin code from Karapace we will need JVM as a dependency. To avoid that we decide to port the required ProtoParser code to Python and use it as part of Karapace. In this pull request, we add ProtoParser code and the first part of ported unittest code for it.

We want to discuss with Aiven representatives which way of using this library is better for the project. May we integrate it now as part of karapace or must we extract from the project and use it as an independent pip library?

@hackaugusto
Copy link

Hey @libretto , thank you for doing this.

May we integrate it now as part of karapace or must we extract from the project and use it as an independent pip library?

We had a little chat and starting with the library inside the karapace repo seems like the better option. The idea is that fixing issues is easier when everything is in the same repo.

Again, thanks for doing this!

@amrutha-shanbhag
Copy link

@hackaugusto would be great if you could review and provide feedback if any.

@hackaugusto
Copy link

Hi @amrutha-shanbhag , my colleagues and I will have a look at the PR. It is a big one, so it will take a bit of time before we go over it. Awesome work, thank you :)

@juha-aiven
Copy link

Hey there. A few comments.

I think it'd be worth mentioning the exact source from where the Python code is originating. That way if there are ever fixes / improvements in the original Kotlin code, maybe it'd be possible to port those to Karapace. I think for example ProtoParserTest in Wire has been converted to a class with same name in Python. Maybe just add a link back to Wire code in a docstring?

Would it be a biggie to use pytest in the tests instead of unittest? Karapace uses that elsewhere. I think it'd mostly involve changing self.assertEqual to assert. For things like CI pipeline it'd be better to have just one test framework.

What comes to the code itself. I think that since the tests have been ported together with the main code, it's very likely things work as expected.

Eventually, we'd write an integration test that runs against Karapace and against Schema Registry and tests protobuf schemas. That'd will reveal if there are any functional differences or problems.

Good work so far! Thanks.

Copy link

@h3nd24 h3nd24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks really great to me. Just a comment on the unit test. Would it be possible to add more validation errors on each type of element? Mainly to show what happens when certain rules for a particular element is violated, robustness against syntax error, and sensible error messages.

@libretto
Copy link
Collaborator Author

There are only 20% of tests ported yet. I continue porting them.

@libretto
Copy link
Collaborator Author

libretto commented May 28, 2021

@juha-aiven I converted class naming from Kotlin to Python 1:1 (it both are UpperCamelCase) so class ProtoFileElement in Kotlin has the same name in Python. Also, most methods arguments' names and method names themselves were converted from lowerCamelCase in Kotlin to lowercase_separated_by_underscores in Python. Also, the same way was converted modules naming. There is the only difference in paths as soon as Kotlin classes are located in different directories. I will add the path to .kt file in the .py file for reference.

I selected unittest because it is similar to JUnit by asserts and it adds some comfort and performance to porting process. But now when the test porting process is fast enough and fewer errors appear I can convert it to pytest at once...

@juha-aiven
Copy link

OK, sounds good. So, what's remaining to finish the feature is:

  1. Port the remaining tests from Kotlin to Python
  2. Change unittest to pytest
  3. Implement what's needed on the Karapace.
  4. Add integration tests
  5. Verify compatibility with Schema Registry

The steps 3-5 are kind of simultaneous.

Anything else?

@mrlika
Copy link
Collaborator

mrlika commented May 29, 2021

OK, sounds good. So, what's remaining to finish the feature is:

1. Port the remaining tests from Kotlin to Python

2. Change `unittest` to `pytest`

3. Implement what's needed on the Karapace.

4. Add integration tests

5. Verify compatibility with Schema Registry

The steps 3-5 are kind of simultaneous.

Anything else?

Everything correct. Change unittest to pytest and mentioning the exact source will be done in this PR.

We are going to focus on Implement what's needed on the Karapace. after that. It will be implemented in the next PRs.

Then Instaclustr team will work on 1, 4 while Karapace team can start verifying compatibility (5).

@juha-aiven
Copy link

Anything else?

Everything correct. Change unittest to pytest and mentioning the exact source will be done in this PR.

We are going to focus on Implement what's needed on the Karapace. after that. It will be implemented in the next PRs.

Then Instaclustr team will work on 1, 4 while Karapace team can start verifying compatibility (5).

Sounds like a plan. You can change this PR to a draft and mark it for review when it's time to merge it in.

@libretto
Copy link
Collaborator Author

@juha-aiven I changed test framework to pytest and added references to the square/wire. I think this PR coding is finished. And next part of tests, library utilization e.t.c will be in other PRs.

@juha-aiven
Copy link

Good. There's now pytest.

Could you open a PR in https://github.com/aiven/karapace? Now the #6 is already merged and this PR builds on top of that. Seeing the overall diff with https://github.com/aiven/karapace master is a little hard.

I think we should in https://github.com/aiven/karapace have first a PR that pulls in the stuff in this PR, essentially the protobuf parsing and the associated tests. At this point we wouldn't expose Protobu schema support in the API.

After that PR is merged, the next PR would add the Protobuf support to Karapace. That PR should also contain integration tests. We don't want to go live with Protobuf support without good integration test coverage.

How does this sound?

@juha-aiven
Copy link

The other approach is that the feature is implemented fully in this fork (with integration tests and all) and merged then to https://github.com/aiven/karapace? So far it looks like this is the approach taken. This works for us too.

@mrlika
Copy link
Collaborator

mrlika commented Jun 2, 2021

The other approach is that the feature is implemented fully in this fork (with integration tests and all) and merged then to https://github.com/aiven/karapace? So far it looks like this is the approach taken. This works for us too.

We are going to implement the full feature in this fork. We keep the fork in sync with aiven/karapace.

@h3nd24 h3nd24 merged commit a9abaee into master Jun 3, 2021
Comment on lines +18 to +19
def min_of(a: int, b: int) -> int:
return a if a < b else b

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This too can be replaced with a language built-in min

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Comment on lines +8 to +15
def hex_digit(c: str) -> int:
if ord(c) in range(ord('0'), ord('9') + 1):
return ord(c) - ord('0')
if ord(c) in range(ord('a'), ord('f') + 1):
return ord(c) - ord('a') + 10
if ord(c) in range(ord('A'), ord('F') + 1):
return ord(c) - ord('A') + 10
return -1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this utility should be removed, the int constructor can be used for this:

>>> int('9', 16)
9
>>> int('a', 16)
10
>>> int('n', 16)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: invalid literal for int() with base 16: 'n'

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Comment on lines +123 to +138
def read_numeric_escape(self, radix: int, length: int) -> str:
value = -1
end_pos = min_of(self.pos + length, len(self.data))
while self.pos < end_pos:
digit = hex_digit(self.data[self.pos])
if digit == -1 or digit >= radix:
break

if value < 0:
value = digit
else:
value = value * radix + digit
self.pos += 1

self.expect(value >= 0, "expected a digit after \\x or \\X")
return chr(value)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be simplified with the use of slices and the built-ins. It should be faster and less-error prone

def read_numeric_escape(self, radix: int, length: int) -> str:
    codepoint = int(self.data[self.pos:self.pos + length], radix)
    return chr(codepoint)

To clarify:

  • Adjusting the range is not necessary, this is done automatically:
>>> s = "this is a long string\x01"
>>> len(s)
22
>>> s[21:40]
'\x01'
  • Since this is called after the \x, the int constructor can be used (and exception is raised if the value is invalid, you may want to caught it and re-throw):
>>> int('ff', 16)
255
>>> int('11', 8)
9

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: How is the encoding determined? chr may need some tweaking.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is part of parsing library. Number of digits is unknown so we cannot use int() call there as soon as we will receive exception.

Copy link
Collaborator Author

@libretto libretto Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About encoding - I cannot find info about affect of encoding on python call chr( int ).

This code was ported from Kotlin
chr(value) is port of Kotlin's value.toChar()
In both cases it is unicode representation of least 16 bit of int value.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number of digits is unknown so we cannot use int() call there as soon as we will receive exception.

Isn't that what the arg `length' does?

In both cases it is unicode representation of least 16 bit of int value.

cool, thanks for checking.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Length limits number of chars which must be parsed But in case if non digit met then routine return number which found or exception if no number found. So 0x1\t must return 0x1 but int("0x1\t") return exception.

syntax_reader.py contain code fully ported from square/wire using their parsing method. It is reason why there no regexps. Because they absent in Kotlin code too. I think optimizing and improving of this part must be done only in case of performance issues and benchmarks.

Comment on lines +199 to +200
except OSError as err:
print("OS error: {0}".format(err))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would throw this exception?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it...

try:
radix = 10
if tag.startswith("0x") or tag.startswith("0X"):
tag = tag[len("0x"):]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessary, the int constructor should handle this already:

>>> int('0x16',16)
22
>>> int('0X16',16)
22

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it maybe removed.
tag = tag[len("0x"):]

self.expect(self.read_char() == ',', "expected ','")
value_type = self.read_data_type()

self.expect(self.read_char() == '>', "expected '>'")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.require('>')

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

answered above

Comment on lines +179 to +186
while self.pos < len(self.data):
c = self.data[self.pos]
if ord(c) in range(ord('a'), ord('z') + 1) \
or ord(c) in range(ord('A'), ord('Z') + 1) \
or ord(c) in range(ord('0'), ord('9') + 1) or c in ['_', '-', '.']:
self.pos += 1
else:
break

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably be re-written as a regex:

import re

WORD_RE = re.compile(rb"[a-zA-Z0-9_-.]+")

match = WORD_RE.match(self.data[self.pos:])
if match:
    self.advance(len(match.string))
    reutrn match.string
    
self.unexpected("expected a word")

Question: Is there a definition of a work somewhere? I'm surprised that numbers and punctuation are allowed at the beginning of a word To answer my own question, a word seems to be a lexical element, this is also used to parse numbers.

Copy link
Collaborator Author

@libretto libretto Nov 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is direct port of square/wire Kotlin code. We had no plan to optimize or improve it yet.

# return (MIN_TAG_VALUE <= self <= RESERVED_TAG_VALUE_START) or\
# (RESERVED_TAG_VALUE_END + 1 <= self <= MAX_TAG_VALUE + 1)

# builtins.int = MyInt

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't monkey patch built-ins

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a workaround when i worked on porting and I forget to remove it...

RESERVED_TAG_VALUE_END = 19999
""" True if the supplied value is in the valid tag range and not reserved. """

# class MyInt(int):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dead code should be removed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed already

from karapace.protobuf.location import Location


class TypeElement:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is used as an immutable value, as a result of the parser. It would make sense to have them as frozen dataclasses instead. Like this:

from karapace.protobuf.location import Location
from dataclasses import dataclass
from typing import List


@dataclass(frozen=True)
class TypeElement:
    location: Location
    name: str
    documentation: str
    options: list
    nested_types: List['TypeElement']

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants