Protobuf parser library and first part of working tests. #7

libretto · 2021-05-19T14:57:58Z

To achieve 1:1 functionality as a SchemaRegistry, we need to use a ProtoBuf schema parser which functionality will be similar to SchemaRegistry used library.

SchemaRegistry uses a Square Wire library. It distributed under Apache 2.0 License. But it is written in Kotlin and no analogical library available in python. To use Kotlin code from Karapace we will need JVM as a dependency. To avoid that we decide to port the required ProtoParser code to Python and use it as part of Karapace. In this pull request, we add ProtoParser code and the first part of ported unittest code for it.

We want to discuss with Aiven representatives which way of using this library is better for the project. May we integrate it now as part of karapace or must we extract from the project and use it as an independent pip library?

… PR #1

…ed by this tests

hackaugusto · 2021-05-21T11:33:42Z

Hey @libretto , thank you for doing this.

May we integrate it now as part of karapace or must we extract from the project and use it as an independent pip library?

We had a little chat and starting with the library inside the karapace repo seems like the better option. The idea is that fixing issues is easier when everything is in the same repo.

Again, thanks for doing this!

amrutha-shanbhag · 2021-05-26T22:28:10Z

@hackaugusto would be great if you could review and provide feedback if any.

hackaugusto · 2021-05-27T08:02:09Z

Hi @amrutha-shanbhag , my colleagues and I will have a look at the PR. It is a big one, so it will take a bit of time before we go over it. Awesome work, thank you :)

juha-aiven · 2021-05-27T11:30:23Z

Hey there. A few comments.

I think it'd be worth mentioning the exact source from where the Python code is originating. That way if there are ever fixes / improvements in the original Kotlin code, maybe it'd be possible to port those to Karapace. I think for example ProtoParserTest in Wire has been converted to a class with same name in Python. Maybe just add a link back to Wire code in a docstring?

Would it be a biggie to use pytest in the tests instead of unittest? Karapace uses that elsewhere. I think it'd mostly involve changing self.assertEqual to assert. For things like CI pipeline it'd be better to have just one test framework.

What comes to the code itself. I think that since the tests have been ported together with the main code, it's very likely things work as expected.

Eventually, we'd write an integration test that runs against Karapace and against Schema Registry and tests protobuf schemas. That'd will reveal if there are any functional differences or problems.

Good work so far! Thanks.

h3nd24

I think this looks really great to me. Just a comment on the unit test. Would it be possible to add more validation errors on each type of element? Mainly to show what happens when certain rules for a particular element is violated, robustness against syntax error, and sensible error messages.

libretto · 2021-05-28T14:16:07Z

There are only 20% of tests ported yet. I continue porting them.

libretto · 2021-05-28T14:44:16Z

@juha-aiven I converted class naming from Kotlin to Python 1:1 (it both are UpperCamelCase) so class ProtoFileElement in Kotlin has the same name in Python. Also, most methods arguments' names and method names themselves were converted from lowerCamelCase in Kotlin to lowercase_separated_by_underscores in Python. Also, the same way was converted modules naming. There is the only difference in paths as soon as Kotlin classes are located in different directories. I will add the path to .kt file in the .py file for reference.

I selected unittest because it is similar to JUnit by asserts and it adds some comfort and performance to porting process. But now when the test porting process is fast enough and fewer errors appear I can convert it to pytest at once...

juha-aiven · 2021-05-28T15:03:32Z

OK, sounds good. So, what's remaining to finish the feature is:

Port the remaining tests from Kotlin to Python
Change unittest to pytest
Implement what's needed on the Karapace.
Add integration tests
Verify compatibility with Schema Registry

The steps 3-5 are kind of simultaneous.

Anything else?

mrlika · 2021-05-29T07:20:24Z

OK, sounds good. So, what's remaining to finish the feature is:
1. Port the remaining tests from Kotlin to Python

2. Change `unittest` to `pytest`

3. Implement what's needed on the Karapace.

4. Add integration tests

5. Verify compatibility with Schema Registry
The steps 3-5 are kind of simultaneous.

Anything else?

Everything correct. Change unittest to pytest and mentioning the exact source will be done in this PR.

We are going to focus on Implement what's needed on the Karapace. after that. It will be implemented in the next PRs.

Then Instaclustr team will work on 1, 4 while Karapace team can start verifying compatibility (5).

juha-aiven · 2021-05-31T07:10:30Z

Anything else?

Everything correct. Change unittest to pytest and mentioning the exact source will be done in this PR.

We are going to focus on Implement what's needed on the Karapace. after that. It will be implemented in the next PRs.

Then Instaclustr team will work on 1, 4 while Karapace team can start verifying compatibility (5).

Sounds like a plan. You can change this PR to a draft and mark it for review when it's time to merge it in.

libretto · 2021-05-31T13:55:18Z

@juha-aiven I changed test framework to pytest and added references to the square/wire. I think this PR coding is finished. And next part of tests, library utilization e.t.c will be in other PRs.

juha-aiven · 2021-06-02T08:15:35Z

Good. There's now pytest.

Could you open a PR in https://github.com/aiven/karapace? Now the #6 is already merged and this PR builds on top of that. Seeing the overall diff with https://github.com/aiven/karapace master is a little hard.

I think we should in https://github.com/aiven/karapace have first a PR that pulls in the stuff in this PR, essentially the protobuf parsing and the associated tests. At this point we wouldn't expose Protobu schema support in the API.

After that PR is merged, the next PR would add the Protobuf support to Karapace. That PR should also contain integration tests. We don't want to go live with Protobuf support without good integration test coverage.

How does this sound?

juha-aiven · 2021-06-02T08:21:10Z

The other approach is that the feature is implemented fully in this fork (with integration tests and all) and merged then to https://github.com/aiven/karapace? So far it looks like this is the approach taken. This works for us too.

mrlika · 2021-06-02T12:08:08Z

The other approach is that the feature is implemented fully in this fork (with integration tests and all) and merged then to https://github.com/aiven/karapace? So far it looks like this is the approach taken. This works for us too.

We are going to implement the full feature in this fork. We keep the fork in sync with aiven/karapace.

hackaugusto · 2021-10-28T14:27:25Z

karapace/protobuf/syntax_reader.py

+def min_of(a: int, b: int) -> int:
+    return a if a < b else b


This too can be replaced with a language built-in min

hackaugusto · 2021-10-28T14:29:14Z

karapace/protobuf/syntax_reader.py

+def hex_digit(c: str) -> int:
+    if ord(c) in range(ord('0'), ord('9') + 1):
+        return ord(c) - ord('0')
+    if ord(c) in range(ord('a'), ord('f') + 1):
+        return ord(c) - ord('a') + 10
+    if ord(c) in range(ord('A'), ord('F') + 1):
+        return ord(c) - ord('A') + 10
+    return -1


I believe this utility should be removed, the int constructor can be used for this:

>>> int('9', 16) 9 >>> int('a', 16) 10 >>> int('n', 16) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: invalid literal for int() with base 16: 'n'

hackaugusto · 2021-10-28T14:36:48Z

karapace/protobuf/syntax_reader.py

+    def read_numeric_escape(self, radix: int, length: int) -> str:
+        value = -1
+        end_pos = min_of(self.pos + length, len(self.data))
+        while self.pos < end_pos:
+            digit = hex_digit(self.data[self.pos])
+            if digit == -1 or digit >= radix:
+                break
+
+            if value < 0:
+                value = digit
+            else:
+                value = value * radix + digit
+            self.pos += 1
+
+        self.expect(value >= 0, "expected a digit after \\x or \\X")
+        return chr(value)


This can be simplified with the use of slices and the built-ins. It should be faster and less-error prone

def read_numeric_escape(self, radix: int, length: int) -> str: codepoint = int(self.data[self.pos:self.pos + length], radix) return chr(codepoint)

To clarify:

Adjusting the range is not necessary, this is done automatically:

>>> s = "this is a long string\x01" >>> len(s) 22 >>> s[21:40] '\x01'

Since this is called after the \x, the int constructor can be used (and exception is raised if the value is invalid, you may want to caught it and re-throw):

>>> int('ff', 16) 255 >>> int('11', 8) 9

Question: How is the encoding determined? chr may need some tweaking.

This code is part of parsing library. Number of digits is unknown so we cannot use int() call there as soon as we will receive exception.

About encoding - I cannot find info about affect of encoding on python call chr( int ).

This code was ported from Kotlin
chr(value) is port of Kotlin's value.toChar()
In both cases it is unicode representation of least 16 bit of int value.

Number of digits is unknown so we cannot use int() call there as soon as we will receive exception.

Isn't that what the arg `length' does?

In both cases it is unicode representation of least 16 bit of int value.

cool, thanks for checking.

Length limits number of chars which must be parsed But in case if non digit met then routine return number which found or exception if no number found. So 0x1\t must return 0x1 but int("0x1\t") return exception.

syntax_reader.py contain code fully ported from square/wire using their parsing method. It is reason why there no regexps. Because they absent in Kotlin code too. I think optimizing and improving of this part must be done only in case of performance issues and benchmarks.

hackaugusto · 2021-10-28T14:39:36Z

karapace/protobuf/syntax_reader.py

+        except OSError as err:
+            print("OS error: {0}".format(err))


what would throw this exception?

I removed it...

hackaugusto · 2021-10-28T14:40:59Z

karapace/protobuf/syntax_reader.py

+        try:
+            radix = 10
+            if tag.startswith("0x") or tag.startswith("0X"):
+                tag = tag[len("0x"):]


This is unnecessary, the int constructor should handle this already:

>>> int('0x16',16) 22 >>> int('0X16',16) 22

yes it maybe removed.
tag = tag[len("0x"):]

hackaugusto · 2021-10-28T15:05:24Z

karapace/protobuf/syntax_reader.py

+            self.expect(self.read_char() == ',', "expected ','")
+            value_type = self.read_data_type()
+
+            self.expect(self.read_char() == '>', "expected '>'")


self.require('>')

answered above

hackaugusto · 2021-10-28T15:08:09Z

karapace/protobuf/syntax_reader.py

+        while self.pos < len(self.data):
+            c = self.data[self.pos]
+            if ord(c) in range(ord('a'), ord('z') + 1) \
+                    or ord(c) in range(ord('A'), ord('Z') + 1) \
+                    or ord(c) in range(ord('0'), ord('9') + 1) or c in ['_', '-', '.']:
+                self.pos += 1
+            else:
+                break


This should probably be re-written as a regex:

import re WORD_RE = re.compile(rb"[a-zA-Z0-9_-.]+") match = WORD_RE.match(self.data[self.pos:]) if match: self.advance(len(match.string)) reutrn match.string self.unexpected("expected a word")

~~Question: Is there a definition of a work somewhere? I'm surprised that numbers and punctuation are allowed at the beginning of a word~~ To answer my own question, a word seems to be a lexical element, this is also used to parse numbers.

It is direct port of square/wire Kotlin code. We had no plan to optimize or improve it yet.

hackaugusto · 2021-10-28T15:22:51Z

karapace/protobuf/utils.py

+#        return (MIN_TAG_VALUE <= self <= RESERVED_TAG_VALUE_START) or\
+#               (RESERVED_TAG_VALUE_END + 1 <= self <= MAX_TAG_VALUE + 1)
+
+# builtins.int = MyInt


Please don't monkey patch built-ins

It was a workaround when i worked on porting and I forget to remove it...

hackaugusto · 2021-10-28T15:23:01Z

karapace/protobuf/utils.py

+RESERVED_TAG_VALUE_END = 19999
+""" True if the supplied value is in the valid tag range and not reserved.  """
+
+# class MyInt(int):


The dead code should be removed

removed already

hackaugusto · 2021-10-28T15:26:21Z

karapace/protobuf/type_element.py

+from karapace.protobuf.location import Location
+
+
+class TypeElement:


It looks like this is used as an immutable value, as a result of the parser. It would make sense to have them as frozen dataclasses instead. Like this:

from karapace.protobuf.location import Location from dataclasses import dataclass from typing import List @dataclass(frozen=True) class TypeElement: location: Location name: str documentation: str options: list nested_types: List['TypeElement']

libretto added 12 commits April 21, 2021 13:55

Add protobuf skeleton

d3aff39

Add skeleton files

46e23c7

remove unfinished tests

f6be627

fixup lint errors

4b2fdb7

Changed project structure, and added one test and debugged issues for…

e28b735

… PR #1

fixup lint issues

3063f58

protobuf parser draft save

62e56eb

beta version of prot_parser class (no dependencies)

da4fa3b

lint issues fuxup

8ffba9c

Ported first part of tests from Wire project. ProtoParser code debugg…

adb145c

…ed by this tests

Merge remote-tracking branch 'origin/master' into protobuf-parser

e37d3e8

fixup lint problem

ad490fb

libretto requested review from amrutha-shanbhag and mrlika May 19, 2021 19:02

libretto mentioned this pull request May 19, 2021

Add support for ProtoBuf Aiven-Open/karapace#67

Closed

amrutha-shanbhag requested a review from h3nd24 May 19, 2021 23:17

libretto added 2 commits May 20, 2021 10:05

Merge branch 'master' into protobuf-parser

0677729

fixup lint issues after conflict remove

af054c7

another part of tests added

fcfabb3

mrlika approved these changes May 26, 2021

View reviewed changes

h3nd24 reviewed May 28, 2021

View reviewed changes

change project to pytest and add references to square/wire project files

e460350

h3nd24 approved these changes Jun 2, 2021

View reviewed changes

h3nd24 merged commit a9abaee into master Jun 3, 2021

hackaugusto reviewed Oct 28, 2021

View reviewed changes

hackaugusto mentioned this pull request Oct 28, 2021

Protobuf compatibility functionality implementation. #13

Merged

		from karapace.protobuf.location import Location


		class TypeElement:

Protobuf parser library and first part of working tests. #7

Protobuf parser library and first part of working tests. #7

Conversation

libretto commented May 19, 2021

hackaugusto commented May 21, 2021

amrutha-shanbhag commented May 26, 2021

hackaugusto commented May 27, 2021

juha-aiven commented May 27, 2021

h3nd24 left a comment • edited Loading

Choose a reason for hiding this comment

libretto commented May 28, 2021

libretto commented May 28, 2021 • edited Loading

juha-aiven commented May 28, 2021

mrlika commented May 29, 2021

juha-aiven commented May 31, 2021

libretto commented May 31, 2021

juha-aiven commented Jun 2, 2021

juha-aiven commented Jun 2, 2021

mrlika commented Jun 2, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

libretto Nov 10, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

libretto Nov 11, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

h3nd24 left a comment •

edited

Loading

libretto commented May 28, 2021 •

edited

Loading

libretto Nov 10, 2021 •

edited

Loading

libretto Nov 11, 2021 •

edited

Loading