feat: taxonomy parser library #18

BryanH01 · 2022-07-06T19:30:03Z

What

Created parser python script in parser folder
It's currently only harvesting data and creating the nodes

I'm not sure about the use of a class here.

Related issue(s)

Fixes #22

Only do the data harvesting part Also, i'm not sure about the use of a class here

small mistake

Correcting some errors

Removing print lines

- Changed harvesting function - Added some comments - Added creating relations function : parent

parser/Parser.py

Changed with Alex's comments, i think

Changed script for it to be more pythonic I kept the start in file_iter because i use a line 2 times For the tests, i don't really know what to test.

alexgarel · 2022-07-15T17:17:23Z

parser/Parser/Parser.py

+        self.filename=filename
+
+    def file_name(self):
+        """open the file filename and return the list of entries, each entry is a list a the sanitized lines"""


update documentation ;-)

alexgarel

You did good progress, kudos.

I commented a lot, hoping it makes you learn.

Also please rename parser/Parser/Parser.py to parser/parser/parser.py (no uppercase in file or directory names)

alexgarel · 2022-07-15T17:18:46Z

parser/Parser/Parser.py

+
+    def file_name(self):
+        """open the file filename and return the list of entries, each entry is a list a the sanitized lines"""
+        self.filename = self.filename + ( '.txt' if (len(self.filename)<4 or self.filename[-4:]!=".txt") else '')  #to not get error if extension is missing


It's not a good practice to modify it like this (not re-entrant).

I propose to you :

rename this function as normalized_filename

in init.py you call: self.filename = self.normalized_filename(filename)

alexgarel · 2022-07-15T17:21:23Z

parser/Parser/Parser.py

+        counter=0
+        with open(self.filename,"r",encoding="utf8") as file:
+            for line in file:
+                if counter < start : counter+=1


for this kind of check it's ok to use a "continue" statement and avoid the else:

if counter < start: counter += 1 continue

even better you can use itertools.islice:

for line in itertools.islice(file, start=start):

parser/Parser/Parser.py

alexgarel · 2022-07-15T17:50:25Z

parser/Parser/test_Parser.py

+    language_code_prefix = re.compile('[a-zA-Z][a-zA-Z]:')
+    for line in file:
+        counter+=1
+        assert line == '' or line[0]=="#" or 'stopword' in line or 'synonym' in line or line[0]=="<" or language_code_prefix.match(line) or ":" in line


please reformat long lines :-) (120 character is an absolute maximum).

parser/Parser/Parser.py

parser/Parser/test_Parser.py

alexgarel · 2022-07-15T17:58:07Z

parser/Parser/test_Parser.py

+    x=Parser("test")
+    x.file_name()
+    data=x.harvest()
+    test_data=["# test taxonomy",


maybe you can put it in a test.json file that you parse to get test_data (json.load) ? It will avoid cluttering the test, and we immediatly see it goes along with test.txt.

aadarsh-ram

Hi! @alexgarel has already mentioned many comments regarding your code. I also have some comments and suggestions regarding your code.

parser/Parser/test_Parser.py

parser/Parser/Parser.py

Correcting a mistake

added a detail

alexgarel

Good progress, commenting as always ;-)

parser/parser/test_taxonomy_parser.py

parser/parser/taxonomy_parser.py

parser/parser/test_taxonomy_parser.py

alexgarel · 2022-07-21T14:26:13Z

parser/parser/test_taxonomy_parser.py

+    text=x.normalizing(text,"fr")
+    assert text == "random-language-with-accents"
+
+def test_create_nodes():


You should put integration test in a different file. (even better create tests/unit and tests/integration and put each kind of file in its directory).

It's always better to distinguish them.

You should add a session fixture to remove all entries at startup (using a Parser instance if you want, or creating a session)

see https://docs.pytest.org/en/7.1.x/how-to/fixtures.html

parser/parser/test_taxonomy_parser.py

For the import of parser in test file, i didn't succeed to import easily.

alexgarel · 2022-07-25T18:00:14Z

@BryanH01 didn't you forgot to commit the integration test file ? (Or is it still a work in progress ?)

BryanH01 · 2022-07-26T07:37:57Z

@BryanH01 didn't you forgot to commit the integration test file ? (Or is it still a work in progress ?)

No i didn't commit it because i haven't updated it yet (because of the changes in parser)

I haven't changed the import method in tests yet. Some changes in parser.py

and get test.txt in right directory

Oops, I also forgot to change a function

Is this how it's done ?

Maybe I should get the expected data from the json file.

aadarsh-ram · 2022-07-27T18:02:34Z

parser/openfoodfacts_taxonomy_parser/parser.py

+    def create_node(self,data):
+        """ run the query to create the node with data dictionary """
+        position_query = """
+            SET n.previous_block = $previous_block


@BryanH01 @alexgarel Is this property called "previous_block" or "is_before"?

Oops sorry my bad, i forgot to change it

After some thinking, isn't "is_before" ambiguous ? It can mean "the node is before that:" or "what is before the node is that:". I think having a different name for the relation and the property used isn't too confusing (as it is done with parents/is_child_of). So I would change back the property name to "previous_block" and keep the relation name to "is_before". Is it a bad idea ?

as you want @BryanH01

aadarsh-ram

@BryanH01 I have added some style and property comments. Do go through them and make the necessary changes after discussion.

aadarsh-ram · 2022-07-28T05:34:19Z

parser/openfoodfacts_taxonomy_parser/parser.py

+        data = {
+                "id" : '',
+                "main_language" : '',
+                "comment" : [],


@BryanH01 @alexgarel Isn't this property is called "preceding_lines"? If so, please do change it everywhere.

Suggested change

"comment" : [],

"preceding_lines" : [],

yes that's preceding_lines.

aadarsh-ram · 2022-07-28T05:38:01Z

parser/openfoodfacts_taxonomy_parser/parser.py

+        else : return words
+
+    def add_line(self,line):
+        """to get a normalized string but keeping the language code "lc:" and the "," (commas) separators , used to add an id to entries and to add a parent tag"""


I think this docstring's too big. @BryanH01 would you be able to shorten it, or leave some spaces?

@BryanH01 in docstring use line return :-)

Also you should pass your code through black (and we should automate that with a make command). Because you are far from the pep 8 specs :-)

https://www.python.org/dev/peps/pep-0008/

aadarsh-ram · 2022-07-28T05:41:39Z

parser/openfoodfacts_taxonomy_parser/parser.py

+                "id" : '',
+                "main_language" : '',
+                "comment" : [],
+                "parent_tag" : [],


@BryanH01 I think this property is redundant with the "is_before" relationship. @BryanH01 is this property being added inside a Neo4j node? @alexgarel should this property be added inside the node?

Which one ? src_position ? If so, yes it's added in the node

@aadarsh-ram, to build the relationship, @BryanH01 must first set attributes, build relationships and then eventually remove them.

@aadarsh-ram and do not confuse "is_child_of" and "is_before" relation.
For example in the editor you should mainly care about "is_child_of", which is relation between entries.

The "is_before" is really just the order in the file.
Nodes have both is_before and is_child_of relations.

@alexgarel I'm really sorry, I meant the "is_child_of" relationship in my comment. If the "parent_tag" property is being removed, there are no issues.

parser/openfoodfacts_taxonomy_parser/parser.py

parser/openfoodfacts_taxonomy_parser/test.txt

aadarsh-ram · 2022-07-28T05:46:53Z

parser/openfoodfacts_taxonomy_parser/parser.py

+                yield line_number,line
+        yield line_number,"" #to end the last entry if not ended
+
+    def normalizing(self,line,lang="default"):


@BryanH01 the "mornalizing()" function could be supplemented with some spaces and newlines to make it more readable.

Changed what you told me to change Updates add_line() as its use has changed Used black

parser/openfoodfacts_taxonomy_parser/parser.py

Co-authored-by: Aadarsh A <[email protected]>

The parser didn't create a node for each stopwords and synonyms line because there's no blank line between stopwords and not always between synonyms. Also changed a little to accept 3-letter language code

Used black Removed something it was already doing

alexgarel

We are almost there, just add some more asserts in integration test, and small changes to simplify the code, if possible.

alexgarel · 2022-07-28T16:42:29Z

parser/openfoodfacts_taxonomy_parser/exception.py

+    def __init__(self,line):
+        exception_message = f"missing new line at line {line}"
+        superinit = super().__init__(exception_message)


Just for a next time (no need to change it now).
It might be better to keep line in exception.args (it's a list) and superseed the __str__ function.
This is because when loging there is a way for tools like sentries to group exception, if we keep moving parts outside :-)

alexgarel · 2022-07-28T16:45:08Z

parser/openfoodfacts_taxonomy_parser/parser.py

+        # we don't want to eat the comments of the next block and it remove the last separating line
+        for i in range(len(header)):
+            if header.pop():
+                h -= 1
+            else:
+                break


waow that's quite smart. Be prudent you it's on the edge to obscure code (but you commented it, so it's ok).

alexgarel · 2022-07-28T16:47:04Z

parser/openfoodfacts_taxonomy_parser/parser.py

+
+    def entry_end(self, line, data):
+        """Return True if the block ended"""
+        if "stopwords" in line or "synonyms" in line or not line:


Suggested change

if "stopwords" in line or "synonyms" in line or not line:

# stopwords and synonyms are one-liner, entries are separated by a blank line

if "stopwords" in line or "synonyms" in line or not line:

alexgarel · 2022-07-28T16:47:57Z

parser/openfoodfacts_taxonomy_parser/parser.py

+
+    def remove_separating_line(self, data):
+        if data["preceding_lines"]:
+            if "synonyms" in data["id"]:


To be more accurate:

Suggested change

if "synonyms" in data["id"]:

if data["id"].startswith("synonyms"):

alexgarel · 2022-07-28T16:48:25Z

parser/openfoodfacts_taxonomy_parser/parser.py

+            if "synonyms" in data["id"]:
+                if "stopwords" in self.is_before:
+                    data["preceding_lines"].pop(0)
+            elif "stopwords" in data["id"]:


Suggested change

elif "stopwords" in data["id"]:

elif data["id"].startswith("stopwords"):

alexgarel · 2022-07-28T17:06:15Z

parser/openfoodfacts_taxonomy_parser/test.txt

@@ -0,0 +1,37 @@
+# test taxonomy


why do we have this file here ?

I also had the same doubt since I wasn't able to run this module while inside the package. This "test.txt" file should be outside openfoodfacts_taxonomy_parser I think.

It's to test the parser myself when I'm writing it

ok, but maybe do not commit that ;-)

Do I alse remove the end (if name== ...) ?

alexgarel · 2022-07-28T17:09:42Z

parser/tests/integration/test_parser_integration.py

+@pytest.fixture
+def new_session():
+    x = parser.Parser()
+    # delete all the nodes and relations in the database
+    query="MATCH (n) DETACH DELETE n"
+    x.session.run(query)
+    return x


it's strange to name it new_session and that it returns a Parser instance !

I would have written:

Suggested change

@pytest.fixture

def new_session():

x = parser.Parser()

# delete all the nodes and relations in the database

query="MATCH (n) DETACH DELETE n"

x.session.run(query)

return x

@pytest.fixture(autouse=True)

def test_setup():

# delete all the nodes and relations in the database

query="MATCH (n) DETACH DELETE n"

parser.Parser().session.run(query)

With autouse, you don't need to add it to tests.

It's ok to create a second Parser in your test, this is not that high a cost and it's more explicit.

alexgarel · 2022-07-28T17:10:27Z

parser/tests/integration/test_parser_integration.py

+def test_calling(new_session):
+    x=new_session


If you change your feature to auto_use=True, it would be

Suggested change

def test_calling(new_session):

x=new_session

def test_calling():

x = parser.Parser()

but I would also change x to test_parser or taxonomy_parser

and you might have a session variable session = test_parser.session to avoid long line thereafter

alexgarel · 2022-07-28T17:13:59Z

parser/tests/integration/test_parser_integration.py

+    assert nodes[0][0] == 'en:meat'
+    assert nodes[0][1] == ['# meat','']


You should test at least three entry nodes and assert more than their id.

Also check one synonym and one stopwords.

alexgarel · 2022-07-28T17:14:29Z

parser/tests/integration/test_parser_integration.py

+
+
+
+    #Child link test


👍 links part is well tested.

aadarsh-ram

@BryanH01 I think your requirements.txt has been created wrongly, as it gave me errors while installation. In the future, do create a virtualenv, install all related packages and use the command pip freeze > requirements.txt to create the txt file. Hope this was helpful!

aadarsh-ram · 2022-07-28T17:32:10Z

parser/openfoodfacts_taxonomy_parser/requirements.txt

+neo4j=='4.4.5'
+re=='2.2.1'
+Unidecode=='1.3.4'


Suggested change

neo4j=='4.4.5'

re=='2.2.1'

Unidecode=='1.3.4'

neo4j==4.4.5

pytz==2022.1

Unidecode==1.3.4

aadarsh-ram · 2022-07-28T17:33:48Z

parser/tests/requirements-test.txt

+neo4j=='4.4.5'
+pytest=='7.1.2'
+re=='2.2.1'
+Unidecode=='1.3.4'


Suggested change

neo4j=='4.4.5'

pytest=='7.1.2'

re=='2.2.1'

Unidecode=='1.3.4'

attrs==22.1.0

iniconfig==1.1.1

neo4j==4.4.5

packaging==21.3

pluggy==1.0.0

py==1.11.0

pyparsing==3.0.9

pytest==7.1.2

pytz==2022.1

tomli==2.0.1

Unidecode==1.3.4

Oh thank you ! I didn't know how to do it

alexgarel

@BryanH01 cool, thank you.

Just one thing I don't agree with. Let's discuss it in slack if needed.

alexgarel · 2022-07-29T13:05:27Z

parser/openfoodfacts_taxonomy_parser/parser.py

+            if data["id"].startswith("synonyms"):
+                # it's a synonyms block,
+                # if the previous block is a stopwords block,
+                # there is at least one separating line
                if "stopwords" in self.is_before:
                    data["preceding_lines"].pop(0)
-            elif "stopwords" in data["id"]:
+
+            elif data["id"].startswith("stopwords"):
+                # it's a stopwords block,
+                # if the previous block is a synonyms block,
+                # there is at least one separating line
                if "synonyms" in self.is_before:
                    data["preceding_lines"].pop(0)
+
            else:
+                # it's an entry block, there is always a separating line
                data["preceding_lines"].pop(0)


I think your assumption here are too much based upon what you have seen so far, without any guarantee. There is no guarantee that we have new lines if we change type and so on. Remember that taxonomies are formatted by humans.
You need at least to check that data["preceding_lines"][0] is empty.

So I really think my proposal is the good one.
see: #18 (comment)
If in rare cases it adds or remove a blank line, this is ok, I mean we do not lose any important information.

alexgarel · 2022-07-29T13:07:32Z

parser/openfoodfacts_taxonomy_parser/parser.py

+        self.stopwords = (
+            dict()
+        )  # it will contain a list of stopwords with their language code as key


why not simply put the comment before the line ? That's the way we do normally !

Also better use the {} to init a dict (not a big deal though)

Suggested change

self.stopwords = (

dict()

) # it will contain a list of stopwords with their language code as key

# stopwords will contain a list of stopwords with their language code as key

self.stopwords = {}

parser/tests/integration/test_parser_integration.py

alexgarel

@BryanH01 let's merge 🎉

I let you do the "squash and merge" when you are ready.

aadarsh-ram · 2022-07-29T15:26:46Z

Kudos @BryanH01! :)

alexgarel and others added 7 commits July 5, 2022 18:07

build: add basic neo4j capability

5405f36

build: add neo4j with docker

3f748b0

Creating python script

567f5ac

Only do the data harvesting part Also, i'm not sure about the use of a class here

Update Parser.py

8aea655

small mistake

Update Parser.py

a631182

Correcting some errors

Update Parser.py

c41afca

Removing print lines

New function : parent

f9e4212

- Changed harvesting function - Added some comments - Added creating relations function : parent

BryanH01 changed the title ~~Parser~~ edit: Parser Jul 7, 2022

BryanH01 changed the title ~~edit: Parser~~ feat: Parser Jul 7, 2022

alexgarel reviewed Jul 8, 2022

View reviewed changes

parser/Parser.py Outdated Show resolved Hide resolved

alexgarel reviewed Jul 8, 2022

View reviewed changes

parser/Parser.py Outdated Show resolved Hide resolved

alexgarel reviewed Jul 8, 2022

View reviewed changes

parser/Parser.py Outdated Show resolved Hide resolved

alexgarel reviewed Jul 8, 2022

View reviewed changes

parser/Parser.py Outdated Show resolved Hide resolved

alexgarel reviewed Jul 8, 2022

View reviewed changes

parser/Parser.py Outdated Show resolved Hide resolved

BryanH01 added 2 commits July 8, 2022 21:57

Update Parser.py

4a7c7b2

Changed with Alex's comments, i think

More pythonic code ? + test

f71d563

Changed script for it to be more pythonic I kept the start in file_iter because i use a line 2 times For the tests, i don't really know what to test.

alexgarel reviewed Jul 15, 2022

View reviewed changes

aadarsh-ram reviewed Jul 16, 2022

View reviewed changes

parser/Parser/test_Parser.py Outdated Show resolved Hide resolved

parser/Parser/test_Parser.py Outdated Show resolved Hide resolved

parser/Parser/Parser.py Outdated Show resolved Hide resolved

parser/Parser/Parser.py Outdated Show resolved Hide resolved

BryanH01 added 6 commits July 19, 2022 18:52

Changed parser filename and updated it

b2b2877

Update test_Parser.py

d57c678

Changed file name

0f9460e

Update taxonomy_parser.py

002a89f

Correcting a mistake

Update taxonomy_parser.py

fb01ebe

added a detail

Update taxonomy_parser.py

b1118d5

alexgarel reviewed Jul 21, 2022

View reviewed changes

alexgarel changed the title ~~feat: Parser~~ feat: taxonomy parser library Jul 22, 2022

aadarsh-ram linked an issue Jul 22, 2022 that may be closed by this pull request

Create a taxonomy parser library in python #22

Closed

BryanH01 mentioned this pull request Jul 22, 2022

docs: document the way to load samples #23

Merged

Changed directory, Updated parser with the comment and new spec

f95fdfe

For the import of parser in test file, i didn't succeed to import easily.

BryanH01 and others added 5 commits July 26, 2022 18:28

Changed name and added integration test

784ded7

I haven't changed the import method in tests yet. Some changes in parser.py

tests: fix tests to have correct import

35bff8e

and get test.txt in right directory

Small fix for header reading

3c49563

Oops, I also forgot to change a function

Add requirements.txt

e2b4a2b

Is this how it's done ?

Added main_language, made some corrections

e9e84ab

Maybe I should get the expected data from the json file.

BryanH01 marked this pull request as ready for review July 27, 2022 13:12

BryanH01 requested a review from a team as a code owner July 27, 2022 13:12

Merge branch 'main' into parser

fdafad8

aadarsh-ram reviewed Jul 27, 2022

View reviewed changes

BryanH01 closed this Jul 27, 2022

BryanH01 reopened this Jul 27, 2022

Changer name previous_block to is_before

8fafca2

aadarsh-ram reviewed Jul 28, 2022

View reviewed changes

Updated following your comments

74c6dee

Changed what you told me to change Updates add_line() as its use has changed Used black

aadarsh-ram reviewed Jul 28, 2022

View reviewed changes

parser/openfoodfacts_taxonomy_parser/parser.py Outdated Show resolved Hide resolved

BryanH01 and others added 3 commits July 28, 2022 14:49

Update parser/openfoodfacts_taxonomy_parser/parser.py

93ad192

Co-authored-by: Aadarsh A <[email protected]>

Changed harvesting method to correctly harvest stopwords and synonyms

b11886f

The parser didn't create a node for each stopwords and synonyms line because there's no blank line between stopwords and not always between synonyms. Also changed a little to accept 3-letter language code

Update parser.py

03130fc

Used black Removed something it was already doing

alexgarel requested changes Jul 28, 2022

View reviewed changes

aadarsh-ram reviewed Jul 28, 2022

View reviewed changes

Changed with your suggestions

3bc6f46

alexgarel requested changes Jul 29, 2022

View reviewed changes

Final changes ?

a7044ba

alexgarel approved these changes Jul 29, 2022

View reviewed changes

BryanH01 merged commit 6b3461a into main Jul 29, 2022

BryanH01 deleted the parser branch July 29, 2022 15:25

openfoodfacts-bot mentioned this pull request Dec 1, 2022

chore(main): release 1.0.0 #136

Merged

	if "stopwords" in line or "synonyms" in line or not line:
	# stopwords and synonyms are one-liner, entries are separated by a blank line
	if "stopwords" in line or "synonyms" in line or not line:

	if "synonyms" in data["id"]:
	if data["id"].startswith("synonyms"):

	elif "stopwords" in data["id"]:
	elif data["id"].startswith("stopwords"):

		assert nodes[0][0] == 'en:meat'
		assert nodes[0][1] == ['# meat','']

-neo4j=='4.4.5'
-pytest=='7.1.2'
-re=='2.2.1'
-Unidecode=='1.3.4'
+attrs==22.1.0
+iniconfig==1.1.1
+neo4j==4.4.5
+packaging==21.3
+pluggy==1.0.0
+py==1.11.0
+pyparsing==3.0.9
+pytest==7.1.2
+pytz==2022.1
+tomli==2.0.1
+Unidecode==1.3.4

feat: taxonomy parser library #18

feat: taxonomy parser library #18

Conversation

BryanH01 commented Jul 6, 2022 • edited by aadarsh-ram Loading

What

Related issue(s)

Choose a reason for hiding this comment

alexgarel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aadarsh-ram left a comment

Choose a reason for hiding this comment

alexgarel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexgarel commented Jul 25, 2022

BryanH01 commented Jul 26, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aadarsh-ram left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aadarsh-ram Jul 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aadarsh-ram Jul 28, 2022 • edited Loading

Choose a reason for hiding this comment

alexgarel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexgarel Jul 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aadarsh-ram left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexgarel left a comment

Choose a reason for hiding this comment

alexgarel Jul 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexgarel left a comment

Choose a reason for hiding this comment

aadarsh-ram commented Jul 29, 2022

BryanH01 commented Jul 6, 2022 •

edited by aadarsh-ram

Loading

aadarsh-ram Jul 28, 2022 •

edited

Loading

aadarsh-ram Jul 28, 2022 •

edited

Loading

alexgarel Jul 28, 2022 •

edited

Loading

aadarsh-ram left a comment •

edited

Loading

alexgarel Jul 29, 2022 •

edited

Loading