Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: taxonomy parser library #18

Merged
merged 29 commits into from
Jul 29, 2022
Merged

feat: taxonomy parser library #18

merged 29 commits into from
Jul 29, 2022

Conversation

BryanH01
Copy link
Collaborator

@BryanH01 BryanH01 commented Jul 6, 2022

What

  • Created parser python script in parser folder
  • It's currently only harvesting data and creating the nodes

I'm not sure about the use of a class here.

Related issue(s)

Fixes #22

alexgarel and others added 7 commits July 5, 2022 18:07
Only do the data harvesting part
Also, i'm not sure about the use of a class here
small mistake
Correcting some errors
Removing print lines
- Changed harvesting function
- Added some comments
- Added creating relations function : parent
@BryanH01 BryanH01 changed the title Parser edit: Parser Jul 7, 2022
@BryanH01 BryanH01 changed the title edit: Parser feat: Parser Jul 7, 2022
parser/Parser.py Outdated Show resolved Hide resolved
parser/Parser.py Outdated Show resolved Hide resolved
parser/Parser.py Outdated Show resolved Hide resolved
parser/Parser.py Outdated Show resolved Hide resolved
parser/Parser.py Outdated Show resolved Hide resolved
Changed with Alex's comments, i think
Changed script for it to be more pythonic
I kept the start in file_iter because i use a line 2 times
For the tests, i don't really know what to test.
self.filename=filename

def file_name(self):
"""open the file filename and return the list of entries, each entry is a list a the sanitized lines"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update documentation ;-)

Copy link
Member

@alexgarel alexgarel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did good progress, kudos.

I commented a lot, hoping it makes you learn.

  • Also please rename parser/Parser/Parser.py to parser/parser/parser.py (no uppercase in file or directory names)


def file_name(self):
"""open the file filename and return the list of entries, each entry is a list a the sanitized lines"""
self.filename = self.filename + ( '.txt' if (len(self.filename)<4 or self.filename[-4:]!=".txt") else '') #to not get error if extension is missing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a good practice to modify it like this (not re-entrant).

I propose to you :

  • rename this function as normalized_filename
  • in init.py you call: self.filename = self.normalized_filename(filename)

counter=0
with open(self.filename,"r",encoding="utf8") as file:
for line in file:
if counter < start : counter+=1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this kind of check it's ok to use a "continue" statement and avoid the else:

if counter < start:
    counter += 1
    continue

even better you can use itertools.islice:

for line in itertools.islice(file, start=start):

parser/Parser/Parser.py Outdated Show resolved Hide resolved
parser/Parser/Parser.py Outdated Show resolved Hide resolved
parser/Parser/Parser.py Outdated Show resolved Hide resolved
language_code_prefix = re.compile('[a-zA-Z][a-zA-Z]:')
for line in file:
counter+=1
assert line == '' or line[0]=="#" or 'stopword' in line or 'synonym' in line or line[0]=="<" or language_code_prefix.match(line) or ":" in line
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please reformat long lines :-) (120 character is an absolute maximum).

parser/Parser/Parser.py Outdated Show resolved Hide resolved
parser/Parser/test_Parser.py Outdated Show resolved Hide resolved
parser/Parser/test_Parser.py Outdated Show resolved Hide resolved
x=Parser("test")
x.file_name()
data=x.harvest()
test_data=["# test taxonomy",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you can put it in a test.json file that you parse to get test_data (json.load) ? It will avoid cluttering the test, and we immediatly see it goes along with test.txt.

Copy link
Collaborator

@aadarsh-ram aadarsh-ram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! @alexgarel has already mentioned many comments regarding your code. I also have some comments and suggestions regarding your code.

parser/Parser/test_Parser.py Outdated Show resolved Hide resolved
parser/Parser/test_Parser.py Outdated Show resolved Hide resolved
parser/Parser/Parser.py Outdated Show resolved Hide resolved
parser/Parser/Parser.py Outdated Show resolved Hide resolved
Copy link
Member

@alexgarel alexgarel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good progress, commenting as always ;-)

parser/parser/test_taxonomy_parser.py Outdated Show resolved Hide resolved
parser/parser/taxonomy_parser.py Outdated Show resolved Hide resolved
parser/parser/taxonomy_parser.py Outdated Show resolved Hide resolved
parser/parser/taxonomy_parser.py Outdated Show resolved Hide resolved
parser/parser/taxonomy_parser.py Outdated Show resolved Hide resolved
parser/parser/test_taxonomy_parser.py Outdated Show resolved Hide resolved
text=x.normalizing(text,"fr")
assert text == "random-language-with-accents"

def test_create_nodes():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should put integration test in a different file. (even better create tests/unit and tests/integration and put each kind of file in its directory).

It's always better to distinguish them.

You should add a session fixture to remove all entries at startup (using a Parser instance if you want, or creating a session)

see https://docs.pytest.org/en/7.1.x/how-to/fixtures.html

parser/parser/test_taxonomy_parser.py Outdated Show resolved Hide resolved
parser/parser/test_taxonomy_parser.py Outdated Show resolved Hide resolved
parser/parser/test_taxonomy_parser.py Outdated Show resolved Hide resolved
@alexgarel alexgarel changed the title feat: Parser feat: taxonomy parser library Jul 22, 2022
@aadarsh-ram aadarsh-ram linked an issue Jul 22, 2022 that may be closed by this pull request
For the import of parser in test file, i didn't succeed to import easily.
@alexgarel
Copy link
Member

@BryanH01 didn't you forgot to commit the integration test file ? (Or is it still a work in progress ?)

@BryanH01
Copy link
Collaborator Author

@BryanH01 didn't you forgot to commit the integration test file ? (Or is it still a work in progress ?)

No i didn't commit it because i haven't updated it yet (because of the changes in parser)

BryanH01 and others added 5 commits July 26, 2022 18:28
I haven't changed the import method in tests yet.
Some changes in parser.py
and get test.txt in right directory
Oops, I also forgot to change a function
Is this how it's done ?
Maybe I should get the expected data from the json file.
@BryanH01 BryanH01 marked this pull request as ready for review July 27, 2022 13:12
@BryanH01 BryanH01 requested a review from a team as a code owner July 27, 2022 13:12
def create_node(self,data):
""" run the query to create the node with data dictionary """
position_query = """
SET n.previous_block = $previous_block
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanH01 @alexgarel Is this property called "previous_block" or "is_before"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops sorry my bad, i forgot to change it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some thinking, isn't "is_before" ambiguous ? It can mean "the node is before that:" or "what is before the node is that:". I think having a different name for the relation and the property used isn't too confusing (as it is done with parents/is_child_of). So I would change back the property name to "previous_block" and keep the relation name to "is_before". Is it a bad idea ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as you want @BryanH01

@BryanH01 BryanH01 closed this Jul 27, 2022
@BryanH01 BryanH01 reopened this Jul 27, 2022
Copy link
Collaborator

@aadarsh-ram aadarsh-ram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanH01 I have added some style and property comments. Do go through them and make the necessary changes after discussion.

data = {
"id" : '',
"main_language" : '',
"comment" : [],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanH01 @alexgarel Isn't this property is called "preceding_lines"? If so, please do change it everywhere.

Suggested change
"comment" : [],
"preceding_lines" : [],

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that's preceding_lines.

else : return words

def add_line(self,line):
"""to get a normalized string but keeping the language code "lc:" and the "," (commas) separators , used to add an id to entries and to add a parent tag"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this docstring's too big. @BryanH01 would you be able to shorten it, or leave some spaces?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanH01 in docstring use line return :-)

Also you should pass your code through black (and we should automate that with a make command). Because you are far from the pep 8 specs :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"id" : '',
"main_language" : '',
"comment" : [],
"parent_tag" : [],
Copy link
Collaborator

@aadarsh-ram aadarsh-ram Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanH01 I think this property is redundant with the "is_before" relationship. @BryanH01 is this property being added inside a Neo4j node? @alexgarel should this property be added inside the node?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which one ? src_position ? If so, yes it's added in the node

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aadarsh-ram, to build the relationship, @BryanH01 must first set attributes, build relationships and then eventually remove them.

@aadarsh-ram and do not confuse "is_child_of" and "is_before" relation.
For example in the editor you should mainly care about "is_child_of", which is relation between entries.

The "is_before" is really just the order in the file.
Nodes have both is_before and is_child_of relations.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexgarel I'm really sorry, I meant the "is_child_of" relationship in my comment. If the "parent_tag" property is being removed, there are no issues.

parser/openfoodfacts_taxonomy_parser/parser.py Outdated Show resolved Hide resolved
parser/openfoodfacts_taxonomy_parser/parser.py Outdated Show resolved Hide resolved
parser/openfoodfacts_taxonomy_parser/parser.py Outdated Show resolved Hide resolved
parser/openfoodfacts_taxonomy_parser/test.txt Outdated Show resolved Hide resolved
yield line_number,line
yield line_number,"" #to end the last entry if not ended

def normalizing(self,line,lang="default"):
Copy link
Collaborator

@aadarsh-ram aadarsh-ram Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanH01 the "mornalizing()" function could be supplemented with some spaces and newlines to make it more readable.

Changed what you told me to change
Updates add_line() as its use has changed
Used black
BryanH01 and others added 3 commits July 28, 2022 14:49
The parser didn't create a node for each stopwords and synonyms line because there's no blank line between stopwords and not always between synonyms.
Also changed a little to accept 3-letter language code
Used black
Removed something it was already doing
Copy link
Member

@alexgarel alexgarel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are almost there, just add some more asserts in integration test, and small changes to simplify the code, if possible.

Comment on lines +4 to +6
def __init__(self,line):
exception_message = f"missing new line at line {line}"
superinit = super().__init__(exception_message)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for a next time (no need to change it now).
It might be better to keep line in exception.args (it's a list) and superseed the __str__ function.
This is because when loging there is a way for tools like sentries to group exception, if we keep moving parts outside :-)

Comment on lines 167 to 172
# we don't want to eat the comments of the next block and it remove the last separating line
for i in range(len(header)):
if header.pop():
h -= 1
else:
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

waow that's quite smart. Be prudent you it's on the edge to obscure code (but you commented it, so it's ok).


def entry_end(self, line, data):
"""Return True if the block ended"""
if "stopwords" in line or "synonyms" in line or not line:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if "stopwords" in line or "synonyms" in line or not line:
# stopwords and synonyms are one-liner, entries are separated by a blank line
if "stopwords" in line or "synonyms" in line or not line:


def remove_separating_line(self, data):
if data["preceding_lines"]:
if "synonyms" in data["id"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more accurate:

Suggested change
if "synonyms" in data["id"]:
if data["id"].startswith("synonyms"):

if "synonyms" in data["id"]:
if "stopwords" in self.is_before:
data["preceding_lines"].pop(0)
elif "stopwords" in data["id"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
elif "stopwords" in data["id"]:
elif data["id"].startswith("stopwords"):

@@ -0,0 +1,37 @@
# test taxonomy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have this file here ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also had the same doubt since I wasn't able to run this module while inside the package. This "test.txt" file should be outside openfoodfacts_taxonomy_parser I think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's to test the parser myself when I'm writing it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but maybe do not commit that ;-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I alse remove the end (if name== ...) ?

Comment on lines 9 to 15
@pytest.fixture
def new_session():
x = parser.Parser()
# delete all the nodes and relations in the database
query="MATCH (n) DETACH DELETE n"
x.session.run(query)
return x
Copy link
Member

@alexgarel alexgarel Jul 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's strange to name it new_session and that it returns a Parser instance !

I would have written:

Suggested change
@pytest.fixture
def new_session():
x = parser.Parser()
# delete all the nodes and relations in the database
query="MATCH (n) DETACH DELETE n"
x.session.run(query)
return x
@pytest.fixture(autouse=True)
def test_setup():
# delete all the nodes and relations in the database
query="MATCH (n) DETACH DELETE n"
parser.Parser().session.run(query)

With autouse, you don't need to add it to tests.

It's ok to create a second Parser in your test, this is not that high a cost and it's more explicit.

Comment on lines 18 to 19
def test_calling(new_session):
x=new_session
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change your feature to auto_use=True, it would be

Suggested change
def test_calling(new_session):
x=new_session
def test_calling():
x = parser.Parser()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I would also change x to test_parser or taxonomy_parser

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and you might have a session variable session = test_parser.session to avoid long line thereafter

Comment on lines 42 to 43
assert nodes[0][0] == 'en:meat'
assert nodes[0][1] == ['# meat','']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should test at least three entry nodes and assert more than their id.

Also check one synonym and one stopwords.




#Child link test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 links part is well tested.

Copy link
Collaborator

@aadarsh-ram aadarsh-ram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanH01 I think your requirements.txt has been created wrongly, as it gave me errors while installation. In the future, do create a virtualenv, install all related packages and use the command pip freeze > requirements.txt to create the txt file. Hope this was helpful!

Comment on lines 1 to 3
neo4j=='4.4.5'
re=='2.2.1'
Unidecode=='1.3.4'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
neo4j=='4.4.5'
re=='2.2.1'
Unidecode=='1.3.4'
neo4j==4.4.5
pytz==2022.1
Unidecode==1.3.4

Comment on lines 1 to 4
neo4j=='4.4.5'
pytest=='7.1.2'
re=='2.2.1'
Unidecode=='1.3.4'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
neo4j=='4.4.5'
pytest=='7.1.2'
re=='2.2.1'
Unidecode=='1.3.4'
attrs==22.1.0
iniconfig==1.1.1
neo4j==4.4.5
packaging==21.3
pluggy==1.0.0
py==1.11.0
pyparsing==3.0.9
pytest==7.1.2
pytz==2022.1
tomli==2.0.1
Unidecode==1.3.4

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thank you ! I didn't know how to do it

Copy link
Member

@alexgarel alexgarel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanH01 cool, thank you.

Just one thing I don't agree with. Let's discuss it in slack if needed.

Comment on lines 194 to 210
if data["id"].startswith("synonyms"):
# it's a synonyms block,
# if the previous block is a stopwords block,
# there is at least one separating line
if "stopwords" in self.is_before:
data["preceding_lines"].pop(0)
elif "stopwords" in data["id"]:

elif data["id"].startswith("stopwords"):
# it's a stopwords block,
# if the previous block is a synonyms block,
# there is at least one separating line
if "synonyms" in self.is_before:
data["preceding_lines"].pop(0)

else:
# it's an entry block, there is always a separating line
data["preceding_lines"].pop(0)
Copy link
Member

@alexgarel alexgarel Jul 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your assumption here are too much based upon what you have seen so far, without any guarantee. There is no guarantee that we have new lines if we change type and so on. Remember that taxonomies are formatted by humans.
You need at least to check that data["preceding_lines"][0] is empty.

So I really think my proposal is the good one.
see: #18 (comment)
If in rare cases it adds or remove a blank line, this is ok, I mean we do not lose any important information.

Comment on lines 218 to 220
self.stopwords = (
dict()
) # it will contain a list of stopwords with their language code as key
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not simply put the comment before the line ? That's the way we do normally !

Also better use the {} to init a dict (not a big deal though)

Suggested change
self.stopwords = (
dict()
) # it will contain a list of stopwords with their language code as key
# stopwords will contain a list of stopwords with their language code as key
self.stopwords = {}

Copy link
Member

@alexgarel alexgarel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BryanH01 let's merge 🎉

I let you do the "squash and merge" when you are ready.

@BryanH01 BryanH01 merged commit 6b3461a into main Jul 29, 2022
@BryanH01 BryanH01 deleted the parser branch July 29, 2022 15:25
@aadarsh-ram
Copy link
Collaborator

Kudos @BryanH01! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

Create a taxonomy parser library in python
3 participants