-
-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: taxonomy parser library #18
Conversation
Only do the data harvesting part Also, i'm not sure about the use of a class here
small mistake
Correcting some errors
Removing print lines
- Changed harvesting function - Added some comments - Added creating relations function : parent
Changed with Alex's comments, i think
Changed script for it to be more pythonic I kept the start in file_iter because i use a line 2 times For the tests, i don't really know what to test.
parser/Parser/Parser.py
Outdated
self.filename=filename | ||
|
||
def file_name(self): | ||
"""open the file filename and return the list of entries, each entry is a list a the sanitized lines""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update documentation ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You did good progress, kudos.
I commented a lot, hoping it makes you learn.
- Also please rename parser/Parser/Parser.py to parser/parser/parser.py (no uppercase in file or directory names)
parser/Parser/Parser.py
Outdated
|
||
def file_name(self): | ||
"""open the file filename and return the list of entries, each entry is a list a the sanitized lines""" | ||
self.filename = self.filename + ( '.txt' if (len(self.filename)<4 or self.filename[-4:]!=".txt") else '') #to not get error if extension is missing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a good practice to modify it like this (not re-entrant).
I propose to you :
- rename this function as
normalized_filename
- in init.py you call:
self.filename = self.normalized_filename(filename)
parser/Parser/Parser.py
Outdated
counter=0 | ||
with open(self.filename,"r",encoding="utf8") as file: | ||
for line in file: | ||
if counter < start : counter+=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this kind of check it's ok to use a "continue" statement and avoid the else:
if counter < start:
counter += 1
continue
even better you can use itertools.islice:
for line in itertools.islice(file, start=start):
parser/Parser/test_Parser.py
Outdated
language_code_prefix = re.compile('[a-zA-Z][a-zA-Z]:') | ||
for line in file: | ||
counter+=1 | ||
assert line == '' or line[0]=="#" or 'stopword' in line or 'synonym' in line or line[0]=="<" or language_code_prefix.match(line) or ":" in line |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please reformat long lines :-) (120 character is an absolute maximum).
parser/Parser/test_Parser.py
Outdated
x=Parser("test") | ||
x.file_name() | ||
data=x.harvest() | ||
test_data=["# test taxonomy", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you can put it in a test.json file that you parse to get test_data (json.load
) ? It will avoid cluttering the test, and we immediatly see it goes along with test.txt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! @alexgarel has already mentioned many comments regarding your code. I also have some comments and suggestions regarding your code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good progress, commenting as always ;-)
text=x.normalizing(text,"fr") | ||
assert text == "random-language-with-accents" | ||
|
||
def test_create_nodes(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should put integration test in a different file. (even better create tests/unit and tests/integration and put each kind of file in its directory).
It's always better to distinguish them.
You should add a session fixture to remove all entries at startup (using a Parser instance if you want, or creating a session)
For the import of parser in test file, i didn't succeed to import easily.
@BryanH01 didn't you forgot to commit the integration test file ? (Or is it still a work in progress ?) |
No i didn't commit it because i haven't updated it yet (because of the changes in parser) |
I haven't changed the import method in tests yet. Some changes in parser.py
and get test.txt in right directory
Oops, I also forgot to change a function
Is this how it's done ?
Maybe I should get the expected data from the json file.
def create_node(self,data): | ||
""" run the query to create the node with data dictionary """ | ||
position_query = """ | ||
SET n.previous_block = $previous_block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanH01 @alexgarel Is this property called "previous_block" or "is_before"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops sorry my bad, i forgot to change it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some thinking, isn't "is_before" ambiguous ? It can mean "the node is before that:" or "what is before the node is that:". I think having a different name for the relation and the property used isn't too confusing (as it is done with parents/is_child_of). So I would change back the property name to "previous_block" and keep the relation name to "is_before". Is it a bad idea ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as you want @BryanH01
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanH01 I have added some style and property comments. Do go through them and make the necessary changes after discussion.
data = { | ||
"id" : '', | ||
"main_language" : '', | ||
"comment" : [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanH01 @alexgarel Isn't this property is called "preceding_lines"? If so, please do change it everywhere.
"comment" : [], | |
"preceding_lines" : [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that's preceding_lines
.
else : return words | ||
|
||
def add_line(self,line): | ||
"""to get a normalized string but keeping the language code "lc:" and the "," (commas) separators , used to add an id to entries and to add a parent tag""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this docstring's too big. @BryanH01 would you be able to shorten it, or leave some spaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"id" : '', | ||
"main_language" : '', | ||
"comment" : [], | ||
"parent_tag" : [], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanH01 I think this property is redundant with the "is_before" relationship. @BryanH01 is this property being added inside a Neo4j node? @alexgarel should this property be added inside the node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which one ? src_position ? If so, yes it's added in the node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aadarsh-ram, to build the relationship, @BryanH01 must first set attributes, build relationships and then eventually remove them.
@aadarsh-ram and do not confuse "is_child_of" and "is_before" relation.
For example in the editor you should mainly care about "is_child_of", which is relation between entries.
The "is_before" is really just the order in the file.
Nodes have both is_before and is_child_of relations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexgarel I'm really sorry, I meant the "is_child_of" relationship in my comment. If the "parent_tag" property is being removed, there are no issues.
yield line_number,line | ||
yield line_number,"" #to end the last entry if not ended | ||
|
||
def normalizing(self,line,lang="default"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanH01 the "mornalizing()" function could be supplemented with some spaces and newlines to make it more readable.
Changed what you told me to change Updates add_line() as its use has changed Used black
Co-authored-by: Aadarsh A <[email protected]>
The parser didn't create a node for each stopwords and synonyms line because there's no blank line between stopwords and not always between synonyms. Also changed a little to accept 3-letter language code
Used black Removed something it was already doing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are almost there, just add some more asserts in integration test, and small changes to simplify the code, if possible.
def __init__(self,line): | ||
exception_message = f"missing new line at line {line}" | ||
superinit = super().__init__(exception_message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for a next time (no need to change it now).
It might be better to keep line in exception.args (it's a list) and superseed the __str__
function.
This is because when loging there is a way for tools like sentries to group exception, if we keep moving parts outside :-)
# we don't want to eat the comments of the next block and it remove the last separating line | ||
for i in range(len(header)): | ||
if header.pop(): | ||
h -= 1 | ||
else: | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
waow that's quite smart. Be prudent you it's on the edge to obscure code (but you commented it, so it's ok).
|
||
def entry_end(self, line, data): | ||
"""Return True if the block ended""" | ||
if "stopwords" in line or "synonyms" in line or not line: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if "stopwords" in line or "synonyms" in line or not line: | |
# stopwords and synonyms are one-liner, entries are separated by a blank line | |
if "stopwords" in line or "synonyms" in line or not line: |
|
||
def remove_separating_line(self, data): | ||
if data["preceding_lines"]: | ||
if "synonyms" in data["id"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be more accurate:
if "synonyms" in data["id"]: | |
if data["id"].startswith("synonyms"): |
if "synonyms" in data["id"]: | ||
if "stopwords" in self.is_before: | ||
data["preceding_lines"].pop(0) | ||
elif "stopwords" in data["id"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elif "stopwords" in data["id"]: | |
elif data["id"].startswith("stopwords"): |
@@ -0,0 +1,37 @@ | |||
# test taxonomy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we have this file here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also had the same doubt since I wasn't able to run this module while inside the package. This "test.txt" file should be outside openfoodfacts_taxonomy_parser
I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's to test the parser myself when I'm writing it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, but maybe do not commit that ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I alse remove the end (if name== ...) ?
@pytest.fixture | ||
def new_session(): | ||
x = parser.Parser() | ||
# delete all the nodes and relations in the database | ||
query="MATCH (n) DETACH DELETE n" | ||
x.session.run(query) | ||
return x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's strange to name it new_session and that it returns a Parser instance !
I would have written:
@pytest.fixture | |
def new_session(): | |
x = parser.Parser() | |
# delete all the nodes and relations in the database | |
query="MATCH (n) DETACH DELETE n" | |
x.session.run(query) | |
return x | |
@pytest.fixture(autouse=True) | |
def test_setup(): | |
# delete all the nodes and relations in the database | |
query="MATCH (n) DETACH DELETE n" | |
parser.Parser().session.run(query) |
With autouse, you don't need to add it to tests.
It's ok to create a second Parser in your test, this is not that high a cost and it's more explicit.
def test_calling(new_session): | ||
x=new_session |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you change your feature to auto_use=True
, it would be
def test_calling(new_session): | |
x=new_session | |
def test_calling(): | |
x = parser.Parser() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but I would also change x to test_parser or taxonomy_parser
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and you might have a session variable session = test_parser.session
to avoid long line thereafter
assert nodes[0][0] == 'en:meat' | ||
assert nodes[0][1] == ['# meat',''] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should test at least three entry nodes and assert more than their id.
Also check one synonym and one stopwords.
|
||
|
||
|
||
#Child link test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 links part is well tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanH01 I think your requirements.txt
has been created wrongly, as it gave me errors while installation. In the future, do create a virtualenv, install all related packages and use the command pip freeze > requirements.txt
to create the txt file. Hope this was helpful!
neo4j=='4.4.5' | ||
re=='2.2.1' | ||
Unidecode=='1.3.4' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neo4j=='4.4.5' | |
re=='2.2.1' | |
Unidecode=='1.3.4' | |
neo4j==4.4.5 | |
pytz==2022.1 | |
Unidecode==1.3.4 |
parser/tests/requirements-test.txt
Outdated
neo4j=='4.4.5' | ||
pytest=='7.1.2' | ||
re=='2.2.1' | ||
Unidecode=='1.3.4' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neo4j=='4.4.5' | |
pytest=='7.1.2' | |
re=='2.2.1' | |
Unidecode=='1.3.4' | |
attrs==22.1.0 | |
iniconfig==1.1.1 | |
neo4j==4.4.5 | |
packaging==21.3 | |
pluggy==1.0.0 | |
py==1.11.0 | |
pyparsing==3.0.9 | |
pytest==7.1.2 | |
pytz==2022.1 | |
tomli==2.0.1 | |
Unidecode==1.3.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh thank you ! I didn't know how to do it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanH01 cool, thank you.
Just one thing I don't agree with. Let's discuss it in slack if needed.
if data["id"].startswith("synonyms"): | ||
# it's a synonyms block, | ||
# if the previous block is a stopwords block, | ||
# there is at least one separating line | ||
if "stopwords" in self.is_before: | ||
data["preceding_lines"].pop(0) | ||
elif "stopwords" in data["id"]: | ||
|
||
elif data["id"].startswith("stopwords"): | ||
# it's a stopwords block, | ||
# if the previous block is a synonyms block, | ||
# there is at least one separating line | ||
if "synonyms" in self.is_before: | ||
data["preceding_lines"].pop(0) | ||
|
||
else: | ||
# it's an entry block, there is always a separating line | ||
data["preceding_lines"].pop(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your assumption here are too much based upon what you have seen so far, without any guarantee. There is no guarantee that we have new lines if we change type and so on. Remember that taxonomies are formatted by humans.
You need at least to check that data["preceding_lines"][0]
is empty.
So I really think my proposal is the good one.
see: #18 (comment)
If in rare cases it adds or remove a blank line, this is ok, I mean we do not lose any important information.
self.stopwords = ( | ||
dict() | ||
) # it will contain a list of stopwords with their language code as key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not simply put the comment before the line ? That's the way we do normally !
Also better use the {}
to init a dict (not a big deal though)
self.stopwords = ( | |
dict() | |
) # it will contain a list of stopwords with their language code as key | |
# stopwords will contain a list of stopwords with their language code as key | |
self.stopwords = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BryanH01 let's merge 🎉
I let you do the "squash and merge" when you are ready.
Kudos @BryanH01! :) |
What
I'm not sure about the use of a class here.
Related issue(s)
Fixes #22