You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A suggestion for an improvement in NeuralParscit is to return a dictionary with the tokens with the same label. For example:
After parsing the reference:
"Calzolari, N. (1982) Towards the organization of lexical definitions on a database structure. In E. Hajicova (Ed.), COLING '82 Abstracts, Charles University, Prague, pp.61-64."
the result would be:
'author author date title title title title title title title title title title editor editor editor editor booktitle booktitle booktitle editor institution location pages'
However, it could rather return a dictionary like this: {'author': ['Calzolari,', 'N.'], 'date': ['(1982)'], 'title': ['Towards', 'the', 'organization', 'of', 'lexical', 'definitions', 'on', 'a', 'database', 'structure.'], 'editor': ['In', 'E.', 'Hajicova', '(Ed.),', 'Charles'], 'booktitle': ['COLING', "'82", 'Abstracts,'], 'institution': ['University,'], 'location': ['Prague,'], 'pages': ['pp.61-64.']}
The dictionary above could later be used for detokenize the lists and get something like:
{'author': 'Calzolari, N.',
'date': '(1982)',
'title': 'Towards the organization of lexical definitions on a database structure.',
'editor': 'In E. Hajicova (Ed.), Charles',
'booktitle': "COLING '82 Abstracts,",
'institution': 'University,',
'location': 'Prague,',
'pages': 'pp.61-64.'}
The code to get something like the above would be:
`result_parsing = neural_parscit.predict_for_text(text=reference, show=False)
result_parsing = [t for t in result_parsing.split(" ")]
result_dict = {}
for token, token_label in zip(reference_tokenized, result_parsing):
if token_label not in result_dict.keys():
result_dict[token_label] = []
result_dict[token_label].append(token)
detokenize everything
result_dict = {k:md.detokenize(v) for k,v in result_dict.items()}`
The detokenizer used is MosesDetokenizer, which is in the library sacremoses
The text was updated successfully, but these errors were encountered:
walter-hernandez
changed the title
Feature improvement: Return dictionary with parse references
Suggestion for feature improvement: Return dictionary with parsed references
Aug 7, 2020
@walter-hernandez Thanks for the request. Are you suggesting the dcitionary output in addition to the string being returned or as the only way to obtain output from predict_for_text
Hello all:
A suggestion for an improvement in NeuralParscit is to return a dictionary with the tokens with the same label. For example:
After parsing the reference:
"Calzolari, N. (1982) Towards the organization of lexical definitions on a database structure. In E. Hajicova (Ed.), COLING '82 Abstracts, Charles University, Prague, pp.61-64."
the result would be:
'author author date title title title title title title title title title title editor editor editor editor booktitle booktitle booktitle editor institution location pages'
However, it could rather return a dictionary like this:
{'author': ['Calzolari,', 'N.'], 'date': ['(1982)'], 'title': ['Towards', 'the', 'organization', 'of', 'lexical', 'definitions', 'on', 'a', 'database', 'structure.'], 'editor': ['In', 'E.', 'Hajicova', '(Ed.),', 'Charles'], 'booktitle': ['COLING', "'82", 'Abstracts,'], 'institution': ['University,'], 'location': ['Prague,'], 'pages': ['pp.61-64.']}
The dictionary above could later be used for detokenize the lists and get something like:
{'author': 'Calzolari, N.',
'date': '(1982)',
'title': 'Towards the organization of lexical definitions on a database structure.',
'editor': 'In E. Hajicova (Ed.), Charles',
'booktitle': "COLING '82 Abstracts,",
'institution': 'University,',
'location': 'Prague,',
'pages': 'pp.61-64.'}
The code to get something like the above would be:
`result_parsing = neural_parscit.predict_for_text(text=reference, show=False)
result_parsing = [t for t in result_parsing.split(" ")]
result_dict = {}
for token, token_label in zip(reference_tokenized, result_parsing):
if token_label not in result_dict.keys():
result_dict[token_label] = []
detokenize everything
result_dict = {k:md.detokenize(v) for k,v in result_dict.items()}`
The detokenizer used is MosesDetokenizer, which is in the library sacremoses
The text was updated successfully, but these errors were encountered: