Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion for feature improvement: Return dictionary with parsed references #23

Open
walter-hernandez opened this issue Aug 7, 2020 · 2 comments

Comments

@walter-hernandez
Copy link

Hello all:

A suggestion for an improvement in NeuralParscit is to return a dictionary with the tokens with the same label. For example:

After parsing the reference:
"Calzolari, N. (1982) Towards the organization of lexical definitions on a database structure. In E. Hajicova (Ed.), COLING '82 Abstracts, Charles University, Prague, pp.61-64."

the result would be:
'author author date title title title title title title title title title title editor editor editor editor booktitle booktitle booktitle editor institution location pages'

However, it could rather return a dictionary like this:
{'author': ['Calzolari,', 'N.'], 'date': ['(1982)'], 'title': ['Towards', 'the', 'organization', 'of', 'lexical', 'definitions', 'on', 'a', 'database', 'structure.'], 'editor': ['In', 'E.', 'Hajicova', '(Ed.),', 'Charles'], 'booktitle': ['COLING', "'82", 'Abstracts,'], 'institution': ['University,'], 'location': ['Prague,'], 'pages': ['pp.61-64.']}

The dictionary above could later be used for detokenize the lists and get something like:
{'author': 'Calzolari, N.',
'date': '(1982)',
'title': 'Towards the organization of lexical definitions on a database structure.',
'editor': 'In E. Hajicova (Ed.), Charles',
'booktitle': "COLING '82 Abstracts,",
'institution': 'University,',
'location': 'Prague,',
'pages': 'pp.61-64.'}

The code to get something like the above would be:
`result_parsing = neural_parscit.predict_for_text(text=reference, show=False)
result_parsing = [t for t in result_parsing.split(" ")]

result_dict = {}

for token, token_label in zip(reference_tokenized, result_parsing):
if token_label not in result_dict.keys():
result_dict[token_label] = []

result_dict[token_label].append(token)

detokenize everything

result_dict = {k:md.detokenize(v) for k,v in result_dict.items()}`

The detokenizer used is MosesDetokenizer, which is in the library sacremoses

@walter-hernandez walter-hernandez changed the title Feature improvement: Return dictionary with parse references Suggestion for feature improvement: Return dictionary with parsed references Aug 7, 2020
@abhinavkashyap
Copy link
Owner

@walter-hernandez Thanks for the request. Are you suggesting the dcitionary output in addition to the string being returned or as the only way to obtain output from predict_for_text

@walter-hernandez
Copy link
Author

The dictionary output could be an additional feature to obtain an output from predict_for_text

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants