Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to save/export flair's datasets corpus in CoNLL format #988

Closed
maziyarpanahi opened this issue Aug 10, 2019 · 2 comments
Closed

How to save/export flair's datasets corpus in CoNLL format #988

maziyarpanahi opened this issue Aug 10, 2019 · 2 comments
Labels
question Further information is requested

Comments

@maziyarpanahi
Copy link
Contributor

Hi,

I am trying to train NER by using WikiNER in multiple NLP libraries starting with flair. One thing I would like to share between all is an identical training corpus which flair does a great job in creating a train, dev, and test corpora. In addition, it uses BIOES format which I am interested in comparing it with IOB or IOB2.

Is there any way to easily save/export loaded datasets on disk in CoNLL format?

For instance:

import flair.datasets
corpus = flair.datasets.WIKINER_ENGLISH()

I would like to save corpus.train, corpus.dev, and corpus.test in CoNLL format on disk and share the same dataset between multiple NLP libraries to compare the final performance.

Many thanks,
Maziyar

@maziyarpanahi maziyarpanahi added the question Further information is requested label Aug 10, 2019
@alanakbik
Copy link
Collaborator

Hi @maziyarpanahi we have no in-built method for this, but you can write a simple method to write out the column format you need. I.e. you can iterate through all sentences in the three splits, then iterate over all tokens of each sentence and write to file the attributes you want.

Something like this:

# got through each sentence
for sentence in corpus.dev:

    # go through each token of sentence
    for token in sentence:
        # print what you need (text and NER value)
        print(f"{token.text}\t{token.get_tag('ner').value}")

    # print newline at end of each sentence
    print()

@maziyarpanahi
Copy link
Contributor Author

This is just beautiful! Easy, clean, and it works great!

Thanks a lot mate 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants