Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write::Treex renders the iset structure invalid #23

Open
dan-zeman opened this issue Oct 2, 2015 · 9 comments
Open

Write::Treex renders the iset structure invalid #23

dan-zeman opened this issue Oct 2, 2015 · 9 comments
Labels

Comments

@dan-zeman
Copy link
Member

This bug surfaces when Interset is used after the document is written. Normally it does not happen because Write::Treex tends to be the last block of the scenario. But try to write the document in two different formats, i.e. your last two blocks will be Write::Treex and Write::CoNLLU (in this order). And make sure that there is a node with a multi-valued interset feature, e.g. prontype=int|rel.

The Treex file is written correctly while CoNLL-U is not. Apparently when Write::CoNLLU takes its turn, the Lingua::Interset::FeatureStructure object is already corrupt. Somehow the "int|rel" string makes it down to the hash value, which would not happen if the set() method of the FeatureStructure was used.

Therefore I suspect that there is a bug in one of the methods of the Node::Interset role.

@dan-zeman dan-zeman added the bug label Oct 2, 2015
@dan-zeman
Copy link
Member Author

I tracked the problem down but I don't see an easy solution. The serialize_iset() method in TC::Node::Interset calls $self->set_attr("iset/prontype", ['int', 'rel']);. The set_attr() method, defined in TC::Node, first converts the list of values to Treex::PML::List. Later on, it accesses the iset feature structure as a plain hash (instead of as Lingua::Interset::FeatureStructure) and ruthlessly pushes the converted value there.

I understand that this hack helps the writer make sure that the list of values appears in the Treex file as <prontype>int|rel</prontype>.

But I don't like the idea that any code outside of the Lingua::Interset::FeatureStructure (or even outside Lingua::Interset) accesses the attributes of the FeatureStructure object directly.

Any thoughts?

@martinpopel
Copy link
Member

Do you have a minimal test case? Some treex oneliner?

@dan-zeman
Copy link
Member Author

Yes. First a correct example:

perl -e 'print "1\twho\twho\tPRON\t_\t_\t0\troot\t_\t_\n\n"' | treex -Len Read::CoNLLU Util::Eval anode='$.iset->set("prontype", "int|rel")' Write::CoNLLU

prints this:

# sent_id a_tree-en-s1-root
1       who     who     PRON    _       PronType=Int,Rel        0       root    _       _

Then add the bug (Write::Treex):

perl -e 'print "1\twho\twho\tPRON\t_\t_\t0\troot\t_\t_\n\n"' | treex -Len Read::CoNLLU Util::Eval anode='$.iset->set("prontype", "int|rel")' Write::Treex to=/dev/null Write::CoNLLU

prints this:

# sent_id a_tree-en-s1-root
1       who     who     PRON    _       PronType=Int|rel        0       root    _       _

@martinpopel
Copy link
Member

The serialize_iset() method in TC::Node::Interset calls $self->set_attr("iset/prontype", ['int', 'rel']);

No, it calls $self->set_attr("iset/prontype", 'int|rel');
because $node->get_iset($feature) returns always string ('int|rel').
This is needed because the PML schema says <member name="prontype"><cdata format="any"/></member>. If an array reference is saved into format=any, you would get <prontype>ARRAY(0xd5e998)</prontype> in the treex file.

One workaround would be to check if Write::Treex (there is an ugly way how to do it) is the last block and if not, then deserialize iset here.

I will think about a better solution.
This "blocks after Write::Treex" affects also wild attributes.

@dan-zeman
Copy link
Member Author

Can't it be deserialized always, after the file's been written?

@dan-zeman
Copy link
Member Author

NB: The problem is not urgent for me. Once I know that the bug is not directly in Interset, I can easily adjust the scenario and work around it.

But of course we want to solve it eventually.

@martinpopel
Copy link
Member

Can't it be deserialized always, after the file's been written?

That would slow down all Write::Treex. More writers in a scenario is a rare usecase (but fully legal, so I agree we should fix this issue).

@dan-zeman
Copy link
Member Author

I know it would slow it down (although I do not know how much) but does it really matter if it is the last block in the scenario?

@martinpopel
Copy link
Member

I know it would slow it down (although I do not know how much)

I did no benchmark, but you need to access all nodes in the document and try to reset all get_nonempty_features().

does it really matter if it is the last block

In most cases yes. If more documents are to be processed by the same job, you need to wait until the writer finishes. Even in single-document processing on one machine I always wait until treex finishes before I (or some script) try to access the output treex file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants