Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

πŸ’« Multi-task CNN for parser, tagger and NER #1057

Closed
honnibal opened this issue May 13, 2017 · 9 comments
Closed

πŸ’« Multi-task CNN for parser, tagger and NER #1057

honnibal opened this issue May 13, 2017 · 9 comments
Labels
enhancement Feature requests and improvements πŸŒ™ nightly Discussion and contributions related to nightly builds ⚠️ wip Work in progress

Comments

@honnibal
Copy link
Member

honnibal commented May 13, 2017

The implementation of the neural network model for spaCy's parser, tagger and NER is now complete. πŸŽ‰ There are still a lot of hyper-parameters to tune, efficiency improvements to make, and hacks to unhack β€” but the main work is done.

The code is on the v2 branch. Currently, it requires Chainer, which may cause installation to fail on machines without a GPU. This will obviously be fixed prior to release.

Preliminary results

Current parser performance on the AnCora Spanish corpus:

spaCy v1 spaCy v2 ParseySaurus (SyntaxNet)
87.5 90.96 91.02

Parse times are down on CPU vs the 1.x branch β€” currently on CPU the neural network model is 4x slower. With a modest GPU, the v2 model is about as fast as v1 with about 3 CPU threads. I think we can claw back most of this lost performance, and get to around half the linear model's speed running on CPU. The plan is to continue focusing on CPU runtime for now: I think this will continue to be the cheapest and most convenient way for people to run spaCy. Of course, GPU training is nice 😊

The parsing model is a blend of recent results. The two recent inspirations have been the work of Eli Klipperwasser and Yoav Goldberg at Bar Ilan [1], and the SyntaxNet team from Google. The foundation of the parser is still based on the work of Joakim Nivre [5], who introduced the transition-based framework [7], the arc-eager transition system, and the imitation learning objective. There's a short bibliography at the end of the issue.

Outline of the model

The model is implemented using Thinc, our machine learning library. (The parsing model uses Thinc v6.6.0, which was just released.) We first predict context-sensitive vectors for each word in the input:

(embed_lower | embed_prefix | embed_suffix | embed_shape)
>> Maxout(token_width)
>> convolution ** 4

This convolutional layer is shared between the tagger, parser and NER, and will also be shared by the future neural lemmatizer. Because the parser shares these layers with the tagger, the parser does not require tag features. I got this trick from David Weiss's "Stack Combination" paper [2].

To boost the representation, the tagger actually predicts a "super tag" with POS, morphology and dependency label. This part is novel – and it helps quite a lot, especially for languages such as Spanish where the POS task is by itself too easy. (Edit: Actually, not so novel -- and I'd actually read this paper, and even discussed it with Yoav! So easy to lose track...)

The tagger predicts these supertags by adding a softmax layer onto the convolutional layer --- so, we're teaching the convolutional layer to give us a representation that's one affine transform from this informative lexical information. This is obviously good for the
parser (which backprops to the convolutions too).

The parser model makes a state vector by concatenating the vector representations for its context tokens. The current context tokens:

  • S0, S1, S2: Top three words on the stack
  • B0, B1: First two words of the buffer
  • S0L1, S0L2: Leftmost and second leftmost children of S0
  • S0R1, S0R2: Rightmost and second rightmost children of S0
  • S1L1, S1L2, S1R2, S1R, B0L1, B0L2: Likewise for S1 and B0

This makes the state vector quite long: 13*T, where T is the token vector width (128 is working well). Fortunately, there's a way to structure the computation to save some expense (and make it more GPU friendly).

The parser typically visits 2*N states for a sentence of length N (although it may visit more, if it back-tracks with a non-monotonic transition [4]). A naive implementation would require 2*N (B, 13*T) @ (13*T, H) matrix multiplications for a batch of size B. We can instead perform one (B*N, T) @ (T, 13*H) multiplication, to pre-compute the hidden weights for each positional feature with respect to the words in the batch. (Note that our token vectors come from the CNN β€” so we can't play this trick over the vocabulary. That's how Stanford's NN parser [3] works β€” and why its model is so big.)

This pre-computation strategy allows a nice compromise between GPU-friendliness and implementation simplicity. The CNN and the wide lower layer are computed on the GPU, and then the precomputed hidden weights are moved to the CPU, before we start the transition-based parsing process. This makes a lot of things much easier. We don't have to worry about variable-length batch sizes, and we don't have to implement the dynamic oracle in CUDA to train.

Currently the parser's loss function is multilabel log loss [6], as the dynamic oracle allows multiple states to be 0 cost. This is defined as:

(exp(score) / Z) - (exp(score) / gZ)

Where gZ is the sum of the scores assigned to gold classes. I'm very interested in regressing on the cost directly, but so far this isn't working well. I've read that L2 losses generally don't work great in neural networks. This is disappointing. Maybe I'm missing some tricks here?

Machinery is in place for beam-search, which has been working well for the linear model. Beam search should benefit greatly from the pre-computation trick. However, I do like parsing the entire input, without having to have sentence boundary detection as a pre-process. This is tricky to do correctly with the beam. The current beam implementation introduces quadratic time complexity for long sequences, as it copies state data that's O(N) in the length of the sentence.

Bibliography

[1] Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations.
Eliyahu Kiperwasser, Yoav Goldberg. (2016) ↡

[2] Stack-propagation: Improved Representation Learning for Syntax. Yuan Zhang, David Weiss (2016) ↡

[3] A Fast and Accurate Dependency Parser using Neural Networks. Danqi Cheng, Christopher D. Manning (2014) ↡

[4] An Improved Non-monotonic Transition System for Dependency Parsing. Matthew Honnibal, Mark Johnson (2015) ↡

[5] A Dynamic Oracle for Arc-Eager Dependency Parsing. Yoav Goldberg, Joakim Nivre (2012) ↡

[6] Parsing the Wall Street Journal using a Lexical-Functional Grammar and Discriminative Estimation Techniques. Stefan Riezler et al. (2002) ↡

[7] Parsing English in 500 Lines of Python. Matthew Honnibal (2013). ↡

Related issues

@anna-hope
Copy link

Thanks for writing this up.

Is this parsing model currently available on the develop branch?

@ines
Copy link
Member

ines commented Jun 5, 2017

See the v2.0.0 alpha release notes and #1105 πŸŽ‰

@mollerhoj
Copy link
Contributor

@honnibal: Please correct me if I'm wrong, but the shared CNN in spacy 2.0 seems to have a big drawback: That the training data for POS, NER and dependencies must come from the same sentences.

For languages where there aren't many resources around, it's often the case that the training data for POS tags, NER and dependency tags come from many different resources. (Usually, corpora with dependency tags are quite small, whereas corpora with POS tags are large).

@honnibal
Copy link
Member Author

@mollerhoj The newest release fixes this, by adding an update_shared flag, and giving each model a private copy of the CNN as well. See here for further discussion: https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting

@mollerhoj
Copy link
Contributor

@honnibal Yay! A new blog post, can't wait to read it! It's much better to read your posts than try to digest scientific articles. Keep up the good work, I'm a big fan!

@StrawberryDream
Copy link

@honnibal Hi, I installed the version 2.0.7 and also synced the latest code on the master branch. But I did not find the "update_shared" flag in the function update() in pipeline.pyx. I wonder if this feature was implemented in other ways. How can I tune the POS tagger and make the dependency parser to use the tuned POS tagger? Thank you very much for your help!

@muzaluisa
Copy link

May I ask which paper do you use for NER implementation? Or where can I find NER implementation details? Here you mention dependency parser, but not specifically NER. Thanks

@l2edzl3oy
Copy link

@muzaluisa see video here and blog post here, alongside all the above information in this issue thread. Those were useful references for me and I hope they are for you too :)

@honnibal @ines Token currently has lemma and norm attributes. Based on my understanding, norm is used as a feature for the model, and I was wondering how lemma is used in the model (if any). I'm trying to wrap my head around the difference between lemma and norm, as both seem to be the "base" form of the original text for the token (i.e. the orth attribute) and hence should have the same value. I was wondering if this distinction was made due to the future neural lemmatizer - am I right to assume so? Thanks!

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements πŸŒ™ nightly Discussion and contributions related to nightly builds ⚠️ wip Work in progress
Projects
None yet
Development

No branches or pull requests

7 participants