Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design: Tagger parser merge step 2 #52

Open
Travelinglight opened this issue Jul 21, 2016 · 2 comments
Open

Design: Tagger parser merge step 2 #52

Travelinglight opened this issue Jul 21, 2016 · 2 comments

Comments

@Travelinglight
Copy link
Member

Travelinglight commented Jul 21, 2016

For 1st order features

  1. Eisner state C[s][t][d][c] is a dict, of which the keys and values are both tuples:

    {(POS_of_head, POS_of_dependent): (score, mid_index), ....}

  2. Keep every combination of POSTAG of head and dependent

for q in range(s, t):
    for pos_head in possible_POSTAG_of_head:
        for pos_dept in possible_POSTAG_of_dependent:
            for pos_mid in possible_POSTAG_of_q:
                # complete right triangle
                score = C[s][q][->][1](pos_head, pos_mid)][0] + C[q][t][->][0][(pos_mid, pos_dept)][0]
                if (C[s][t][->][0][(pos_head, pos_dept)][0] < score)
                    C[s][t][->][0][(pos_head, pos_dept)] = (score, q)
                # incomplete right trapezoidal
                score = C[s][q][->][0](pos_head, pos_mid)][0] + C[q][t][<-][0][(pos_mid, pos_dept)][0] + get_score(s, t, pos_head, pos_dept)
                if (C[s][t][->][1][(pos_head, pos_dept)][0] < score)
                    C[s][t][->][0][(pos_head, pos_dept)] = (score, q)
                # complete left triangle
                ....
                # incomplete left trapezoidal
                ....

For 2nd order features (with cube pruning)

  1. Eisner state C[s][t][d][c] is a list of tuples:

    [(feature_signature , POS_of_head, POS_of_dependent, score, mid_index), ....]

  2. Keep only top k tuples

for q in range(s, t):
    for pos_head in possible_POSTAG_of_head:
        for pos_mid in possible_POSTAG_of_q:
            for pos_dept in possible_POSTAG_of_dependent:

                # complete right triangle
                list_left = SELECT tuples FROM C[s][q][->][1] WHERE tuple[1] == pos_head AND tuple[2] == pos_mid ORDER BY score
                list_right = SELECT tuples FROM C[q][t][->][0] WHERE tuple[1] == pos_mid AND tuple[2] == pos_dept ORDER BY score
                C[s][t][->][0].push((q, pos_head, pos_mid, explore(list_left, list_right), q))

                # incomplete right trapezoidal
                list_left = SELECT tuples FROM C[s][q][->][0] WHERE tuple[1] == pos_head AND tuple[2] == pos_mid ORDER BY score
                list_right = SELECT tuples FROM C[q][t][<-][0] WHERE tuple[1] == pos_dept AND tuple[2] == pos_mid ORDER BY score
                C[s][t][->][1].push((t, pos_head, pos_dept, explore(list_left, list_right), q))

                # complete left triangle
                ....
                # incomplete left trapezoidal
                ....
            keep_highest_score(C[s][t][->][0], q, pos_head, pos_mid)
            keep_highest_score(C[s][t][<-][0], q, pos_head, pos_mid)
keep_k_scores(C, s, t)

This can also implemented by dict. In this case, the function keep_k_scores would extract items from the dict, sort them, and delete items other than k best from the dict.

@Travelinglight
Copy link
Member Author

Just find out the design may not be applicable. In the design, only the POS tag of s and t are determined, but when the feature generator generates in_between features, it requires that all words between s and t have only one POS tag.

english_1st_fgen.pyx Line #208
for between_index in range(start_index + 1, end_index):
    xb_pos = self.POSTAG[between_index]
    # Add all words between xi and xj into the feature
    local_fv.add( (2,0,xi_pos,xb_pos,xj_pos) )

Thus, in_between features depend on the POS tag of the words between s and t.
Also, surrounding features depend on the POS tag of the surrounding words.

Possible solutions:

  1. generate all possible features (with all possible in_between POS tags), but this may not be good for training
  2. dismiss in_between features and surrounding features, but also not be good for training

I'm still trying to invent a better design. Anyone has a good idea on this problem?

@anoopsarkar
Copy link
Member

Third solution is to take the single best POS tag for each word between head and modifier for this feature function.

We need to have the single best POS tag computed anyway for unknown words. In this case, we will use it for known words as well for this feature function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants