Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the comment of Translation Matrix #1594

Merged
merged 8 commits into from
Sep 25, 2017

Conversation

robotcator
Copy link
Contributor

No description provided.

@robotcator robotcator changed the title Fix the comment of Tr Fix the comment of Translation Matrix Sep 19, 2017

Args:
`word_pair` (list): a list pair of words
`word_pairs` (list): a list pair of words
`source_space` (Space object): source language space
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

train method use only word_pairs, what is source/target space here?

self.source_space = Space.build(self.source_lang_vec, set(self.source_word))
self.target_space = Space.build(self.target_lang_vec, set(self.target_word))
self.source_word, self.target_word = zip(*word_pairs)
if self.translation_matrix is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if I called train twice, in the second time I don't fit model.
Please remove this if

self.source_word_vec_file = datapath("EN.1-10.cbow1_wind5_hs0_neg10_size300_smpl1e-05.txt")
self.target_word_vec_file = datapath("IT.1-10.cbow1_wind5_hs0_neg10_size300_smpl1e-05.txt")

with utils.smart_open(self.train_file, "r") as f:
self.word_pair = [tuple(utils.to_unicode(line).strip().split()) for line in f]
self.word_pairs = [("one", "uno"), ("two", "due"), ("three", "tre"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use hanging indents

("grape", "acino"), ("banana", "banana"), ("mango", "mango")
]

self.test_word_pairs = [("ten", "dieci"), ("dog", "cane"), ("cat", "gatto")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove ("dog", "cane") from self.word_pairs

@@ -91,7 +91,7 @@ def build(cls, lang_vec, lexicon=None):
return Space(mat, words)

def normalize(self):
""" normalized the word vector's matrix """
""" Normalized the word vector's matrix """
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'Normalize…' (imperative rather than past-tense)

model = super(TranslationMatrix, cls).load(*args, **kwargs)
return model

def apply_transmat(self, words_space):
"""
mapping the source word vector to the target word vector using translation matrix
Mapping the source word vector to the target word vector using translation matrix
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'Map…' (imperative rather than '-ing' form)

@gojomo
Copy link
Collaborator

gojomo commented Sep 21, 2017

The handling of word_pairs in __init__() and train() now makes sense, thanks. The comments have been improved but still may benefit from a deep review for clarity/wording.

Though I know I requested the Doc2Vec-related example, in its current form the motivations/benefits are muddled. Really it shouldn't require a separate helper class (BackMappingTranslationMatrix), and the notebook section ("Tranlation Matrix Revisit") is hard-to-follow, and includes a bunch of improper practices. (For example: using an imbalanced set of docs for the mapping 'overlap'; using the slow-and-iffy dm_concat mode; calling train() multiple times with a sawtooth alpha progression, etc.)

The word-translation example can presumably be evaluated based on real datasets in the original context that motivated the approach, while the doc-vec example will need more novel design/evaluation – so I'd recommend splitting them into separate notebooks.

@robotcator
Copy link
Contributor Author

robotcator commented Sep 21, 2017

Thanks. You do remind me the imbalanced set problem in the example. And the code for training document vector are borrowed from the doc2vec-imdb.ipynb and I will re-train the document vector.
As for the imbalanced data, how to sample documents to 'overlap' according to the sentiment or whether it is in the train or test set. (according to the sentiment to sample is more logical)

For the BackMappingTranslationMatrix class, I didn't find a good way to integrate this function into my TranslationMatrix class, so I separate this into two class. Because the word2vec and doc2vec has different method to access the vector.

for word2vec, use model[word] to get the word vector.
for doc2vec, use model.doc2vec['doc_tag'] to get the document vector.

If BackMappingTranslationMatrix was integrated into TranslationMatrix, I would handle them separately according to the type (ininstance method), is it appropriate?

I didn't catch that The word-translation example can be evaluated based on real datasets in the original context , can you please explain in more detail?

@gojomo
Copy link
Collaborator

gojomo commented Sep 21, 2017

I meant there are published papers about using word-vector transformations for language-translation - the original Google paper, the Dinu paper – so there are specific datasets & procedures to mimic – and similar results would indicate everything is working. The Doc2Vec use is novel so requires more experimentation/thought.

@robotcator
Copy link
Contributor Author

robotcator commented Sep 22, 2017

The word pairs used in this experiment are extracted from the OPUS(http://opus.lingfil.uu.se/). The same as the Ninu's paper. I plot the vis to show the linear relationship between two language vector space and use the word translation to show this transformation works. More re-produced experiments from the Mikolov's and Ninu's paper would be fine to support this transformation. But I still can not find any experiment for language translation(Do it mean sentences translation ) from the two paper I mentioned here? Can you remind me If I miss something?

I added "unstable/experimental" warning tag in notebook explicitly for doc2vec transformation part as Ivan suggested.

I also found a paper (OFFLINE BILINGUAL WORD VECTORS,ORTHOGONALTRANSFORMATIONS AND THE INVERTED SOFTMAX) is related to this experiment. But I‘m just having a look and need dig deeper.

@menshikh-iv
Copy link
Contributor

Thank you @robotcator for fast fixes.
Thank you @gojomo for a review:+1:

@menshikh-iv
Copy link
Contributor

Need to fix some typos/pep8 issues in a notebook, but I can't wait for more, it's release time.

@menshikh-iv menshikh-iv merged commit 33a3ef2 into piskvorky:develop Sep 25, 2017
@robotcator robotcator deleted the mydevelop branch September 25, 2017 12:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants