Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete seq2seq for fluid #56

Merged
merged 6 commits into from
Jan 17, 2018
Merged

Complete seq2seq for fluid #56

merged 6 commits into from
Jan 17, 2018

Conversation

pkuyym
Copy link
Collaborator

@pkuyym pkuyym commented Jan 15, 2018

Resolves #55
Resolves #22


parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument(
"--word_vector_dim",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

embedding_dim better?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followed.

import distutils.util

import paddle.v2 as paddle
import paddle.v2.fluid as fluid
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benchmark as a demo code, we'd like only have

+import paddle.v2 as paddle
 +import paddle.v2.fluid as fluid

just like tensorflow
import tensorflow as tf
nothing else.

help="The dictionary capacity. Dictionaries of source sequence and "
"target dictionary have same capacity. (default: %(default)d)")
parser.add_argument(
"--pass_number",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for unity, pass_num

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followed.

type=str,
default='train',
choices=['train', 'infer'],
help="Do training or inference. (default: %(default)s)")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, followed.

target_dict_dim,
is_generating=False,
beam_size=3,
max_length=250):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leave max_length, beam_size default value to argparse.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followed.

"""Construct a seq2seq network."""
feeding_list = ["source_sequence", "target_sequence", "label_sequence"]

def bi_lstm_encoder(input_seq, size):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here maybe we need a notation.
the lstm unit has 4 parameters, hidden, memory_cell, ...
so need to multiply by 4.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add detailed comments.

size=size * 4,
act='tanh')
forward, _ = fluid.layers.dynamic_lstm(
input=input_forward_proj, size=size * 4)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double check dynamic_lstm need 4. I roughly remember it has been done inside the lstm layer.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double check dynamic_lstm need 4. I roughly remember it has been done inside the lstm layer.

https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/fluid/layers/nn.py#L231

name a gate_size is a good idea.

Agree.

default=16,
help="The sequence number of a batch data. (default: %(default)d)")
parser.add_argument(
"--dict_size",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value is indicated by the dataset. should it be an argument?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"--max_length",
type=int,
default=250,
help="The max length of sequence when doing generation. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max -> maximum

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followed.

"--batch_size",
type=int,
default=16,
help="The sequence number of a batch data. (default: %(default)d)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of a mini-batch

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Followed.

"--encoder_size",
type=int,
default=512,
help="The size of encoder bi-rnn unit. (default: %(default)d)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size -> dimension

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinks both are ok, but size is shorter.

"--decoder_size",
type=int,
default=512,
help="The size of decoder rnn unit. (default: %(default)d)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size -> dimension

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinks both are ok, but size is shorter.

"--use_gpu",
type=distutils.util.strtobool,
default=True,
help="Whether use gpu. (default: %(default)d)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to use

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, followed.


def lstm_decoder_with_attention(target_embedding, encoder_vec, encoder_proj,
decoder_boot, decoder_size):
def simple_attention(encoder_vec, encoder_proj, decoder_state):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The attention mechanism is wrong. Where is 'tanh' operation which appears in original formula?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't catch your point, why tanh is necessary for attention? There are several kind of attention mechanisms. Please refer to https://github.com/PaddlePaddle/Paddle/blob/9bfa3013891cf3da832307894acff919d6705cee/python/paddle/trainer_config_helpers/networks.py#L1400

Copy link
Contributor

@ranqiu92 ranqiu92 Jan 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/PaddlePaddle/Paddle/blob/9bfa3013891cf3da832307894acff919d6705cee/python/paddle/trainer_config_helpers/networks.py#L1473
Here, the mixed_layer performs tanh.
And for attention mechanism in Neural Machine Translation By Jointly Learning To Align and Translate, tanh is used. Is this what you want to realize?

Copy link
Collaborator Author

@pkuyym pkuyym Jan 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you think it's wrong to apply linear activation ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep consistent, will apply tanh in next PR. Thanks.

Copy link
Owner

@dzhwinter dzhwinter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


fetch_outs = exe.run(
inference_program,
feed=dict(zip(*[feeding_list, (src_seq, trg_seq, lbl_seq)])),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix this issue.
even dict is better than *zip() function sugar.

@pkuyym pkuyym merged commit 315b20f into dzhwinter:master Jan 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants