How is the contrastive data pipeline implemented? #12

MarkYangjiayi · 2023-08-27T16:36:07Z

Hi, I saw in the paper mentioning that C_curr and C_prev from the same document in the batch, but didn't really see how this is implemented.

It seems that in the data_processing part of the code, each time the processor just samples from a new piece of data, how does it guarantee that the next batch of data will have same context in different steps? Thanks

hxs91 · 2023-08-28T03:24:46Z

I have the same question. I guess maybe use the same data process in the Memorizing Transformers(Figure 3)?

CStanKonrad · 2023-08-29T13:59:52Z

As mentioned in the readme the instruction fine-tuning does not use FoT.
In fact, it can be thought of as a "modified" FoT with cross_batch=1 because:

We take the document and randomly pad it (left, right) so that it has 2048 tokens
Then we load the document to the model, and as the last_context_length is 1024, a part of the document will be loaded to memory and constitute as C_prev

However, this is not the implementation that was used to create the base models.
We plan to release the official FoT large scale continual pre-training (FoT finetuning) code within two weeks (this code will be in JAX).

hxs91 · 2023-09-06T04:01:52Z

@MarkYangjiayi As described Appendix A.2 in FoT paper, maybe FoT does not need the same data process pipeline in Memorizing Transformers. C_curr and C_prev don't represent by batch, instead they represent by segments(vertical) within batch, this can explain two statements in FoT paper:

"FOT does not use memory during training, while MT does."
"FOT does not require long documents in the training set, while MT does in order to capture long dependencies in memory"

If it is correct, how is the data process of FoT? does FoT split long doc into multiple subsequences like Memorizing Transformers thus training can utilize data in one long doc as much as possible? or it just perform truncation and padding for every single doc? @CStanKonrad

HuXinjing · 2023-09-13T05:03:19Z

Have there been any developments about “ official FoT large scale continual pre-training (FoT finetuning) code ”

NickGao96 · 2023-09-15T02:45:17Z

As mentioned in the readme the instruction fine-tuning does not use FoT. In fact, it can be thought of as a "modified" FoT with cross_batch=1 because:

We take the document and randomly pad it (left, right) so that it has 2048 tokens

Then we load the document to the model, and as the last_context_length is 1024, a part of the document will be loaded to memory and constitute as C_prev

However, this is not the implementation that was used to create the base models. We plan to release the official FoT large scale continual pre-training (FoT finetuning) code within two weeks (this code will be in JAX).

It's been almost two weeks, how's the plan on releasing the FoT pipeline? Still looking forward to seeing the actual implementation of the cross batched contrastive learning FoT.

MarkYangjiayi · 2023-09-18T09:09:52Z

@hxs91 My hypothesis is that FoT is using a similar training strategy to Recurrent Memory Transformer, if you want to train a local context of 2k with 4 segments, you enter 8k tokens and split it in the training loop.

hxs91 · 2023-09-20T10:10:14Z

@hxs91 My hypothesis is that FoT is using a similar training strategy to Recurrent Memory Transformer, if you want to train a local context of 2k with 4 segments, you enter 8k tokens and split it in the training loop.

Yeah, I realize that if put different segments in different batch they are not differentiable, which is inconsistent with the description in FoT paper.

CStanKonrad · 2023-09-22T17:26:44Z

I apologize for the late response and delay in the publication of the continued pre-training code. The FoT continued pre-training code is now available here. A brief explanation of this implementation can be found here.

CStanKonrad mentioned this issue Aug 29, 2023

Where do i find some function like: #11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is the contrastive data pipeline implemented? #12

How is the contrastive data pipeline implemented? #12

MarkYangjiayi commented Aug 27, 2023

hxs91 commented Aug 28, 2023

CStanKonrad commented Aug 29, 2023

hxs91 commented Sep 6, 2023

HuXinjing commented Sep 13, 2023

NickGao96 commented Sep 15, 2023

MarkYangjiayi commented Sep 18, 2023

hxs91 commented Sep 20, 2023

CStanKonrad commented Sep 22, 2023

How is the contrastive data pipeline implemented? #12

How is the contrastive data pipeline implemented? #12

Comments

MarkYangjiayi commented Aug 27, 2023

hxs91 commented Aug 28, 2023

CStanKonrad commented Aug 29, 2023

hxs91 commented Sep 6, 2023

HuXinjing commented Sep 13, 2023

NickGao96 commented Sep 15, 2023

MarkYangjiayi commented Sep 18, 2023

hxs91 commented Sep 20, 2023

CStanKonrad commented Sep 22, 2023