Skip to content

Latest commit

 

History

History
122 lines (93 loc) · 7.82 KB

README.md

File metadata and controls

122 lines (93 loc) · 7.82 KB

LongT5: Efficient Text-To-Text Transformer for Long Sequences

LongT5 is an extension of the T5 model that handles long sequence inputs more efficiently. We integrated attention ideas from long-input transformers ETC,and adopted pre-training strategies from summarization pre-training PEGASUS into the scalable T5 architecture. The result is a new attention mechanism we call Transient Global(TGlobal), which mimics ETC’s local/globalattention mechanism, but without requiring additional side-inputs. We are able to achieve state-of-the-art results on several summarization and question answering tasks, as well as outperform the original T5 models on these tasks.

Summarization Results

LongT5 achieves state-of-the-art performance on several summarization benchmarks that required longer context or multi-document understanding. The table is showing ROUGE-1 scores. LongT5 base models are all reported with 4k input tokens; large and xl models are trained with 16k tokens for arXiv, PubMed, BigPatent, 8k for MultiNews, and 4k for MediaSum and CNN/Daily News.

Model arXiv PubMed BigPatent MultiNews MediaSum CNN/Daily Mail
DANCER PEGASUS 45.01 46.34 - - - -
BigBird-PEGASUS (large) 46.63 46.32 60.64 - - -
HAT-BART 46.68 48.36 - - - 44.48
LED (large) 46.64 - - - - -
PRIMER 47.6 - - 49.9 - -
TG-MultiSum - - - 47.10 - -
BART (large) - - - - 35.09 -
LongT5 base 44.87 47.77 60.95 46.01 35.09 42.15
LongT5 large 48.28 49.98 70.38 47.18 35.53 42.49
LongT5 xl 48.35 50.23 76.87 48.17 36.15 43.94

QA Results

Natural Questions

For NQ, we compare T5.1.1 and LongT5 with TGlobal attention. We decided to run T5.1.1 (1) with the default 512 input sequence length and (2) with the largest input sequence length that can fit into device memory, and use those as baselines. Since we are comparing against T5.1.1, for LongT5 experiments we report results at 512 input length for base and large, and the largest input length allowed by each model before running out of memory on the same hardware configuration used in our T5.1.1 experiments. For base and large models, we used 4x8 TPUv3 and no model partitioning; for xl model, we used 8x16 TPUv3 and 8 partitions.

Model EM F1
T5.1.1 base-512 50.93 52.54
T5.1.1 base-6k 56.73 56.73
T5.1.1 large-512 57.29 60.68
T5.1.1 large-3k 60.09 64.17
T5.1.1 xl-4k 60.75 64.07
LongT5 base-512 55.73 59.06
LongT5 base-12k 58.12 62.44
LongT5 large-512 57.55 61.53
LongT5 large-4k 60.77 65.38
LongT5 xl-8k 62.66 66.61

Moreover, in our analysis for Input Length vs Speed and Input Length vs Performance sections using NQ, it shows that (1) at shorter sequence length T5.1.1 and LongT5 variants have similar speeds, but as we increase the sequence length, LongT5 becomes significantly faster, (2) T5.1.1 models reach their out-of-memory point much earlier than LongT5 models, and (3) performance increases significantly as input length increases.

TriviaQA

For TriviaQA, we compare LongT5 with various top approaches on the leader board. All LongT5 models are reported with 16k input tokens.

Model EM F1
BigBird-ETC (random attn) 80.86 84.5
Fusion-in-Decoder 80.09 84.35
ReadTwice 76.86 80.85
LongT5 base 74.67 78.9
LongT5 large 78.38 82.45
LongT5 xl 81.00 84.83

Usage

Data Preprocessing

Most of our tasks are using Tensorflow Datasets which works directly with the SeqIO used in the T5 library. But for Natural Questions and MediaSum we provided our own data preprocessing code. To run the tasks corresponding to these datasets, please specify NQ_DATA_DIR and MEDIASUM_DATA_DIR to the output files produced by the preprocessing code in tasks.py.

Example command for running NQ data preprocessing:

# Data path where the NQ json files are downloaded to.
INPUT_PATH="..."
# Data path where the output files will be generated.
OUTPUT_PATH="..."
LONGT5_DIR="..."  # directory where the LongT5 repo is cloned.

python3 ${LONGT5_DIR}/data/nq_preprocess.py \
  --input_path=${INUT_PATH} \
  --output_path=${OUTPUT_PATH}

Training

The experiments are shown in the tasks.py file. Our architecture, model, and training configuration setups can be found in Flaxformer github repository.

Released Model Checkpoints

We have released the following checkpoints for LongT5 pre-trained models:

Additionally, we have released the following checkpoints for mLongT5 pre-trained models:

Citing LongT5

If you use LongT5 in your research, please cite LongT5: Efficient Text-To-Text Transformer for Long Sequences.

@inproceedings{guo2022longt5,
    title = "{L}ong{T}5: {E}fficient Text-To-Text Transformer for Long Sequences",
    author = "Mandy Guo and Joshua Ainslie and David Uthus and Santiago Onta{\~n}{\'o}n and Jianmo Ni and Yun-Hsuan Sung and Yinfei Yang",
    booktitle = "Findings of the Association for Computational Linguistics: NAACL 2022",
    year = "2022",
    url = "https://aclanthology.org/2022.findings-naacl.55",
    pages = "724--736",
}

For mLongT5, please cite mLongT5: A Multilingual and Efficient Text-To-Text Transformer for Longer Sequences.

@misc{uthus2023mlongt5,
    title = "{mLongT5}: A Multilingual and Efficient Text-To-Text Transformer for Longer Sequences",
    author = "David Uthus and Santiago Onta{\~n}{\'o}n and Joshua Ainslie and Mandy Guo",
    year = "2023",
    eprint = "2305.11129",
    archivePrefix = "arXiv",
    primaryClass = "cs.CL",
    url = "https://arxiv.org/abs/2305.11129"
}