Large language models (LLMs) exhibit remarkable in-context learning (ICL) capabilities. However, the underlying working mechanism of ICL remains poorly understood. Recent research presents two conflicting views on ICL: One attributes it to LLMs' inherent ability of task recognition, deeming label correctness and shot numbers of demonstrations as not crucial; the other emphasizes the impact of similar examples in the demonstrations, stressing the need for label correctness and more shots.
An overview of the proposed two-dimensional coordinate system for ICL.
In this work, we provide a Two-Dimensional Coordinate System that unifies both views into a systematic framework. The framework explains the behavior of ICL through two orthogonal variables: whether LLMs can recognize the task and whether similar examples are presented in the demonstrations. We propose the peak inverse rank metric to detect the task recognition ability of LLMs and study LLMs' reactions to different definitions of similarity. Based on these, we conduct extensive experiments to elucidate how ICL functions across each quadrant on multiple representative classification tasks. Finally, we extend our analyses to generation tasks, showing that our coordinate system can also be used to interpret ICL for generation tasks effectively.
We find that although ICL tends to slightly favor semantically similar examples over lexically similar ones, the preference for both is significantly greater than randomly selected examples. Thus, regardless of whether the similarity is lexical or semantic, as long as demonstrations contain examples of either type, we consider them to contain similar examples.
Inspired by the recent work Label Words are Anchors, which demonstrates that label words serve as semantic anchors, we propose a new metric called PIR to quantify a model's ability to recognize tasks.
For a given layer l
corresponding to a label token, we project the hidden state hl
into the vocabulary space by multiplying it with the pre-trained language model head E
. The rank of the task-representative token within this projected distribution is denoted as ranktask(hl, E)
.
The PIR is formally defined as:
These datasets are used for the upper part of the x-axis. We employ the Stanford Sentiment Treebank Binary (SST-2) for sentiment analysis. In addition, we create two datasets for the World Capitals and Reasoning about Colored Objects tasks, which contain 50 hand-crafted pairs of country-capital and object-color, respectively.
These datasets are used for the lower part of the x-axis. We utilize the Text REtrieval Conference (TREC) Question Classification dataset for question type classification and the EmoContext (emo) for emotion classification.
We adopt a comprehensive suite of models, including GPT2-XL (1.61B) and GPT-J (6B) from the GPT series; Llama-2-7B, Llama-2-13B, and their instruction-tuned counterparts from the Llama series; and Falcon-40B, along with its instruction-tuned variant from the Falcon series.
In the first quadrant, models can leverage their pre-trained knowledge to make predictions once they recognize the task and can also refer to the labels from similar examples if their pre-trained knowledge is insufficient. However, if the labels of similar examples are incorrect, smaller models tend to replicate these incorrect labels, while larger models tend to rely on their pre-trained knowledge for making predictions.
In the second quadrant, models primarily leverage their pre-trained knowledge to make predictions. Moreover, given that each input-label pair plays an identical role in helping models recognize tasks, increasing the number of in-context examples does not significantly enhance the effectiveness of ICL.
In the third quadrant, ICL fails to work. Specifically, models fail to leverage the ICL content for making predictions and tend to predict the label of the first example.
In the fourth quadrant, models directly replicate the labels of similar examples. Therefore, the performance of ICL depends heavily on whether the labels of similar examples match the ground truth labels of test samples. Additionally, larger models are better at recognizing similar examples, which increases their tendency to copy the labels from these examples.
We have used Python 3.9.18
with the following dependencies.
pip install -r requirements.txt
One can use PIR to quantify a model's ability to recognize tasks.
python task_recognition_pir.py \
--model_name={Name of the model to load} \
--label_index={Index of the label token} \
--plot_file={Name of the plot file}
Through experiments with models of varying sizes, one can observe that in the setting where Similar(T) has an incorrect label, smaller models tend to replicate these incorrect labels, while larger models are more inclined to leverage their pre-trained knowledge when making predictions.
python first_quadrant.py \
--model_name={Name of the model to load} \
--dataset_path={Path to the dataset directory} \
--num_samples={Number of test samples for experimentation}
One can observe that increasing the number of in-context examples does not lead to significant changes in the performance of ICL.
python second_quadrant.py \
--model_name={Name of the model to load} \
--dataset_path={Path to the dataset directory} \
--shot_number={The number of in-context examples}
--num_samples={Number of test samples for experimentation}
ICL fails to work.
python third_quadrant.py \
--model_name={Name of the model to load} \
--dataset_path={Path to the dataset directory} \
--num_samples={Number of test samples for experimentation}
One can observe that by increasing the model size, larger models demonstrate superior capabilities in recognizing similar examples.
python fourth_quadrant.py \
--model_name={Name of the model to load} \
--dataset_path={Path to the dataset directory} \
--num_samples={Number of test samples for experimentation}
@inproceedings{zhao2024coordinate,
title={Unveiling In-Context Learning: A Coordinate System to Understand Its Working Mechanism},
author={Anhao Zhao, Fanghua Ye, Jinlan Fu, Xiaoyu Shen },
booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year={2024},
}
If you have any questions, feel free to raise an issue or contact us at [email protected].