Enhancing LLM’s Cognition via Structurization

Training the StruXGPT-v1/v2 Model

You can use the following scripts to train the StruXGPT-v1/v2 model for structurization, or just use our pre-trained weights for applications.

First, download the preprocessed data source (curated from CAMEL-AI and Wiki, generated by Llama3-70B-Instruct) from this link. Unzip the data to data/tune/StruXGPT, which should be:

|— data
   |— tune
      |— StruXGPT
          |— struxgpt_v1_train_raw.jsonl
          |— struxgpt_v1_val_raw.jsonl
          |— struxgpt_v2_train_raw.jsonl
          |— struxgpt_v2_val_raw.jsonl

Second, run the script to generate the train/val json file in LLaMA-Factory's format:

cd path/to/StruXGPT
python examples/StruXGPT/prepare_trainval.py --version v1
                                                       v2

Finally, go to the folder of the LLaMA-Factory third-party dependency and run the training. You can adjust the hyper-parameters in train_struxgpt.sh according to your customized settings (e.g., the STRUXGPT version, MODEL_NAME_OR_PATH, MODEL_TYPE, etc). We assume that you are familiar to LLaMA-Factory's architecture.

cd third_party/LLaMA-Factory
bash ./bash/train_struxgpt.sh

After training, you will get the StruXGPT model specialized for structurization. Here is an example to show the difference between StruXGPT-v1 and StruXGPT-v2.

Structurization Example (click to expand)

Here is the original lengthy and complicated context:

In this section, we describe the feature extraction process and network architecture in detail. We use spectral features of 256 dimensions computed using 512 point FFT for every frame, and we add an energy feature for every frame giving us total 257 features for every frame. We use a window size of 25ms and frame shift of 10ms during feature computation. We crop random 5sec audio data from each utterance during training which results in a spectrogram of size 257x500 (features x number of features). We use these spectrograms as input to our CNN model during training. During testing, we compute the prediction score irrespective of the audio length.
For the network architecture, we use ResNet-34 architecture, as described in [9]. The model uses convolution layers with Relu activations to map the spectrogram of size 257x500 input into 3D feature map of size 1x32x512. This feature cube is converted into 2D feature map of dimension 32x512 and fed into Ghost-VLAD/NetVLAD layer to generate a representation that has more language discrimination capacity. We use Adam optimizer with an initial learning rate of 0.01 and a final learning rate of 0.00001 for training. Each model is trained for 15 epochs with early stopping criteria.
For the baseline, we train an i-vector model using GMM-UBM. We fit a small classifier on top of the generated i-vectors to measure the accuracy. This model is referred as i-vector+svm . To compare our model with the previous state of the art system, we set up the x-vector language identification system [8]. The x-vector model used time-delay neural networks (TDNN) along with statistic-pooling. We use 7 layer TDNN architecture similar to [8] for training. We refer to this model as tdnn+stat-pool . Finally, we set up a Deep LSTM based language identification system similar to [4] but with little modification where we add statistics pooling for the last layers hidden activities before classification. We use 3 layer Bi-LSTM with 256 hidden units at each layer. We refer to this model as LSTM+stat-pool. We train our i-vector+svm and TDNN+stat-pool using Kaldi toolkit. We train our NetVLAD and GhostVLAD experiments using Keras by modifying the code given by [9] for language identification. We train the LSTM+stat-pool and the remaining experiments using Pytorch [14] toolkit, and we will opensource all the codes and data soon.
RESULTS
In this section, we compare the performance of our system with the recent state of the art language identification approaches. We also compare different pooling strategies and finally, compare the robustness of our system to the length of the input spectrogram during training. We visualize the embeddings generated by the GhostVLAD method and conclude that the GhostVLAD embeddings shows very good feature discrimination capabilities.
RESULTS ::: Comparison with different approaches
We compare our system performance with the previous state of the art language identification approaches, as shown in Table 2. The i-vector+svm system is trained using GMM-UBM models to generate i-vectors as proposed in [1]. Once the i-vectors are extracted, we fit SVM classifier to classify the audio. The TDNN+stat-pool system is trained with a statistics pooling layer and is called the x-vector system as proposed by David Snyder et al. [11] and is currently the state of the art language identification approach as far as our knowledge. Our methods outperform the state of the art x-vector system by absolute 1.88% improvement in F1-score, as shown in Table 2.
RESULTS ::: Comparison with different pooling techniques
We compare our approach with different pooling strategies in Table 3. We use ResNet as our base feature extraction network. We keep the base network the same and change only the pooling layers to see which pooling approach performs better for language identification task. Our experiments show that GhostVLAD pooling outperforms all the other pooling methods by achieving 98.43% F1-Score.
RESULTS ::: Duration analysis
To observe the performance of our method with different input durations, we conducted an experiment where we train our model on different input durations. Since our model uses ResNet as the base feature extractor, we need to feed fixed-length spectrogram. We conducted 4 different experiments where we trained the model using 2sec, 3sec, 4sec and 5sec spectrograms containing 200,300,400 and 500 frames respectively. We observed that the model trained with a 5sec spectrogram is the best model, as shown in Table 4.
RESULTS ::: Visualization of embeddings

After structurzation by StruXGPT-v1, the semantic hierarchy is clearly highlighted:

## Statement's scope:
The feature extraction process and network architecture for a language identification system.

## Statement's main aspects and corresponding descriptions:
1. Feature extraction process
    1.1. Spectral features are computed using 512 point FFT for every frame, resulting in 256 dimensions.
    1.2. An energy feature is added to each frame, totalling 257 features per frame.
    1.3. A window size of 25ms and frame shift of 10ms are used during feature computation.
    1.4. Random 5sec audio data from each utterance is cropped to create spectrograms of size 257x500 (features x number of features).
    1.5. These spectrograms are used as input to the CNN model during training.
    1.6. During testing, prediction scores are computed regardless of audio length.
2. Network architecture
    2.1. The model uses ResNet-34 architecture, as described in [9].
    2.2. Convolution layers with Relu activations map the spectrogram into a 3D feature map of size 1x32x512.
    2.3. This feature cube is converted into a 2D feature map of dimension 32x512 and fed into Ghost-VLAD/NetVLAD layer.
    2.4. Adam optimizer with initial learning rate 0.01 and final learning rate 0.00001 is used for training.
    2.5. Each model is trained for 15 epochs with early stopping criteria.
3. Baseline and comparison systems
    3.1. An i-vector model trained with GMM-UBM is used as a baseline.
    3.2. An i-vector+svm model is trained on top of the generated i-vectors for comparison.
    3.3. A x-vector language identification system using TDNN+stat-pool is set up as a comparison.
    3.4. A Deep LSTM-based system (LSTM+stat-pool) is also compared, with modifications including statistics pooling.
4. Training and evaluation
    4.1. All experiments are conducted using Kaldi toolkit for i-vector+svm and LSTM+stat-pool.
    4.2. NetVLAD and GhostVLAD experiments are trained using Keras code from [9] for language identification.
    4.3. Pytorch is used for training the LSTM+stat-pool and remaining experiments.
5. Results and analysis
    5.1. Performance comparison with state-of-the-art systems: Our system outperforms the x-vector system by 1.88% in F1-score.
    5.2. Comparison with different pooling techniques: GhostVLAD pooling achieves the best F1-Score of 98.43%.
    5.3. Duration analysis: A 5sec spectrogram yields the best model performance.
    5.4. Visualization of embeddings: GhostVLAD embeddings show good feature discrimination capabilities.

However, the result from StruXGPT-v1 has three main issues:

it has rougly the same length with the input context, which introduces considerable computational overload;
there are inevitably some paragraphs in LLM's outputs, which may harm the performance when the answers are strictly taken from the original text;
the semantic hierarchy is limited to the three-layer structure, which may be unable to describe more complicated structures.

Therefore, we develope StruXGPT-v2 to extract a more cost-friendly, faithful, and flexible semantic hierarchy:

Scope: Description of feature extraction process, network architecture, and evaluation
  in language identification
Aspects:
- AspectName: Feature extraction process
  SentenceRange:
    start: 1
    end: 4
  SubAspects:
  - AspectName: Spectral features computation
    SentenceRange:
      start: 2
      end: 3
- AspectName: Network architecture
  SentenceRange:
    start: 5
    end: 9
  SubAspects:
  - AspectName: ResNet-34 architecture
    SentenceRange:
      start: 7
      end: 8
- AspectName: Training details
  SentenceRange:
    start: 10
    end: 12
- AspectName: Baseline models
  SentenceRange:
    start: 13
    end: 16
  SubAspects:
  - AspectName: i-vector+svm
    SentenceRange:
      start: 13
      end: 14
  - AspectName: TDNN+stat-pool
    SentenceRange:
      start: 15
      end: 16
- AspectName: Comparison with state-of-the-art systems
  SentenceRange:
    start: 17
    end: 33
  SubAspects:
  - AspectName: x-vector system
    SentenceRange:
      start: 17
      end: 20
  - AspectName: LSTM+stat-pool
    SentenceRange:
      start: 21
      end: 23
  - AspectName: Performance comparison
    SentenceRange:
      start: 24
      end: 33
- AspectName: Pooling strategies comparison
  SentenceRange:
    start: 34
    end: 38
- AspectName: Duration analysis
  SentenceRange:
    start: 39
    end: 43
- AspectName: Visualization of embeddings
  SentenceRange:
    start: 44
    end: 44

By spliting raw context to sentence enumeration and identify the structure with arbitrary aspect hierarchy, StruXGPT-v2 can re-organize the raw context into structurized results to highlight the abstractive conclusion or important information while preserving the original texts exactly.

Evaluation on Downstream Applications

In this section, we provide the scripts to evluate the LLM's cognition ability on several downstream tasks on various benchmarks. Make sure you are in the root path of this repo (cd /path/to/StruXGPT), and run the following commands.

During data preparation, you are free to swich the STRUXGPT version (v1 or v2) to obtain different structurization results to further investigation (e.g., export STRUXGPT=v1). Default is v1.

1. Reading Comprehension —— LongBench

We take several reading comprehension tasks from LongBench for evaluation:

SingleDoc QA: Qapser, MultifieldQA-en
MultiDoc QA: HotpotQA, 2WikiMQA, Musique
Synthetic Tasks: PassageCount, PassageRetrieval-en

First, you can download the data folder from this link and unzip it to third_party/LongBench/data. Then, use the following commands to evaluate LLMs on LongBench.

Preparation

python examples/StruXGPT/preprocess.py --benchmark LongBench

Evaluation

python examples/StruXGPT/evaluate.py --benchmark LongBench --model llama2-7b-chat-4k

2. Hallucination Evaluation —— AttrScore

We take the attreval_gensearch subset from AttrScore for evaluation.

Download the dataset and put it into third_party/AttrScore/AttrScore. Then run the code to get the results.

Preparation

python examples/StruXGPT/preprocess.py --benchmark AttrScore

Evaluation

python examples/StruXGPT/evaluate.py --benchmark AttrScore --model llama2_7b

Optional: set --use-cot flag to integrate CoT in hallucination evaluation.

3. Hallucination Evaluation —— FActScore

We take all the three subsets (data generated by InstructGPT/ChatGPT/PerplexityAI and annotated by human) from FActScore for evaluation.

Download the dataset and put it into third_party/FActScore/data. Then run the code to get the results.

Preparation

python examples/StruXGPT/preprocess.py --benchmark FactScore

Evaluation

python examples/StruXGPT/evaluate.py --benchmark FactScore --model llama2_7b

4. Passage Retrieval —— BEIR

We take subsets from the BEIR benchmark to evaluate the performance on dense passage retrieval.

Download the corpus and queries from HuggingFace, and put them into third_party/tevatron/datasets/*

We use tevatron as the evaluation toolkit.

Preparation

python examples/StruXGPT/preprocess.py --benchmark BEIR

Evaluation

cd third_party/tevatron 
bash scripts/eval_beir.sh MODEL_NAME_OR_PATH

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Enhancing LLM’s Cognition via Structurization

Training the StruXGPT-v1/v2 Model

Evaluation on Downstream Applications

1. Reading Comprehension —— LongBench

Preparation

Evaluation

2. Hallucination Evaluation —— AttrScore

Preparation

Evaluation

3. Hallucination Evaluation —— FActScore

Preparation

Evaluation

4. Passage Retrieval —— BEIR

Preparation

Evaluation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Enhancing LLM’s Cognition via Structurization

Training the StruXGPT-v1/v2 Model

Evaluation on Downstream Applications

1. Reading Comprehension —— LongBench

Preparation

Evaluation

2. Hallucination Evaluation —— AttrScore

Preparation

Evaluation

3. Hallucination Evaluation —— FActScore

Preparation

Evaluation

4. Passage Retrieval —— BEIR

Preparation

Evaluation