found a bug when the input text's length exceeds max_seq_length #10

flyangovoyang · 2020-07-02T08:05:26Z

hello, i am coming again, (nice work deserves deep reading~~)

yesterday's statement has been reorganized.

when the input text's length exceeds max_seq_length, targets in the extra part will be left out since all_doc_token has been truncated according to max_seq_length, however, start/end indexes and polarities in the extra part has been kept in tok_start_positions/tok_end_positions and polarity_labels respectively

SpanABSA/absa/utils.py

Lines 101 to 115 in 66369af

    
           tok_start_positions = [] 
        
           tok_end_positions = [] 
        
           for start_position, end_position in \ 
        
                   zip(example.start_positions, example.end_positions): 
        
               tok_start_position = orig_to_tok_index[start_position] 
        
               if end_position < len(example.sent_tokens) - 1: 
        
                   tok_end_position = orig_to_tok_index[end_position + 1] - 1 
        
               else: 
        
                   tok_end_position = len(all_doc_tokens) - 1 
        
               tok_start_positions.append(tok_start_position) 
        
               tok_end_positions.append(tok_end_position) 
        
           # Account for [CLS] and [SEP] with "- 2" 
        
           if len(all_doc_tokens) > max_seq_length - 2: 
        
               all_doc_tokens = all_doc_tokens[0:(max_seq_length - 2)]

later, initialize tokens from all_doc_token with the [CLS] and [SEP] added.

SpanABSA/absa/utils.py

Lines 117 to 128 in 66369af

    
           tokens = [] 
        
           token_to_orig_map = {} 
        
           segment_ids = [] 
        
           tokens.append("[CLS]") 
        
           segment_ids.append(0) 
        
           for index, token in enumerate(all_doc_tokens): 
        
               token_to_orig_map[len(tokens)] = tok_to_orig_index[index] 
        
               tokens.append(token) 
        
               segment_ids.append(0) 
        
           tokens.append("[SEP]") 
        
           segment_ids.append(0)

convert tokens to input_ids, and pad it to max_seq_length

SpanABSA/absa/utils.py

Lines 130 to 136 in 66369af

    
           input_ids = tokenizer.convert_tokens_to_ids(tokens) 
        
           input_mask = [1] * len(input_ids) 
        
           while len(input_ids) < max_seq_length: 
        
               input_ids.append(0) 
        
               input_mask.append(0) 
        
               segment_ids.append(0)

at this moment, the code initializes start_position and end_positions with the length of input_ids which has beed truncated, but the for-loop ranges from tok_start_positions/tok_end_positions and example.polarities, which has not been truncated, this may lead to index_out_of_bound exception.

SpanABSA/absa/utils.py

Lines 142 to 163 in 66369af

    
           # For distant supervision, we annotate the positions of all answer spans 
        
           start_positions = [0] * len(input_ids) 
        
           end_positions = [0] * len(input_ids) 
        
           bio_labels = [0] * len(input_ids) 
        
           polarity_positions = [0] * len(input_ids) 
        
           start_indexes, end_indexes = [], [] 
        
           for tok_start_position, tok_end_position, polarity in zip(tok_start_positions, tok_end_positions, example.polarities): 
        
               if (tok_start_position >= 0 and tok_end_position <= (max_seq_length - 1)): 
        
                   start_position = tok_start_position + 1   # [CLS] 
        
                   end_position = tok_end_position + 1   # [CLS] 
        
                   start_positions[start_position] = 1 
        
                   end_positions[end_position] = 1 
        
                   start_indexes.append(start_position) 
        
                   end_indexes.append(end_position) 
        
                   term_length = tok_end_position - tok_start_position + 1 
        
                   max_term_length = term_length if term_length > max_term_length else max_term_length 
        
                   bio_labels[start_position] = 1  # 'B' 
        
                   if start_position < end_position: 
        
                       for idx in range(start_position + 1, end_position + 1): 
        
                           bio_labels[idx] = 2  # 'I' 
        
                   for idx in range(start_position, end_position + 1): 
        
                       polarity_positions[idx] = label_to_id[polarity]

The text was updated successfully, but these errors were encountered:

huminghao16 · 2020-07-03T08:43:57Z

Thank you for your interests in our work.
This is indeed a potential bug. Since we set the max_seq_length to be larger than the maximum length of sentence in the dataset, this bug will not occur in the ABSA task.
However, once the sentence length exceeds max_seq_length, the may lead to index_out_of_bound exception.
We will fix this bug as soon as possible.

bilalghanem · 2021-10-29T23:33:38Z

A solution for this:

in SpanABSA>absa>utils.py in function: wrapped_get_final_text:

add:
if end_index not in feature.token_to_orig_map:
feature.token_to_orig_map[end_index] = feature.token_to_orig_map[end_index-1]+1

before the line:
orig_doc_end = feature.token_to_orig_map[end_index]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

found a bug when the input text's length exceeds max_seq_length #10

found a bug when the input text's length exceeds max_seq_length #10

flyangovoyang commented Jul 2, 2020 •

edited

Loading

huminghao16 commented Jul 3, 2020

bilalghanem commented Oct 29, 2021 •

edited

Loading

found a bug when the input text's length exceeds max_seq_length #10

found a bug when the input text's length exceeds max_seq_length #10

Comments

flyangovoyang commented Jul 2, 2020 • edited Loading

huminghao16 commented Jul 3, 2020

bilalghanem commented Oct 29, 2021 • edited Loading

flyangovoyang commented Jul 2, 2020 •

edited

Loading

bilalghanem commented Oct 29, 2021 •

edited

Loading