Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

found a bug when the input text's length exceeds max_seq_length #10

Open
flyangovoyang opened this issue Jul 2, 2020 · 2 comments
Open

Comments

@flyangovoyang
Copy link

flyangovoyang commented Jul 2, 2020

hello, i am coming again, (nice work deserves deep reading~~)

yesterday's statement has been reorganized.

when the input text's length exceeds max_seq_length, targets in the extra part will be left out since all_doc_token has been truncated according to max_seq_length, however, start/end indexes and polarities in the extra part has been kept in tok_start_positions/tok_end_positions and polarity_labels respectively

SpanABSA/absa/utils.py

Lines 101 to 115 in 66369af

tok_start_positions = []
tok_end_positions = []
for start_position, end_position in \
zip(example.start_positions, example.end_positions):
tok_start_position = orig_to_tok_index[start_position]
if end_position < len(example.sent_tokens) - 1:
tok_end_position = orig_to_tok_index[end_position + 1] - 1
else:
tok_end_position = len(all_doc_tokens) - 1
tok_start_positions.append(tok_start_position)
tok_end_positions.append(tok_end_position)
# Account for [CLS] and [SEP] with "- 2"
if len(all_doc_tokens) > max_seq_length - 2:
all_doc_tokens = all_doc_tokens[0:(max_seq_length - 2)]

later, initialize tokens from all_doc_token with the [CLS] and [SEP] added.

SpanABSA/absa/utils.py

Lines 117 to 128 in 66369af

tokens = []
token_to_orig_map = {}
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for index, token in enumerate(all_doc_tokens):
token_to_orig_map[len(tokens)] = tok_to_orig_index[index]
tokens.append(token)
segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)

convert tokens to input_ids, and pad it to max_seq_length

SpanABSA/absa/utils.py

Lines 130 to 136 in 66369af

input_ids = tokenizer.convert_tokens_to_ids(tokens)
input_mask = [1] * len(input_ids)
while len(input_ids) < max_seq_length:
input_ids.append(0)
input_mask.append(0)
segment_ids.append(0)

at this moment, the code initializes start_position and end_positions with the length of input_ids which has beed truncated, but the for-loop ranges from tok_start_positions/tok_end_positions and example.polarities, which has not been truncated, this may lead to index_out_of_bound exception.

SpanABSA/absa/utils.py

Lines 142 to 163 in 66369af

# For distant supervision, we annotate the positions of all answer spans
start_positions = [0] * len(input_ids)
end_positions = [0] * len(input_ids)
bio_labels = [0] * len(input_ids)
polarity_positions = [0] * len(input_ids)
start_indexes, end_indexes = [], []
for tok_start_position, tok_end_position, polarity in zip(tok_start_positions, tok_end_positions, example.polarities):
if (tok_start_position >= 0 and tok_end_position <= (max_seq_length - 1)):
start_position = tok_start_position + 1 # [CLS]
end_position = tok_end_position + 1 # [CLS]
start_positions[start_position] = 1
end_positions[end_position] = 1
start_indexes.append(start_position)
end_indexes.append(end_position)
term_length = tok_end_position - tok_start_position + 1
max_term_length = term_length if term_length > max_term_length else max_term_length
bio_labels[start_position] = 1 # 'B'
if start_position < end_position:
for idx in range(start_position + 1, end_position + 1):
bio_labels[idx] = 2 # 'I'
for idx in range(start_position, end_position + 1):
polarity_positions[idx] = label_to_id[polarity]

@huminghao16
Copy link
Owner

Thank you for your interests in our work.
This is indeed a potential bug. Since we set the max_seq_length to be larger than the maximum length of sentence in the dataset, this bug will not occur in the ABSA task.
However, once the sentence length exceeds max_seq_length, the may lead to index_out_of_bound exception.
We will fix this bug as soon as possible.

@bilalghanem
Copy link

bilalghanem commented Oct 29, 2021

A solution for this:

in SpanABSA>absa>utils.py in function: wrapped_get_final_text:

add:
if end_index not in feature.token_to_orig_map:
feature.token_to_orig_map[end_index] = feature.token_to_orig_map[end_index-1]+1

before the line:
orig_doc_end = feature.token_to_orig_map[end_index]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants