This repository has been archived by the owner on Jan 15, 2024. It is now read-only.
Add assert for doc_stride, max_seq_length and max_query_length #1587
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This change adds assert for doc_stride, max_seq_length and max_query_length relation (
args.doc_stride <= args.max_seq_length - args.max_query_length - 3
) as incautious setting of them can cause data loss when chunking input features and ultimately significantly lower accuracy.Example
Without the assert when one sets max_seq_length to e.g. 128 and keeps default 128 value for doc_stride this happens for the input feature of qas_id == "572fe53104bcaa1900d76e6b" when running
bash ~/gluon-nlp/scripts/question_answering/commands/run_squad2_uncased_bert_base.sh
:As you can see we are losing some of the context_tokens_ids (in red rectangle) as they are not included in any of the ChunkFeatures due to too high doc_stride in comparison to max_seq_length and user does not get notified even with a simple warning. This can lead to significant accuracy drop as this kind of data losses happen for all input features which do not fit entirely into single chunk.
This change introduces an assert popping when there is a possible data loss and forces the user to set proper/safe values for doc_stride, max_seq_length and max_query_length.
Error message
Chunk from example above with doc_stride reduced to 32
As you can see when values of doc_stride, max_seq_length and max_query_length satisfy abovementioned equation no data is lost during chunking and we avoid accuracy loss.
cc @dmlc/gluon-nlp-team