Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

Add assert for doc_stride, max_seq_length and max_query_length #1587

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

bartekkuncer
Copy link

@bartekkuncer bartekkuncer commented Apr 7, 2022

Description

This change adds assert for doc_stride, max_seq_length and max_query_length relation (args.doc_stride <= args.max_seq_length - args.max_query_length - 3) as incautious setting of them can cause data loss when chunking input features and ultimately significantly lower accuracy.

Example

Without the assert when one sets max_seq_length to e.g. 128 and keeps default 128 value for doc_stride this happens for the input feature of qas_id == "572fe53104bcaa1900d76e6b" when running bash ~/gluon-nlp/scripts/question_answering/commands/run_squad2_uncased_bert_base.sh:
image

As you can see we are losing some of the context_tokens_ids (in red rectangle) as they are not included in any of the ChunkFeatures due to too high doc_stride in comparison to max_seq_length and user does not get notified even with a simple warning. This can lead to significant accuracy drop as this kind of data losses happen for all input features which do not fit entirely into single chunk.

This change introduces an assert popping when there is a possible data loss and forces the user to set proper/safe values for doc_stride, max_seq_length and max_query_length.

Error message

image

Chunk from example above with doc_stride reduced to 32

image

As you can see when values of doc_stride, max_seq_length and max_query_length satisfy abovementioned equation no data is lost during chunking and we avoid accuracy loss.

cc @dmlc/gluon-nlp-team

@bartekkuncer bartekkuncer requested a review from a team as a code owner April 7, 2022 17:07
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants