-
Notifications
You must be signed in to change notification settings - Fork 211
QuestionAnsweringInputBase is returning incorrect number of samples in batch #1166
Comments
Hey @mfojtak Thanks for reporting this! Great analysis of the issue, I think having an iterable input for question answering would be a great way forward. If this something you'd like to help out with? There may be some hoops we need to jump through to get this to work with our API but we could help there 😃 |
Hi @ethanwharris Thanks for swift answer. I have already implemented the iterable version and tested it. There is also some refactoring of base classes required for this to work. It might be better to use Python generator syntax instead of load_data and load_sample approach. It would be more intuitive and authoring of new datamodules easier. Can I just create a pull request for this? |
@mfojtak Yes definitely happy to have a PR with it 😃 Note that we have recently changed our text tasks to apply the tokenization in the collate function instead of the |
@ethanwharris let me implement it in the latest and greatest API. For this - could you please give me some clarity on how the new API works. In my understanding: Input is now responsible only to load samples with no feature transformations (e.g. tokenization). BUT - I noticed you created TransformersCollate callable class. How does this class fit into the picture? |
@mfojtak Sure 😃 The main thing we tried to resolve is that the model should be responsible for the tokenization. It used to be (a few versions ago now) that you had to provide the same backbone argument to the datamodule and the task. We later added a mechanism for the datamodule and task to share state which allowed the model to set a tokenizer state that could then be used by the Input.load_sample. The main issue with the state mechanism was that it connected too many things together and it was unclear how different objects modified it, so we've removed it. We also have a way that models can set a collate function (this makes sense e.g. for object detection where different models may need the data collated in different ways). So the QA task currently works by creating a The problem with our current approach is that you wouldn't be able to implement the iterable tokenization, which I think would be a great feature to have. This is the approach I think we could take:
Hope that helps 😃 sorry for the long message, let me know if you have any thoughts / comments |
@ethanwharris Thanks for clarification. I fully understand your points and while implementing iterable version I was facing the exact issues you pointed out. I agree the "get_state" approach is complicated and self.trainer is better. The question is where tokenization should be happening. It looks like InputTransform is a good candidate for this component. But you did not mention it and I can see in the code that it is not used too often. Is Input/Output Transform API planned to be used in the future? Benefits of attaching transform to the model: The only question is about the Transform API - is it the right one future proof approach? |
The I think it could make a lot of sense for the How would this work with the iterable QA input? Can the contexts, questions, etc. be split into fixed sized chunks without also tokenizing them or would we need more of a special case there? |
The situation is getting a little more complicated. However, it uncovered more design problems, bugs and confusions. datamodule = QuestionAnsweringData.from_squad_v2( #no InputTransofrm or collate specified, not connected with trainer or model
train_file="/data/share/cuad/CUAD_v1/CUAD_v1.json",
batch_size=2
)
dl = DataLoader(ds, batch_size=2)
for sample in dl:
print(sample) #crash here because of default_collate My plan now is to implement InputTransform for QA and fix SQUAD parsing bug and feature extraction bug. However, it would be good to start decoupling concepts. Currently everything is stitched together in not so transparent way and it is not easy to understand behavior of the API. Also many things are happening implicitly which hinders implementation and debugging. |
The plan sounds good and agree with the decoupling 😃 Couple of comments:
The only difference currently between using the datamodule outside of a trainer etc. is that the model needs to override the collate in order to own the tokenization. One option there is that we could change the default transform for QA to just return the list of samples from the collate. |
Hello all, The initial version before the data pipeline overhaul was using This has been changed I guess and I also overlooked this while going through the code. Going through the conversation above, I agree that we could put the
In the previous version of the API, it would tokenize the entire datatset when creating the |
Hey @karthikrangasai Yes that's a good point, I had forgotten about it but one option is we could add a hook to the InputTransform that would allow for processing the whole data set. This could be quite useful for static preprocessing like tokenization or e.g. creating spectrograms in audio tasks. |
Should we also provide the usecase when users would want to further transform the already processed dataset ? |
I think we could have the following API:
In the
In the collate:
Let me know your thoughts 😃 |
Hi all, see my comments bellow
Per batch because in general a transform might do batch level optimizations. In fact - transormers tokenizer operates on batches.
The input transform cannot be turned off even by setting to None. Basically, the API should follow lightning API in order to be simplified in my opinion. Dataset should just yield samples. Should have no knowledge about model, loader or trainer. Question answering is used as an example here. All the above applies in general to all tasks. This is more less how it is designed in Lightning. Flash in my opinion should implement specific tasks only. There are parts where it feels like flash implements lightning in lightning. Please share your opinions. |
Yes, @mfojtak I generally agree with all of your points.
The above is all generally quite clean except for a few caveats:
Removing the above two would simplify the transforms significantly. TokenizationAs for where the tokenization should happen, I don't think any transforms should be attached to the model. The main reason for this is that intensive transforms (like tokenization) should be executed in parralell within the dataloader workers (that is, in the getitem or injected into the collate function) for performance reasons.
Let's try to design the API here. We could have the following:
We used to have something like that but didn't like it as you had to specify the backbone in two places. Alternatively we could do more explicit:
Personally, I think that allowing the transform to have a reference to the trainer (and thus the model) is fine since the datamodule gets this reference already in PL. That would then allow for the following, which is my personal preference:
Note that all of the above would allow for the tokenization to be overriden manually by providing a custom input transform to the I propose the following steps forward:
|
How about this: tokenizer = ....
data_module = QuestionAnsweringData.from_*(
transform=HuggingFaceQuestionAnsweringTokenizationTransform(tokenizer=tokenizer), #no kwargs
...
)
model = QuestionAnsweringTask(data_module=..., backbone=) Be aware that is some (not uncommon) cases the tokenizer backbone might be different from model. E.g. you are adding extra tokens which requires instatiation of your own tokenizer. Or you are using different tokenization library. or: data_module = QuestionAnsweringData.from_*(
...
)
model = QuestionAnsweringTask(data_module=..., backbone=...)
__int__(data_module, backbone):
if not data_module.transform and not self.input_transform: #or other smart heuristics to adapt any input type
tokenizer = Tokenizer(backbone)
self.input_transform = HuggingFaceQuestionAnsweringTokenizationTransform(tokenizer=tokenizer) or data_module = QuestionAnsweringData.from_*(
...
)
transform = HuggingFaceQuestionAnsweringTokenizationTransform(backbone=...)
model = QuestionAnsweringTask(data_module=..., backbone=..., input_transform=...) It could get even smarter and model can propose transforms automatically based on data module output type information. |
I would prefer the API to look like: data_module = QuestionAnsweringData.from_*(
transform_kwargs={"backbone": ...},
train_transform=HuggingFaceQuestionAnsweringTokenizationTransform,
...
)
model = QuestionAnsweringTask(backbone=...)
data_module = QuestionAnsweringData.from_*(
train_transform=GloveEmbeddingTransform,
transform_kwargs={"name": '6B', "dim":50},
...
)
rnn_model = RNN()
model = QuestionAnsweringTask(backbone=rnn_model) |
Yes, @mfojtak definitely agree with your points there. Personally I would be very happy to see this API for transforms: data_module = DataModule.from_*(
transform=MyCustomTransform(image-size=128, ...),
...
) This would involve a slight change in how the transforms work underneath and then removing the I would like to avoid passing a reference to the data module to the model, so with the above changes I think we could have something like this: data_module = QuestionAnsweringData.from_*(
transform=HuggingFaceQuestionAnsweringTokenizationTransform(backbone=...),
...
)
model = QuestionAnsweringTask(backbone=) Then as @karthikrangasai suggests we would also be able to have e.g. glove or word2vec embedding and stuff like that as optional alternatives to the HF tokenization. @mfojtak Are you on the PL slack? It might be easier for us to discus there 😃 |
@ethanwharris perfect! I guess we agree :-)
Absolutely agree here too. My mistake - It is trainer which links data with model. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
🐛 Bug
In case of long QA context the Huggingface tokenizer divides tokenized output in chunks. Which is expected and correct.
But load_sample function in QuestionAnsweringInputBase is returning collated sample which results in arbitrary sized batches ignoring batch_size specified.
This may result in cuda OOM and other problems.
One sample per chunk is created instead of one sample per squad sample. It looks like the code tries to "utilize" all chunks even if they do not contain answer. Which might be useful but in this case IterableInput should be used.
By default only one sample per squad sample should be returned and impossible answers ignored (unless not squad impossible answers).
To Reproduce
Steps to reproduce the behavior:
Code sample
Here 2 samples per batch is requested.
If sample context size > 4096 then multiple chunks are returned.
e.g. if first context size is 5000 and second context size is 3000 then 3 samples will be yielded from QuestionAnsweringInputBase.
Expected behavior
Correct number of samples in batch is returned.
Environment
conda
,pip
, source): pipAdditional context
Possible solutions:
QuestionAnsweringInputBase should be based on IterableInput as number of samples is not known, or completely new iterable version is implemented separately.
Or "classic" Input would remain but one sample per squad sample must be returned.
The text was updated successfully, but these errors were encountered: