-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify language_modeling.py
and tokenization.py
#2703
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ZanSara
added
topic:DPR
topic:speed
topic:modeling
type:refactor
Not necessarily visible to the users
labels
Jun 22, 2022
…ystack into simplify-language-modeling
…ystack into simplify-language-modeling
…ystack into simplify-language-modeling
…ystack into simplify-language-modeling
…ystack into simplify-language-modeling
…2862) * Add language unit tests * Parametrize language unit tests * Simplify model_type handling * Remove model_type parameter from get_language_model * Use the actual dpr encoders rather than tiny-bert in test_retriever.py * Additional checks for pretrained_model_name_or_path parameter in get_language_model * Fix get_language_model docs * Small fix
…ystack into simplify-language-modeling
vblagoje
approved these changes
Jul 22, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, CI green
Krak91
pushed a commit
to Krak91/haystack
that referenced
this pull request
Jul 26, 2022
* Simplification of language_model.py and tokenization.py to remove code duplication Co-authored-by: vblagoje <[email protected]>
Closed
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Related Issue(s): #2445, closes #2704
Proposed changes:
This PR is a near-full rewrite of
haystack/modeling/language_model.py
and a heavy rewrite ofhaystack/modeling/tokenization.py
. Although not strictly necessary for the multimodality support, this allows us to add new models with ease and without any code duplication. In addition it removes heavy and unnecessary use of**kwargs
across many function calls and clarifies which parameters are expected by which function.One core change is the removal of
LanguageModel.load()
. This factory method has been misused as a replacement for the__init__
function of model classes, making the whole initialization process quite confusing. To prevent similar issues, the factory method has been extracted from the class, namedget_language_model()
, and located at the bottom of the module.Another core change is the removal of copy-pasted model wrappers like
Bert
,Roberta
etc. These classes were mostly copies of each other and have been all replaced with only two classes calledHFLanguageModel
andHFLanguageModelWithPooler
. DPR model classes have been merged into a singleDPREncoder
class, but could be further simplified.In the tokenization part, all tokenizers have been replaced by
AutoTokenizer
(as suggested by @bogdankostic), which also simplified the codebase heavily. TheTokenizer.load()
factory method has also been removed and replace with theget_tokenizer()
factory method. Whether this method brings any value, now that the only tokenizer used isAutoTokenizer
, is open for discussion.Other classes in the
haystack/modeling
folder have been adapted to these changes. AlsoDPRetriever
andEmbeddingRetriever
have been slightly affected. Tests have also been corrected.TODOs:
I plan to try makeDPREncoder
a subclass ofHFLanguageModel
Pre-flight checklist
I have read the contributors guidelinesI have enabled actions on my fork