-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Terminology #448
base: main
Are you sure you want to change the base?
Terminology #448
Conversation
Wouldn't it be much simpler to make the terminology map a member of the ResponseOptions class, and map it to a I don't know if ResponseOptions is an immutable struct at this point. I don't think we have an async API on the python side so there's no risk of Python changing the terminology during a translation. But if you do want to implement that, its something to take into account. |
src/translator/service.cpp
Outdated
std::string srcword; | ||
std::string replacementword; | ||
getline(ss, srcword, '\t'); | ||
getline(ss, replacementword, '\n'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure people don't use windows line endings, hehe. Incidentally, if we could do this terminology map loading in Python you'd get their line ending strippping code for free.
src/translator/service.cpp
Outdated
getline(ss, replacementword, '\n'); | ||
// @TODO it seems like removing the tags forces the model to copy which is | ||
// I guess just as good and more reliable. In that case we just don't tell the model | ||
// what the original source is and it just has no choice BUT to generate the target. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: ah you say it here explicitly. Copy is a copy with the assumption it won't try to translate because it doesn't know the translation. For Chinese <-> English I can imagine this working, but no way that English <-> French would accept something like that… right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically the model just learns to copy when there's accidentally target text on the source side. I expected it it would work just fine for English <-> french. Would be a problem with multilingual models, potentially.
- Move it to its own file - Prefer replacing longer sequences above shorter ones - Don't replace terminology found in the just replaced bits
Still on my wish list, but not a priority for a first release:
|
Before we push to main i'd like to fix some of the CI (as building the wheel is broken atm, onnxgemm requires some updating etc). |
Todo: instead of a |
Should we nest application specific augments e.g. bergamot-translator-app:
terminology: terms.tsv
terminology-form: "__source__ {src} __target {trg} __done__" to avoid any future name conflicts with upstream Marian |
Don't know if Marian's yaml parser can handle that, but if we can do that, it might be a good future-proofing. Ideally there would be no collisions because this would be the marian implementation but maybe that's slightly unrealistic. |
I would guess the parser library supports it, but from a quick look it's not wired up on the Marian side. But ideally Marian wouldn't need to parse it, just ignore it since terminology (for now?) is just implemented in bergamot-translator. But on the ignoring unknown args:
(and similarly directly on the cli) |
Now GPU support is working. Reliably from the C++ test app, not so much from python. The python interface randomly deadlocks, and I have no idea why. I didn't have this problem before so I blame the ensemble change, but i haven't bisected, or anything. FML. The deadlocks happen even when compiling without CUDA compiled in. I guess I should be attaching GDB... |
This is where we are at:
The code in question is here:
What's happening? For some reason a future just takes forever while no computation is happening... |
Reverting: |
How to compile the python interface with CUDA support:
Invoke as: bergamot-translator -c model.npz.best-bleu.npz.decoder.brg.yml -i dataset -n 0 -g 0 -l info |
std::string srcword; | ||
std::string replacementword; | ||
getline(ss, srcword, '\t'); | ||
getline(ss, replacementword, '\n'); // BEWARE of windows file ndings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is why I would have liked it if we could have kept the "read the tsv file" part out of the bergamot C++ codebase, and just implement it in Python/Qt/JavaScript which are much better at integrating with the local environment.
… But treat this complaint so I'll stop complaining about this from now on!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, functionally, we can bypass the redundant terminology implementation, by applying terminology constraints directly. Maybe we can add this to the readme. I would like to keep a C++ implementation at the very least as documentation.
@@ -202,12 +283,14 @@ void AsyncService::pivot(std::shared_ptr<TranslationModel> first, std::shared_pt | |||
void AsyncService::translate(std::shared_ptr<TranslationModel> translationModel, std::string &&source, | |||
CallbackType callback, const ResponseOptions &responseOptions) { | |||
// Producer thread, a call to this function adds new work items. If batches are available, notifies workers waiting. | |||
// Tagging | |||
if (!terminologyMap_.empty()) source = ReplaceTerminology(std::move(source), terminologyMap_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::string ReplaceTerminology(std::string const &str, TerminologyMap const &terminology);
. I think I made the std::move
pointless when I changed the ReplaceTerminilogy
implementation.
app.add_option("--terminology-form", config.format, | ||
"Form for technology. Default is \"%s __target__ %s __done__ \". Change depending on the model."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to figure out how to store this in the model yaml instead before we merge?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pedantically, we probably should.
|
||
} // namespace | ||
|
||
std::string ReplaceTerminology(std::string const &str, TerminologyMap const &terminology) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once TerminologyMap
becomes very large, this method is going to be stupidly slow. At that point we should change it into a state machine type of search thingy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If terminology map is very large, i feel something wrong is happening.
State machine would always be better but also I feel this would be a time sink... ;D
Update marian-dev
Basic Python interface + installation instructions