You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to understand the intended workflow for OpusCleaner.
Suppose I want to build some MT systems. I fire up OpusCleaner, download some data, apply cleaning rules until I am happy, then I upload data to the data to the cluster for training. Maybe I come back the next day, and want to create a new version of this data set, or maybe I want train a completely different MT system.
For this, would it be useful if OC had the notion of a "project"? I open a project, add files to it, set some project-wide rules and parameters, then maybe some data set specific parameters. If I then want to work on a different MT system, then I open a different project. I can copy the project file onto a different server, and initialise it (by downloading the files). Maybe projects could have versions, so I can track which data/rule set I used.
The text was updated successfully, but these errors were encountered:
My idea was to leave project management to things like git, and let OpusCleaner use the filesystem as a project structure. So you'd run opuscleaner server from your project dir, e.g. hplt/v1/eng-fry. You can then also use git to commit the json files generated by OpusCleaner to track filters per dataset.
A hybrid approach might be what Jupyter Lab does, where you can also change directories (to some degree) from the web interface to open up a different project. But by default it will treat the current working directory as the root of your lab session.
Ah, that makes sense. Having opuscleaner manage projects would add more complexity. Running multiple instances would be simpler.
It would still be useful have some config that applies to all the datasets in a directory, say to set languages or default rules. It could be just a json/yaml file that lives in the directory.
I am trying to understand the intended workflow for OpusCleaner.
Suppose I want to build some MT systems. I fire up OpusCleaner, download some data, apply cleaning rules until I am happy, then I upload data to the data to the cluster for training. Maybe I come back the next day, and want to create a new version of this data set, or maybe I want train a completely different MT system.
For this, would it be useful if OC had the notion of a "project"? I open a project, add files to it, set some project-wide rules and parameters, then maybe some data set specific parameters. If I then want to work on a different MT system, then I open a different project. I can copy the project file onto a different server, and initialise it (by downloading the files). Maybe projects could have versions, so I can track which data/rule set I used.
The text was updated successfully, but these errors were encountered: