Should OpusCleaner have the notion of a "project"? #146

bhaddow · 2024-01-12T16:59:56Z

I am trying to understand the intended workflow for OpusCleaner.

Suppose I want to build some MT systems. I fire up OpusCleaner, download some data, apply cleaning rules until I am happy, then I upload data to the data to the cluster for training. Maybe I come back the next day, and want to create a new version of this data set, or maybe I want train a completely different MT system.

For this, would it be useful if OC had the notion of a "project"? I open a project, add files to it, set some project-wide rules and parameters, then maybe some data set specific parameters. If I then want to work on a different MT system, then I open a different project. I can copy the project file onto a different server, and initialise it (by downloading the files). Maybe projects could have versions, so I can track which data/rule set I used.

jelmervdl · 2024-01-12T19:46:51Z

My idea was to leave project management to things like git, and let OpusCleaner use the filesystem as a project structure. So you'd run opuscleaner server from your project dir, e.g. hplt/v1/eng-fry. You can then also use git to commit the json files generated by OpusCleaner to track filters per dataset.

A hybrid approach might be what Jupyter Lab does, where you can also change directories (to some degree) from the web interface to open up a different project. But by default it will treat the current working directory as the root of your lab session.

bhaddow · 2024-01-12T21:34:07Z

Ah, that makes sense. Having opuscleaner manage projects would add more complexity. Running multiple instances would be simpler.

It would still be useful have some config that applies to all the datasets in a directory, say to set languages or default rules. It could be just a json/yaml file that lives in the directory.

jelmervdl · 2024-01-12T21:40:31Z

There seems to be some agreement on that. You're not the first to bring it up! #101

bhaddow added the enhancement New feature or request label Jan 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should OpusCleaner have the notion of a "project"? #146

Should OpusCleaner have the notion of a "project"? #146

bhaddow commented Jan 12, 2024

jelmervdl commented Jan 12, 2024

bhaddow commented Jan 12, 2024

jelmervdl commented Jan 12, 2024

Should OpusCleaner have the notion of a "project"? #146

Should OpusCleaner have the notion of a "project"? #146

Comments

bhaddow commented Jan 12, 2024

jelmervdl commented Jan 12, 2024

bhaddow commented Jan 12, 2024

jelmervdl commented Jan 12, 2024