-
Notifications
You must be signed in to change notification settings - Fork 23
SDK Train
Sergio Matos edited this page Apr 13, 2021
·
3 revisions
If users do not want to take advantage of the Train CLI tool, it is also straightforward to train a machine-learning model for NER programatically. Such process has two different phases:
- Phase 1: read sentences and annotations and build the corpus with NLP data;
- Phase 2: train the model based on the built corpus.
In the end, the model can be serialized into a file or uses on a processing pipeline.
The following source code snippet shows how to train a machine-learning model for NER, using the data provided on the "example" folder.
// Set files
String sentencesFile = "example/train/sentences";
String annotationsFile = "example/train/annotations";
String modelConfigurationFile = "example/train/model.config";
String modelFile = "example/train/model.gz";
// Create parser
Parser parser = new GDepParser(ParserLanguage.ENGLISH, ParserLevel.CHUNKING, new LingpipeSentenceSplitter(), false).launch();
// Set sentences and annotations streams
InputStream sentencesStream = new FileInputStream(sentencesFile);
InputStream annotationsStream = new FileInputStream(annotationsFile);
// Run pipeline to get corpus from sentences and annotations
Pipeline pipelinePhase1 = new TrainPipelinePhase1()
.add(new BC2Reader(parser, null, annotationsStream))
.add(new TrainNLP(parser));
pipelinePhase1.run(sentencesStream);
// Close sentences and annotations streams
sentencesStream.close();
annotationsStream.close();
// Get corpus
Corpus corpus = pipelinePhase1.getCorpus();
// Get model configuration
InputStream inputStream = new ByteArrayInputStream(" ".getBytes("UTF-8"));
ModelConfig modelConfig = new ModelConfig(modelConfigurationFile);
// Run pipeline to train model on corpus
Pipeline pipelinePhase2 = new TrainPipelinePhase2()
.add(new DefaultTrainer(modelConfig));
pipelinePhase2.setCorpus(corpus);
pipelinePhase2.run(inputStream);
// Close input stream
inputStream.close();
// Get trained model and write to file
CRFModel model = (CRFModel) pipelinePhase2.getModuleData("TRAINED_MODEL").get(0);
model.write(new GZIPOutputStream(new FileOutputStream(modelFile)));