SPARKNLP-656 & SPARKNLP-657: Updated Documentation #13108

DevinTDHa · 2022-11-18T11:05:45Z

Description

This PR updates the documentation so it is more clear how to use setTestDataset in *DLApproach annotators.

In addition installation instructions for M1 machines were also updated.

Resolves #13070 and #13079.

How Has This Been Tested?

No code changes.

maziyarpanahi · 2022-11-18T11:12:50Z

@DevinTDHa

Once you split (or use CoNLL() to have another DataFrame for test/dev), you need to transform it on the very same pipeline.

example for NER: https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/75f443b1792ecc99575bfe217f30640acff7b55d/jupyter/training/english/dl-ner/ner_graph_builder.ipynb
example for all the classfiers are here: (Train/Evaluation) https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/jupyter/training/english/classification

So perhaps, we can have one example/doc for a normal training, and another one in case the needs to have testDataset param which usually doesn't use Pipeline (or everything up and incuding embeddings is in the Pipeline and the trainiable annotator is outside the pipeline)

Like

document = DocumentAssembler()\
    .setInputCol("description")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained() \
 .setInputCols(["document"])\
 .setOutputCol("sentence_embeddings")

pipeline = Pipeline(stages = [document,use])

test_dataset = pipeline.fit(news_test_dataset).transform(news_test_dataset)

we transform and save the test/dev:

test_dataset.write.parquet("./test_news.parquet")

and we train:

classsifierdl = ClassifierDLApproach()\
  .setInputCols(["sentence_embeddings"])\
  .setOutputCol("class")\
  .setLabelColumn("category")\
  .setMaxEpochs(5)\
  .setEnableOutputLogs(True) \
  .setEvaluationLogExtended(True) \
  .setValidationSplit(0.2) \
  .setTestDataset("./test_news.parquet")

pipeline = Pipeline(
    stages = [
        document,
        use,
        classsifierdl
    ])

pipelineModel = pipeline.fit(trainDataset)

… classifiers - updated docs for NerDLApproach, ClassifierDLApproach, MultiClassifierDLApproach, SentimentDLApproach

DevinTDHa · 2022-11-18T12:41:42Z

@maziyarpanahi Updated with the latest push to better examples

DevinTDHa added enhancement documentation DON'T MERGE Do not merge this PR labels Nov 18, 2022

DevinTDHa requested a review from maziyarpanahi November 18, 2022 11:05

DevinTDHa assigned maziyarpanahi and DevinTDHa Nov 18, 2022

DevinTDHa added 2 commits November 18, 2022 13:39

SPARKNLP-656: Update documentation regarding testDataset in trainable…

f78022d

… classifiers - updated docs for NerDLApproach, ClassifierDLApproach, MultiClassifierDLApproach, SentimentDLApproach

SPARKNLP-657: Updated installation docs for M1

d41f408

DevinTDHa force-pushed the docs/setTestDataset branch from dd19c1c to d41f408 Compare November 18, 2022 12:39

maziyarpanahi approved these changes Nov 28, 2022

View reviewed changes

maziyarpanahi changed the base branch from master to feature/424-release-candidate November 28, 2022 09:59

maziyarpanahi linked an issue Nov 28, 2022 that may be closed by this pull request

Mac M1: jnitensorflow error with BertEmbeddings.pretrained #13079

Closed

maziyarpanahi merged commit 7bbb637 into JohnSnowLabs:feature/424-release-candidate Nov 28, 2022

This was referenced Nov 28, 2022

Spark NLP 424-release-candidate #13156

Closed

Release/424 release candidate #13162

Closed

Release/424 release candidate #13163

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARKNLP-656 & SPARKNLP-657: Updated Documentation #13108

SPARKNLP-656 & SPARKNLP-657: Updated Documentation #13108

DevinTDHa commented Nov 18, 2022

maziyarpanahi commented Nov 18, 2022

DevinTDHa commented Nov 18, 2022

SPARKNLP-656 & SPARKNLP-657: Updated Documentation #13108

SPARKNLP-656 & SPARKNLP-657: Updated Documentation #13108

Conversation

DevinTDHa commented Nov 18, 2022

Description

How Has This Been Tested?

maziyarpanahi commented Nov 18, 2022

DevinTDHa commented Nov 18, 2022