forked from JohnSnowLabs/spark-nlp
-
Notifications
You must be signed in to change notification settings - Fork 0
/
CHANGELOG
1331 lines (1176 loc) · 67.4 KB
/
CHANGELOG
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
========
2.0.8
========
---------------
Overview
---------------
This release fixes a few tiny but meaningful issues that prevent from new trained models having internal compatibility issues.
---------------
Bugfixes
---------------
* Fixed wrong logic when checking embeddingsRef is being overwritten in a WordEmbeddingsModel
* Deleted unnecessary chunk index from tokens
* Fixed some of the new trained models compatibility issues when python API had mismatching pretrained models compared to scala
========
2.0.7
========
---------------
Overview
---------------
This release addresses bugs related to cluster support, improving error messages and fixing various potential bugs depending
on the cluster configuration, such as Kryo Serialization or non default FS systems
---------------
Bugfixes
---------------
* Fixed a bug introduced in 2.0.5 that caused NerDL not to work in clusters with Kryo serialization enabled
* NerDLModel was not properly reading user provided config proto bytes during prediction
* Improved cluster embeddings message to hit user of cluster mode without shared filesystems
* Removed lazy model downloading on PretrainedPipeline to download the model at instantiation
* Fixed URI construction for cluster embeddings on non defaultFS configurations, improves cluster compatibility
========
2.0.6
========
---------------
Overview
---------------
Following the 2.0.5 (read notes below), this release fixes a bug when disabling contrib param in NerDLApproach on non-windows OS
---------------
Bugfixes
---------------
* Fixed NerDLApproach failing when training with setUseContrib(false)
========
2.0.5
========
---------------
Overview
---------------
This release bumps Spark NLP by default to Apache Spark 2.4.3. Spark has been undergoing testing with Scala 2.12 and they are back in 2.11 now, so this should be a working release.
In this version, we fixed a series of Pretrained models, as well as focused on improving the flexibility of NerDL annotator, which is, if not, the most popular one based on user feedback.
Users can point to graphs they create without having to re-compile the library, graph options as well whether to use Tensorflow contrib is now user defined.
Particular thanks to @CyborgDroid because of reporting importantly and well-reported bugs that helped us improve Spark NLP.
Thank you for reporting issues and feedback, and we always welcome more. Join us on Slack!
---------------
Enhancements
---------------
* ViveknSentiment annotator now includes confidence score in metadata
* NerDL now has setGraphFolder to allow a path to folder with custom generated graphs using python/tensorflow code
* NerDL now has setConfigProtoBytes to allow users submit his own ConfigProto (serialized) to the graph settings
* NerDLApproach now has setUseContrib to let training user decide whether or not to use contrib. Contrib LSTM Cells are proved to return more accurate results, but does not work in Windows yet.
* Updated default tensorflow settings to include GPU allow_growth by default, disabled log device placement spamming message
* Spark version bumped to 2.4.3
---------------
Bugfixes
---------------
* Fixed contrib NerDL models not work properly in clusters such as Databricks (Thanks @CyborgDroid)
* Fixed sparknlp.start(include_ocr=True) missing dependencies for OCR
* Fixed DependencyParser pretrained models not working properly in Python
---------------
Models and Pipelines
---------------
* NerDL will download noncontrib model if windows is detected, for better compatibility
* noncontrib version of pipelines with NerDL have been uploaded, as well as new models. Check documentation for complete list
* Improved error message when user is under windows and trying to load a contrib NerDL model
* Fixed ViveknSentimentModel not working properly (Thanks @CyborgDroid)
---------------
Developer API
---------------
* Embeddings in python moved to annotator module for consistency
* SourceStream ResourceHelper class now properly handles cluster files for Dependency Parser
* Metadata model reader now ignores empty lines instead of failing
* Unified lang instead of language attribute name in pretrained API
========
2.0.4
========
---------------
Overview
---------------
We are excited about Spark NLP workshop (spark-nlp-workshop repository) being so useful for many users.
Now we also made a step forward by moving website's documentation to an easy to maintain Wiki!. Spark NLP library received key bug fixes
on this release. Thanks to the community for reporting issues on GitHub. Much more to come, as always.
---------------
Bugfixes
---------------
* Fixed DependencyParser and TypedDependencyParser working inaccurately
* Fixed a bug preventing the load of WordEmbeddingsModel class from python
* Fixed wrong pretrained model names preventing some pretrained models to work properly
* Fixed BertEmbeddings not being capable of loading from file due a reader exception
---------------
Documentation
---------------
* Website documentation migrated to GitHub wiki page (WIP)
---------------
Developer API
---------------
* OcrHelper now reports failed file name when throwing exceptions (Thanks @kgeis)
* Fixed Annotation function explodeAnnotations to consider replacing output column scenarios
* Fixed TRAVIS CI unit tests
========
2.0.3
========
---------------
Overview
---------------
Short after 2.0.2, a hotfix release was made to address two bugs that prevented users from using pretrained tensorflow models in clusters.
Please read release notes for 2.0.2 to catch up!
---------------
Bugfixes
---------------
* Fixed logger serializable, causing issues in executors to serialize TensorflowWrapper
* Fixed contrib loading in cluster, when retrieving a Tensorflow session
========
2.0.2
========
---------------
Overview
---------------
Thank you for joining us in this exciting Spark NLP year!. We continue to make progress towards a better performing library, both in speed and in accuracy.
This release focuses strongly in the quality and stability of the library, making sure it works well in most cluster environments
and improving the compatibility across systems. Word Embeddings continue to be improved for better performance and lower memory blueprint.
Context Spell Checker continues to receive enhancements in concurrency and usage of spark. Finally, tensorflow based annotators
have been significantly improved by refactoring the serialization design. Help us with feedback and we'll welcome any issue reports!
---------------
New Features
---------------
* NerCrf annotator has now includeConfidence param that includes confidence scores for predictions in metadata
---------------
Enhancements
---------------
* Cluster mode performance improved in tensorflow annotators by serializing to bytes internal information
* Doc2Chunk annotator added new params startCol, startColByTokenIndex, failOnMissing and lowerCase allows better chunking of documents
* All annotations that derive from sentence or chunk types now contain metadata information referring to the sentence or chunk ID they belong to
* ContextSpellChecker now creates a window around the token to improve computation performance
* Improved WordEmbeddings matching accuracy by trying alternative case sensitive tokens
* WordEmbeddings won't load twice if already loaded
* WordEmbeddings can use embeddingsRef if source was not provided, improving reutilization of embeddings in a pipeline
* WordEmbeddings new param includeEmbeddings allow annotators not to save entire embeddings source along them
* Contrib tensorflow dependencies now only load if necessary
---------------
Bugfixes
---------------
* Added missing Symmetric delete pretrained model
* Fixed a broken param name in Normalizer (thanks @RobertSassen)
* Fixed Cloudera cluster support
* Fixed concurrent access in ContextSpellChecker in high partition number use cases and LightPipelines
* Fixed POS dataset creator to better handle corrupted pairs
* Fixed a bug in Word Embeddings not matching exact case sensitive tokens in some scenarios
* Fixed OCR Tess4J initialization problems in concurrent scenarios
---------------
Models and Pipelines
---------------
* Renaming of models and pipelines (work in progress)
* Better output column naming in pipelines
---------------
Developer API
---------------
* Unified more WordEmbeddings interface with dimension params and individual setters
* Improved unit tests for better compatibility on Windows
* Python embeddings moved to sparknlp.embeddings
========
2.0.1
========
---------------
Overview
---------------
Thanks for following up after our 2.0.0 release!. This release covers a few holes left by the immense 2.0.0 release,
to address high priority issues found after release. More importantly, the library should now behave correctly when using
Spark cluster modes, and memory and CPU utilization should be reduced to normal levels after some serious profiling of Serialization
revealed a bunch of problems. Aside from performance and resource management improvements, we include an OCR dependency handler in start() function as well
as improve the support of GPU for NER Deep Learning models. Finally, check out our spark-nlp-workshop repo, it has cool features!
---------------
Enhancements
---------------
* Improved serialization of Deep Learning models, shows performance boosts of up to 2.5 times over 1.8.3
* Tensorflow contrib libraries now managed correctly across a cluster
* Reverted useFeatureBroadcasting after internal benchmarks proved it was performing better
* SparkNLP.start() and sparknlp.start() now accept an includeOCR parameter which allows to automatically include OCR library
* Recreated NerDL Graphs to allow GPU allow_growth in tensorflow to improve memory management with GPU
* Expanded GPU coverage in NerDL graph
* Reduced NerDL Batch Size for better compatibility with GPUs
---------------
Bugfixes
---------------
* Fixed deep learning models not working across cluster due a bug in inputBuffers from graph reading
* Fixed a bug in POS() training function which did not work correctly from Python
* Fixed a bug in OCR where page number and intersection was not correctly matched
* Correctly handle exceptions when training Norvig and Symmetric Spell Checkers from dataframes
---------------
Developer API
---------------
* ContextSpellChecker now follows Features API correctly
---------------
Documentation
---------------
* spark-nlp-workshop repository has been expanded with better documentation and new notebooks
* we are still catching up with 2.x release!
========
2.0.0
========
---------------
Overview
---------------
Thank you for following up with the biggest changelog ever on Spark NLP: Spark NLP 2.0.0! Where to begin?
We have no less than 50 Pull Requests merged this time. Most importantly, we become the first library to have a production
ready implementation of BERT embeddings. Along with this interesting deep learning and context based embeddings algorithm, here is a quick overview of new things:
* Word Embeddings as well as Bert Embeddings are now annotators, just like any other component in the library. This means, embeddings can be
cached on memory through DataFrames, can be saved on disk and shared as part of pipelines!
* We revamped and enhanced Named Entity Recognition (NER) Deep Learning models to a new state of the art level, reaching up to 93% F1 micro-averaged accuracy in the industry standard.
* We upgraded tensorflow version and also started using contrib LSTM Cells.
* Performance and memory usage improvements also tag along by improving serialization throughput of Deep Learning annotators by receiving feedback from Apache Spark contributor Davies Liu.
* Revamping and expanding our pretrained pipelines list, plus the addition of new pretrained models for different languages together with
tons of new example notebooks, which include changes that aim the library to be easier to use. API overall was modified towards helping new comers get started.
* OCR module comes with a handful of improvements that increase accuracy.
All of this comes together with a full range of bug fixes and annotator improvements, follow up the details below!
Bear with us since documentation is still catching up a little bit behind, as well as new models to be made available. Stay tuned on Slack!
----------------
New Features
----------------
* BertEmbeddings annotator, with four google ready models ready to be used through Spark NLP as part of your pipelines, includes Wordpiece tokenization.
* WordEmbeddings, our previous embeddings system is now an Annotator to be serialized along Spark ML pipelines
* Created training helper functions that create spark datasets from files, such as CoNLL and POS tagging
* NER DL has been revamped by using contrib LSTM Cells. Added library handling for different OS.
----------------
Enhancements
----------------
* OCR improved handling of images by adding binarizing of buffered segments
* OCR now allows automatic adaptive scaling
* SentenceDetector params merged between DL and Rule based annotators
* SentenceDetector max length has been disabled by default, and now truncates by whitespace
* Part of Speech, NER, Spell Checking and Vivekn Sentiment Analysis annotators now train from dataset passed to fit() using Spark in the process
* Tokens and Chunks now hold metadata information regarding which sentence they belong to by sentence ID
* AnnotatorApproach annotators now allow a param trainingCols allowing them to use different inputs in training and in prediction. Improves Pipeline versatility.
* LightPipelines now allow method transform() to call against a DataFrame
* Noticeable performance gains by improving serialization performance in annotators through removal of transient variables
* Spark NLP in 30 seconds now provides a function SparkNLP.start() and sparknlp.start() (python) that automatically creates a local Spark session.
* Improved DateMatcher accuracy
* Improved Normalizer annotator by supporting and tokenizing a slang dictionary, with case sensitivity matching option
* ContextSpellChecker now is capable of handling multiple sentences in a row
* PretrainedPipeline feature now allows handling John Snow Labs remote pretrained pipelines to make it easy to update and access new models
* Symmetric Delete spell checking model improved training performance
----------------
Models and Pipelines
----------------
* Added more than 15 pretrained pipelines that cover a huge range of use cases. To be documented
* Improved multi language support by adding french and italian pipelines and models. More to come!
* Dependency Parser annotators now include a pretrained english model based on CoNLL-U 2009
----------------
Bugfixes
----------------
* Fixed python classname reference when deserializing pipelines
* Fixed serialization in ContextSpellChecker
* Fixed a bug in LightPipeline causing not to include output from embedded pipelines in a PipelineModel
* Fixed DateMatcher wrong param name not allowing to access it properly
* Fixed a bug where DateMatcher didn't know how to handle dash in dates where year had two digits instead of four
* Fixed a ContextSpellChecker bug that prevented it from being used repeatedly with collections in LightPipeline
* Fixed a bug in OCR that made it blow up with some image formats when using text preferred method
* Fixed a bug on OCR which made params not to work in cluster mode
* Fixed OCR setSplitPages and setSplitRegions to work properly if tesseract detected multiple regions
----------------
Developer API
----------------
* AnnotatorType params renamed to inputAnnotatorTypes and outputAnnotatorTypes
* Embeddings now serialize along a FloatArray in Annotation class
* Disabled useFeatureBroadcasting, showed better performance number when training large models in annotators that use Features
* OCR must be instantiated
* OCR works best with 4.0.0-beta.1
----------------
Build and release
----------------
* Added GPU build with tensorflow-gpu to Maven coordinates
* Removed .jar file from pip package
========
1.8.3
========
---------------
Overview
---------------
We're glad to announce a new release for Spark NLP. This one calls the attention of the community who contributed
immensely towards reporting bugs and feedback to the library. This release focuses in various bugfixes around DeepSentenceDetector
and also python deserialization of some specific pipelines. It also improves the DeepSentenceDetector allowing further fine-tuning
and customization. Then, we have embeddings that are being cached in the models folder, and further improvements towards accessing
them through S3 storage. Finally, we have made serious improvements in noteoboks and documentation around the library.
Special thanks to @Tshimanga and @haimco10 for very interesting contributions. See you on Slack!
---------------
Enhancements
---------------
* Improved OCR performance in skew detection
* SentenceDetector now better handles single quote protections (Thanks @haimco10)
* DeepSentenceDetector now can explodeSentences (Thanks @Tshimanga from Deep6.ai)
* EmbeddingsHelper now is capable of caching downloaded embeddings to avoid re-downloading
* Application.conf file may now be read from an s3 location
* DeepSentenceDetector has now access to all pragmatic SentenceDetector params in order to fine-tune it
---------------
Bugfixes
---------------
* Fixed ambiguous classpath resolution in pyspark, causing errors in deserializing some models
* Fixed DeepSentenceDetector not being deserializable in PySpark
* Fixed Chunk2Doc and Doc2Chunk annotators not being loadable in PySpark
* Fixed a bug where DeepSentenceDetector wouldn't corrent denote start and end offsets (Thanks @Tshimanga from Deep6.ai)
* Fixed a bug where DeepSentenceDetector would miss sentence parts when NER model missed header sentence (Thanks @Tshimanga from Deep6.ai)
* Cleaned and optimized DeepSentenceDetector code (Thanks @danilojsl)
* Fixed a missing dependency for OCR
---------------
Documentation and notebooks
---------------
* Added support and instructions for Anaconda deployment (Thanks @Maziyar)
* Updated various python notebooks to show utilization of spark packages instead of jars
* Added a new conference talk with Spark NLP in French at XebiCon'18
* Updated documentation towards less use of jars in favor of dependency solving
========
1.8.2
========
---------------
Overview
---------------
This release potentially targets to improve performance and resource usage in some pipelines that use word embeddings, it also comes
together with a very interesting autorotation feature in OCR, and a couple of new annotators to solve particular needs, including the ChunkTokenizer
or a Param to limit sentence lengths. Finally, we are starting to organize our multilingual store of models and data for training models.
Check the examples for some italian notebooks!. Thanks again to all community for such quick feedback all the time.
---------------
New Features
---------------
* OCR now capable of automatic rotation, significantly improving accuracy in some scenarios
* ChunkTokenizer is a new annotator that Tokenizes CHUNK type annotations. Extends Tokenizer algorithm and stores chunk ID for reference.
* SentenceDetector new Param maxLength now cuts off sentences longer than (by default) 240 characters. It avoids Deep Learning annotator issues and may improve performance in some scenarios.
* NerConverter new Param whiteList now allows a list of NER labels to be considered, while discarding the rest. May be useful for selective CHUNKing pipelines.
---------------
Enhancements
---------------
* Pipelines using Word Embeddings should now perform faster due to a group of RocksDB optimizations allowing annotators to reuse current open connections to DB
---------------
Bugfixes
---------------
* Fixed a bug where DeepSentenceDetector was missing the load() interface (Thanks @Tshimanga from Deep6!)
* Fixed a bug where RocksDB opened too many files at once causing pipelines to fail or to work very slowly
* Fixed NerCrfModel when prefetching RocksDB causing slower performance
---------------
Framework
---------------
* Added missing artifact resolution dependencies for OCR Module
* Started adding and organizing multilanguage models (Thanks @maziyarpanahi)
* Updated RocksDB to 5.17.2
========
1.8.1
========
---------------
Overview
---------------
This hotfix version of Spark-NLP improves framework support by adding Maven coordinates for OCR and allowing S3 retrieval of files.
We also included code for generating Graphs for NerDL and also for creating your own metadata files for a private model downloader.
As new features, we are including a new experimental machine learning based sentence detector, which uses NER for bounds detections.
Aside from this, we are including a few bug fixes and ocr improvements. Enjoy! and thanks again for community contributions!
---------------
New Features
---------------
* New DeepSentenceDetector annotator takes Spark-NLP's NER Deep Learning models as a base to improve sentence detection
---------------
Enhancements
---------------
* Improved accuracy of ContextSpellChecker by enabling re-ranking of candidate words according to a weighted levenshtein distance
* OCR process now defaults to split content in rows whether paragraphs or pages are identified for improved parallelism. May be turned off
---------------
Examples and use cases
---------------
* Added Scala examples for Sentiment analysis and Lemmatizer in Italian (Thanks Vincenzo Gaudenzi from DXC.technology for dataset and model contribution!!!)
---------------
Bugfixes
---------------
* Fixed a bug in Norvig and Symmetric SpellCheckers where the pattern parameter was not provided properly in Scala side (Thanks @johnmccain for reporting!)
---------------
Framework
---------------
* Added hadoop-aws dependency for remote download capabilities (e.g. word embeddings sets)
---------------
Other
---------------
* Metadata files for pretrained model downloads code is now included. This may be useful if anyone wants to setup their own private local model downloader service
* NerDL Graphs generation code is now included in the library. This allows the usage of custom word embedding dimensions and feature counts.
---------------
Special mentions
---------------
* Vincenzo Gaudenzi (DXC.technology) for contributing italian datasets and models. @maziyar for creating examples with them.
* @correlator from Deep6.ai for contributing feedback in slack and features feedback in general
* @johnmccain for reporting bugs in spell checker
* @rohit-nlp for delivering maven coordinates for OCR
* @haimco10 for contributing a sentence detector improvement with apostrophe's use case. Not merged due specific issues involved.
========
1.8.0
========
---------------
Overview
---------------
This release is huge! Spark-NLP made the leap into Spark 2.4.0, even with the challenge of not having everyone yet on board there (i.e. Zeppelin doesn't yet support it).
In this version we release three new NLP annotators. Two for dependency parsing processes and one for contextual deep learning based spell checking.
We also significantly improved OCR functionality, fine-tuning capabilities and general output performance, particularly on tesseract.
Finally, there's plenty of bug fixes and improvements in the word embeddings field, along with performance boosts and reduced disk IO.
Feel free to shoot us with any feedback you have! Particularly on your Spark 2.4.x experience.
---------------
New Features
---------------
* Built on top of Spark 2.4.0
* Dependency Parser annotator allows for sentence relationship encoding
* Typed Dependency Parser annotator allows for labeling relationships within dependency tags
* ContextSpellChecker is our first Deep Learning based Spell Checker that evaluates context and not only tokens
---------------
Enhancements
---------------
* More OCR parameters exposed for further fine tuning, including preferred methods priority and page segmentation modes
* OCR now has a setting setSplitPages() which allows setting whether to output one page per row or the entire document instead
* Improved word embeddings performance when working in local filesystems
* Reduced the amount of disk IO when working with Word Embeddings
* All python notebooks improved for better readability and better documentation
* Simplified PySpark interface API
* CoNLLGenerator utility class which helps building CoNLL-2003 files for NER training
* EmbeddingsHelper now allows reading word embeddings files directly from s3a:// paths
---------------
Bugfixes
---------------
* Solved race-condition issues in regards of cluster usage of RocksDB index for embeddings
* Fixed application.conf reading bug which didn't properly refresh AWS credentials
* RocksDB index no longer uses compression, in order to support Windows without native RocksDB compression libraries
* Solved various python default parameter settings
* Fixed circular dependency with jbig pdfbox image OCR
---------------
Deprecations
---------------
* DeIdentification annotator is no longer supported in the open source version of Spark-NLP
* AssertionStatus annotator is no longer supported in the open source version of Spark-NLP
========
1.7.3
========
---------------
Overview
---------------
This hotfix release focuses on fixing word-embeddings cluster problems on some frameworks such as Databricsk, while keeping 1.7.x performance benefits. Various YARN based clusters have been tested, databricks cloud among them to test this hotfix.
Aside of that, multiple improvements have been commited towards a better support of PySpark-NLP, fixing diverse technical issues in the API that help consistency in Annotator's super classes.
Finally, PIP installation has been made easier with a SparkNLP class that creates SparkSession automatically, for those who are learning Python Spark on their local computers.
Thanks to all the community for reporting issues.
---------------
Bugfixes
---------------
* Fixed 'RocksDB not serializable' when running LightPipeline scenarios or using _.functions implicits
* Fixed dependency with apache.commons.codec causing Apache Zeppelin 0.8.0 not to work in %pyspark
* Fixed Python pretrained() downloader not correctly setting Params and incorrectly creating new Model UIDs
* Fixed error 'JavaPackage not callable' when using AnnotatorModel.load() API without instantiating the class first
* Fixed Spark addFiles missing local file causing Word Embeddings not properly work in some Cluster-based frameworks
* Fixed broadcast NoSuchElementException `Failed to get broadcast_6_piece0 of broadcast_6` causing pretrained models not work in cluster frameworks (thanks @EnricoMi)
---------------
Developer API
---------------
* EmbeddingsHelper.setRef() has been removed. Reference is now set implicitly through EmbeddingsHelper.load(). Does not need to be loaded before deserializing models.
* Fixed and properly renamed chunk2doc and dock2chunk transformers, should now be working as expected
* Renamed setCompositeTokens to setCompositeTokensPatterns to help user remind that regex are being used in such Param
* Fixed PySpark automatic getter and setter Param generation when using pretrained() or load() models
* Simplified cluster path resolution for word embeddings
---------------
Other
---------------
* sparknlp.base now contains SparkNLP() classs which automatically cretes SparkSession using appropriate jar settings. Helps newcomers get started in PySpark NLP.
========
1.7.2
========
---------------
Overview
---------------
Quick release with another hotfix, due to a new found bug when deserializing word embeddings in a distributed fs. Also introduces changes in application.conf reader in order
to allow run-time changes. Also introduces renaming from EmbeddingsHelper API.
---------------
Bugfixes
---------------
* Fixed embeddings deserialization from distributed filesystem (caused due to windows pathfix)
* Fixed application.conf not reading changes in runtime
* Added missing remote_locs argument in python pretrained() functions
* Fixed wrong build version introduced in 1.7.1 to detect proper pretrained models version
---------------
Developer API
---------------
* Renamed EmbeddingsHelper functions for more convenience
========
1.7.1
========
---------------
Overview
---------------
Thanks to our slack community (Bryan Wilkinson, @maziyarpanahi, @apiltamang), a few bugs been pointed out very quickly from 1.7.0 release. This hotfix fixes an embeddings deserialization issue when cache_pretrained is located on a distributed filesystem.
Also, fixes some path resolution in Windows OS. Thanks to Maziyar, .gitattributes been added in order to identify proper languages in GitHub.
Finally, 1.7.1 adds a missing annotator from 1.7.0 Chunk2Doc, which converts CHUNK types into DOCUMENT types, for further retokenization or other annotations.
---------------
Enhancements
---------------
* Chunk2Doc annotator converts annotatorType from CHUNK to DOCUMENT
---------------
Bugfixes
---------------
* Fixed embedding-based annotators deserialization error when cache_pretrained is on distributed fs (Thanks Bryan Wilkinson for pointing out issue and testing fix)
* Fixed windows path reading when deserializing embeddings (Thanks @apiltamang)
---------------
Other
---------------
* .gitattributes added in order to properly discard jupyter as main language for GitHub repo (thanks @maziyarpanahi)
========
1.7.0
========
---------------
Overview
---------------
Having multiple annotators that use the same word embeddings set, may result in huge pipelines, driver memory and storage consumption.
Since now on, embeddings may be shared and reutilized across annotators making the process much more efficient.
Also, thanks to @apiltamang, we now better support path resolution for Windows implementations.
---------------
Enhancements
---------------
Memory and storage saving by allowing annotators with embeddings through params 'includeEmbeddings' and 'embeddingsRef' to allow them to set whether they should be included when saved, or referenced by id from other annotators
EmbeddingsHelper class allows embeddings management
---------------
Bug fixes
---------------
Thanks to @apiltamang for improving URI path support for Windows Servers
---------------
Developer API
---------------
Embeddings interfaces and method names completely refactored, hopefully simplified and easier to understand
========
1.6.3
========
---------------
Overview
---------------
This release includes a new annotator for de-identification of sensitive information. It uses CHUNK annotations, meaning its accuracy will depend on previous annotators on the pipeline.
Also, OCR capabilities have been improved in the OCR module.
In terms of broken stuff, we've fixed a few annoying bugs on SymmetricDelete and SentenceDetector explode feature.
Finally, pip is now part of the official repositories, meaning you can install it just as any other module. It also includes jars and we've added a SparkNLP class which creates SparkSession easily for you.
Thanks again for all community contribution in issues, feedback and comments in GitHub and in Slack.
---------------
New features
---------------
* DeIdentification annotator, takes DOCUMENT and TOKEN from the original sentence, plus a CHUNK annotation to anonymize target chunk in sentence. CHUNK annotation might come from NerConverter, TextMatcher or other chunk annotators.
---------------
Enhancements
---------------
* Kernel zoom and region erosion improve overall detection quality. Fixed some stability bugs. Improved parallelism
---------------
Bug fixes
---------------
* Sentence Detector explode sentences into rows now works properly
* Fixed Dictionary-based sentiment detector not working on pyspark
* Added missing NerConverter to annotator._ imports
* Fixed SymmetricDelete spell checker deleting tokens in some scenarios
* Fixed SymmetricDelete spell checker unwilling lower-casing
---------------
Other
---------------
* PySpark pip now part from official pip repos
* Pip installation now includes corresponding spark-nlp jar. base module includes SparkNLP SparkSession creator
========
1.6.2
========
---------------
Overview
---------------
In this release, we focused on reviewing out streaming performance, buy measuring our amount of sentences processed by second, through a LightPipeline.
We increased Norvig Spell Checker by more than 300% by disabling DoubleVariants and improving algorithm orders. It is now reported capable of 42K sentences per second.
Symmetric Delete Spell checker is more performance, although it has been reported to process 2K sentences per second.
NerCRF has been reported to process 300 hundred sentences per second, while NerDL can do twice fast (about 700 sentences per second).
Vivekn Sentiment Analysis was improved and is now capable to processing 100K sentences per sentence (before it was below 500).
Finally, SentenceDetector performance was improved by a 40% from ~30K rows processed per second to ~40K. But, we have now enabled Abbreviation processing by default which reduces final speed to 22K rows per second with a negative net but better accuracy.
Again, thanks for the community for helping with feedback. We welcome everyone asking questions or giving feedback in our Slack channel or reporting issues on Github.
---------------
Enhancements
---------------
* OCR now features kernel segmentation. Significantly improves image based PDF processing
* Vivekn Sentiment Analysis prediction performance improved by better data structures
* Both Norvig and Symmetric Delete spell checkers now have improved performance
* SentenceDetector improved accuracy by better handling abbreviations. UseAbbreviations now also by default turned ON
* SentenceDetector improved performance significantly by improved preloading of rules
---------------
Bug fixes
---------------
* Fixed NerDL not training correctly (broken since 1.6.0). Pretrained models not affected
* Fixed NerConverter not properly considering multiple sentences per row (after using SentenceDetector), causing an unhandled exception to occur in some scenarios.
* Tensorflow sessions now all support allow_soft_placement, supporting GPU based graphs to work with and without GPU
* Norvig Spell Checker fixed a missing step from the algorithm to check for additional variants. May improve accuracy
* Norvig Spell Checker disabled DoubleVariants by default. Was not improving accuracy significantly and was hitting performance very hard
---------------
Developer API
---------------
* New FeatureSet allows HashSet params
---------------
Models
---------------
* Vivekn Sentiment Pipeline doesn't have Spell Checker anymore
* Fixed Vivekn Sentiment pretrained improved accuracy
========
1.6.1
========
---------------
Overview
---------------
Hi! We're glad to announce new hotfix 1.6.1. Although changes seem modest or very specific, there is a lot going underground. First of all, we've worked hard with the community to understand S3-based clusters,
which don't have a common fs.defaultFS configuration, which is the one we use to tell where is the cluster temp folder located in order to distribute word embeddings. We fixed two things here,
on one side we fixed a bug pointing to the wrong filesystem. Second, we added a custom override setting in application.conf that allows manually setting where to put temp folders in cluster. This should help S3 users.
Please share your feedback on this regard.
On the other hand, we created a new annotator type internally. The CHUNK type allows better modulary in the communication between different annotators. Impact will be noticed implicitly and over time.
---------------
New features
---------------
* new Scala-only functions that make it easier to work with Annotations in Dataframes. May be imported through com.johnsnowlabs.nlp.functions._ and allow mapping and filtering within and outside Annotations.
filterByAnnotations, mapAnnotations and explodeAnnotations work by providing a column and a function. Check out documentation. Possibly later coming to Python.
---------------
Bug fixes
---------------
* Fixed incorrect filesystem readings in some S3 environments for word embeddings
* Fixed NerCRF not correctly training from CONLL, labeling everything as -O- (Thanks @arnound from Slack Channel)
---------------
Enhancements
---------------
* Added overrideable config sparknlp.settings.cluster_tmp_dir allows setting cluster location for temporary embeddings file. May help S3 based clusters with no fs.defaultFS set to a proper distributed storage.
* New annotator type: CHUNK. Representes a SUBSTRING of DOCUMENT and it is used as output from NerConverter, TextMatcher, RegexMatcher and other annotators that retrieve a substring from the original document.
This will make for better modularity and integration within various annotators, such as between NER and AssertionStatus.
* New annotation transformer: ChunkAssembler. Takes a string or array(string) column from a dataset and creates a CHUNK type annotator. The content must also belong to the current DOCUMENT annotation's content.
* SentenceDetector new param explodeSentences allow to explode sentences within a single row into different rows to increase parallelism and performance in some scenarios. Particularly OCR based.
* AssertionDLApproach now may be used within LightPipelines
* AssertionDLApproach and AssertionLogRegApproach now work from CHUNK type instead of start/end bounds. May still be trained with Start/end though. This means target for assertion may be any CHUNK output annotator now (e.g. RegexMatcher)
---------------
Other
---------------
* PerceptronApproachLegacy moved back to default PerceptronApproach. Distributed PerceptronApproach moved to PerceptronApproachDistributed due to not meeting accuracy expectations yet.
* Some configuration parameters in application.conf have been appropriately moved to proper annotator Params (NorvigSweeting Spell Checker, Vivekn Approach and Sentiment Detector affected)
* application.conf renamed configuration values for better consistency
---------------
Developer API
---------------
* Added beforeAnnotate() and afterAnnotate() to manipulate dataframes after or before calling annotate() UDF
* Added extraValidate() and extraValidateMsg() in all annotators to provide developer to add additional SCHEMA checks in transformSchema() stage
* Removed validation() stage in fit() stage. Allows for more flexible training when some of the columns are not really required yet.
* WrapColumnMetadata() will wrap an Annotation column with its appropriate Metadata. Makes it easier not to forget about Metadata in Schema.
* RawAnnotator trait has now all the basics needed to start a new Annotator without annotate() function. It is a complete previous stage before AnnotatorModel, which inherits from RawAnnotator.
========
1.6.0
========
---------------
Overview
---------------
We're late! But it was worth it. We're glad to release 1.6.0 which brings new features, lots of enhancements and many bugfixes. First of all, we are thankful for community participating in Slack and in GitHub by reporting feedback and issues.
In this one, we have a new annotator, the Chunker, which allows to grab pieces of text following a particular Part-of-Speech pattern.
On the other hand, we have a brand new OCR to Spark Dataframe utility, which bundles as an optional component to Spark-NLP. This one requires tesseract 4.x+ to be installed on your system, and may be downloaded from our website or readme pages.
Aside from that, we improved in many areas, from the DocumentAssembler to work better with OCR output, down to our Deep Learning models with better consistency and accuracy. Word Embedding based annotators also receive improvements when working in Cluster environments.
Finally, we are glad a user contributed a fix to the AWS dependency issue, particularly happening in Cloudera environments. We're still waiting for feedback, and gladly accept it.
We'll be working on the documentation as this release follows. Thank you.
---------------
New Features
---------------
* New annotator: Chunker. This annotator takes regex for Part-of-Speech tags and returns appropriate chunks of text following such patterns
* OCR to Spark-NLP: As an optional jar module, users may use OcrHelper class in order to convert PDF files into Spark Dataset, ready to be utilized by Spark-NLP's document assembler. May be used without Spark-NLP. Requires Tesseract 4.x on your system.
---------------
Enhancements
---------------
* TextMatcher now has caseSensitive (setCaseSensitive) Param which allows to setup for matching with case sensitivity or not (Ignores if Normalizer did it). Returned word is still the original.
* LightPipelines in Python should now be faster thanks to an optimization of prefetching results into Python memory instead of py4j bridge
* LightPipelines can now handle embedded Pipelines
* PerceptronApproach now trains utilizing full Spark distributed algoritm. Still experimental. PerceptronApproachLegacy may still be used, which might be better for local non cluster setups.
* Tokenizer now has a param 'includeDefaults' which may be set to False to disable all preset-rules.
* WordEmbedding based annotators may now decide to normalize tokens before matching embeddings vectors through 'useNormalizedTokensForEmbeddings' Param. Generally improves consistency, lesser overfitting.
* DocumentAssembler may now better deal with large amounts of texts by using 'trimAndClearNewLines' to better work with OCR Outputs and be better ready for further Sentence Detection
* Improved SentenceDetector handling of enumerations and lists
* Slightly improved SentenceDetector performance through non-tail-recursive optimizations
* Finisher does no longer have default delimiters when output into String (not Array) (thanks @S_L)
---------------
Bug fixes
---------------
* AWS library dependecy conflict now resolved (Thanks to @apiltamang for proposing solution. thanks to the community for follow-up). Solution is experimental, waiting for feedback.
* Fixed wrong order of further added Tokenizer's infixPatterns in Python (Thanks @sethah)
* Training annotators that use Word Embeddings in a distributed cluster does no longer throw file not found exceptions sporadically
* Fixed NerDLModel returning non-deterministic results during prediction
* Deep-Learning based models and graphs now allow running them on CPU if trained on GPU and GPU is not available on client
* WordEmbeddings temporary location no longer in HOME dir, moved to tmp.dir
* Fixed SentenceDetector incorrectly bounding sentences with non-English characters (Thanks @lorenz-nlp)
* Python Spark-NLP annotator models should now have all appropriate setter and getter functions for Params
* Fixed wrong-format of column when showing Metadata through Finisher's output as Array
* Added missing python Finisher's include metadata function (thanks @PinusSilvestris for reporting the bug)
* Fixed Symmetric Delete Spell Checker throwing wrong error when training with an empty dataset (Thanks @ankush)
---------------
Developer API
---------------
* Deep Learning models may now be read through SavedModelBundle API into Tensorflow for Java in TensorflowWrapper
* WordEmbeddings now allow checking if word exists with contains()
* Included tool that converts text into CoNLL format for further labeling for training NER models (
========
1.5.4
========
---------------
Overview
---------------
This release improves various annotators: the Normalizer, SymmetricDelete, TextMatcher, DocumentAssembler and Finisher
allowing them to cover more use-cases that were mentioned in our Slack channel. We also fixed two important bugs.
Finally, this will be our first release with PIP support for python sparknlp, for those entirely python based.
---------------
Enhancements
---------------
* Normalizer now allows multiple to-delete regex patterns.
* Normalizer slangDictionary param allows converting tokens into something else (e.g. 'lol' into 'laughing out loud') from a dictionary file
* SymmetricDelete spell checker may now be trained from the dataset passed to fit if external corpus not provided
* SymmetricDelete spell checker improved training and prediction performance
* Finisher param includeMetadata now outputs annotation metadata content both in Array format or String format
* DocumentAssembler may now read from Array[String] column if provided. This improves compatibility for some SparkML transformers
* TextMatcher now includes identifier name in metadata
---------------
Bug fixes
---------------
* Fixed a bug introduced in 1.5.3 that made spark-nlp not to work in Python2 (thanks @surendralalwani)
* Fixed SymmetricDeleteApproach wrong annotator type
---------------
Other
---------------
* setup.py for PIP support (instructions will be added to readme and website). Still needs spark-nlp jar in SparkSession classpath.
========
1.5.3
========
---------------
Overview
---------------
This quick release is a hotfix for issues found on 1.5.2 after it's release. Thanks to the users who quickly tested this out.
It fixes Symmetric spell checker not being capable of reading the pretrained model, a SentenceDetector missing default value and retroactive version matching to the downloader
---------------
Bug fixes
---------------
* Fixed a bug causing the library to fail when trying to save or read an annotator with an unset Feature without default
* Added missing default Param value to SentenceDetector. Thanks @superman24-7
* Symmetric spell checker now utilizes List instead of ListBuffer on its prediction layer
* Fixed Vivekn Sentiment Analysis failing when training with a sentiment column
---------------
Models
---------------
* Symmetric Spell Checker pretrained model now works well and may be downloaded
* Vivekn Sentiment pretrained model now defaults to "token" input column instead of "spell"
---------------
Other
---------------
* Downloader now works retroactively when a newer version finds a model of a previous release
* Renamed folder argument to remote_loc for downloader remote location, which caused confusion. Thanks @AtulSehgal
* Added new Scala example in example folder, also available on website
========
1.5.2
========
---------------
Overview
---------------
This release focuses on improving model downloader stability, fixing word embedding reading issues and joining
spark ecosystem filesystem configuration appropriately, utilizing spark's defined default filesystem, in order to work
properly with clusters and multi node environments. This includes Databricks cloud clusters or Amazon EMR yarn HDFS nodes.
Aside of that we come up with exciting new features, a brand new Spell Checker with higher accuracy inspired on the
Symmetric delete algorithm.
Finally Assertion Status can be trained and predicted on top of NER output, since before
this only worked by providing assertion status Start and End boundaries for the target to assert.
---------------
New Features
---------------
* Assertion status annotators can now be trained and predict against NER output instead of start and end boundaries. Entities can now be directly asserted
* Brand new Symmetric Delete annotator (SymmetricDeleteApproach) with closer to start of the art optimal accuracy 80%
---------------
Enhancements
---------------
* Model downloader now uses proper spark filesystem. Works properly with distributed storage, databricks cloud clusters or amazon EMR seamlessly
* Fixed several race condition while loading word embeddings from disk or download resources, library is more stable
* Improved several assertion status validations and error messages
---------------
Bug fixes
---------------
* Stand alone Annotator models are now properly read from disk in python
---------------
Models
---------------
* New Symmetric Delete Spell checker pretrained model
* Vivekn Sentiment annotator may now be downloaded standalone with pretrained()
========
1.5.1
========
---------------
Overview
---------------
This release is an enhancement release to 1.5.0 which includes improved downloader properties and better annotator defaults.
Also, assertion status models have been included as pretrained, which are models trained on top of Glove Stanford word embeddings
---------------
Enhancements
---------------
* SentenceDetector has now a useCustomOnly param which enforces into using only the custom bounds provided (thanks @atomobianco)
* Normalizer defaults to not lowerCase words leads to better implicit accuracy in pipelines (thanks @marek.modry)
* SpellChecker defaults to be case sensitive leads to better accuracy
* DateMatcher improved speed performance
* com.johnsnowlabs.annotator._ in Scala now also includes RecursivePipelines and LightPipelines for easier imports
* ModelDownloader has been improved with better directory management
---------------
Models
---------------
* New Assertion Status (LogisticRegression and DeepLearning) pretrained models now available
* Vivekn, Basic and Advanced pretrained Pipelines improved accuracy (thanks @marek.modry)
---------------
Other
---------------
* S3 library dependencies updated
========
1.5.0
========
---------------
Overview
---------------
We are proud to announce if not the biggest release in terms of content in Spark-NLP!
This release makes the library miles easier to use for new comers, allowing easier to import
annotators and the extended use of model downloader throughout pretrained models and pipelines.
This also includes two new annotators that use deep learning algorithms with graphs from TensorFlow, which
is the first time we do so.
Apart from this, we include new Light Pipelines that are 10x times faster when working with data smaller than about
50,000 rows length.
Finally, we included several bugfixes across the library, from algorithm wise to developer API.
We'll gladly welcome any feedback! The website has been extensively updated.
---------------
New features
---------------
* Light Pipelines are Annotator Pipelines created from SparkML pipelines that run more than 10x faster in small datasets
* Deep Learning NER based on Bi-LSTM and Convolutional Neural Networks from word embeddings datasets
* Deep Learning Assertion Status model based on LSTM to compute status identification from word embeddings
* Easier to use Spark-NLP:
1. Imports have been made easy in scala API (com.johnsnowlabs.annotator._) to bring all annotators
2. BasicPipeline and AdvancedPipeline downloadable pipelines created for quick annotation of text
3. Light Pipelines are easy to use and accept simple strings to annotate a Spark ML Pipeline without spark datasets
* New Downloadable models: CRF NER, Lemmatizer, POS and Spell checker
* New Downloadable pipelines: Vivekn Sentiment analysis, BasicPipeline and AdvancedPipeline
---------------
Enhancements
---------------
* Model downloader significantly improved in terms of usability
---------------
Documentation
---------------
* Website widely improved
* Added invite to our first slack chat channel
---------------
Bugfixes
---------------
* Fixed positional index wrong value when creating Annotations from constructor
* Fixed hamming distance calculation in spell checker
* Fixed Downloadable NER model failing sporadically due to missing temporary files
* Fixed SearchTrie algorithm used in TextMatcher (fmy. EntiyExtractor) thanks @avenka11 for reporting and proposing solution
* Fixed some model deserialization issues happening on Windows
---------------
Other
---------------
* Thanks to @showy we have TravisCI automatic integration testing
* Finisher now outputs to array by default
* Training example resources removed in advantage of using the model downloader more
========
1.4.2
========
---------------
Bugfixes
---------------
* Filesystem protocols now properly read across the library, fixed use case for S3:// protocol (thanks @avenka11)
* Library now works properly in Windows environments
* PySpark annotator param getters now work properly when retrieving default values
* Fixed stemmer serialization due to misspelled param name
* Fixed Tokenizer infixPattern param name to infixPatterns, leading to broken pyspark serialization of such param
* Added missing addInfixPattern() function to PySpark, to allow adding patterns to current value
* Model Downloader clearCache now properly removes both .zip files and extracted content
* Model Downloader is now capable of reading all types of models properly
* Added missing clearCache function into PySpark
---------------
Developer API
---------------
* Function names in model downloader code has been refactored consistenyl
---------------
Other
---------------
* RocksDB rolled back to previous version to support Windows