CHANGELOG

========
2.0.8
========
---------------
Overview
---------------
This release fixes a few tiny but meaningful issues that prevent from new trained models having internal compatibility issues.

---------------
Bugfixes
---------------
* Fixed wrong logic when checking embeddingsRef is being overwritten in a WordEmbeddingsModel
* Deleted unnecessary chunk index from tokens
* Fixed some of the new trained models compatibility issues when python API had mismatching pretrained models compared to scala

========
2.0.7
========
---------------
Overview
---------------
This release addresses bugs related to cluster support, improving error messages and fixing various potential bugs depending
on the cluster configuration, such as Kryo Serialization or non default FS systems

---------------
Bugfixes
---------------
* Fixed a bug introduced in 2.0.5 that caused NerDL not to work in clusters with Kryo serialization enabled
* NerDLModel was not properly reading user provided config proto bytes during prediction
* Improved cluster embeddings message to hit user of cluster mode without shared filesystems
* Removed lazy model downloading on PretrainedPipeline to download the model at instantiation
* Fixed URI construction for cluster embeddings on non defaultFS configurations, improves cluster compatibility

========
2.0.6
========
---------------
Overview
---------------
Following the 2.0.5 (read notes below), this release fixes a bug when disabling contrib param in NerDLApproach on non-windows OS

---------------
Bugfixes
---------------
* Fixed NerDLApproach failing when training with setUseContrib(false)

========
2.0.5
========
---------------
Overview
---------------
This release bumps Spark NLP by default to Apache Spark 2.4.3. Spark has been undergoing testing with Scala 2.12 and they are back in 2.11 now, so this should be a working release.
In this version, we fixed a series of Pretrained models, as well as focused on improving the flexibility of NerDL annotator, which is, if not, the most popular one based on user feedback.
Users can point to graphs they create without having to re-compile the library, graph options as well whether to use Tensorflow contrib is now user defined.
Particular thanks to @CyborgDroid because of reporting importantly and well-reported bugs that helped us improve Spark NLP.
Thank you for reporting issues and feedback, and we always welcome more. Join us on Slack!

---------------
Enhancements
---------------
* ViveknSentiment annotator now includes confidence score in metadata
* NerDL now has setGraphFolder to allow a path to folder with custom generated graphs using python/tensorflow code
* NerDL now has setConfigProtoBytes to allow users submit his own ConfigProto (serialized) to the graph settings
* NerDLApproach now has setUseContrib to let training user decide whether or not to use contrib. Contrib LSTM Cells are proved to return more accurate results, but does not work in Windows yet.
* Updated default tensorflow settings to include GPU allow_growth by default, disabled log device placement spamming message
* Spark version bumped to 2.4.3

---------------
Bugfixes
---------------
* Fixed contrib NerDL models not work properly in clusters such as Databricks (Thanks @CyborgDroid)
* Fixed sparknlp.start(include_ocr=True) missing dependencies for OCR
* Fixed DependencyParser pretrained models not working properly in Python

---------------
Models and Pipelines
---------------
* NerDL will download noncontrib model if windows is detected, for better compatibility
* noncontrib version of pipelines with NerDL have been uploaded, as well as new models. Check documentation for complete list
* Improved error message when user is under windows and trying to load a contrib NerDL model
* Fixed ViveknSentimentModel not working properly (Thanks @CyborgDroid)

---------------
Developer API
---------------
* Embeddings in python moved to annotator module for consistency
* SourceStream ResourceHelper class now properly handles cluster files for Dependency Parser
* Metadata model reader now ignores empty lines instead of failing
* Unified lang instead of language attribute name in pretrained API

========
2.0.4
========
---------------
Overview
---------------
We are excited about Spark NLP workshop (spark-nlp-workshop repository) being so useful for many users.
Now we also made a step forward by moving website's documentation to an easy to maintain Wiki!. Spark NLP library received key bug fixes
on this release. Thanks to the community for reporting issues on GitHub. Much more to come, as always.

---------------
Bugfixes
---------------
* Fixed DependencyParser and TypedDependencyParser working inaccurately
* Fixed a bug preventing the load of WordEmbeddingsModel class from python
* Fixed wrong pretrained model names preventing some pretrained models to work properly
* Fixed BertEmbeddings not being capable of loading from file due a reader exception

---------------
Documentation
---------------
* Website documentation migrated to GitHub wiki page (WIP)

---------------
Developer API
---------------
* OcrHelper now reports failed file name when throwing exceptions (Thanks @kgeis)
* Fixed Annotation function explodeAnnotations to consider replacing output column scenarios
* Fixed TRAVIS CI unit tests

========
2.0.3
========
---------------
Overview
---------------
Short after 2.0.2, a hotfix release was made to address two bugs that prevented users from using pretrained tensorflow models in clusters.
Please read release notes for 2.0.2 to catch up!

---------------
Bugfixes
---------------
* Fixed logger serializable, causing issues in executors to serialize TensorflowWrapper
* Fixed contrib loading in cluster, when retrieving a Tensorflow session

========
2.0.2
========
---------------
Overview
---------------
Thank you for joining us in this exciting Spark NLP year!. We continue to make progress towards a better performing library, both in speed and in accuracy.
This release focuses strongly in the quality and stability of the library, making sure it works well in most cluster environments
and improving the compatibility across systems. Word Embeddings continue to be improved for better performance and lower memory blueprint.
Context Spell Checker continues to receive enhancements in concurrency and usage of spark. Finally, tensorflow based annotators
have been significantly improved by refactoring the serialization design. Help us with feedback and we'll welcome any issue reports!

---------------
New Features
---------------
* NerCrf annotator has now includeConfidence param that includes confidence scores for predictions in metadata

---------------
Enhancements
---------------
* Cluster mode performance improved in tensorflow annotators by serializing to bytes internal information
* Doc2Chunk annotator added new params startCol, startColByTokenIndex, failOnMissing and lowerCase allows better chunking of documents
* All annotations that derive from sentence or chunk types now contain metadata information referring to the sentence or chunk ID they belong to
* ContextSpellChecker now creates a window around the token to improve computation performance
* Improved WordEmbeddings matching accuracy by trying alternative case sensitive tokens
* WordEmbeddings won't load twice if already loaded
* WordEmbeddings can use embeddingsRef if source was not provided, improving reutilization of embeddings in a pipeline
* WordEmbeddings new param includeEmbeddings allow annotators not to save entire embeddings source along them
* Contrib tensorflow dependencies now only load if necessary

---------------
Bugfixes
---------------
* Added missing Symmetric delete pretrained model
* Fixed a broken param name in Normalizer (thanks @RobertSassen)
* Fixed Cloudera cluster support
* Fixed concurrent access in ContextSpellChecker in high partition number use cases and LightPipelines
* Fixed POS dataset creator to better handle corrupted pairs
* Fixed a bug in Word Embeddings not matching exact case sensitive tokens in some scenarios
* Fixed OCR Tess4J initialization problems in concurrent scenarios

---------------
Models and Pipelines
---------------
* Renaming of models and pipelines (work in progress)
* Better output column naming in pipelines

---------------
Developer API
---------------
* Unified more WordEmbeddings interface with dimension params and individual setters
* Improved unit tests for better compatibility on Windows
* Python embeddings moved to sparknlp.embeddings

========
2.0.1
========
---------------
Overview
---------------
Thanks for following up after our 2.0.0 release!. This release covers a few holes left by the immense 2.0.0 release,
to address high priority issues found after release. More importantly, the library should now behave correctly when using
Spark cluster modes, and memory and CPU utilization should be reduced to normal levels after some serious profiling of Serialization
revealed a bunch of problems. Aside from performance and resource management improvements, we include an OCR dependency handler in start() function as well
as improve the support of GPU for NER Deep Learning models. Finally, check out our spark-nlp-workshop repo, it has cool features!

---------------
Enhancements
---------------
* Improved serialization of Deep Learning models, shows performance boosts of up to 2.5 times over 1.8.3
* Tensorflow contrib libraries now managed correctly across a cluster
* Reverted useFeatureBroadcasting after internal benchmarks proved it was performing better
* SparkNLP.start() and sparknlp.start() now accept an includeOCR parameter which allows to automatically include OCR library
* Recreated NerDL Graphs to allow GPU allow_growth in tensorflow to improve memory management with GPU
* Expanded GPU coverage in NerDL graph
* Reduced NerDL Batch Size for better compatibility with GPUs

---------------
Bugfixes
---------------
* Fixed deep learning models not working across cluster due a bug in inputBuffers from graph reading
* Fixed a bug in POS() training function which did not work correctly from Python
* Fixed a bug in OCR where page number and intersection was not correctly matched
* Correctly handle exceptions when training Norvig and Symmetric Spell Checkers from dataframes

---------------
Developer API
---------------
* ContextSpellChecker now follows Features API correctly

---------------
Documentation
---------------
* spark-nlp-workshop repository has been expanded with better documentation and new notebooks
* we are still catching up with 2.x release!

========
2.0.0
========
---------------
Overview
---------------
Thank you for following up with the biggest changelog ever on Spark NLP: Spark NLP 2.0.0! Where to begin?
We have no less than 50 Pull Requests merged this time. Most importantly, we become the first library to have a production
ready implementation of BERT embeddings. Along with this interesting deep learning and context based embeddings algorithm, here is a quick overview of new things:
* Word Embeddings as well as Bert Embeddings are now annotators, just like any other component in the library. This means, embeddings can be
cached on memory through DataFrames, can be saved on disk and shared as part of pipelines!
* We revamped and enhanced Named Entity Recognition (NER) Deep Learning models to a new state of the art level, reaching up to 93% F1 micro-averaged accuracy in the industry standard.
* We upgraded tensorflow version and also started using contrib LSTM Cells.
* Performance and memory usage improvements also tag along by improving serialization throughput of Deep Learning annotators by receiving feedback from Apache Spark contributor Davies Liu.
* Revamping and expanding our pretrained pipelines list, plus the addition of new pretrained models for different languages together with
tons of new example notebooks, which include changes that aim the library to be easier to use. API overall was modified towards helping new comers get started.
* OCR module comes with a handful of improvements that increase accuracy.
All of this comes together with a full range of bug fixes and annotator improvements, follow up the details below!
Bear with us since documentation is still catching up a little bit behind, as well as new models to be made available. Stay tuned on Slack!

----------------
New Features
----------------
* BertEmbeddings annotator, with four google ready models ready to be used through Spark NLP as part of your pipelines, includes Wordpiece tokenization.
* WordEmbeddings, our previous embeddings system is now an Annotator to be serialized along Spark ML pipelines
* Created training helper functions that create spark datasets from files, such as CoNLL and POS tagging
* NER DL has been revamped by using contrib LSTM Cells. Added library handling for different OS.

----------------
Enhancements
----------------
* OCR improved handling of images by adding binarizing of buffered segments
* OCR now allows automatic adaptive scaling
* SentenceDetector params merged between DL and Rule based annotators
* SentenceDetector max length has been disabled by default, and now truncates by whitespace
* Part of Speech, NER, Spell Checking and Vivekn Sentiment Analysis annotators now train from dataset passed to fit() using Spark in the process
* Tokens and Chunks now hold metadata information regarding which sentence they belong to by sentence ID
* AnnotatorApproach annotators now allow a param trainingCols allowing them to use different inputs in training and in prediction. Improves Pipeline versatility.
* LightPipelines now allow method transform() to call against a DataFrame
* Noticeable performance gains by improving serialization performance in annotators through removal of transient variables
* Spark NLP in 30 seconds now provides a function SparkNLP.start() and sparknlp.start() (python) that automatically creates a local Spark session.
* Improved DateMatcher accuracy
* Improved Normalizer annotator by supporting and tokenizing a slang dictionary, with case sensitivity matching option
* ContextSpellChecker now is capable of handling multiple sentences in a row
* PretrainedPipeline feature now allows handling John Snow Labs remote pretrained pipelines to make it easy to update and access new models
* Symmetric Delete spell checking model improved training performance

----------------
Models and Pipelines
----------------
* Added more than 15 pretrained pipelines that cover a huge range of use cases. To be documented
* Improved multi language support by adding french and italian pipelines and models. More to come!
* Dependency Parser annotators now include a pretrained english model based on CoNLL-U 2009

----------------
Bugfixes
----------------
* Fixed python classname reference when deserializing pipelines
* Fixed serialization in ContextSpellChecker
* Fixed a bug in LightPipeline causing not to include output from embedded pipelines in a PipelineModel
* Fixed DateMatcher wrong param name not allowing to access it properly
* Fixed a bug where DateMatcher didn't know how to handle dash in dates where year had two digits instead of four
* Fixed a ContextSpellChecker bug that prevented it from being used repeatedly with collections in LightPipeline
* Fixed a bug in OCR that made it blow up with some image formats when using text preferred method
* Fixed a bug on OCR which made params not to work in cluster mode
* Fixed OCR setSplitPages and setSplitRegions to work properly if tesseract detected multiple regions

----------------
Developer API
----------------
* AnnotatorType params renamed to inputAnnotatorTypes and outputAnnotatorTypes
* Embeddings now serialize along a FloatArray in Annotation class
* Disabled useFeatureBroadcasting, showed better performance number when training large models in annotators that use Features
* OCR must be instantiated
* OCR works best with 4.0.0-beta.1

----------------
Build and release
----------------
* Added GPU build with tensorflow-gpu to Maven coordinates
* Removed .jar file from pip package

========
1.8.3
========
---------------
Overview
---------------
We're glad to announce a new release for Spark NLP. This one calls the attention of the community who contributed
immensely towards reporting bugs and feedback to the library. This release focuses in various bugfixes around DeepSentenceDetector
and also python deserialization of some specific pipelines. It also improves the DeepSentenceDetector allowing further fine-tuning
and customization. Then, we have embeddings that are being cached in the models folder, and further improvements towards accessing
them through S3 storage. Finally, we have made serious improvements in noteoboks and documentation around the library.
Special thanks to @Tshimanga and @haimco10 for very interesting contributions. See you on Slack!

---------------
Enhancements
---------------
* Improved OCR performance in skew detection
* SentenceDetector now better handles single quote protections (Thanks @haimco10)
* DeepSentenceDetector now can explodeSentences (Thanks @Tshimanga from Deep6.ai)
* EmbeddingsHelper now is capable of caching downloaded embeddings to avoid re-downloading
* Application.conf file may now be read from an s3 location
* DeepSentenceDetector has now access to all pragmatic SentenceDetector params in order to fine-tune it

---------------
Bugfixes
---------------
* Fixed ambiguous classpath resolution in pyspark, causing errors in deserializing some models
* Fixed DeepSentenceDetector not being deserializable in PySpark
* Fixed Chunk2Doc and Doc2Chunk annotators not being loadable in PySpark
* Fixed a bug where DeepSentenceDetector wouldn't corrent denote start and end offsets (Thanks @Tshimanga from Deep6.ai)
* Fixed a bug where DeepSentenceDetector would miss sentence parts when NER model missed header sentence (Thanks @Tshimanga from Deep6.ai)
* Cleaned and optimized DeepSentenceDetector code (Thanks @danilojsl)
* Fixed a missing dependency for OCR

---------------
Documentation and notebooks
---------------
* Added support and instructions for Anaconda deployment (Thanks @Maziyar)
* Updated various python notebooks to show utilization of spark packages instead of jars
* Added a new conference talk with Spark NLP in French at XebiCon'18
* Updated documentation towards less use of jars in favor of dependency solving

========
1.8.2
========
---------------
Overview
---------------
This release potentially targets to improve performance and resource usage in some pipelines that use word embeddings, it also comes
together with a very interesting autorotation feature in OCR, and a couple of new annotators to solve particular needs, including the ChunkTokenizer
or a Param to limit sentence lengths. Finally, we are starting to organize our multilingual store of models and data for training models.
Check the examples for some italian notebooks!. Thanks again to all community for such quick feedback all the time.

---------------
New Features
---------------
* OCR now capable of automatic rotation, significantly improving accuracy in some scenarios
* ChunkTokenizer is a new annotator that Tokenizes CHUNK type annotations. Extends Tokenizer algorithm and stores chunk ID for reference.
* SentenceDetector new Param maxLength now cuts off sentences longer than (by default) 240 characters. It avoids Deep Learning annotator issues and may improve performance in some scenarios.
* NerConverter new Param whiteList now allows a list of NER labels to be considered, while discarding the rest. May be useful for selective CHUNKing pipelines.

---------------
Enhancements
---------------
* Pipelines using Word Embeddings should now perform faster due to a group of RocksDB optimizations allowing annotators to reuse current open connections to DB

---------------
Bugfixes
---------------
* Fixed a bug where DeepSentenceDetector was missing the load() interface (Thanks @Tshimanga from Deep6!)
* Fixed a bug where RocksDB opened too many files at once causing pipelines to fail or to work very slowly
* Fixed NerCrfModel when prefetching RocksDB causing slower performance

---------------
Framework
---------------
* Added missing artifact resolution dependencies for OCR Module
* Started adding and organizing multilanguage models (Thanks @maziyarpanahi)
* Updated RocksDB to 5.17.2

========
1.8.1
========
---------------
Overview
---------------
This hotfix version of Spark-NLP improves framework support by adding Maven coordinates for OCR and allowing S3 retrieval of files.
We also included code for generating Graphs for NerDL and also for creating your own metadata files for a private model downloader.
As new features, we are including a new experimental machine learning based sentence detector, which uses NER for bounds detections.
Aside from this, we are including a few bug fixes and ocr improvements. Enjoy! and thanks again for community contributions!

---------------
New Features
---------------
* New DeepSentenceDetector annotator takes Spark-NLP's NER Deep Learning models as a base to improve sentence detection

---------------
Enhancements
---------------
* Improved accuracy of ContextSpellChecker by enabling re-ranking of candidate words according to a weighted levenshtein distance
* OCR process now defaults to split content in rows whether paragraphs or pages are identified for improved parallelism. May be turned off

---------------
Examples and use cases
---------------
* Added Scala examples for Sentiment analysis and Lemmatizer in Italian (Thanks Vincenzo Gaudenzi from DXC.technology for dataset and model contribution!!!)

---------------
Bugfixes
---------------
* Fixed a bug in Norvig and Symmetric SpellCheckers where the pattern parameter was not provided properly in Scala side (Thanks @johnmccain for reporting!)

---------------
Framework
---------------
* Added hadoop-aws dependency for remote download capabilities (e.g. word embeddings sets)

---------------
Other
---------------
* Metadata files for pretrained model downloads code is now included. This may be useful if anyone wants to setup their own private local model downloader service
* NerDL Graphs generation code is now included in the library. This allows the usage of custom word embedding dimensions and feature counts.

---------------
Special mentions
---------------
* Vincenzo Gaudenzi (DXC.technology) for contributing italian datasets and models. @maziyar for creating examples with them.
* @correlator from Deep6.ai for contributing feedback in slack and features feedback in general
* @johnmccain for reporting bugs in spell checker
* @rohit-nlp for delivering maven coordinates for OCR
* @haimco10 for contributing a sentence detector improvement with apostrophe's use case. Not merged due specific issues involved.

========
1.8.0
========
---------------
Overview
---------------
This release is huge! Spark-NLP made the leap into Spark 2.4.0, even with the challenge of not having everyone yet on board there (i.e. Zeppelin doesn't yet support it).
In this version we release three new NLP annotators. Two for dependency parsing processes and one for contextual deep learning based spell checking.
We also significantly improved OCR functionality, fine-tuning capabilities and general output performance, particularly on tesseract.
Finally, there's plenty of bug fixes and improvements in the word embeddings field, along with performance boosts and reduced disk IO.
Feel free to shoot us with any feedback you have! Particularly on your Spark 2.4.x experience.

---------------
New Features
---------------
* Built on top of Spark 2.4.0
* Dependency Parser annotator allows for sentence relationship encoding
* Typed Dependency Parser annotator allows for labeling relationships within dependency tags
* ContextSpellChecker is our first Deep Learning based Spell Checker that evaluates context and not only tokens

---------------
Enhancements
---------------
* More OCR parameters exposed for further fine tuning, including preferred methods priority and page segmentation modes
* OCR now has a setting setSplitPages() which allows setting whether to output one page per row or the entire document instead
* Improved word embeddings performance when working in local filesystems
* Reduced the amount of disk IO when working with Word Embeddings
* All python notebooks improved for better readability and better documentation
* Simplified PySpark interface API
* CoNLLGenerator utility class which helps building CoNLL-2003 files for NER training
* EmbeddingsHelper now allows reading word embeddings files directly from s3a:// paths

---------------
Bugfixes
---------------
* Solved race-condition issues in regards of cluster usage of RocksDB index for embeddings
* Fixed application.conf reading bug which didn't properly refresh AWS credentials
* RocksDB index no longer uses compression, in order to support Windows without native RocksDB compression libraries
* Solved various python default parameter settings
* Fixed circular dependency with jbig pdfbox image OCR

---------------
Deprecations
---------------
* DeIdentification annotator is no longer supported in the open source version of Spark-NLP
* AssertionStatus annotator is no longer supported in the open source version of Spark-NLP

========
1.7.3
========
---------------
Overview
---------------
This hotfix release focuses on fixing word-embeddings cluster problems on some frameworks such as Databricsk, while keeping 1.7.x performance benefits. Various YARN based clusters have been tested, databricks cloud among them to test this hotfix.
Aside of that, multiple improvements have been commited towards a better support of PySpark-NLP, fixing diverse technical issues in the API that help consistency in Annotator's super classes.
Finally, PIP installation has been made easier with a SparkNLP class that creates SparkSession automatically, for those who are learning Python Spark on their local computers.
Thanks to all the community for reporting issues.

---------------
Bugfixes
---------------
* Fixed 'RocksDB not serializable' when running LightPipeline scenarios or using _.functions implicits
* Fixed dependency with apache.commons.codec causing Apache Zeppelin 0.8.0 not to work in %pyspark
* Fixed Python pretrained() downloader not correctly setting Params and incorrectly creating new Model UIDs
* Fixed error 'JavaPackage not callable' when using AnnotatorModel.load() API without instantiating the class first
* Fixed Spark addFiles missing local file causing Word Embeddings not properly work in some Cluster-based frameworks
* Fixed broadcast NoSuchElementException `Failed to get broadcast_6_piece0 of broadcast_6` causing pretrained models not work in cluster frameworks (thanks @EnricoMi)

---------------
Developer API
---------------
* EmbeddingsHelper.setRef() has been removed. Reference is now set implicitly through EmbeddingsHelper.load(). Does not need to be loaded before deserializing models.
* Fixed and properly renamed chunk2doc and dock2chunk transformers, should now be working as expected
* Renamed setCompositeTokens to setCompositeTokensPatterns to help user remind that regex are being used in such Param
* Fixed PySpark automatic getter and setter Param generation when using pretrained() or load() models
* Simplified cluster path resolution for word embeddings

---------------
Other
---------------
* sparknlp.base now contains SparkNLP() classs which automatically cretes SparkSession using appropriate jar settings. Helps newcomers get started in PySpark NLP.

========
1.7.2
========
---------------
Overview
---------------
Quick release with another hotfix, due to a new found bug when deserializing word embeddings in a distributed fs. Also introduces changes in application.conf reader in order
to allow run-time changes. Also introduces renaming from EmbeddingsHelper API.

---------------
Bugfixes
---------------
* Fixed embeddings deserialization from distributed filesystem (caused due to windows pathfix)
* Fixed application.conf not reading changes in runtime
* Added missing remote_locs argument in python pretrained() functions
* Fixed wrong build version introduced in 1.7.1 to detect proper pretrained models version

---------------
Developer API
---------------
* Renamed EmbeddingsHelper functions for more convenience

========
1.7.1
========
---------------
Overview
---------------
Thanks to our slack community (Bryan Wilkinson, @maziyarpanahi, @apiltamang), a few bugs been pointed out very quickly from 1.7.0 release. This hotfix fixes an embeddings deserialization issue when cache_pretrained is located on a distributed filesystem.
Also, fixes some path resolution in Windows OS. Thanks to Maziyar, .gitattributes been added in order to identify proper languages in GitHub.
Finally, 1.7.1 adds a missing annotator from 1.7.0 Chunk2Doc, which converts CHUNK types into DOCUMENT types, for further retokenization or other annotations.

---------------
Enhancements
---------------
* Chunk2Doc annotator converts annotatorType from CHUNK to DOCUMENT

---------------
Bugfixes
---------------
* Fixed embedding-based annotators deserialization error when cache_pretrained is on distributed fs (Thanks Bryan Wilkinson for pointing out issue and testing fix)
* Fixed windows path reading when deserializing embeddings (Thanks @apiltamang)

---------------
Other
---------------
* .gitattributes added in order to properly discard jupyter as main language for GitHub repo (thanks @maziyarpanahi)

========
1.7.0
========
---------------
Overview
---------------
Having multiple annotators that use the same word embeddings set, may result in huge pipelines, driver memory and storage consumption.
Since now on, embeddings may be shared and reutilized across annotators making the process much more efficient.
Also, thanks to @apiltamang, we now better support path resolution for Windows implementations.

---------------
Enhancements
---------------
Memory and storage saving by allowing annotators with embeddings through params 'includeEmbeddings' and 'embeddingsRef' to allow them to set whether they should be included when saved, or referenced by id from other annotators
EmbeddingsHelper class allows embeddings management

---------------
Bug fixes
---------------
Thanks to @apiltamang for improving URI path support for Windows Servers

---------------
Developer API
---------------
Embeddings interfaces and method names completely refactored, hopefully simplified and easier to understand

========
1.6.3
========
---------------
Overview
---------------
This release includes a new annotator for de-identification of sensitive information. It uses CHUNK annotations, meaning its accuracy will depend on previous annotators on the pipeline.
Also, OCR capabilities have been improved in the OCR module.
In terms of broken stuff, we've fixed a few annoying bugs on SymmetricDelete and SentenceDetector explode feature.
Finally, pip is now part of the official repositories, meaning you can install it just as any other module. It also includes jars and we've added a SparkNLP class which creates SparkSession easily for you.
Thanks again for all community contribution in issues, feedback and comments in GitHub and in Slack.

---------------
New features
---------------
* DeIdentification annotator, takes DOCUMENT and TOKEN from the original sentence, plus a CHUNK annotation to anonymize target chunk in sentence. CHUNK annotation might come from NerConverter, TextMatcher or other chunk annotators.

---------------
Enhancements
---------------
* Kernel zoom and region erosion improve overall detection quality. Fixed some stability bugs. Improved parallelism

---------------
Bug fixes
---------------
* Sentence Detector explode sentences into rows now works properly
* Fixed Dictionary-based sentiment detector not working on pyspark
* Added missing NerConverter to annotator._ imports
* Fixed SymmetricDelete spell checker deleting tokens in some scenarios
* Fixed SymmetricDelete spell checker unwilling lower-casing

---------------
Other
---------------
* PySpark pip now part from official pip repos
* Pip installation now includes corresponding spark-nlp jar. base module includes SparkNLP SparkSession creator

========
1.6.2
========
---------------
Overview
---------------
In this release, we focused on reviewing out streaming performance, buy measuring our amount of sentences processed by second, through a LightPipeline.
We increased Norvig Spell Checker by more than 300% by disabling DoubleVariants and improving algorithm orders. It is now reported capable of 42K sentences per second.
Symmetric Delete Spell checker is more performance, although it has been reported to process 2K sentences per second.
NerCRF has been reported to process 300 hundred sentences per second, while NerDL can do twice fast (about 700 sentences per second).
Vivekn Sentiment Analysis was improved and is now capable to processing 100K sentences per sentence (before it was below 500).
Finally, SentenceDetector performance was improved by a 40% from ~30K rows processed per second to ~40K. But, we have now enabled Abbreviation processing by default which reduces final speed to 22K rows per second with a negative net but better accuracy.
Again, thanks for the community for helping with feedback. We welcome everyone asking questions or giving feedback in our Slack channel or reporting issues on Github.

---------------
Enhancements
---------------
* OCR now features kernel segmentation. Significantly improves image based PDF processing
* Vivekn Sentiment Analysis prediction performance improved by better data structures
* Both Norvig and Symmetric Delete spell checkers now have improved performance
* SentenceDetector improved accuracy by better handling abbreviations. UseAbbreviations now also by default turned ON
* SentenceDetector improved performance significantly by improved preloading of rules

---------------
Bug fixes
---------------
* Fixed NerDL not training correctly (broken since 1.6.0). Pretrained models not affected
* Fixed NerConverter not properly considering multiple sentences per row (after using SentenceDetector), causing an unhandled exception to occur in some scenarios.
* Tensorflow sessions now all support allow_soft_placement, supporting GPU based graphs to work with and without GPU
* Norvig Spell Checker fixed a missing step from the algorithm to check for additional variants. May improve accuracy
* Norvig Spell Checker disabled DoubleVariants by default. Was not improving accuracy significantly and was hitting performance very hard

---------------
Developer API
---------------
* New FeatureSet allows HashSet params

---------------
Models
---------------
* Vivekn Sentiment Pipeline doesn't have Spell Checker anymore
* Fixed Vivekn Sentiment pretrained improved accuracy

========
1.6.1
========
---------------
Overview
---------------
Hi! We're glad to announce new hotfix 1.6.1. Although changes seem modest or very specific, there is a lot going underground. First of all, we've worked hard with the community to understand S3-based clusters,
which don't have a common fs.defaultFS configuration, which is the one we use to tell where is the cluster temp folder located in order to distribute word embeddings. We fixed two things here,
on one side we fixed a bug pointing to the wrong filesystem. Second, we added a custom override setting in application.conf that allows manually setting where to put temp folders in cluster. This should help S3 users.
Please share your feedback on this regard.
On the other hand, we created a new annotator type internally. The CHUNK type allows better modulary in the communication between different annotators. Impact will be noticed implicitly and over time.

---------------
New features
---------------
* new Scala-only functions that make it easier to work with Annotations in Dataframes. May be imported through com.johnsnowlabs.nlp.functions._ and allow mapping and filtering within and outside Annotations.
filterByAnnotations, mapAnnotations and explodeAnnotations work by providing a column and a function. Check out documentation. Possibly later coming to Python.

---------------
Bug fixes
---------------
* Fixed incorrect filesystem readings in some S3 environments for word embeddings
* Fixed NerCRF not correctly training from CONLL, labeling everything as -O- (Thanks @arnound from Slack Channel)

---------------
Enhancements
---------------
* Added overrideable config sparknlp.settings.cluster_tmp_dir allows setting cluster location for temporary embeddings file. May help S3 based clusters with no fs.defaultFS set to a proper distributed storage.
* New annotator type: CHUNK. Representes a SUBSTRING of DOCUMENT and it is used as output from NerConverter, TextMatcher, RegexMatcher and other annotators that retrieve a substring from the original document.
This will make for better modularity and integration within various annotators, such as between NER and AssertionStatus.
* New annotation transformer: ChunkAssembler. Takes a string or array(string) column from a dataset and creates a CHUNK type annotator. The content must also belong to the current DOCUMENT annotation's content.
* SentenceDetector new param explodeSentences allow to explode sentences within a single row into different rows to increase parallelism and performance in some scenarios. Particularly OCR based.
* AssertionDLApproach now may be used within LightPipelines
* AssertionDLApproach and AssertionLogRegApproach now work from CHUNK type instead of start/end bounds. May still be trained with Start/end though. This means target for assertion may be any CHUNK output annotator now (e.g. RegexMatcher)

---------------
Other
---------------
* PerceptronApproachLegacy moved back to default PerceptronApproach. Distributed PerceptronApproach moved to PerceptronApproachDistributed due to not meeting accuracy expectations yet.
* Some configuration parameters in application.conf have been appropriately moved to proper annotator Params (NorvigSweeting Spell Checker, Vivekn Approach and Sentiment Detector affected)
* application.conf renamed configuration values for better consistency

---------------
Developer API
---------------
* Added beforeAnnotate() and afterAnnotate() to manipulate dataframes after or before calling annotate() UDF
* Added extraValidate() and extraValidateMsg() in all annotators to provide developer to add additional SCHEMA checks in transformSchema() stage
* Removed validation() stage in fit() stage. Allows for more flexible training when some of the columns are not really required yet.
* WrapColumnMetadata() will wrap an Annotation column with its appropriate Metadata. Makes it easier not to forget about Metadata in Schema.
* RawAnnotator trait has now all the basics needed to start a new Annotator without annotate() function. It is a complete previous stage before AnnotatorModel, which inherits from RawAnnotator.

========
1.6.0
========
---------------
Overview
---------------
We're late! But it was worth it. We're glad to release 1.6.0 which brings new features, lots of enhancements and many bugfixes. First of all, we are thankful for community participating in Slack and in GitHub by reporting feedback and issues.
In this one, we have a new annotator, the Chunker, which allows to grab pieces of text following a particular Part-of-Speech pattern.
On the other hand, we have a brand new OCR to Spark Dataframe utility, which bundles as an optional component to Spark-NLP. This one requires tesseract 4.x+ to be installed on your system, and may be downloaded from our website or readme pages.
Aside from that, we improved in many areas, from the DocumentAssembler to work better with OCR output, down to our Deep Learning models with better consistency and accuracy. Word Embedding based annotators also receive improvements when working in Cluster environments.
Finally, we are glad a user contributed a fix to the AWS dependency issue, particularly happening in Cloudera environments. We're still waiting for feedback, and gladly accept it.
We'll be working on the documentation as this release follows. Thank you.

---------------
New Features
---------------
* New annotator: Chunker. This annotator takes regex for Part-of-Speech tags and returns appropriate chunks of text following such patterns
* OCR to Spark-NLP: As an optional jar module, users may use OcrHelper class in order to convert PDF files into Spark Dataset, ready to be utilized by Spark-NLP's document assembler. May be used without Spark-NLP. Requires Tesseract 4.x on your system.

---------------
Enhancements
---------------
* TextMatcher now has caseSensitive (setCaseSensitive) Param which allows to setup for matching with case sensitivity or not (Ignores if Normalizer did it). Returned word is still the original.
* LightPipelines in Python should now be faster thanks to an optimization of prefetching results into Python memory instead of py4j bridge
* LightPipelines can now handle embedded Pipelines
* PerceptronApproach now trains utilizing full Spark distributed algoritm. Still experimental. PerceptronApproachLegacy may still be used, which might be better for local non cluster setups.
* Tokenizer now has a param 'includeDefaults' which may be set to False to disable all preset-rules.
* WordEmbedding based annotators may now decide to normalize tokens before matching embeddings vectors through 'useNormalizedTokensForEmbeddings' Param. Generally improves consistency, lesser overfitting.
* DocumentAssembler may now better deal with large amounts of texts by using 'trimAndClearNewLines' to better work with OCR Outputs and be better ready for further Sentence Detection
* Improved SentenceDetector handling of enumerations and lists
* Slightly improved SentenceDetector performance through non-tail-recursive optimizations
* Finisher does no longer have default delimiters when output into String (not Array) (thanks @S_L)

---------------
Bug fixes
---------------
* AWS library dependecy conflict now resolved (Thanks to @apiltamang for proposing solution. thanks to the community for follow-up). Solution is experimental, waiting for feedback.
* Fixed wrong order of further added Tokenizer's infixPatterns in Python (Thanks @sethah)
* Training annotators that use Word Embeddings in a distributed cluster does no longer throw file not found exceptions sporadically
* Fixed NerDLModel returning non-deterministic results during prediction
* Deep-Learning based models and graphs now allow running them on CPU if trained on GPU and GPU is not available on client
* WordEmbeddings temporary location no longer in HOME dir, moved to tmp.dir
* Fixed SentenceDetector incorrectly bounding sentences with non-English characters (Thanks @lorenz-nlp)
* Python Spark-NLP annotator models should now have all appropriate setter and getter functions for Params
* Fixed wrong-format of column when showing Metadata through Finisher's output as Array
* Added missing python Finisher's include metadata function (thanks @PinusSilvestris for reporting the bug)
* Fixed Symmetric Delete Spell Checker throwing wrong error when training with an empty dataset (Thanks @ankush)

---------------
Developer API
---------------
* Deep Learning models may now be read through SavedModelBundle API into Tensorflow for Java in TensorflowWrapper
* WordEmbeddings now allow checking if word exists with contains()
* Included tool that converts text into CoNLL format for further labeling for training NER models (

========
1.5.4
========
---------------
Overview
---------------
This release improves various annotators: the Normalizer, SymmetricDelete, TextMatcher, DocumentAssembler and Finisher
allowing them to cover more use-cases that were mentioned in our Slack channel. We also fixed two important bugs.
Finally, this will be our first release with PIP support for python sparknlp, for those entirely python based.

---------------
Enhancements
---------------
* Normalizer now allows multiple to-delete regex patterns.
* Normalizer slangDictionary param allows converting tokens into something else (e.g. 'lol' into 'laughing out loud') from a dictionary file
* SymmetricDelete spell checker may now be trained from the dataset passed to fit if external corpus not provided
* SymmetricDelete spell checker improved training and prediction performance
* Finisher param includeMetadata now outputs annotation metadata content both in Array format or String format
* DocumentAssembler may now read from Array[String] column if provided. This improves compatibility for some SparkML transformers
* TextMatcher now includes identifier name in metadata

---------------
Bug fixes
---------------
* Fixed a bug introduced in 1.5.3 that made spark-nlp not to work in Python2 (thanks @surendralalwani)
* Fixed SymmetricDeleteApproach wrong annotator type

---------------
Other
---------------
* setup.py for PIP support (instructions will be added to readme and website). Still needs spark-nlp jar in SparkSession classpath.

========
1.5.3
========
---------------
Overview
---------------
This quick release is a hotfix for issues found on 1.5.2 after it's release. Thanks to the users who quickly tested this out.
It fixes Symmetric spell checker not being capable of reading the pretrained model, a SentenceDetector missing default value and retroactive version matching to the downloader

---------------
Bug fixes
---------------
* Fixed a bug causing the library to fail when trying to save or read an annotator with an unset Feature without default
* Added missing default Param value to SentenceDetector. Thanks @superman24-7
* Symmetric spell checker now utilizes List instead of ListBuffer on its prediction layer
* Fixed Vivekn Sentiment Analysis failing when training with a sentiment column

---------------
Models
---------------
* Symmetric Spell Checker pretrained model now works well and may be downloaded
* Vivekn Sentiment pretrained model now defaults to "token" input column instead of "spell"

---------------
Other
---------------
* Downloader now works retroactively when a newer version finds a model of a previous release
* Renamed folder argument to remote_loc for downloader remote location, which caused confusion. Thanks @AtulSehgal
* Added new Scala example in example folder, also available on website

========
1.5.2
========
---------------
Overview
---------------
This release focuses on improving model downloader stability, fixing word embedding reading issues and joining
spark ecosystem filesystem configuration appropriately, utilizing spark's defined default filesystem, in order to work
properly with clusters and multi node environments. This includes Databricks cloud clusters or Amazon EMR yarn HDFS nodes.

Aside of that we come up with exciting new features, a brand new Spell Checker with higher accuracy inspired on the
Symmetric delete algorithm.

Finally Assertion Status can be trained and predicted on top of NER output, since before
this only worked by providing assertion status Start and End boundaries for the target to assert.

---------------
New Features
---------------
* Assertion status annotators can now be trained and predict against NER output instead of start and end boundaries. Entities can now be directly asserted
* Brand new Symmetric Delete annotator (SymmetricDeleteApproach) with closer to start of the art optimal accuracy 80%

---------------
Enhancements
---------------
* Model downloader now uses proper spark filesystem. Works properly with distributed storage, databricks cloud clusters or amazon EMR seamlessly
* Fixed several race condition while loading word embeddings from disk or download resources, library is more stable
* Improved several assertion status validations and error messages

---------------
Bug fixes
---------------
* Stand alone Annotator models are now properly read from disk in python

---------------
Models
---------------
* New Symmetric Delete Spell checker pretrained model
* Vivekn Sentiment annotator may now be downloaded standalone with pretrained()

========
1.5.1
========
---------------
Overview
---------------
This release is an enhancement release to 1.5.0 which includes improved downloader properties and better annotator defaults.
Also, assertion status models have been included as pretrained, which are models trained on top of Glove Stanford word embeddings

---------------
Enhancements
---------------
* SentenceDetector has now a useCustomOnly param which enforces into using only the custom bounds provided (thanks @atomobianco)
* Normalizer defaults to not lowerCase words leads to better implicit accuracy in pipelines (thanks @marek.modry)
* SpellChecker defaults to be case sensitive leads to better accuracy
* DateMatcher improved speed performance
* com.johnsnowlabs.annotator._ in Scala now also includes RecursivePipelines and LightPipelines for easier imports
* ModelDownloader has been improved with better directory management

---------------
Models
---------------
* New Assertion Status (LogisticRegression and DeepLearning) pretrained models now available
* Vivekn, Basic and Advanced pretrained Pipelines improved accuracy (thanks @marek.modry)

---------------
Other
---------------
* S3 library dependencies updated

========
1.5.0
========
---------------
Overview
---------------
We are proud to announce if not the biggest release in terms of content in Spark-NLP!
This release makes the library miles easier to use for new comers, allowing easier to import
annotators and the extended use of model downloader throughout pretrained models and pipelines.
This also includes two new annotators that use deep learning algorithms with graphs from TensorFlow, which
is the first time we do so.
Apart from this, we include new Light Pipelines that are 10x times faster when working with data smaller than about
50,000 rows length.
Finally, we included several bugfixes across the library, from algorithm wise to developer API.
We'll gladly welcome any feedback! The website has been extensively updated.

---------------
New features
---------------
* Light Pipelines are Annotator Pipelines created from SparkML pipelines that run more than 10x faster in small datasets
* Deep Learning NER based on Bi-LSTM and Convolutional Neural Networks from word embeddings datasets
* Deep Learning Assertion Status model based on LSTM to compute status identification from word embeddings
* Easier to use Spark-NLP:
1. Imports have been made easy in scala API (com.johnsnowlabs.annotator._) to bring all annotators
2. BasicPipeline and AdvancedPipeline downloadable pipelines created for quick annotation of text
3. Light Pipelines are easy to use and accept simple strings to annotate a Spark ML Pipeline without spark datasets
* New Downloadable models: CRF NER, Lemmatizer, POS and Spell checker
* New Downloadable pipelines: Vivekn Sentiment analysis, BasicPipeline and AdvancedPipeline

---------------
Enhancements
---------------
* Model downloader significantly improved in terms of usability

---------------
Documentation
---------------
* Website widely improved
* Added invite to our first slack chat channel

---------------
Bugfixes
---------------
* Fixed positional index wrong value when creating Annotations from constructor
* Fixed hamming distance calculation in spell checker
* Fixed Downloadable NER model failing sporadically due to missing temporary files
* Fixed SearchTrie algorithm used in TextMatcher (fmy. EntiyExtractor) thanks @avenka11 for reporting and proposing solution
* Fixed some model deserialization issues happening on Windows
---------------
Other
---------------
* Thanks to @showy we have TravisCI automatic integration testing
* Finisher now outputs to array by default
* Training example resources removed in advantage of using the model downloader more

========
1.4.2
========
---------------
Bugfixes
---------------
* Filesystem protocols now properly read across the library, fixed use case for S3:// protocol (thanks @avenka11)
* Library now works properly in Windows environments
* PySpark annotator param getters now work properly when retrieving default values
* Fixed stemmer serialization due to misspelled param name
* Fixed Tokenizer infixPattern param name to infixPatterns, leading to broken pyspark serialization of such param
* Added missing addInfixPattern() function to PySpark, to allow adding patterns to current value
* Model Downloader clearCache now properly removes both .zip files and extracted content
* Model Downloader is now capable of reading all types of models properly
* Added missing clearCache function into PySpark

---------------
Developer API
---------------
* Function names in model downloader code has been refactored consistenyl

---------------
Other
---------------
* RocksDB rolled back to previous version to support Windows
* NerCRF unittest modified to reduce time to test
* Removed training scripts from repository
* Updated build spark and scala version

========
1.4.1
========
---------------
New features
---------------
* Model and Pipeline Downloader
We are glad to announce our first experimental model downloader, working both in Python and Scala.
This allows to download pre-trained models from our public storage. This does not include any pre-trained models yet
but just the logic to be able to do it.

---------------
Enhancements
---------------
* Improved ExternalResource API (introduced in 1.4.0) to make it easier to provide external corpus and resource information
on annotators such as readAs (which allows setting how would you like SparkNLP to read your source), delimiters and parse settings among
other options that might be passed to Spark Reader directly. Annotators using external sources now all share this functionality.
WordEmbeddings are not yet supported on this format.
* All python annotators now properly have getter functions to retrieve param values

--------------
Bugfixes
--------------
* Fixed some annotators in python not de-serializable on their own outside a Pipeline
* Fixed CRF NER not working when not using word embeddings (thanks @crisliu for reporting)
* Fixed Tokenizer not properly recognizing some stop words (thanks @easimadi)
* Fixed Tokenizer not properly recognizing composite tokens when changing target pattern param (thanks @easimadi)
* ReadAs parameter now properly read from string in all ExternalResource setters

---------------
Developer API
---------------
* PySpark API further improvements within AnnotatorApproach, AnnotatorModel and now private internal _AnnotatorModel for fit() result representation
* Automated getter have been written in order not to have to write getter functions in all annotators manually

-----------
Other
-----------
* RocksDB dependency rolled back to 5.2.1 for better universal compatibility particularly to support databricks platform

---------------
Documentation
---------------
* Updated website components page to match 1.4.x
* Replaced notebooks site to a placeholder linking to current python notebooks for lower maintenance

========
1.4.0
========
---------------
New features
---------------
* All annotator external sources have been unified through an ExternalResource component.
This is used to represents external data information deals with content in HDFS or local just as spark deals with data.
It also improves performance globally and allows customization
into how these sources are read (e.g. as RDD or Line by Line sequences)
* NorvigSweeting SpellChecker, ViveknSentiment and POS Perceptron can now train from the dataset passed to fit().
For Spell Checker, this will be applied if the user did not supply a corpus, forcing fit() to learn from words in the data column.
For ViveknSentiment and POS Perceptron, this strategy will be applied if sentimentCol and posCol params have been set respectively.

---------------
Enhancements
---------------
* ResourceHelper now has an improved SourceStream class which allows for more consistent HDFS/Filesystem reading by using
more of the Hadoop APIs.
* application.conf is now a global setting and can be overridden in run-time through ConfigLoader.setConfigPath(). It may also be accessed from PySpark
* PySpark API improved by creating AnnotatorApproach and AnnotatorModel  classes
* EntityMatcher now uses recursive Pipelines
* Part-of-Speech tagging performance has been improved throughout the prediction algorithm
* EntityMatcher may now use RecursivePipeline in order to tokenize external data with the same pipeline provided by the user

---------------
Developer API
---------------
*PySpark API has been severly improved to make it easier to extend JVM classes
*PySpark API improved for extending annotator approaches and models appropriately

----------------
Bugfixes
----------------
* Reverted a bug introduced causing NER not to read datasets properly from HDFS
* Fixed EntityMatcher wrongly normalizing external content (thanks @sofianeh)

----------------
Documentation
----------------
* Fixed EntityMatcher documentation obsolete params (Thanks @sofianeh)
* Fixed NER CRF documentation in website

========
1.3.0
========
IMPORTANT: Pipelines from 1.2.6 or older cannot be loaded from 1.3.0
---------------
New features
---------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/94
Tokenizer annotator has been revamped. It now follows standard NLP Rules, matching above 90% of StanfordNLP Tokens
This annotator has now more complex rules allowing setting custom composite words as exceptions (e.g. to not break New York)
and custom Prefix, Infix, Suffix and Breaking rules. It uses regular expression groups in order to match various tokens per target word
Defaults have been updated to also be language agnostic and support foreign characters from Unicode charset
* https://github.com/JohnSnowLabs/spark-nlp/pull/93
Assertion Status. This annotator identifies negated sequences within target scope. Assertion status is a machine learning
annotator and works throughout a set of Word Embeddings which a set of them is provided as a part of our Python notebook examples.
* https://github.com/JohnSnowLabs/spark-nlp/pull/90
Recursive Pipelines. We have created our own Pipeline class which will take more advantages from Spark-NLP annotators.
Although this Pipeline is completely optional and works well with default Apache Spark estimators and transforms, it allows
training our annotators more efficiently by allowing annotator approaches access the previous state of the Pipeline,
allowing them to use it to tokenize or transform their own external content. It is recommended to use such Pipelines.

----------------
Enhancements
----------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/83
Part of Speech training has been improved in both performance and quality, and now better makes use of the input corpus provided.
New params have been extended in order to have more control of its training, through corpusFormat and corpusLimit, allowing
whether to read training data as Dataset or raw text files, and the number of limit files if a folder is provided
* https://github.com/JohnSnowLabs/spark-nlp/pull/84
Thanks to @lambdaofgod to allow Normalizer to optionally lower case tokens
* Thanks to Lorenz Bernauer, Normalizer default pattern now becomes language agnostic by not breaking unicode characters such as Spanish or German letters
* Features now have appropriate default values which are lazy by nature and executed only once upon request. This improves by side effect to the Lemmatizer performance.
* RuleFactory (A regex rule factory) performance has been improved due to set to use a Factory pattern and not re-check it's strategy on every transformation in run-time.
This might have positive side effects in SentenceDetector, DateMatcher and RegexMatcher which extensively use this class.

----------------
Class Renames
----------------
RegexTokenizer -> Tokenizer (it is not just regex anymore)
SentenceDetectorModel -> SentenceDetector (it is not a model, it is a rule-based algorithm)
SentimentDetectorModel -> SentimentDetector (it is not a model, it is a rule-based algorithm)

----------------
User Utilities
----------------
* ResourceHelper has a function createDatasetFromText which allows the user to more
easily read one or multiple text files from path into a dataset with various options,
including filename by row or by file aggregation. This class should be more widely
used since it helps dealing with local files parsing. It shall be better documented.
* com.johnsnowlabs.util now contains a Benchmark class which allows measuring the time of
any function easily, by using it as Benchmark.time("Description of measured") {someFunction()}

----------------
Developer API
----------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/89/files
Word embedding traits have been generalized. Now any annotator who might want to use them can easily access their properties
* Recursive pipelines now allow injecting PipelineModel object into train() stage. It is an optional parameter. If the user
utilizes RecursivePipeline, the annotator might use this pipeline for transforming secondary data inputs.
* Annotator abstract class has been divided into a previous RawAnnotator class which contains all annotator properties
and validations, but does not make use of the annotate() function. This allows annotators that need to work directly with
the transform() call, but also participate between other annotators in the pipeline

----------------
Bugfixes
----------------
* Fixed a bug in annotators with word embeddings not correctly serializing into disk
* Fixed a bug creating temporary folders in home folder
* Fixed a broken geospatial pattern in sentence detection

========
1.2.6
========
---------------
Enhancements
---------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/82
Vivekn Sentiment Analysis improved memory consumption and training performance
Parameter pruneCorpus is an adjustable value now, defaults to 1. Higher values lead to better performance
but are meant on larger corpora. tokenPattern params are meant to allow different tokenization regex
within the corpora provided on Vivekn and Norvig models.
* https://github.com/JohnSnowLabs/spark-nlp/pull/81
Serialization improvements. New default format (parquet lasted little) is RDD objects. Proved to be lighter on
heap memory. Also added lazier default values for Feature containers. New application.conf performance tunning
settings allow to customize whether we want to Feature broadcast or not, and use parquet or objects in serialization.

========
1.2.5
========
IMPORTANT: Pipelines from 1.2.4 or older cannot be loaded from 1.2.5
---------------
New features
---------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/70
Word embeddings parameter for CRF NER annotator
* https://github.com/JohnSnowLabs/spark-nlp/pull/78
Annotator features replace params and are now serialized using KRYO and partitioned files, increases performance and smaller
memory consumption in Driver for saving and loading pipelines with large corpora. Such features are now also broadcasted
for better performance in distributed environments. This enhancement is a breaking change, does not allow to load older pipelines

----------------
Bug fixes
----------------
* https://github.com/JohnSnowLabs/spark-nlp/commit/cb9aa4366f3e2c9863482df39e07b7bacff13049
Stemmer was not capable of being deserialized (Implements DefaultParamsReadable)
* https://github.com/JohnSnowLabs/spark-nlp/pull/75
Sentence Boundary detector was not properly setting bounds

----------------
Documentation (thanks @maziyarpanahi)
----------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/79
Typo in code
* https://github.com/JohnSnowLabs/spark-nlp/pull/74
Bad description

========
1.2.4
========
---------------
New features
---------------
* https://github.com/JohnSnowLabs/spark-nlp/commit/c17ddac7a5a9e775cddc18d672e80e60f0040e38
ResourceHelper now allows input files to be read in the shape of Spark Dataset, implicitly enabling HDFS paths, allowing larger annotator input files. Needs to set 'TXTDS' as input format Param to let annotators read this way. Allowed in: Lemmatizer, EntityExtractor, RegexMatcher, Sentiment Analysis models, Spell Checker and Dependency Parser.

---------------
Enhancements and progress
---------------
* https://github.com/JohnSnowLabs/spark-nlp/commit/4920e5ce394b25937969cc4cab1d81172be722a3
CRF NER Benchmarking progress
* https://github.com/JohnSnowLabs/spark-nlp/pull/64
EntityExtractor refactored. This annotator uses an input file containing a list of entities to look for inside target text. This annotator has been refactored to be of better use and specifically faster, by using a Trie search algorithm. Proper examples included in python notebooks.

---------------
Bug fixes
---------------
* Issue https://github.com/JohnSnowLabs/spark-nlp/issues/41 <> https://github.com/JohnSnowLabs/spark-nlp/commit/d3b9086e834233f3281621d7c82e32195479fc82
Fixed default resources not being loaded properly when using the library through --spark-packages. Improved input reading from resources and folder resources, and falling back to disk, with better error handling.
* https://github.com/JohnSnowLabs/spark-nlp/commit/08405858c6186e6c3e8b668233e30df12fa50374
Corrected param names in DocumentAssembler
* Issue https://github.com/JohnSnowLabs/spark-nlp/issues/58 <> https://github.com/JohnSnowLabs/spark-nlp/commit/5a533952cdacf67970c5a8042340c8a4c9416b13
Deleted a left-over deprecated function which was misleading.
* https://github.com/JohnSnowLabs/spark-nlp/commit/c02591bd683db3f615150d7b1d121ffe5d9e4535
Added a filtering to ensure no empty sentences arrive to unnormalized Vivekn Sentiment Analysis

---------------
Documentation and examples
---------------
* https://github.com/JohnSnowLabs/spark-nlp/commit/b81e95ce37ed3c4bd7b05e9f9c7b63b31d57e660
Added additional resources into FAQ page.
* https://github.com/JohnSnowLabs/spark-nlp/commit/0c3f43c0d3e210f3940f7266fe84426900a6294e
Added Spark Summit example notebook with full Pipeline use case
* Issue https://github.com/JohnSnowLabs/spark-nlp/issues/53 <> https://github.com/JohnSnowLabs/spark-nlp/commit/20efe4a3a5ffbceedac7bf775466b7a8cde5044f
Fixed scala python documentation mistakes
* https://github.com/JohnSnowLabs/spark-nlp/commit/782eb8dce171b69a615887b3defaf8b729b735f2
Typos fix

---------------
Other
---------------
* https://github.com/JohnSnowLabs/spark-nlp/commit/91d8acb1f0f4840dad86db3319d0b062bd63b8c6
Removed Regex NER due to slowness and little use. CRF NER to replace NER.

---------------
Other
---------------
https://github.com/JohnSnowLabs/spark-nlp/commit/91d8acb1f0f4840dad86db3319d0b062bd63b8c6
Removed Regex NER due to slowness and little use. CRF NER to replace NER.

========
1.2.3
========
---------------
Bugfixes
---------------
* Sentence detection not properly bounding punctuation marks
* Sentence detection abbreviations feature disabled by default, since it is a slow feature and not necessary benefitial
* Sentence detection punctuation bounds improved algorithm performance by removing redundant formatting
* Fixed a bug causing sentiment analysis not to work when using normalized tokens that may cause tokens to be deleted from sentence
* Fixed Resource Helper text reading missing first line from text stream. More improvements to come later.

---------------
Other
---------------
CRF NER progress, word embeddings support. Not yet officially released.

========
1.2.2
========
---------------
New features
---------------
* Finisher parameter to export output as Array

========
1.2.1
========
---------------
Bugfixes
---------------
* Finisher not properly deleting annotation columns even when set to true explicitly

========
1.2.0
========
--------------
New features
--------------
* New annotator: CRF NER
* New transformer: Token Assembler
* Added custom bounds parameter in sentence detector
* Added custom pattern parameter in Normalizer
* SpellChecker able to train from text files as dataset
* Typesafe configuration may now be read from disk at runtime

-------------------------
Performance improvements
-------------------------
* Enabled and suggested KRYO Serializer for better pipeline write-read

--------------------
Code improvements
--------------------
* Pack/Unpack in annotators leads to better code reutilization
* ResourceHelper reduced io responsibility
* Annotations centralized main result and Finisher improvements

---------------
Bugfixes
---------------
* RegexMatcher fixed to read from input files properly
* DocumentAssembler fixed relative positioning

------------------
Release framework
------------------
* Better package release readiness for different repositories
* Organization package name change due to central's standards