#Natural Language Processing with Spark's ML
##Requires
- Anaconda Python 3.4
- NLTK
- langid
- findspark (for local spark install only)
- Spark 1.6
- Local install OK
#Example Description
- How to create a Data Science vs Spam classifier for twitter?
- How to choose the right algorithm?
- What do I need to start?
##Use PySpark to preprocess text data
- Language Classification
- Stop Word Removal
- Custom Twitter Specific Clean Up
- Part of Speech Tagging
- Lemmatization/Stemming of Text
- General Cleanup
##Converting text to numerical data with ML Pipelines
- Tokenization
- Term Frequency Hashing
- Inverse Document Frequency
##Training & Testing a Model
- Crossvalidation with ML Pipeline CrossValidator
- Evaluation with ML Pipeline Evaluator
##Watch the Talk