GitHub - dreyco676/nlp_spark: Natural Language Processing with Spark's MLlib

dreyco676 / nlp_spark Public

Notifications You must be signed in to change notification settings
Fork 41
Star 62

Natural Language Processing with Spark's MLlib

62 stars 41 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.idea		.idea
.gitignore		.gitignore
README.md		README.md
TCHUG_NLP_PYSPARK.pptx		TCHUG_NLP_PYSPARK.pptx
data.zip		data.zip
nlp_with_spark.ipynb		nlp_with_spark.ipynb
preproc.py		preproc.py

Repository files navigation

#Natural Language Processing with Spark's ML

##Requires

Anaconda Python 3.4
- NLTK
- langid
- findspark (for local spark install only)
Spark 1.6
- Local install OK

#Example Description

How to create a Data Science vs Spam classifier for twitter?
How to choose the right algorithm?
What do I need to start?

##Use PySpark to preprocess text data

Language Classification
Stop Word Removal
Custom Twitter Specific Clean Up
Part of Speech Tagging
Lemmatization/Stemming of Text
General Cleanup

##Converting text to numerical data with ML Pipelines

Tokenization
Term Frequency Hashing
Inverse Document Frequency

##Training & Testing a Model

Crossvalidation with ML Pipeline CrossValidator
Evaluation with ML Pipeline Evaluator

##Watch the Talk

https://www.youtube.com/watch?v=AsW0QzbYVow

About

Natural Language Processing with Spark's MLlib

Report repository

Releases

No releases published

Packages

No packages published

Contributors 3

Languages