Skip to content
Lev Konstantinovskiy edited this page Jul 6, 2016 · 21 revisions

A list of Google Summer of Code and student thesis projects for Gensim, a scientific Python package for efficient, large-scale topic modelling.

We offer financial reward as well as technical and academic assistance for completing these projects. Expectations are high though; read this general summary before applying.

If you'd like to work on any of the topics below, or have your own ideas, get in touch at [email protected].


Online NNMF

Background:

Non-negative matrix factorization, NNMF [1], is a popular machine learning algorithm, widely used in collaborative filtering and natural language processing. It can be phrased as an online learning algorithm. [2]

While implementations of NNMF in Python exist [3, 4], they only work on small datasets that fit fully into RAM, which is too restrictive for many real-world applications. You will contribute a scalable implementation of NNMF to the Python data science world. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at [email protected].

Goals:

  1. Demonstrate understanding of matrix factorization theory and practice, by describing, implementing and evaluating a scalable version of the NNMF algorithm.

  2. Implement streamed NNMF [5] that is capable of online (incremental) updates. Model training must proceed in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally also implement a version that can use multiple cores on the same machine.

  3. Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

Deliverables:

  1. Code: a pull request against gensim [6] on github [7]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

  2. Report: timings and accuracy of your NNMF implementation on English Wikipedia and the Lee corpus [8] of human similarity judgements included in gensim. A summary of insights into parameter selection and tuning of your NNMF implementation. You can also evaluate the NNMF factorization quality against other factorization methods, such as SVD and LDA [9] in collaborative filtering settings (optional).

Resources:

[1] NNMF on Wikipedia

[2] Online algorithm

[3] Christian Thurau et al. "Python Matrix Factorisation"

[4] Sklearn NMF code

[5] Online NMF on Wikipedia

[6] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks

[7] Gensim on github

[8] Lee, M., Pincombe, B., & Welsh, M. (2005). An empirical evaluation of models of text document similarity. Proceedings of the 27th Annual Conference of the Cognitive Science Society

[9] Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation, NIPS 2010

[10] Wang, Tan, König, Li. "Efficient Document Clustering via Online Nonnegative Matrix Factorizations." 2011

[11] Topics extraction with Non-Negative Matrix Factorization in sklearn

[12] Gensim github issue #132.

Explicit Semantic Analysis

Background: Explicit Semantic Analysis [1, 2] is a method of unsupervised document analysis using Wikipedia as a resource. It has many applications, for example event classification on Twitter [3].

While implementations of ESA exist in Python [4] and other languages [5], they only work on small datasets that fit fully into RAM, which is too restrictive for many real-world applications.

You will contribute a scalable implementation of ESA to the Python data science world. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at [email protected].

Goals:

  1. Demonstrate understanding of semantic interpretation theory and practice, by describing, implementing and evaluating a scalable version of the ESA algorithm.

  2. Implement streamed ESA that is capable of online (incremental) updates. Model training must proceed in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally implement a version that can use multiple cores on the same machine.

  3. Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

Deliverables

  1. Code: a pull request against gensim [6] on github [7]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

  2. Report: timings and accuracy of your ESA implementation on the Lee corpus [8] of human similarity judgements included in gensim. A summary of insights into parameter selection and tuning of your ESA implementation. You can also evaluate the ESA against other methods of semantic analysis, such as Latent Semantic Analysis [9, 10] in an event classification task (optional).

Resources:

[1] Evgeniy Gabrilovich and Shaul Markovitch "Wikipedia-based Semantic Interpretation for Natural Language Processing." Journal of Artificial Intelligence Research, 34:443–498, 2009

[2] Explicit Semantic Analysis.

[3] Musaev, A.; De Wang; Shridhar, S.; Chien-An Lai; Pu, C., "Toward a Real-Time Service for Landslide Detection: Augmented Explicit Semantic Analysis and Clustering Composition Approaches," in Web Services (ICWS), 2015 IEEE International Conference on , vol., no., pp.511-518, June 27 2015-July 2 2015

[4] Python implementation of ESA

[5] Gabrilovich's page on ESA

[6] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks

[7] Gensim on github

[8] Lee, M., Pincombe, B., & Welsh, M. (2005). An empirical evaluation of models of text document similarity. Proceedings of the 27th Annual Conference of the Cognitive Science Society

[9] "Latent Semantic Analysis" article on Wikipedia

[10] Susan T. Dumais (2005). "Latent Semantic Analysis". Annual Review of Information Science and Technology 38: 188

Topic coherence

Background

Unsupervised learning methods like Latent Dirichlet Allocation [1] and Latent Semantic Analysis [2] are attractive methods to bring structure to otherwise unstructured text data. However they give no guarantees on the interpretability of their output by humans. This interpretability can be measured by topic coherence. A good model must have high topic coherence, i.e. be understandable by humans.

The "Agile Knowledge Engineering and Semantic Web (AKSW)" research group of Michael Röder recently found a topic coherence measure that outperforms others [3].

While they released an implementation in Java [4,5], no Python implementation in Python exists.

You will contribute a scalable implementation of AKSW Topic Coherence to the Python data science world. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at [email protected].

Goals:

  1. Demonstrate understanding of topic coherence theory and practice, by describing, implementing and evaluating a scalable version of the AKSW topic coherence algorithm.

  2. Implement AKSW topic coherence and other coherence measures. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally implement a version that can use multiple cores on the same machine.

  3. Learn modern, practical distributed project collaboration and engineering tools (Git, Mailing lists, Continuous Builds, Automated Testing).

Deliverables

  1. Code: a pull request against gensim [6] on github [7]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

  2. Report: timings and accuracy of your topic coherence implementation on the corpora in [4]. A summary of insights into parameter selection and tuning of your topic coherence implementation. You can also evaluate AKSW topic coherence against other coherence measures, such as University of Massachussets Coherence (optional) [8].

Resources:

[1] Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (January 2003). Lafferty, John, ed. "Latent Dirichlet allocation". Journal of Machine Learning Research 3 (4–5): pp. 993–1022

[2] Susan T. Dumais (2005). "Latent Semantic Analysis". Annual Review of Information Science and Technology 38: 188

[3] Röder, Michael, Andreas Both, and Alexander Hinneburg. "Exploring the space of topic coherence measures." Proceedings of the Eighth ACM International Conference on Web Search and Data Mining. ACM, 2015

[4] Topic coherence in Java: code and corpus

[5] [Topic coherence web-app ] (http://palmetto.aksw.org/palmetto-webapp/)

[6] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks

[7] Gensim on github

[8] Gensim implementation of University of Massachusets coherence

Performance evaluation of k-Nearest Neighbours algorithms in Text Processing

Background: k-Nearest Neighbours (kNN) is a very widely used machine learning technique. For example, it powers music recommendations at Spotify. [1]

kNN can retrieve "top-100 most similar" documents among millions in hundreds of milliseconds on a common laptop. [2] The exact kNN algorithm scales linearly in the number of documents. However in many applications exact kNN results are not needed and approximate results obtained in sub-linear time are enough.

While fast approximate kNN libraries exist [1,3,4], none of them are seamlessly integrated with a Python Natural Language Processing library. That complicates their use in industrial text processing.

Writing a good and scalable k-NN library is a non-trivial task. Most academic implementations don't fit the bill [5]. Therefore you will evaluate and integrate the best library for the document similarity. It will be your contribution to the Python data science world based on an existing library. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at [email protected].

Goals

  1. Demonstrate understanding of approximate information retrieval theory and practice by describing, implementing and evaluating an application of an approximate kNN library to text processing. Developers of all the packages mentioned above are responsive, aware of gensim and open to improvements/feedback.

  2. Integrate an open-source Natural Language Processing library gensim [6] with an out-of-core AkNN library. Processing must be done in constant memory independent on the full indexed size. Optionally implement a version that can use multiple cores on the same machine. Optionally compare with a library capable of online (incremental) updates.

  3. Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

Deliverables

  1. Code: a pull request against gensim on github [7]. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples. The integration must follow the Similarity API [8].

  2. Report: timings, memory use and accuracy of your AkNN integration on the corpus of English Wikipedia or another publicly available large text corpus. A summary of insights into parameter selection and tuning of AkNN libraries. Comparison to gensim's existing exact kNN. [2]

Resources:

[1] Spotify's Approximate kNN library Annoy

[2] Gensim linear exact kNN

[3] NearPy Approximate kNN library

[4] Wei Dong, Moses Charikar, Kai Li. Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures. WWW 2011 kgraph Approximate kNN library

[5] Performance shootout of Nearest Neighbours Contestants

[6] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks

[7] Gensim on github

[8] Gensim similiarity API

[9] Github ticket #51 has some theoretical pointers; this benchmark of kNN libraries on Wikipedia data (~3.5 million documents) has practical code and an ecosystem summary

Dynamic Topic Models

Background: Dynamic topic models [1,2,3,4] are used to analyze the evolution of topics of a collection of documents over time. For example, by analyzing the famous academic journal "Science" over last 120 years one can see the evolution of Atomic Physics. This topic gradually drops the word "matter" and picks up "quantum" [5].

This family of models was proposed by David Blei and John Lafferty [4]. It is an extension to Latent Dirichlet Allocation (LDA) [6] that can handle sequential documents.

While there is an academic implementation in C++ [7], no practical implementation exists. You will contribute a scalable implementation of DTM to the Python data science world. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at [email protected].

Goals

  1. Demonstrate understanding of topic modeling theory and practice by describing, implementing and evaluating Dynamic Topic Modelling.

  2. Implement a streamed DTM that is capable of online (incremental) updates. Processing must be done in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally implement a version that can use multiple cores on the same machine.

  3. Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

Deliverables

  1. Code: a pull request against gensim [8] on github [9]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples. Gensim doesn't include any support for "timed streams", or time tags, at the moment. So part of this project will be engineering a clean API for this new functionality.

  2. Report on timings, memory use and accuracy of your DTM implementation on a text corpus evolving through time. Include a summary of insights into parameter selection and tuning of DTM libraries. Compare to gensim's wrapper around the academic C++ DTM code. [10]

Resources:

[1] Dynamic Topic Models

[2] Wang, Chong, David Blei, and David Heckerman. "Continuous time dynamic topic models." arXiv preprint arXiv:1206.3298 (2012).

[3] Wang, Xuerui, and Andrew McCallum. "Topics over time: a non-Markov continuous-time model of topical trends." Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006.

[4] Blei, David M., and John D. Lafferty. "Dynamic topic models." Proceedings of the 23rd international conference on Machine learning. ACM, 2006

[5] [David M. Blei and John D. Lafferty "Modeling the Evolution of Science" interactive topic browser

[6] Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (January 2003). Lafferty, John, ed. "Latent Dirichlet allocation". Journal of Machine Learning Research 3 (4–5): pp. 993–1022

[7] Academic implementation of DTM on David Blei's page and github

[8] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks

[9] Gensim on github

[10] Python wrapper for DTM C++ library in gensim

Supervised Latent Dirichlet Allocation

Background: Supervised Latent Dirichlet Allocation (sLDA) [1] is a Natural Language Processing method based on Latent Dirichlet Allocation (LDA) [2]. It is used in predicting the number of "Likes" for a post or the number of stars in a movie review.

In the vanilla LDA we treat the topic proportions for a text document as a draw from a Dirichlet distribution. We obtain the words in the document by repeatedly choosing a topic assignment from those proportions, then drawing a word from the corresponding topic. In Supervised Latent Dirichlet Allocation (sLDA), we add our target variable to the LDA model. For example, the number of stars assigned in a movie review or number of "Likes" of a post.

While academic implementations of sLDA exist in C++ and R [3, 4], there is no Python implementation available. You will contribute a scalable implementation of sLDA to the Python data science world. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at [email protected].

Goals

  1. Demonstrate understanding of topic modeling theory and practice by describing, implementing and evaluating sLDA.

  2. Implement a streamed sLDA that is capable of online (incremental) updates. Processing must be done in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally implement a version that can use multiple cores on the same machine.

  3. Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

Deliverables

  1. Code: a pull request against gensim [5, 6] on github. [7] Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

  2. Report: timings, memory use and accuracy of your sLDA implementation on the Cornell Movie Review Corpus [8] following the same methodology as in [1]. A summary of insights into parameter selection and tuning of sLDA.

Resources:

[1] Mcauliffe, Jon D., and David M. Blei. "Supervised topic models." Advances in neural information processing systems. 2008.

[2] Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (January 2003). Lafferty, John, ed. "Latent Dirichlet allocation". Journal of Machine Learning Research 3 (4–5): pp. 993–1022

[3] sLDA implementation in C++

[4] Implementation of sLDA in R

[5] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks

[6] Gensim github issue #121.

[7] Gensim on github

[8] Movie Review Dataset from Cornell NLP group

[9] Ramage, Daniel, et al. "Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora." Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1. Association for Computational Linguistics, 2009.

[10] Labelled LDA in Python

[11] Jagarlamudi, Jagadeesh, Hal Daumé III, and Raghavendra Udupa. "Incorporating lexical priors into topic models." Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 2012

Online Word2Vec

Background: Word2Vec [1, 2] is a continous word representation technique for creating word vectors to capture the syntax and semantics of words. The vectors used to represent the words have many interesting features, for example king−man+woman=queen.

This original Word2Vec algorithm can't add more words to vocabulary after an initial training. This is quite limiting for a news recommender engine encountering new words every day, for example. Many other real-world uses will benefit from being able to add new words to the vocabulary during training. This modification is called an online-training [3] of a Word2vec model.

There is no robust implementation of Online Word2vec available in any programming language. You will contribute a scalable implementation of Online Word2Vec to the data science world in Python. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at [email protected].

Goals

  1. Demonstrate understanding theory and practice of distributed representations of words by describing, implementing and evaluating Online word2vec.

  2. Implement a streamed Online word2vec that is capable of online (incremental) updates. Processing must be done in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally implement a version that can use multiple cores on the same machine.

  3. Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

Deliverables

  1. Code: a pull request against gensim [4] on github [5]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

  2. Report: timings, memory use and accuracy of your Online word2vec using Lee corpus [6] of human similarity judgements included in gensim. A summary of insights into parameter selection and tuning of Online word2vec.

Resources: [1] Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013)

[2] Gensim word2vec tutorial at Kaggle

[3] Online algorithm

[4] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks

[5] Gensim on github

[6] Lee, M., Pincombe, B., & Welsh, M. (2005). An empirical evaluation of models of text document similarity. Proceedings of the 27th Annual Conference of the Cognitive Science Society

Word Movers Distance for word2vec

Background: Word2Vec [1, 2] is a continous word representation technique for creating word vectors to capture the syntax and semantics of words. The vectors used to represent the words have many interesting features, for example king−man+woman=queen.

Many methods are proposed on how to measure distance between sentences in this new vector space. "Word Mover's Distance" (WMD) [3] is a novel distance-between-text-documents measure. It outperforms simple combinations like sum or mean. Visually, the distance between the two documents is the minimum cumulative distance that all words in document A need to travel to exactly match document B.

For example, these two sentences are close with respect to WMD even though they only have one word in common: "The restaurant is loud, we couldn't speak across the tabel" and "The restaurant has a lot to offer but easy conversation is not there". [4]

While there is an academic implementation in C [5], there is no implementation of WMD available in Python. You will contribute a scalable implementation of WMD to the data science world in Python. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at [email protected].

Goals

  1. Demonstrate understanding theory and practice of document distances by describing, implementing and evaluating WMD.

  2. Implement the WMD. Processing must be done in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally implement a version that can use multiple cores on the same machine.

  3. Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

Deliverables

  1. Code: a pull request against gensim [6] on github [7]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

  2. Report: timings, memory use and accuracy of your WMD using the freely available datasets in [3], for example the "20 newsgroups" corpus [8]. A summary of insights into parameter selection and tuning of document distances.

Resources:

[1] Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013)

[2] Gensim word2vec tutorial at Kaggle

[3] "From Word Embeddings to Document Distances" Kusner et al 2015

[4] [Sudeep Das "Navigating themes in restaurant reviews with Word Mover’s Distance", 2015] (http://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/)

[5] Matthew J Kusner's WMD in C on github

[6] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks

[7] Gensim on github

[8] The 20 newsgroups dataset

[9] Gensim github issue #482

Author-Topic Models

Background: Author-topic model [2] is a Natural Language Processing method that tells us about a person's writing. It can say how diverse is a range of topics covered by one author. It can also compare two authors and say how similar they are.

The author-topic model adds information about an author into very popular Latent Dirichlet Allocation (LDA) [6] model.

While there are academic implementations in Python and other languages [3, 4], they are very slow for large datasets. You will contribute a scalable implementation of Author-topic modelling to the data science world in Python. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at [email protected].

Goals

  1. Demonstrate understanding of theory and practice of topic modelling by describing, implementing and evaluating author-topic modelling.

  2. Implement a streamed author-topic model that is capable of online (incremental) updates. Processing must be done in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. Optionally implement a version that can use multiple cores on the same machine.

  3. Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

  4. A very interesting point here is adapting a Gibbs sampling paper to use Gensim's variational inference.

Deliverables

  1. Code: a pull request against gensim [1] on github [2]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples.

  2. Report: timings, memory use and accuracy of your author-topic model using the NIPS papers dataset [5], following the methodology of [2]. A summary of insights into parameter selection and tuning of the model.

Resources: [1] Rosen-Zvi, Michal, et al. "The author-topic model for authors and documents." Proceedings of the 20th conference on Uncertainty in artificial intelligence. AUAI Press, 2004. PDF.

[2] Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation, NIPS 2010.

[3] Author-topic model in Python

[4] Author-topic model in C++

[5] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks.

[6] Gensim on github

[7] NIPS text corpus in MATLAB format

Distributed computing for Latent Dirichlet Allocation

Background: Latent Dirichlet Allocation (LDA) [1] is a very popular algorithm for modelling topics of text documents.

Modern data mining relies on high-level distributed [2] frameworks like Hadoop, Spark [3], Celery [4], Disco [5], Samza [6] and Ibis [7].

While there are implementations of distributed LDA in Scala over Spark and in other languages, there is no established distributed computing framework that contains an LDA implementation in Python. You will contribute a scalable implementation of distributed LDA to the data science world in Python, building on top of one of the existing distributed frameworks. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at [email protected].

Goals

  1. Demonstrate understanding of theory and practice of distributed computing and topic modelling by describing, implementing and evaluating distributed LDA.

  2. Implement a streamed distributed LDA model that is capable of online (incremental) updates. Processing must be done in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. By integrating with one of the existing distributed frameworks, it must simultaneously use multiple machines and multiple cores on the same machine.

  3. Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

Deliverables

  1. Code: a pull request against gensim [8] on github [9]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples. Gensim contains a very manual low-level distributed implementation of LDA [8] that you can build on.

  2. Report: timings, memory use and accuracy of your distributed LDA implementation on the English Wikipedia corpus. A summary of insights into parameter selection and tuning of the model. In particular, how performance changes by adding cores and machines to the cluster.

Resources:

[1] Hoffman, Blei, Bach: Online Learning for Latent Dirichlet Allocation, NIPS 2010

[2] MapReduce: Simplified Data Processing on Large Clusters

[3] Spark distributed computing framework

[4] Celery

[5] Disco

[6] Storm, Samza.

[7] Ibis

[8] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks

[9] Gensim on github

[10] Low-level distributed LDA in gensim

Distributed computing for Latent Semantic Indexing

Background: Latent Semantic Indexing (LSI) [1] is a very popular algorithm for modelling topics of text documents.

Modern data mining relies on high-level distributed [2] frameworks like Hadoop, Spark [3], Celery [4], Disco [5], Samza [6] and Ibis [7].

While there are implementations of distributed LSI in Scala over Spark and in other languages, there is no established distributed computing framework that contains an LSI implementation in Python. You will contribute a scalable implementation of distributed LSI to the data science world in Python, building on top of one of the existing distributed frameworks. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at [email protected].

Goals

  1. Demonstrate understanding theory and practice of distributed computing and topic modelling by describing, implementing and evaluating distributed LSI.

  2. Implement a streamed distributed LSI model that is capable of online (incremental) updates. Processing must be done in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. By integrating with one of the existing distributed frameworks, it must simultaneously use multiple machines and multiple cores on the same machine.

  3. Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

Deliverables

  1. Code: a pull request against gensim [8] on github [9]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples. Gensim contains a very manual low-level distributed implementation of LSI [10] that you can build on.

  2. Report: timings, memory use and accuracy of your distributed LSI implementation on the English Wikipedia corpus. A summary of insights into parameter selection and tuning of the model.

Resources:

[1] Susan T. Dumais (2005). "Latent Semantic Analysis". Annual Review of Information Science and Technology 38: 188

[2] MapReduce: Simplified Data Processing on Large Clusters

[3] Spark distributed computing framework

[4] Celery

[5] Disco

[6] Storm, Samza.

[7] Ibis

[8] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks

[9] Gensim on github

[10] Low-level distributed LSI in gensim

[11] LSI on Spark

Distributed computing for word2vec

Background: Word2Vec [1, 2] is a continous word representation technique for creating word vectors to capture the syntax and semantics of words. The vectors used to represent the words have many interesting features, for example king−man+woman=queen.

Modern data mining relies on high-level distributed [3] frameworks like Hadoop, Spark [4], Celery [5], Disco [6], Samza [7] and Ibis [8].

While there are implementations of distributed word2vec in Scala over Spark [9] and in other languages [10], there is no established distributed computing framework that contains a word2vec implementation in Python. You will contribute a scalable implementation of distributed word2vec to the data science world in Python, building on top of one of the existing distributed frameworks. A quality implementation will be widely used in the industry.

RaRe Technologies offers a financial reward as well as technical and academic assistance for completing this project. Please get in touch at [email protected].

Goals

  1. Demonstrate understanding theory and practice of distributed computing and word representations by describing, implementing and evaluating distributed word2vec.

  2. Implement a streamed distributed word2vec model that is capable of online (incremental) updates. Processing must be done in mini-batches of training samples, in constant memory independent on the full training set size. The implementation must rely on Python's NumPy and SciPy libraries for high performance computing. By integrating with one of the existing distributed frameworks, it must simultaneously use multiple machines and multiple cores on the same machine.

  3. Learn modern, practical distributed project collaboration and engineering tools (git, mailing lists, continuous build, automated testing).

Deliverables

  1. Code: a pull request against gensim [11] on github [12]. Gensim is an open-source Python library for Natural Language Processing. The pull request is expected to contain robust, well-tested and well-documented industry-strength implementation, not flimsy academic code. Check corner cases, summarize insights into documentation tips and examples. Gensim contains a very manual low-level distributed implementation of distributed word2vec that you can build on.

  2. Report: timings, memory use and accuracy of your distributed word2vec implementation on the English Wikipedia corpus. A summary of insights into parameter selection and tuning of the model.

Resources:

[1] Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).

[2] Gensim word2vec tutorial at Kaggle

[3] MapReduce: Simplified Data Processing on Large Clusters

[4] Spark distributed computing framework

[5] Celery

[6] Disco

[7] Storm, Samza.

[8] Ibis

[9] word2vec in Spark

[10] word2vec in DeepLearning4J

[11] Radim Řehůřek and Petr Sojka (2010). Software framework for topic modelling with large corpora. Proc. LREC Workshop on New Challenges for NLP Frameworks

[12] Gensim on github

WordRank

WordRank is a new word embedding algorithm.

Investigate how it compares to word2vec by expanding on the approach in this blog.

BigARTM wrapper

Add Montemurro and Zanette algorithm

See https://github.com/RaRe-Technologies/gensim/issues/665

Clone this wiki locally