ch

bab2min · Dec 29, 2019 · 4e73a9c · 4e73a9c
1 parent 03d8677
commit 4e73a9c
Show file tree

Hide file tree

Showing 6 changed files with 98 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -9,4 +9,5 @@
 /tomotopy.egg-info
 build_windows.bat
 *.bin
+enwiki-stemmed-1000.txt
 /venv/
diff --git a/README.kr.rst b/README.kr.rst
@@ -185,6 +185,19 @@ add_doc은 `tomotopy.LDAModel.train`을 시작하기 전까지만 사용할 수
 infer 메소드는 `tomotopy.Document` 인스턴스 하나를 추론하거나 `tomotopy.Document` 인스턴스의 `list`를 추론하는데 사용할 수 있습니다. 
 자세한 것은 `tomotopy.LDAModel.infer`을 참조하길 바랍니다.
 
+병렬 샘플링 알고리즘
+----------------------------
+`tomotopy`는 0.5.0버전부터 병렬 알고리즘을 고를 수 있는 선택지를 제공합니다.
+0.4.2 이전버전까지 제공되던 알고리즘은 `COPY_MERGE`로 이 기법은 모든 토픽 모델에 사용 가능합니다.
+새로운 알고리즘인 `PARTITION`은 0.5.0이후부터 사용가능하며, 이를 사용하면 더 빠르고 메모리 효율적으로 학습을 수행할 수 있습니다. 단 이 기법은 일부 토픽 모델에 대해서만 사용 가능합니다.
+
+다음 차트는 토픽 개수와 코어 개수에 따라 두 기법의 속도 차이를 보여줍니다.
+
+.. image:: https://bab2min.github.io/tomotopy/images/algo_comp.png
+
+.. image:: https://bab2min.github.io/tomotopy/images/algo_comp2.png
+
+
 예제 코드
 --------
 tomotopy의 Python3 예제 코드는 https://github.com/bab2min/tomotopy/blob/master/example.py 를 확인하시길 바랍니다.

diff --git a/README.rst b/README.rst
@@ -189,6 +189,19 @@ Inference for unseen document should be performed using `tomotopy.LDAModel.infer
 The `infer` method can infer only one instance of `tomotopy.Document` or a `list` of instances of `tomotopy.Document`. 
 See more at `tomotopy.LDAModel.infer`.
 
+Parallel Sampling Algorithms
+----------------------------
+Since version 0.5.0, `tomotopy` allows you to choose a parallelism algorithm. 
+The algorithm provided in versions prior to 0.4.2 is `COPY_MERGE`, which is provided for all topic models.
+The new algorithm `PARTITION`, available since 0.5.0, makes training generally faster and more memory-efficient, but it is available at not all topic models.
+
+The following chart shows the speed difference between the two algorithms based on the number of topics and the number of workers.
+
+.. image:: https://bab2min.github.io/tomotopy/images/algo_comp.png
+
+.. image:: https://bab2min.github.io/tomotopy/images/algo_comp2.png
+
+
 Examples
 --------
 You can find an example python code of tomotopy at https://github.com/bab2min/tomotopy/blob/master/example.py .

diff --git a/benchmark.py b/benchmark.py
@@ -0,0 +1,47 @@
+import time
+import tomotopy as tp
+filename = 'enwiki-stemmed-1000.txt'
+
+def bench_gensim(k):
+    from gensim import corpora, models
+    dictionary = corpora.Dictionary(filter(lambda x:x!='.', text.strip().split()) for text in open(filename, encoding='utf-8'))
+    corpus = [dictionary.doc2bow(filter(lambda x:x!='.', text.strip().split())) for text in open(filename, encoding='utf-8')]
+    #print('Number of vocabs:', len(dictionary))
+
+    start_time = time.time()
+    model = models.ldamodel.LdaModel(corpus, num_topics=k, id2word=dictionary, passes=10)
+    #model = models.ldamulticore.LdaMulticore(corpus, num_topics=k, id2word=dictionary, passes=10, workers=8) # not work at Windows
+    #for i in range(k): print(model.show_topic(i))
+    print('K=%d\tTime: %.5g' % (k, time.time() - start_time), end='\t')
+    print('LL: %g' % model.log_perplexity(corpus), flush=True)
+
+def bench_tomotopy(k, ps, w=0):
+    model = tp.LDAModel(k=k)
+    for text in open(filename, encoding='utf-8'): model.add_doc(filter(lambda x:x!='.', text.strip().split()))
+    #print('Number of vocabs:', len(model.vocabs))
+
+    start_time = time.time()
+    model.train(200, workers=w, parallel=ps)
+    #for i in range(k): print(model.get_topic_words(i))
+    print('K=%d\tW=%d\tTime: %.5g' % (k, w, time.time() - start_time), end='\t')
+    print('LL: %g' % model.ll_per_word, flush=True)
+
+
+print('== tomotopy (K x ParallelScheme) ==')
+for ps in [tp.ParallelScheme.COPY_MERGE, tp.ParallelScheme.PARTITION]:
+    print('= {} ='.format(ps.name))
+    for k in range(10, 101, 10):
+        bench_tomotopy(k, ps)
+        time.sleep(2)
+
+print('== tomotopy (Workers x ParallelScheme) ==')
+for ps in [tp.ParallelScheme.COPY_MERGE, tp.ParallelScheme.PARTITION]:
+    print('= {} ='.format(ps.name))
+    for w in [1, 2, 3, 4, 5, 6, 7, 8]:
+        bench_tomotopy(50, ps, w)
+        time.sleep(2)
+
+print('== gensim (K) ==')
+for k in range(10, 101, 10):
+    bench_gensim(k)
+    time.sleep(2)
diff --git a/tomotopy/documentation.kr.rst b/tomotopy/documentation.kr.rst
@@ -227,6 +227,18 @@ add_doc은 `tomotopy.LDAModel.train`을 시작하기 전까지만 사용할 수
 infer 메소드는 `tomotopy.Document` 인스턴스 하나를 추론하거나 `tomotopy.Document` 인스턴스의 `list`를 추론하는데 사용할 수 있습니다. 
 자세한 것은 `tomotopy.LDAModel.infer`을 참조하길 바랍니다.
 
+병렬 샘플링 알고리즘
+----------------------------
+`tomotopy`는 0.5.0버전부터 병렬 알고리즘을 고를 수 있는 선택지를 제공합니다.
+0.4.2 이전버전까지 제공되던 알고리즘은 `COPY_MERGE`로 이 기법은 모든 토픽 모델에 사용 가능합니다.
+새로운 알고리즘인 `PARTITION`은 0.5.0이후부터 사용가능하며, 이를 사용하면 더 빠르고 메모리 효율적으로 학습을 수행할 수 있습니다. 단 이 기법은 일부 토픽 모델에 대해서만 사용 가능합니다.
+
+다음 차트는 토픽 개수와 코어 개수에 따라 두 기법의 속도 차이를 보여줍니다.
+
+.. image:: https://bab2min.github.io/tomotopy/images/algo_comp.png
+
+.. image:: https://bab2min.github.io/tomotopy/images/algo_comp2.png
+
 예제 코드
 --------
 tomotopy의 Python3 예제 코드는 https://github.com/bab2min/tomotopy/blob/master/example.py 를 확인하시길 바랍니다.

diff --git a/tomotopy/documentation.rst b/tomotopy/documentation.rst
@@ -229,6 +229,18 @@ Inference for unseen document should be performed using `tomotopy.LDAModel.infer
 The `infer` method can infer only one instance of `tomotopy.Document` or a `list` of instances of `tomotopy.Document`. 
 See more at `tomotopy.LDAModel.infer`.
 
+Parallel Sampling Algorithms
+----------------------------
+Since version 0.5.0, `tomotopy` allows you to choose a parallelism algorithm. 
+The algorithm provided in versions prior to 0.4.2 is `COPY_MERGE`, which is provided for all topic models.
+The new algorithm `PARTITION`, available since 0.5.0, makes training generally faster and more memory-efficient, but it is available at not all topic models.
+
+The following chart shows the speed difference between the two algorithms based on the number of topics and the number of workers.
+
+.. image:: https://bab2min.github.io/tomotopy/images/algo_comp.png
+
+.. image:: https://bab2min.github.io/tomotopy/images/algo_comp2.png
+
 Examples
 --------
 You can find an example python code of tomotopy at https://github.com/bab2min/tomotopy/blob/master/example.py .