Skip to content

Commit

Permalink
ch
Browse files Browse the repository at this point in the history
  • Loading branch information
bab2min committed Dec 29, 2019
1 parent 03d8677 commit 4e73a9c
Show file tree
Hide file tree
Showing 6 changed files with 98 additions and 0 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,5 @@
/tomotopy.egg-info
build_windows.bat
*.bin
enwiki-stemmed-1000.txt
/venv/
13 changes: 13 additions & 0 deletions README.kr.rst
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,19 @@ add_doc은 `tomotopy.LDAModel.train`을 시작하기 전까지만 사용할 수
infer 메소드는 `tomotopy.Document` 인스턴스 하나를 추론하거나 `tomotopy.Document` 인스턴스의 `list`를 추론하는데 사용할 수 있습니다.
자세한 것은 `tomotopy.LDAModel.infer`을 참조하길 바랍니다.

병렬 샘플링 알고리즘
----------------------------
`tomotopy`는 0.5.0버전부터 병렬 알고리즘을 고를 수 있는 선택지를 제공합니다.
0.4.2 이전버전까지 제공되던 알고리즘은 `COPY_MERGE`로 이 기법은 모든 토픽 모델에 사용 가능합니다.
새로운 알고리즘인 `PARTITION`은 0.5.0이후부터 사용가능하며, 이를 사용하면 더 빠르고 메모리 효율적으로 학습을 수행할 수 있습니다. 단 이 기법은 일부 토픽 모델에 대해서만 사용 가능합니다.

다음 차트는 토픽 개수와 코어 개수에 따라 두 기법의 속도 차이를 보여줍니다.

.. image:: https://bab2min.github.io/tomotopy/images/algo_comp.png

.. image:: https://bab2min.github.io/tomotopy/images/algo_comp2.png


예제 코드
--------
tomotopy의 Python3 예제 코드는 https://github.com/bab2min/tomotopy/blob/master/example.py 를 확인하시길 바랍니다.
Expand Down
13 changes: 13 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -189,6 +189,19 @@ Inference for unseen document should be performed using `tomotopy.LDAModel.infer
The `infer` method can infer only one instance of `tomotopy.Document` or a `list` of instances of `tomotopy.Document`.
See more at `tomotopy.LDAModel.infer`.

Parallel Sampling Algorithms
----------------------------
Since version 0.5.0, `tomotopy` allows you to choose a parallelism algorithm.
The algorithm provided in versions prior to 0.4.2 is `COPY_MERGE`, which is provided for all topic models.
The new algorithm `PARTITION`, available since 0.5.0, makes training generally faster and more memory-efficient, but it is available at not all topic models.

The following chart shows the speed difference between the two algorithms based on the number of topics and the number of workers.

.. image:: https://bab2min.github.io/tomotopy/images/algo_comp.png

.. image:: https://bab2min.github.io/tomotopy/images/algo_comp2.png


Examples
--------
You can find an example python code of tomotopy at https://github.com/bab2min/tomotopy/blob/master/example.py .
Expand Down
47 changes: 47 additions & 0 deletions benchmark.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
import time
import tomotopy as tp
filename = 'enwiki-stemmed-1000.txt'

def bench_gensim(k):
from gensim import corpora, models
dictionary = corpora.Dictionary(filter(lambda x:x!='.', text.strip().split()) for text in open(filename, encoding='utf-8'))
corpus = [dictionary.doc2bow(filter(lambda x:x!='.', text.strip().split())) for text in open(filename, encoding='utf-8')]
#print('Number of vocabs:', len(dictionary))

start_time = time.time()
model = models.ldamodel.LdaModel(corpus, num_topics=k, id2word=dictionary, passes=10)
#model = models.ldamulticore.LdaMulticore(corpus, num_topics=k, id2word=dictionary, passes=10, workers=8) # not work at Windows
#for i in range(k): print(model.show_topic(i))
print('K=%d\tTime: %.5g' % (k, time.time() - start_time), end='\t')
print('LL: %g' % model.log_perplexity(corpus), flush=True)

def bench_tomotopy(k, ps, w=0):
model = tp.LDAModel(k=k)
for text in open(filename, encoding='utf-8'): model.add_doc(filter(lambda x:x!='.', text.strip().split()))
#print('Number of vocabs:', len(model.vocabs))

start_time = time.time()
model.train(200, workers=w, parallel=ps)
#for i in range(k): print(model.get_topic_words(i))
print('K=%d\tW=%d\tTime: %.5g' % (k, w, time.time() - start_time), end='\t')
print('LL: %g' % model.ll_per_word, flush=True)


print('== tomotopy (K x ParallelScheme) ==')
for ps in [tp.ParallelScheme.COPY_MERGE, tp.ParallelScheme.PARTITION]:
print('= {} ='.format(ps.name))
for k in range(10, 101, 10):
bench_tomotopy(k, ps)
time.sleep(2)

print('== tomotopy (Workers x ParallelScheme) ==')
for ps in [tp.ParallelScheme.COPY_MERGE, tp.ParallelScheme.PARTITION]:
print('= {} ='.format(ps.name))
for w in [1, 2, 3, 4, 5, 6, 7, 8]:
bench_tomotopy(50, ps, w)
time.sleep(2)

print('== gensim (K) ==')
for k in range(10, 101, 10):
bench_gensim(k)
time.sleep(2)
12 changes: 12 additions & 0 deletions tomotopy/documentation.kr.rst
Original file line number Diff line number Diff line change
Expand Up @@ -227,6 +227,18 @@ add_doc은 `tomotopy.LDAModel.train`을 시작하기 전까지만 사용할 수
infer 메소드는 `tomotopy.Document` 인스턴스 하나를 추론하거나 `tomotopy.Document` 인스턴스의 `list`를 추론하는데 사용할 수 있습니다.
자세한 것은 `tomotopy.LDAModel.infer`을 참조하길 바랍니다.

병렬 샘플링 알고리즘
----------------------------
`tomotopy`는 0.5.0버전부터 병렬 알고리즘을 고를 수 있는 선택지를 제공합니다.
0.4.2 이전버전까지 제공되던 알고리즘은 `COPY_MERGE`로 이 기법은 모든 토픽 모델에 사용 가능합니다.
새로운 알고리즘인 `PARTITION`은 0.5.0이후부터 사용가능하며, 이를 사용하면 더 빠르고 메모리 효율적으로 학습을 수행할 수 있습니다. 단 이 기법은 일부 토픽 모델에 대해서만 사용 가능합니다.

다음 차트는 토픽 개수와 코어 개수에 따라 두 기법의 속도 차이를 보여줍니다.

.. image:: https://bab2min.github.io/tomotopy/images/algo_comp.png

.. image:: https://bab2min.github.io/tomotopy/images/algo_comp2.png

예제 코드
--------
tomotopy의 Python3 예제 코드는 https://github.com/bab2min/tomotopy/blob/master/example.py 를 확인하시길 바랍니다.
Expand Down
12 changes: 12 additions & 0 deletions tomotopy/documentation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -229,6 +229,18 @@ Inference for unseen document should be performed using `tomotopy.LDAModel.infer
The `infer` method can infer only one instance of `tomotopy.Document` or a `list` of instances of `tomotopy.Document`.
See more at `tomotopy.LDAModel.infer`.

Parallel Sampling Algorithms
----------------------------
Since version 0.5.0, `tomotopy` allows you to choose a parallelism algorithm.
The algorithm provided in versions prior to 0.4.2 is `COPY_MERGE`, which is provided for all topic models.
The new algorithm `PARTITION`, available since 0.5.0, makes training generally faster and more memory-efficient, but it is available at not all topic models.

The following chart shows the speed difference between the two algorithms based on the number of topics and the number of workers.

.. image:: https://bab2min.github.io/tomotopy/images/algo_comp.png

.. image:: https://bab2min.github.io/tomotopy/images/algo_comp2.png

Examples
--------
You can find an example python code of tomotopy at https://github.com/bab2min/tomotopy/blob/master/example.py .
Expand Down

0 comments on commit 4e73a9c

Please sign in to comment.