Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Version 4 sqlite memory usage patch #224

Open
wants to merge 10 commits into
base: master
Choose a base branch
from
Open

Conversation

mraygalaxy
Copy link

Hello,

This version addresses comments from #223

  1. Fallback to default in-memory cache if sqlite3 is not found
  2. Use more "replace into".
  3. Find optimal transation and sqlite memory cache size to speed up new DB (about 14 seconds) with binary search
  4. Use large file test with multiple lines instead of one single line

Here are the results for memory:

mrhines@mahler:~/mica-android/mica$ rm /tmp/jieba.*; util/jieba.py
Building prefix dict from /home/mrhines/mica-android/mica/jieba/jieba/dict.txt ...
Using model from cache /tmp/jieba.u30de3b79a89ddfd331486dee490ffa50.db
Making first-pass over sqlite cache...
Normalizing sqlite frequencies...
Loading model cost 14.6021559238 seconds.
Memory usage: 12.85546875 MB

Here are the large-file test results:

mrhines@mahler:~/mica-android/mica/jieba/test$ python ./test_file_sqlite.py /large_400K_file.txt
Building prefix dict from /home/mrhines/mica-android/mica/jieba/jieba/dict.txt ...
Using model from cache /tmp/jieba.u30de3b79a89ddfd331486dee490ffa50.db
cost 7.74777913094
speed 58807.5617929 bytes/second
mrhines@mahler:
/mica-android/mica/jieba/test$ python ./test_file.py ~/large_400K_file.txt
Building prefix dict from /usr/local/lib/python2.7/dist-packages/jieba/dict.txt ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 1.17606186867 seconds.
Prefix dict has been built succesfully.
cost 0.61910200119
speed 735949.809763 bytes/second

As usual, comments welcome =)

Michael R. Hines added 2 commits January 24, 2015 12:27
… dynamically without bundling it into the application, so we do not want any large files like 'dict.txt' pre-packaged into the application, so we need to skip checking the age of this file because the mobile application itself will instead handle when new sqlite databases are sent to the application.
@mraygalaxy
Copy link
Author

Ping?

@gumblex
Copy link
Contributor

gumblex commented Feb 9, 2015

TTL expired...

In my opinion, this patch should be placed in a seperate py file. When required, call a function that imports it, and overrides the API, so that the SQLite dependency won't affect the package's portability.
In an Android app (eg. your MICA), speed is not that important compared to memory and disk usage, because the text is short and the memory and disk is very limited. But for other NLP tasks, the difference may be several hours. So I think it's better to make it seperate, like the pos and analysis module.

Try select word from FREQ where word like 'FIRSTCHAR%' so that you don't need a seperate prefix table. The SQL implementation should be faster than this one with brute-force. I will also work on this when I'm free.

Also, don't create a branch every time because it's hard to track and merge changes. I can't pull it in v3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants