Version 4 sqlite memory usage patch #224

mraygalaxy · 2015-01-24T04:40:08Z

Hello,

This version addresses comments from #223

Fallback to default in-memory cache if sqlite3 is not found
Use more "replace into".
Find optimal transation and sqlite memory cache size to speed up new DB (about 14 seconds) with binary search
Use large file test with multiple lines instead of one single line

Here are the results for memory:

mrhines@mahler:~/mica-android/mica$ rm /tmp/jieba.*; util/jieba.py
Building prefix dict from /home/mrhines/mica-android/mica/jieba/jieba/dict.txt ...
Using model from cache /tmp/jieba.u30de3b79a89ddfd331486dee490ffa50.db
Making first-pass over sqlite cache...
Normalizing sqlite frequencies...
Loading model cost 14.6021559238 seconds.
Memory usage: 12.85546875 MB

Here are the large-file test results:

mrhines@mahler:~/mica-android/mica/jieba/test$ python ./test_file_sqlite.py /large_400K_file.txt
Building prefix dict from /home/mrhines/mica-android/mica/jieba/jieba/dict.txt ...
Using model from cache /tmp/jieba.u30de3b79a89ddfd331486dee490ffa50.db
cost 7.74777913094
speed 58807.5617929 bytes/second
mrhines@mahler:/mica-android/mica/jieba/test$ python ./test_file.py ~/large_400K_file.txt
Building prefix dict from /usr/local/lib/python2.7/dist-packages/jieba/dict.txt ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 1.17606186867 seconds.
Prefix dict has been built succesfully.
cost 0.61910200119
speed 735949.809763 bytes/second

As usual, comments welcome =)

… dynamically without bundling it into the application, so we do not want any large files like 'dict.txt' pre-packaged into the application, so we need to skip checking the age of this file because the mobile application itself will instead handle when new sqlite databases are sent to the application.

mraygalaxy · 2015-02-08T15:22:41Z

Ping?

gumblex · 2015-02-09T10:05:56Z

TTL expired...

In my opinion, this patch should be placed in a seperate py file. When required, call a function that imports it, and overrides the API, so that the SQLite dependency won't affect the package's portability.
In an Android app (eg. your MICA), speed is not that important compared to memory and disk usage, because the text is short and the memory and disk is very limited. But for other NLP tasks, the difference may be several hours. So I think it's better to make it seperate, like the pos and analysis module.

Try select word from FREQ where word like 'FIRSTCHAR%' so that you don't need a seperate prefix table. The SQL implementation should be faster than this one with brute-force. I will also work on this when I'm free.

Also, don't create a branch every time because it's hard to track and merge changes. I can't pull it in v3.

The SQLite cache based on your version 4

Michael R. Hines added 2 commits January 24, 2015 12:27

addressing additional comments

3ae3434

gumblex and others added 8 commits February 10, 2015 17:51

Merge upstream with the sqlite version 4

07cdc8b

autopep8

27d29a0

use sqlitecache as external module

858a8fe

merge upstream

0183721

fix various errors

ecbda6e

override get_DAG

2b54b17

fix global vars

f842ad1

Merge pull request #1 from gumblex/sqlite

3e93e32

The SQLite cache based on your version 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version 4 sqlite memory usage patch #224

Version 4 sqlite memory usage patch #224

mraygalaxy commented Jan 24, 2015

mraygalaxy commented Feb 8, 2015

gumblex commented Feb 9, 2015

Version 4 sqlite memory usage patch #224

Are you sure you want to change the base?

Version 4 sqlite memory usage patch #224

Conversation

mraygalaxy commented Jan 24, 2015

mraygalaxy commented Feb 8, 2015

gumblex commented Feb 9, 2015