-
Notifications
You must be signed in to change notification settings - Fork 4
Damerau-Levenshtein Distance UDF for MySQL - Supports upper bounding for fast searching and UTF-8 case insensitive throught iconv.
License
ifsnop/damlev
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Damerau-Levenshtein Distance UDF for MySQL Supports upper bounding for fast searching and UTF-8 case insensitive throught iconv. Copyright (C) 2013 Diego Torres <diego dot torres at gmail dot com> Implementing https://github.com/torvalds/linux/blob/8a72f3820c4d14b27ad5336aed00063a7a7f1bef/tools/perf/util/levenshtein.c Redistribute as you wish, but leave this information intact. Compiling Run make. It will try to autodetect architecture and generate 32 or x86_64 bit version. Enable testing with "make debug". An executable instead of a library will be built. Execute it to run some tests regarding memory allocation and levenshtein core calculations. Errors you may get when compiling: In file included from ./src/damlevlim.c:16:0: ./src/damlevlim.h:26:23: fatal error: my_global.h: No such file or directory #include <my_global.h> ^ compilation terminated. Makefile:30: recipe for target 'mysqldamlevlim' failed make: *** [mysqldamlevlim] Error 1 Development libraries from mysql are missing. Do "sudo apt-get install libmysqlclient-dev" /usr/include/gnu/stubs.h:7:27: fatal error: gnu/stubs-32.h: No such file or directory You're missing the 32 bit libc dev package. This is answered here: http://stackoverflow.com/questions/7412548/gnu-stubs-32-h-no-such-file-or-directory - On Ubuntu it's called libc6-dev-i386 - do "sudo apt-get install" libc6-dev-i386. See below for extra instructions for Ubuntu 12.04. - On Red Hat distros, the package name is glibc-devel.i686 (Thanks to David Gardner's comment) - On CentOS 5.8, the package name is glibc-devel.i386 (Thanks to JimKleck's comment) - On CentOS 6.3, the package name is glibc-devel.i686. - On SLES it's called glibc-devel-32bit - do zypper in glibc-devel-32bit Loading UDFs in MySQL STEP 1 Copy compiled udf to plugin dir. To know where exactly, ask MySQL: mysql> SHOW VARIABLES LIKE 'plugin_dir'; and use result directory. To modify plugin_dir to point to a custom location: my.cnf: plugin_dir = /usr/lib/mysql/plugin Don't forget to restart MySQL!! Loading UDFs in MySQL STEP 2 Bind the library to an UDF inside MySQL (only do it once, survives MySQL reboot). Depending your MySQL binaries (32 or 64 bits), insert the right library. mysql> DROP FUNCTION IF EXISTS damlevlim; mysql> CREATE FUNCTION damlevlim RETURNS INT SONAME 'mysqldamlevlim.so'; Check if UDF was loaded OK. mysql> SELECT * FROM mysql.func; +-----------+-----+-------------------+----------+ | name | ret | dl | type | +-----------+-----+-------------------+----------+ | damlevlim | 2 | mysqldamlevlim.so | function | +-----------+-----+-------------------+----------+ 1 row in set (0.00 sec) Using damlevlim mysql> select damlevlim(); ERROR: DAMLEVLIM() requires three arguments (string, string, int) First string is compared against second string. Compare only the first int chars (upper bound). mysql> select damlevlim('coche', 'çóche', 5); +--------------------------------+ | damlevlim('coche', 'çòche', 5) | +--------------------------------+ | 0 | +--------------------------------+ 1 row in set (0.00 sec) Further reading about MySql, UTF8 and charset encoding http://www.oreillynet.com/onlamp/blog/2006/01/turning_mysql_data_in_latin1_t.html How to search for multibyte characters in a column Use MySql CHAR_LENGTH function to find rows with multi-byte characters: mysql> SELECT name FROM users WHERE LENGTH(name) != CHAR_LENGTH(name); How to search for strange multibyte characters in a column (don't remember source of original tip) Use MySql HEX and PHP bin2hex functions mysql> SELECT name, HEX(name) FROM users; Get the result back into PHP, and run a bin2hex on the string, compare it to MySQL’s hex of that same string.
About
Damerau-Levenshtein Distance UDF for MySQL - Supports upper bounding for fast searching and UTF-8 case insensitive throught iconv.
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published