Unicode in queries #28

djordjeglbvc · 2014-09-11T14:37:24Z

When testing db field with unicode string, UnicodeEncodeError exception is raised.
Line which causes the exception:

db.get(where('name') == u'žir')

Inserting unicode data went without problems:

db.insert({'name': 'žir'})

I have made quick hack which fixes problem for my little hobby project, but I will examine this problem more when I find time.

In queries.py, I've changed Query._update_repr function body to:

self._repr = u'\'{0}\' {1} {2}'.format(self._key, operator, value)

and Query.__hash__ to:

return hash(repr(unicode(self)))

Basically adding string preffix "u" in _update_repr, and "unicode" call in __hash__...

Using tinydb from git on python 2.7.6, ubuntu 14.04

The text was updated successfully, but these errors were encountered:

eugene-eeo · 2014-09-12T12:47:54Z

Is it possible to normalize the data first before inserting? I.e. I know that there is a function called unicodedata.normalize that should help. Then you can query easily with:

db.get(where('name') == 'zir')

Can you provide the full traceback information? (Just copy + paste from your Python interpreter session)

msiemens · 2014-09-15T02:38:49Z

@zelenikotao Can you please post a full traceback?

djordjeglbvc · 2014-09-15T08:49:21Z

Sorry for not responding earlier, I didn't have any free time over weekend.

@eugene-eeo I have tried with unicodedata.normalize, result is the same.

@eugene-eeo @msiemens, here is the traceback:

$ python example.py 

Traceback (most recent call last):
  File "example.py", line 13, in <module>
    db.get(where('name') == unicodedata.normalize('NFKC', u'žir'))
  File "/usr/local/lib/python2.7/dist-packages/tinydb/queries.py", line 184, in __eq__
    self._update_repr('==', other)
  File "/usr/local/lib/python2.7/dist-packages/tinydb/queries.py", line 310, in _update_repr
    self._repr = '\'{0}\' {1} {2}'.format(self._key, operator, value)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u017e' in position 0: ordinal not in range(128)

Here is test script which causes exception,
https://gist.github.com/zelenikotao/b23d79edc80bcea3b511.js

msiemens · 2014-09-16T00:15:06Z

@zelenikotao You've mixed up unicode strings and byte strings. It should work if you use byte strings only, e.g.:

db.insert({'name': 'žir'})
db.search(where('name') == 'žir')

@eugene-eeo I wouldn't recommend normalizing the data that way as you will lose information. Say you insert both {'name': 'zir'} and {'name': 'žir'}, TinyDB will regard them as equal while they propably shouldn't be.

djordjeglbvc · 2014-09-16T09:55:10Z

@msiemens when I use byte strings, as you've proposed, db holds unicode string for value of inserted document, and using search raises this warning

/usr/local/lib/python2.7/dist-packages/tinydb/queries.py:183: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  self._cmp = lambda value: value == other

Of course, search doesn't return document I was searching for, only None value.

msiemens · 2014-09-16T10:57:46Z

What's the exact code you've used? If I use byte strings for both inserting and searching, it works...

>>> from tinydb import TinyDB, where
>>> from tinydb.storages import MemoryStorage
>>> db = TinyDB(storage=MemoryStorage)
>>> db.insert({'name': 'žir'})
1
>>> db.search(where('name') == 'žir')
[{'name': '?ir'}]

(Note: the question mark in the db.search result is caused by the Windows CMD terminal, shouldn't be a bug in TinyDB)

djordjeglbvc · 2014-09-16T11:05:18Z

This is the code I've used:

>>> from tinydb import TinyDB, where
>>> db = TinyDB('db.json')
>>> db.insert({'name': 'žir'})
1
>>> db.search(where('name') == 'žir')
/usr/local/lib/python2.7/dist-packages/tinydb/queries.py:183: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  self._cmp = lambda value: value == other
[]

But I've tried with MemoryStorage as in your example, and it is working. Could problem be somewhere in file storage handling?

msiemens · 2014-09-16T11:15:06Z

Could be, I'm investigating.

EDIT: This doesn't seem to have a trivial non-hacky solution, I'll work a bit on this.

eugene-eeo · 2014-09-16T13:46:42Z

@msiemens I think you should read this as well http://stackoverflow.com/questions/11759070/python-json-loads-dumps-break-unicode#11759156

UPDATE: It works:

>>> from ujson import dumps
>>> d = dumps({"name": "ålpha"}, ensure_ascii=False)
>>> d
'{"name":"\xc3\xa5lpha"}'
>>> loads(d)
{u'name': u'\xe5lpha'}
>>>

msiemens · 2014-09-17T13:50:35Z

I was wrong, there is a trivial solution, see 6b518b8. Test cases for unicode data included.

@zelenikotao Could you test if it works in the latest development version?

djordjeglbvc · 2014-09-17T14:29:48Z

Tested it, works great for me!
Thanks!

msiemens · 2014-09-17T14:31:41Z

Thanks for reporting!

eugene-eeo pushed a commit to eugene-eeo/tinydb that referenced this issue Sep 15, 2014

bugfix for msiemens#28

eb20f8d

eugene-eeo mentioned this issue Sep 15, 2014

bugfix for #28 #29

Closed

msiemens closed this as completed Sep 17, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode in queries #28

Unicode in queries #28

djordjeglbvc commented Sep 11, 2014

eugene-eeo commented Sep 12, 2014

msiemens commented Sep 15, 2014

djordjeglbvc commented Sep 15, 2014

msiemens commented Sep 16, 2014

djordjeglbvc commented Sep 16, 2014

msiemens commented Sep 16, 2014

djordjeglbvc commented Sep 16, 2014

msiemens commented Sep 16, 2014

eugene-eeo commented Sep 16, 2014

msiemens commented Sep 17, 2014

djordjeglbvc commented Sep 17, 2014

msiemens commented Sep 17, 2014

Unicode in queries #28

Unicode in queries #28

Comments

djordjeglbvc commented Sep 11, 2014

eugene-eeo commented Sep 12, 2014

msiemens commented Sep 15, 2014

djordjeglbvc commented Sep 15, 2014

msiemens commented Sep 16, 2014

djordjeglbvc commented Sep 16, 2014

msiemens commented Sep 16, 2014

djordjeglbvc commented Sep 16, 2014

msiemens commented Sep 16, 2014

eugene-eeo commented Sep 16, 2014

msiemens commented Sep 17, 2014

djordjeglbvc commented Sep 17, 2014

msiemens commented Sep 17, 2014