-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode in queries #28
Comments
Is it possible to normalize the data first before inserting? I.e. I know that there is a function called unicodedata.normalize that should help. Then you can query easily with: db.get(where('name') == 'zir') Can you provide the full traceback information? (Just copy + paste from your Python interpreter session) |
@zelenikotao Can you please post a full traceback? |
Sorry for not responding earlier, I didn't have any free time over weekend. @eugene-eeo I have tried with unicodedata.normalize, result is the same. @eugene-eeo @msiemens, here is the traceback: $ python example.py
Traceback (most recent call last):
File "example.py", line 13, in <module>
db.get(where('name') == unicodedata.normalize('NFKC', u'žir'))
File "/usr/local/lib/python2.7/dist-packages/tinydb/queries.py", line 184, in __eq__
self._update_repr('==', other)
File "/usr/local/lib/python2.7/dist-packages/tinydb/queries.py", line 310, in _update_repr
self._repr = '\'{0}\' {1} {2}'.format(self._key, operator, value)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u017e' in position 0: ordinal not in range(128) Here is test script which causes exception, |
@zelenikotao You've mixed up unicode strings and byte strings. It should work if you use byte strings only, e.g.: db.insert({'name': 'žir'})
db.search(where('name') == 'žir') @eugene-eeo I wouldn't recommend normalizing the data that way as you will lose information. Say you insert both |
@msiemens when I use byte strings, as you've proposed, db holds unicode string for value of inserted document, and using search raises this warning /usr/local/lib/python2.7/dist-packages/tinydb/queries.py:183: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
self._cmp = lambda value: value == other Of course, search doesn't return document I was searching for, only None value. |
What's the exact code you've used? If I use byte strings for both inserting and searching, it works... >>> from tinydb import TinyDB, where
>>> from tinydb.storages import MemoryStorage
>>> db = TinyDB(storage=MemoryStorage)
>>> db.insert({'name': 'žir'})
1
>>> db.search(where('name') == 'žir')
[{'name': '?ir'}] (Note: the question mark in the |
This is the code I've used: >>> from tinydb import TinyDB, where
>>> db = TinyDB('db.json')
>>> db.insert({'name': 'žir'})
1
>>> db.search(where('name') == 'žir')
/usr/local/lib/python2.7/dist-packages/tinydb/queries.py:183: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
self._cmp = lambda value: value == other
[] But I've tried with MemoryStorage as in your example, and it is working. Could problem be somewhere in file storage handling? |
Could be, I'm investigating. EDIT: This doesn't seem to have a trivial non-hacky solution, I'll work a bit on this. |
@msiemens I think you should read this as well http://stackoverflow.com/questions/11759070/python-json-loads-dumps-break-unicode#11759156 UPDATE: It works: >>> from ujson import dumps
>>> d = dumps({"name": "ålpha"}, ensure_ascii=False)
>>> d
'{"name":"\xc3\xa5lpha"}'
>>> loads(d)
{u'name': u'\xe5lpha'}
>>> |
I was wrong, there is a trivial solution, see 6b518b8. Test cases for unicode data included. @zelenikotao Could you test if it works in the latest development version? |
Tested it, works great for me! |
Thanks for reporting! |
When testing db field with unicode string,
UnicodeEncodeError
exception is raised.Line which causes the exception:
Inserting unicode data went without problems:
I have made quick hack which fixes problem for my little hobby project, but I will examine this problem more when I find time.
In
queries.py
, I've changedQuery._update_repr
function body to:and
Query.__hash__
to:Basically adding string preffix "u" in
_update_repr
, and "unicode" call in__hash__
...Using tinydb from git on python 2.7.6, ubuntu 14.04
The text was updated successfully, but these errors were encountered: