Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode in queries #28

Closed
djordjeglbvc opened this issue Sep 11, 2014 · 12 comments
Closed

Unicode in queries #28

djordjeglbvc opened this issue Sep 11, 2014 · 12 comments

Comments

@djordjeglbvc
Copy link

When testing db field with unicode string, UnicodeEncodeError exception is raised.
Line which causes the exception:

db.get(where('name') == u'žir')

Inserting unicode data went without problems:

db.insert({'name': 'žir'})

I have made quick hack which fixes problem for my little hobby project, but I will examine this problem more when I find time.

In queries.py, I've changed Query._update_repr function body to:

self._repr = u'\'{0}\' {1} {2}'.format(self._key, operator, value)

and Query.__hash__ to:

return hash(repr(unicode(self)))

Basically adding string preffix "u" in _update_repr, and "unicode" call in __hash__...

Using tinydb from git on python 2.7.6, ubuntu 14.04

@eugene-eeo
Copy link
Contributor

Is it possible to normalize the data first before inserting? I.e. I know that there is a function called unicodedata.normalize that should help. Then you can query easily with:

db.get(where('name') == 'zir')

Can you provide the full traceback information? (Just copy + paste from your Python interpreter session)

@msiemens
Copy link
Owner

@zelenikotao Can you please post a full traceback?

@djordjeglbvc
Copy link
Author

Sorry for not responding earlier, I didn't have any free time over weekend.

@eugene-eeo I have tried with unicodedata.normalize, result is the same.

@eugene-eeo @msiemens, here is the traceback:

$ python example.py 

Traceback (most recent call last):
  File "example.py", line 13, in <module>
    db.get(where('name') == unicodedata.normalize('NFKC', u'žir'))
  File "/usr/local/lib/python2.7/dist-packages/tinydb/queries.py", line 184, in __eq__
    self._update_repr('==', other)
  File "/usr/local/lib/python2.7/dist-packages/tinydb/queries.py", line 310, in _update_repr
    self._repr = '\'{0}\' {1} {2}'.format(self._key, operator, value)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u017e' in position 0: ordinal not in range(128)

Here is test script which causes exception,
https://gist.github.com/zelenikotao/b23d79edc80bcea3b511.js

eugene-eeo pushed a commit to eugene-eeo/tinydb that referenced this issue Sep 15, 2014
@msiemens
Copy link
Owner

@zelenikotao You've mixed up unicode strings and byte strings. It should work if you use byte strings only, e.g.:

db.insert({'name': 'žir'})
db.search(where('name') == 'žir')

@eugene-eeo I wouldn't recommend normalizing the data that way as you will lose information. Say you insert both {'name': 'zir'} and {'name': 'žir'}, TinyDB will regard them as equal while they propably shouldn't be.

@djordjeglbvc
Copy link
Author

@msiemens when I use byte strings, as you've proposed, db holds unicode string for value of inserted document, and using search raises this warning

/usr/local/lib/python2.7/dist-packages/tinydb/queries.py:183: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  self._cmp = lambda value: value == other

Of course, search doesn't return document I was searching for, only None value.

@msiemens
Copy link
Owner

What's the exact code you've used? If I use byte strings for both inserting and searching, it works...

>>> from tinydb import TinyDB, where
>>> from tinydb.storages import MemoryStorage
>>> db = TinyDB(storage=MemoryStorage)
>>> db.insert({'name': 'žir'})
1
>>> db.search(where('name') == 'žir')
[{'name': '?ir'}]

(Note: the question mark in the db.search result is caused by the Windows CMD terminal, shouldn't be a bug in TinyDB)

@djordjeglbvc
Copy link
Author

This is the code I've used:

>>> from tinydb import TinyDB, where
>>> db = TinyDB('db.json')
>>> db.insert({'name': 'žir'})
1
>>> db.search(where('name') == 'žir')
/usr/local/lib/python2.7/dist-packages/tinydb/queries.py:183: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
  self._cmp = lambda value: value == other
[]

But I've tried with MemoryStorage as in your example, and it is working. Could problem be somewhere in file storage handling?

@msiemens
Copy link
Owner

Could be, I'm investigating.

EDIT: This doesn't seem to have a trivial non-hacky solution, I'll work a bit on this.

@eugene-eeo
Copy link
Contributor

@msiemens I think you should read this as well http://stackoverflow.com/questions/11759070/python-json-loads-dumps-break-unicode#11759156

UPDATE: It works:

>>> from ujson import dumps
>>> d = dumps({"name": "ålpha"}, ensure_ascii=False)
>>> d
'{"name":"\xc3\xa5lpha"}'
>>> loads(d)
{u'name': u'\xe5lpha'}
>>> 

@msiemens
Copy link
Owner

I was wrong, there is a trivial solution, see 6b518b8. Test cases for unicode data included.

@zelenikotao Could you test if it works in the latest development version?

@djordjeglbvc
Copy link
Author

Tested it, works great for me!
Thanks!

@msiemens
Copy link
Owner

Thanks for reporting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants