Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

👌 IMPROVE: Allow for storing Decimal #4964

Merged
merged 5 commits into from
Jul 28, 2021

Conversation

dev-zero
Copy link
Contributor

No description provided.

@dev-zero
Copy link
Contributor Author

@greschd do you remember: when generating the hash, do you do it after a store and load (and on the reloaded data) or before?

@greschd
Copy link
Member

greschd commented May 27, 2021

We don't explicitly go through the store / load loop before hashing -- the expectation is that the hash function is implemented such that it's insensitive to changes that may occur.

@dev-zero dev-zero force-pushed the feature/permit-decimal branch 2 times, most recently from 11eb8c8 to 1a98a9f Compare May 27, 2021 09:56
@chrisjsewell
Copy link
Member

thanks @dev-zero, I see changes to the django backend and migrations, but no accompanying change to the sqlalchemy backend. Is this also required?

and yeh obviously if you could fix the tests; I'll change this to a draft PR whilst you finalise it and let me know when its ready 👍

@chrisjsewell chrisjsewell marked this pull request as draft June 16, 2021 06:43
@dev-zero
Copy link
Contributor Author

@chrisjsewell no, changes to sqlalchemy are not required: for some reason we've been using simplejson for some time there. My guess is that Django was using simplejson at some point and when updating we either didn't notice or we migrated to the builtin json. So at the moment SQLA is using simplejson while Django ORM uses json. The first commit fixes that (and Django needs a migration for that since it includes the column class name in its schema).

My question is how we should do hashing for Decimals: in Python hashes for Decimal('N') with N an int are the same as hashes for int(N) (and similar mechanism is in place for float). Which would make sense in our case as well since atm we'll be storing a Decimal as a string repr (a json float or json int) but when loading we will definitely get an int or a float back. Maybe I should be doing the same thing here: determine whether it is an int and call _single_digest('int', ...) and _single_digest('float', ...) if not? @greschd ?

@chrisjsewell
Copy link
Member

but when loading we will definitely get an int or a float back

Hmm, so you are saying that currently when a Decimal is stored to the database, it is stored and thus returned as an int/float? I guessed the desired behaviour would be to properly serialize/deserialize as a Decimal, but perhaps this is not possible/easy to achieve

@codecov
Copy link

codecov bot commented Jun 16, 2021

Codecov Report

Merging #4964 (9992bf1) into develop (91c1c0b) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #4964      +/-   ##
===========================================
+ Coverage    80.23%   80.24%   +0.01%     
===========================================
  Files          515      515              
  Lines        36746    36758      +12     
===========================================
+ Hits         29478    29491      +13     
+ Misses        7268     7267       -1     
Flag Coverage Δ
django 74.71% <100.00%> (+0.01%) ⬆️
sqlalchemy 73.63% <36.37%> (-<0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...rations/0033_replace_text_field_with_json_field.py 100.00% <100.00%> (ø)
...migrations/0037_attributes_extras_settings_json.py 86.96% <100.00%> (+0.12%) ⬆️
...ckends/djsite/db/migrations/0045_dbgroup_extras.py 100.00% <100.00%> (ø)
...db/migrations/0046_add_node_repository_metadata.py 100.00% <100.00%> (ø)
aiida/backends/djsite/db/models.py 81.51% <100.00%> (+0.11%) ⬆️
aiida/common/hashing.py 94.07% <100.00%> (+0.27%) ⬆️
aiida/common/json.py 78.95% <100.00%> (+1.17%) ⬆️
aiida/orm/implementation/utils.py 100.00% <100.00%> (ø)
aiida/transports/plugins/local.py 81.66% <0.00%> (+0.26%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 91c1c0b...9992bf1. Read the comment docs.

@dev-zero
Copy link
Contributor Author

but when loading we will definitely get an int or a float back

Hmm, so you are saying that currently when a Decimal is stored to the database, it is stored and thus returned as an int/float?

Yes. But that also applies to the Numpy datatypes: they will as well be deserialized as Python floats/ints.

I guessed the desired behaviour would be to properly serialize/deserialize as a Decimal, but perhaps this is not possible/easy to achieve

Well, yes. But the only way for this would work is by introducing a schema. This can be either integrated in the json-data stored in the db column, or you keep it external. The latter might be easier: a Node subclass could define an attributes_schema (which could be a pydantic schema), which then gets wrapped in a JSONEncoder/JSONDecoder compatible interface, being called when serializing/deserializing. For Django that's straightforward (Django 3.x supports specifying a decoder) to define per column. For SQLA a new column type would be needed.

@dev-zero dev-zero marked this pull request as ready for review June 16, 2021 11:30
@chrisjsewell
Copy link
Member

chrisjsewell commented Jun 16, 2021

Yes. But that also applies to the Numpy datatypes: they will as well be deserialized as Python floats/ints.
Well, yes. But the only way for this would work is by introducing a schema. This can be either integrated in the json-data stored in the db column, or you keep it external.
which then gets wrapped in a JSONEncoder/JSONDecoder compatible interface

Yeh, it just feels a bit "off" to me (and unexpected for users), that we should be allowing data types to be serialized that we essentially do not know how to deserialize correctly. I would think it would be more explicit if you convert to a serializable data type before e.g. setting an attribute.

Proper serialization would be nice, but I would be wary of the performance impact this would have on (de)serialization, and thus interacting with the database.
Indeed, do we know the performance impact of using the simplejson.JSONEncoder instead of the built-in one? (maybe it is faster 🤷)

@greschd
Copy link
Member

greschd commented Jun 16, 2021

Maybe I should be doing the same thing here: determine whether it is an int and call _single_digest('int', ...) and _single_digest('float', ...) if not? @greschd ?

If the DB serialize / deserialize changes the type to int / float then yes, I think this is the right thing to do. The important thing to make sure here is that the hashes remain the same for an object before / after storing.

In general though, I'd be wary to allow decimals if they silently "degrade" (especially in the inexact case of float conversion).

@chrisjsewell chrisjsewell self-assigned this Jul 21, 2021
While SQLA has been using simplejson for some time (via `aiida.common.json.dumps` in `aiida.backends.utils`), the `JSONField` from Django was using the native `json` module from Python (they have been using simplejson at some point). This becomes clear as soon as the decimal.Decimal is allowed which simplejson can natively serialize to JSON while the builtin json module does not.
@dev-zero
Copy link
Contributor Author

@greschd wrt hashing: I am currently calculating hashes for either a float or an int, depending on its string representation (e.g. depending on what type it gets when deserialized). So, it will be consistent.

@dev-zero
Copy link
Contributor Author

Yes. But that also applies to the Numpy datatypes: they will as well be deserialized as Python floats/ints.
Well, yes. But the only way for this would work is by introducing a schema. This can be either integrated in the json-data stored in the db column, or you keep it external.
which then gets wrapped in a JSONEncoder/JSONDecoder compatible interface

Yeh, it just feels a bit "off" to me (and unexpected for users), that we should be allowing data types to be serialized that we essentially do not know how to deserialize correctly. I would think it would be more explicit if you convert to a serializable data type before e.g. setting an attribute.

Proper serialization would be nice, but I would be wary of the performance impact this would have on (de)serialization, and thus interacting with the database.
Indeed, do we know the performance impact of using the simplejson.JSONEncoder instead of the built-in one? (maybe it is faster shrug)

All of the Python JSON libraries claim the performance throne for themselves it seems, with https://github.com/ijl/orjson or https://pypi.org/project/ujson/ being specifically performance tuned.

I've looked a bit more into this, and I think the proper way to go here would be to give the DbNode descendants a way to specify a serialization/deserialization schema (Pydantic?) and while storing the JSON as such in the DB (to allow for performant SQL-level queries on it), still load the numerical data as a string first (or as Decimal) before converting it to the final data type.

Anyway, this is good to go and would simplify my live already quiet a bit.

@chrisjsewell chrisjsewell changed the title feature/permit decimal 👌 IMPROVE: Allow for storing Decimal Jul 28, 2021
@chrisjsewell chrisjsewell merged commit f99f1e8 into aiidateam:develop Jul 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants