Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some client JSON parsers don't support 64-bit integers, potentially causing truncation #50

Closed
peterdutey opened this issue May 31, 2023 · 14 comments

Comments

@peterdutey
Copy link

Hello

First I'd like to say this is a much-needed library that's both incredibly fast and implements recent operators of the ECL.
I'm enjoying learning to use it!

I just came across a behaviour I don't quite understand for concepts in the UK drug extension, which have longer identifiers.
This first happened on my laptop and it seems the behaviour is similar using your helpful test server.
I do apologise in advance if the explanation should be obvious - I seem to be observing a change in the last digit of the concept ID of UK medical products. The two example below seem to affect all the REST endpoints I've tested.

I am wondering whether I'm missing something. The data stored seems valid, it's just the REST response doesn't match.
Curious to understand what is happening with the retrieval.
Thanks!
Peter

http://3.9.221.177:8080/v1/snomed/search?constraint=9401401000001101

[
  {
    "id": 47770001000001110,
    "conceptId": 9401401000001100,
    "term": "Zoladex",
    "preferredTerm": "Zoladex"
  }
]

You can see the concept id 9401401000001101 reads 9401401000001100

http://3.9.221.177:8080/v1/snomed/concepts/38001711000001104/preferred

{
  "id": 653951000001114,
  "effectiveTime": "2020-01-22",
  "active": true,
  "moduleId": 999000011000001200,
  "conceptId": 38001711000001100,
  "languageCode": "en",
  "typeId": 900000000000013000,
  "term": "Leuprorelin 10.72mg implant pre-filled syringes 1 pre-filled disposable injection",
  "caseSignificanceId": 900000000000017000
}

You can see the concept id 38001711000001104 reads 38001711000001100

@wardle
Copy link
Owner

wardle commented May 31, 2023

Hi. Thanks for getting in touch and really lovely to get nice feedback.

On my phone I can't reproduce the issue.image

I'll do some digging tomorrow but May I ask which browser or tool you are using to make the HTTP call? Could it be tripping up with the very long numbers in JSON?

Mark

@wardle
Copy link
Owner

wardle commented May 31, 2023

And for the second example image

@wardle
Copy link
Owner

wardle commented May 31, 2023

I wonder if you've found a bug in your http / json client library but I will double check my end just in case! Would be helpful to know if you get a different result if you access from a different client/browser?

@wardle
Copy link
Owner

wardle commented Jun 1, 2023

It looks as if some client APIs (notably JavaScript) struggle with large numbers. While JSON specs support arbitrary sizes, there are limits based on safe interoperability.

Options

  1. Change default to return SNOMED identifiers as text instead and avoid any precision errors in client implementation.
  2. This is a client bug- get it fixed there.
  3. Add option to return ids as strings on a per-request basis.

I'm inclined to (3) because it would be easy, would support users on broken clients who don't have capability to fix the issue, and wouldn't introduce a breaking change for clients who have no difficulty in processing JSON with unsigned 64 bit integers.

@wardle
Copy link
Owner

wardle commented Jun 1, 2023

In favour of (2):

  1. The spec for JSON doesn't stipulate a max size at all, the format is just any number of digits... https://www.json.org/json-en.html ... so a failure of handling is one in the client library implementation or client platform.
  2. This issue doesn't apply to EDN... you can request data in that format using a Content-Type header in the request. EDN parsing libraries are available for a range of platforms.
  3. This issue does not apply to many client platforms (Java, Python) which handle unsigned 64 bit integers with no difficulty.

It will be informative to know how snowstorm and ontoserver handle this issue.

@wardle
Copy link
Owner

wardle commented Jun 1, 2023

It appears that snowstorm defaults to return ids as text.

@peterdutey peterdutey changed the title UK drug extension Long concept identifiers losing precision in JSON Jun 1, 2023
@peterdutey
Copy link
Author

Hi Mark
This is very interesting - I learnt something here. I should have realised this alone but really appreciate the help!
Until now, I was using snowstorm with text identifiers, and I never stumbled across this behaviour of JSON reverting to atomic types.
What you saw earlier is taken from the JSON viewer in firefox - the raw output is unaffected.
But I originally faced the problem in R. The jsonlite library which has its own mapping going on (still reading about it), but ultimately translates into the same behaviour as the Firefox viewer. There is a quick fix with this using the simplifyVector argument (should others read this!).

So while this has nothing to do with Hermes (and I agree there isn't a case for changing anything) I suspect other users will inadvertently run into this issue and will be happy to read this thread.
I took the liberty to update the name of the issue so others find it more easily in the future.
Obviously please do edit as you see fit before closing this issue.
Many thanks for the speedy response and helping track down the problem!
P

@wardle
Copy link
Owner

wardle commented Jun 1, 2023

The issue that concerns me is that there is silent data loss for some clients using JSON as the transport, even though admittedly this is a client problem.

If there is a configurable per-request option, I could generate strings for identifiers instead of numbers in the JSON response, and then the issue a) is well documented and b) has a workaround baked in and c) mimics behaviour of other SNOMED servers. The counter-argument is that it adds complexity to both the surface API and the implementation.

I will ponder for a bit before deciding. Does your quick fix work from R?

@wardle wardle changed the title Long concept identifiers losing precision in JSON Some client JSON parsers (e.g. JavaScript) support only 53-bit integers, potentially causing truncation Jun 1, 2023
@wardle wardle changed the title Some client JSON parsers (e.g. JavaScript) support only 53-bit integers, potentially causing truncation Some client JSON parsers don't support 64-bit integers, potentially causing truncation Jun 1, 2023
@peterdutey
Copy link
Author

Yes very much a client problem - type conversion being ubiquitous in R it seems the usual workarounds (using bit64 or character type casting) when loading data from database or from file directly are not operational here.
Certainly having an option to retrieve character identifiers would be nice, but I'm trying to find a workaround that might avoid this.
I may need a couple of days before I can dedicate the time to properly get to the bottom of it - I'll get back to you.

@wardle
Copy link
Owner

wardle commented Jun 1, 2023

The patch below adds per-request support for stringifying values that are a) of type 'long' and b) an 'id' property (ie property name ends in 'Id'. Another alternative is to add an option to the server (on startup) so that this is a per-server configuration rather than a per-request option. The per-request option would then override whatever the default setting would be in the running server. As you can see, it is quite trivial to add this functionality. However, there are situations in which this won't appropriately stringify results - such as when there are extended fields within a reference set item. Arguably that could be made to work, by looking at the refset descriptors carefully, but adds more complexity.

diff --git a/cmd/com/eldrix/hermes/cmd/server.clj b/cmd/com/eldrix/hermes/cmd/server.clj
index eae5f30..2d11505 100644
--- a/cmd/com/eldrix/hermes/cmd/server.clj
+++ b/cmd/com/eldrix/hermes/cmd/server.clj
@@ -40,6 +40,8 @@
 (def ok (partial response 200))
 (def not-found (partial response 404))
 
+(defn parse-flag [s] (case s ("1" "true") true false))
+
 (defn accepted-type
   [ctx]
   (get-in ctx [:request :accept :field] "application/json"))
@@ -51,27 +53,33 @@
 
 (extend LocalDate json/JSONWriter {:-write write-local-date})
 
+(defn stringify-ids [k v]
+  (if (and (.endsWith (name k) "Id") (instance? Long v)) (str v) v))
+
 (defn transform-content
-  [body content-type]
+  [body content-type stringify-ids?]
   (case content-type
     "text/html" body
     "text/plain" body
     "application/edn" (.getBytes (pr-str body) "UTF-8")
-    "application/json" (.getBytes (json/write-str body) "UTF-8")))
+    "application/json" (.getBytes (if stringify-ids?
+                                    (json/write-str body {:value-fn stringify-ids})
+                                    (json/write-str body)) "UTF-8")))
 
 (defn coerce-to
-  [resp content-type]
+  [resp content-type stringify-ids?]
   (-> resp
-      (update :body transform-content content-type)
+      (update :body transform-content content-type stringify-ids?)
       (assoc-in [:headers "Content-Type"] content-type)))
 
 (def coerce-body
   {:name ::coerce-body
    :leave
    (fn [ctx]
-     (if (get-in ctx [:response :headers "Content-Type"])
-       ctx
-       (update ctx :response coerce-to (accepted-type ctx))))})
+     (let [stringify-ids? (parse-flag (get-in ctx [:request :params :stringifyIds]))]
+       (if (get-in ctx [:response :headers "Content-Type"])
+         ctx
+         (update ctx :response coerce-to (accepted-type ctx) stringify-ids?))))})
 
 (defn inject-svc
   "A simple interceptor to inject terminology service 'svc' into the context."
@@ -109,9 +117,6 @@
     :else
     (assoc ctx :io.pedestal.interceptor.chain/error err)))
 
-
-(defn parse-flag [s] (case s ("1" "true") true false))
-
 (def get-concept
   {:name  ::get-concept
    :enter (fn [{::keys [svc] :as ctx}]

@peterdutey
Copy link
Author

I think I found the workaround - undocumented argument of jsonlite (bigint_as_char) I had forgotten means that all long integers found in the JSON get read as characters by R.

As far as I'm concerned this works for me.
I'm conscious this still means this constitutes a bit of a trap and that many of us will fall into it!

request <- httr::parse_url(
  paste0("http://localhost:8080/v1/snomed/search?constraint=9401401000001101") 
) |> 
  httr::GET() |> 
  httr::content(as = "text", encoding = "UTF-8")

results <- request |> 
  jsonlite::fromJSON(flatten = TRUE, bigint_as_char = TRUE)

results$conceptId == "9401401000001101"
> [1] TRUE

wardle added a commit that referenced this issue Jun 1, 2023
@wardle
Copy link
Owner

wardle commented Jun 1, 2023

Thanks @peterdutey - that's helpful to know. I have added a warning in the documentation for the moment, and keep this issue open for a while.

On a related issue, while I have not used R since my doctorate days, I wonder whether you had a view on a rjava-based library for Hermes so that you can use SNOMED CT in-process and avoid even local HTTP roundtrips? This is how I use in my own analytics from Clojure... although could switch to using an external server if I needed to do so. Hermes has a Java-friendly API already so that it should be easy to run from R? In that case, you'd have no issues with type coercions as you'd not serialise via JSON at all.

@peterdutey
Copy link
Author

Yes! I have been meaning to email you about this for several months about this... Email coming now!

@wardle
Copy link
Owner

wardle commented Jun 9, 2023

I am going to close this issue, but happy to consider re-opening. This is principally a client JSON decoding problem, and it sounds as if there are mitigations possible - e.g. in JavaScript or in R - as those libraries have recognised their problems with large numbers. The JSON specification does not limit numbers to a certain size, although I do recognise the warning re: practical interoperability based on implementation limits. The SNOMED specification defines identifiers to be unsigned 64-bit integers; it is reasonable to use strings or numbers representations. While it is straightforward to add a per-server or per-request option to turn identifiers into strings in the JSON output, it adds complexity when in fact most clients will have no trouble at all in dealing with 64-bit integers.

@wardle wardle closed this as completed Jun 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants