-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace Klaxon with kotlinx-serialization #603
Conversation
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/json.kt
Outdated
Show resolved
Hide resolved
What do you mean with "specially handle the float case". Looking at your code I see you skip it and convert it to a Double sometimes. I don't really understand why; DataFrame understands Floats just fine, so you can keep them as Floats |
@devcrocod you wrote that this is a first step, what could be the next step? |
Check related issue |
By default, klaxon translates all floating-point numbers into Double. kotlinx-serialization can work with float. So, the question here is not what DataFrame can do, but what klaxon could do. It was possible to support Float in JSON parsing, but then it would have been necessary to additionally change the logic in JSON parsing. I decided not to do this because these subsequent parsers should ultimately change, and because it maximally preserves the previous behavior of the library. |
I think the previous behavior of the library was not intended, but rather a limit imposed on the library by Klaxon. It seems odd to keep this broken behavior if both Kotlinx-serialization ánd DataFrame support Floats to convert them to Doubles. So yes, this might require updating some tests and logic, but it's ultimately for the best I believe. |
I agree that such behavior is not entirely correct, as it also carries computational errors when converting float to double. But I will repeat that this behavior will be changed in the future when serializers for dataframe objects are written. Moreover, in the current version, there is a question:
But during deserialization, we look at each element, and the logic is as follows: if the number fits within 32, then it's a float; if it's more than 32, then it's a double. And it turns out that our list can easily contain both Float and Double, and instead of the output type being Double, we get Number. |
# Conflicts: # core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/json.kt # core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/jupyter/JupyterHtmlRenderer.kt # core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/io/json.kt # core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/jupyter/RenderingTests.kt
@devcrocod Looks like you introduced 4 failing tests in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall a trivial conversion, thanks :) just some small notes
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/io/readJson.kt
Outdated
Show resolved
Hide resolved
"NaN" -> { | ||
nanIndices.add(i) | ||
collector.add(null) | ||
is JsonPrimitive -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this xOrNull
conversion works, it's quite heavy as it works by throwing/catching exceptions. Ilya actually recently replaced similar logic from the notebook kernel because it was so heavy. It was replaced by Jackson. Now I know that's not an option here, but maybe we could try another approach if that's possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something
Could you please explain in more detail or provide an example from the notebook?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let's say we want to parse an array of null
s. In this case, for each element it would check intOrNull
, longOrNull
, doubleOrNull
, and floatOrNull
before trying v.jsonPrimitive is JsonNull
. All of these work with mapExceptionsToNull {}
, meaning, for each element in the array an exception is thrown and caught 4 times. This is very heavy, especially when it happens very often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, indeed this isn't the good place that should be fixed when supporting Serializable for DataFrame and DataRow. I also think that part of this might be optimized at runtime since our value is already wrapped in JsonPrimitive. There are doubts that jdk8 handles such optimizations well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the null check higher up. Given the current capabilities of kotlinx-serialization, there are two options: either leave it as it is now, or write a custom serializer. However, the logic in the custom serializer would be quite similar. We can always take a String and then manually attempt to convert it to the desired type. This might be slightly better in terms of performance, and we would need to account for all edge cases for different types
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/io/readJson.kt
Outdated
Show resolved
Hide resolved
nanIndices.add(i) | ||
collector.add(null) | ||
is JsonArray -> collector.add(null) | ||
is JsonPrimitive -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same note as before with xOrNull
checks
// serialization versions and format. | ||
internal const val SERIALIZATION_VERSION = "2.1.0" | ||
private val valueTypes = | ||
setOf(Boolean::class, Double::class, Int::class, Float::class, Long::class, Short::class, Byte::class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where's Char?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copied what was there and moved it to the top. I believe that all constants and configurations should be at the top, so you don't have to search for them throughout the file. Therefore, all the types that were there are now at the top
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, but would Char make sense here? or would that just become a String from json?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think second options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, everything is read as string, kotlinx-serialization doesn't have the char option for JsonPrimitive
core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/io/writeJson.kt
Outdated
Show resolved
Hide resolved
if (arraysAreFrames) encodeFrame(it as AnyFrame) else null | ||
} | ||
?: encodeRow(frame, rowIndex) | ||
valueColumn?.get(rowIndex) ?: arrayColumn?.get(rowIndex)?.let { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ?: chain is very hard to understand. I know you just reformatted it, but maybe you know a better way to write it? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, I also found it difficult to understand what was happening here, so I just rewrote it using objects. I'll see how this can be improved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rewrote it using the when
construct. it seems more readable now. Could you please take a look?
I made some minor corrections and a bit of refactoring. I also fixed issues with the plugin tests. The problem was that we set a |
# Conflicts: # core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/guess.kt
This is strange, I got errors with DataFrameBlackBoxCodegenTestGenerated after today's merge |
It's because #729 got merged. Looks like the test failed because your API change changed the FIR representation in the plugin. This will probably happen a lot in the future but is actually solvable and just a warning really :) Simply:
|
I did as you said, and locally everything worked for me. But as you can see, the tests on TC still failed:
Could these files be platform or environment dependent? In any case, this approach of regenerating files for tests doesn't look as perfect decision |
Try setting your project sdk + gradle's jvm to JDK 11, while we figure out how to make that the default for all contributors :) Edit: actually, I fixed it in #736, I'll update your branch from master and we should be good to go :) |
This is the first step in migrating from klaxon to kotlinx-serialization #312
This PR should be accepted after #574, after which I will resolve the conflicts.