Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Klaxon with kotlinx-serialization #603

Merged
merged 14 commits into from
Jun 14, 2024

Conversation

devcrocod
Copy link
Contributor

This is the first step in migrating from klaxon to kotlinx-serialization #312

  • Removed klaxon dependency
  • Added kotlinx-serialization
  • Klaxon objects replaced with JsonElement from serialization
  • All other behavior is preserved. Unlike klaxon, kotlinx-serialization works better with primitives. For example, klaxon does not work with float, only with double, so I had to specially handle the float case. Also, nulls are wrapped in serialization.

This PR should be accepted after #574, after which I will resolve the conflicts.

@Jolanrensen Jolanrensen added the enhancement New feature or request label Feb 26, 2024
@Jolanrensen Jolanrensen added this to the 0.14.0 milestone Feb 26, 2024
build.gradle.kts Outdated Show resolved Hide resolved
@Jolanrensen
Copy link
Collaborator

What do you mean with "specially handle the float case". Looking at your code I see you skip it and convert it to a Double sometimes. I don't really understand why; DataFrame understands Floats just fine, so you can keep them as Floats

@zaleslaw zaleslaw self-requested a review February 26, 2024 12:38
@zaleslaw
Copy link
Collaborator

@devcrocod you wrote that this is a first step, what could be the next step?

@devcrocod
Copy link
Contributor Author

@devcrocod you wrote that this is a first step, what could be the next step?

Check related issue

@devcrocod
Copy link
Contributor Author

What do you mean with "specially handle the float case". Looking at your code I see you skip it and convert it to a Double sometimes. I don't really understand why; DataFrame understands Floats just fine, so you can keep them as Floats

By default, klaxon translates all floating-point numbers into Double. kotlinx-serialization can work with float. So, the question here is not what DataFrame can do, but what klaxon could do. It was possible to support Float in JSON parsing, but then it would have been necessary to additionally change the logic in JSON parsing. I decided not to do this because these subsequent parsers should ultimately change, and because it maximally preserves the previous behavior of the library.

@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Feb 27, 2024

What do you mean with "specially handle the float case". Looking at your code I see you skip it and convert it to a Double sometimes. I don't really understand why; DataFrame understands Floats just fine, so you can keep them as Floats

By default, klaxon translates all floating-point numbers into Double. kotlinx-serialization can work with float. So, the question here is not what DataFrame can do, but what klaxon could do. It was possible to support Float in JSON parsing, but then it would have been necessary to additionally change the logic in JSON parsing. I decided not to do this because these subsequent parsers should ultimately change, and because it maximally preserves the previous behavior of the library.

I think the previous behavior of the library was not intended, but rather a limit imposed on the library by Klaxon. It seems odd to keep this broken behavior if both Kotlinx-serialization ánd DataFrame support Floats to convert them to Doubles. So yes, this might require updating some tests and logic, but it's ultimately for the best I believe.

@devcrocod
Copy link
Contributor Author

I agree that such behavior is not entirely correct, as it also carries computational errors when converting float to double. But I will repeat that this behavior will be changed in the future when serializers for dataframe objects are written.

Moreover, in the current version, there is a question:
During serialization, it's obvious how it should look:

1.1f -> 1.1 (float)
1.1 -> 1.100000 (double)

But during deserialization, we look at each element, and the logic is as follows: if the number fits within 32, then it's a float; if it's more than 32, then it's a double. And it turns out that our list can easily contain both Float and Double, and instead of the output type being Double, we get Number.

# Conflicts:
#	core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/json.kt
#	core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/jupyter/JupyterHtmlRenderer.kt
#	core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/io/json.kt
#	core/src/test/kotlin/org/jetbrains/kotlinx/dataframe/jupyter/RenderingTests.kt
@Jolanrensen
Copy link
Collaborator

@devcrocod Looks like you introduced 4 failing tests in org.jetbrains.dataframe.ksp.DataFrameSymbolProcessorTest

Copy link
Collaborator

@Jolanrensen Jolanrensen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall a trivial conversion, thanks :) just some small notes

"NaN" -> {
nanIndices.add(i)
collector.add(null)
is JsonPrimitive -> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this xOrNull conversion works, it's quite heavy as it works by throwing/catching exceptions. Ilya actually recently replaced similar logic from the notebook kernel because it was so heavy. It was replaced by Jackson. Now I know that's not an option here, but maybe we could try another approach if that's possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something
Could you please explain in more detail or provide an example from the notebook?

Copy link
Collaborator

@Jolanrensen Jolanrensen Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let's say we want to parse an array of nulls. In this case, for each element it would check intOrNull, longOrNull, doubleOrNull, and floatOrNull before trying v.jsonPrimitive is JsonNull. All of these work with mapExceptionsToNull {}, meaning, for each element in the array an exception is thrown and caught 4 times. This is very heavy, especially when it happens very often.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, indeed this isn't the good place that should be fixed when supporting Serializable for DataFrame and DataRow. I also think that part of this might be optimized at runtime since our value is already wrapped in JsonPrimitive. There are doubts that jdk8 handles such optimizations well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the null check higher up. Given the current capabilities of kotlinx-serialization, there are two options: either leave it as it is now, or write a custom serializer. However, the logic in the custom serializer would be quite similar. We can always take a String and then manually attempt to convert it to the desired type. This might be slightly better in terms of performance, and we would need to account for all edge cases for different types

nanIndices.add(i)
collector.add(null)
is JsonArray -> collector.add(null)
is JsonPrimitive -> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note as before with xOrNull checks

// serialization versions and format.
internal const val SERIALIZATION_VERSION = "2.1.0"
private val valueTypes =
setOf(Boolean::class, Double::class, Int::class, Float::class, Long::class, Short::class, Byte::class)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where's Char?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copied what was there and moved it to the top. I believe that all constants and configurations should be at the top, so you don't have to search for them throughout the file. Therefore, all the types that were there are now at the top

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, but would Char make sense here? or would that just become a String from json?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think second options

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, everything is read as string, kotlinx-serialization doesn't have the char option for JsonPrimitive

if (arraysAreFrames) encodeFrame(it as AnyFrame) else null
}
?: encodeRow(frame, rowIndex)
valueColumn?.get(rowIndex) ?: arrayColumn?.get(rowIndex)?.let {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ?: chain is very hard to understand. I know you just reformatted it, but maybe you know a better way to write it? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, I also found it difficult to understand what was happening here, so I just rewrote it using objects. I'll see how this can be improved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote it using the when construct. it seems more readable now. Could you please take a look?

@devcrocod
Copy link
Contributor Author

I made some minor corrections and a bit of refactoring. I also fixed issues with the plugin tests. The problem was that we set a rateLimit (10000) on the incoming input stream. When reading, we check if the selected format is appropriate; if not, we reset the input stream. The issue was that the JSON format was being predicted among the first, and kotlinx-serialization reads all bytes from the incoming stream, ignoring our set limit. Consequently, when we perform a reset, it always results in an error because more than the limit was read.

# Conflicts:
#	core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/guess.kt
@devcrocod
Copy link
Contributor Author

This is strange, I got errors with DataFrameBlackBoxCodegenTestGenerated after today's merge

@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Jun 13, 2024

This is strange, I got errors with DataFrameBlackBoxCodegenTestGenerated after today's merge

It's because #729 got merged. Looks like the test failed because your API change changed the FIR representation in the plugin. This will probably happen a lot in the future but is actually solvable and just a warning really :)

Simply:

@devcrocod
Copy link
Contributor Author

I did as you said, and locally everything worked for me. But as you can see, the tests on TC still failed:

org.opentest4j.AssertionFailedError: Actual data differs from file content: toDataFrame_dsl.fir.ir.txt

Could these files be platform or environment dependent? In any case, this approach of regenerating files for tests doesn't look as perfect decision

@Jolanrensen
Copy link
Collaborator

Jolanrensen commented Jun 14, 2024

I did as you said, and locally everything worked for me. But as you can see, the tests on TC still failed:

org.opentest4j.AssertionFailedError: Actual data differs from file content: toDataFrame_dsl.fir.ir.txt

Could these files be platform or environment dependent? In any case, this approach of regenerating files for tests doesn't look as perfect decision

Try setting your project sdk + gradle's jvm to JDK 11, while we figure out how to make that the default for all contributors :)

Edit: actually, I fixed it in #736, I'll update your branch from master and we should be good to go :)

@Jolanrensen Jolanrensen merged commit 76091c7 into master Jun 14, 2024
3 of 4 checks passed
@devcrocod devcrocod deleted the devcrocod/serialization-patch0 branch June 18, 2024 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants