KTNB-693 Send the full dataframe schema as metadata #706

cmelchior · 2024-05-27T11:07:57Z

This part adds the infrastructure needed for https://youtrack.jetbrains.com/issue/KTNB-693/Enable-AI-Actions-for-DataFrames-in-Kotlin-Notebooks as we currently are not able to detect column types in a good way which is needed when creating prompts for the AI Assistant.

It adds a new "types" property to the top-level "metadata" as well as recursively on each row so it is possible to easily identify column types.

A columns property has also been added to ColumnGroup and FrameColumn metadata, it contains nested column names similar to the top-level columns property.

Example:

val col1 by columnOf("a", "b", "c")
val col2 by columnOf(1, 2, 3)
val col3 by columnOf("Foo", "Bar", null)
val df2 = dataFrameOf(Pair("header", listOf("A", "B", "C")))
val col4 by columnOf(df2, df2, df2)
var df = dataFrameOf(col1, col2, col3, col4)
df.group(col1, col2).into("group")

{
   ...
             {
              "${'$'}version": "2.1.0",
              "metadata": {
                "columns": ["group", "col3", "col4"],
                "types": [{
                  "kind": "ColumnGroup"
                }, {
                  "kind": "ValueColumn",
                  "type": "kotlin.String?"
                }, {
                  "kind": "FrameColumn"
                }],
                "nrow": 3,
                "ncol": 3
              },
              "kotlin_dataframe": [{
                "group": {
                  "data": {
                    "col1": "a",
                    "col2": 1
                  },
                  "metadata": {
                    "kind": "ColumnGroup",
                    "columns": ["col1", "col2"],
                    "types": [{
                      "kind": "ValueColumn",
                      "type": "kotlin.String"
                    }, {
                      "kind": "ValueColumn",
                      "type": "kotlin.Int"
                    }]
                  }
                },
                "col3": "Foo",
                "col4": {
                  "data": [{
                    "header": "A"
                  }, {
                    "header": "B"
                  }, {
                    "header": "C"
                  }],
                  "metadata": {
                    "kind": "FrameColumn",
                    "columns": ["header"],
                    "types": [{
                      "kind": "ValueColumn",
                      "type": "kotlin.String"
                    }],
                    "ncol": 1,
                    "nrow": 3
                  }
                }
              }, {
                "group": {
                  "data": {
                    "col1": "b",
                    "col2": 2
                  },
                  "metadata": {
                    "kind": "ColumnGroup",
                    "columns": ["col1", "col2"],
                    "types": [{
                      "kind": "ValueColumn",
                      "type": "kotlin.String"
                    }, {
                      "kind": "ValueColumn",
                      "type": "kotlin.Int"
                    }]
                  }
                },
                "col3": "Bar",
                "col4": {
                  "data": [{
                    "header": "A"
                  }, {
                    "header": "B"
                  }, {
                    "header": "C"
                  }],
                  "metadata": {
                    "kind": "FrameColumn",
                    "columns": ["header"],
                    "types": [{
                      "kind": "ValueColumn",
                      "type": "kotlin.String"
                    }],
                    "ncol": 1,
                    "nrow": 3
                  }
                }
              }, {
                "group": {
                  "data": {
                    "col1": "c",
                    "col2": 3
                  },
                  "metadata": {
                    "kind": "ColumnGroup",
                    "columns": ["col1", "col2"],
                    "types": [{
                      "kind": "ValueColumn",
                      "type": "kotlin.String"
                    }, {
                      "kind": "ValueColumn",
                      "type": "kotlin.Int"
                    }]
                  }
                },
                "col3": null,
                "col4": {
                  "data": [{
                    "header": "A"
                  }, {
                    "header": "B"
                  }, {
                    "header": "C"
                  }],
                  "metadata": {
                    "kind": "FrameColumn",
                    "columns": ["header"],
                    "types": [{
                      "kind": "ValueColumn",
                      "type": "kotlin.String"
                    }],
                    "ncol": 1,
                    "nrow": 3
                  }
                }
              }]
            }
}

…d by AI actions.

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/io/writeJson.kt

koperagen · 2024-05-27T12:53:20Z

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/io/writeJson.kt

+        schemaData["name"] = name
+        schemaData["kind"] = columnSchema.kind.toString()
+        when (columnSchema) {
+            is ColumnSchema.Value -> schemaData["type"] = columnSchema.type.toString()


We also have a function to turn KType to String, it's used in HTML rendering and DataFrameSchema.toString

dataframe/core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/Rendering.kt

Line 51 in 6be5e1a

internal fun renderType(type: KType?): String {

What do you think? Suits for AI actions?

Looking into that method, it seems to hide some of the type information in some cases. So I do not think it is suitable for AI Actions. At least if we want to be as specific as possible with the context information. Also, if others want to use the type information, we probably need the fully qualified type as well

Looking into this method, it seems to remove type information in some cases, which might make it problematic when we want to use the metadata to improve the context of AI actions. So for now I would prefer to keep the fully qualified names that also include nullability, e.g. Kotlin.String?

ermolenkodev · 2024-05-27T15:01:56Z

There is a problem when a FrameColumn contains frames with different schemas. I recommend attaching types to the metadata of each nested frame. This may lead to duplication if the schema of each nested frame is the same, but it will make it easier to work with on the Kotlin Notebook plugin side. We already have a lot of duplication because we pass column names for each value in rows, so this additional overhead will be minimal.
Here is the short reproducer of the problem:
dataFrameOf("a", "b")(1, dataFrameOf("c", "d")(1, 2), 2, dataFrameOf("e", "f")(1, 2))

ermolenkodev

#706 (comment)

cmelchior · 2024-05-28T11:09:06Z

@ermolenkodev I see your point. I forgot to think about that each row could hold different schemas for data frame references. So you are right, it is probably better to have the schema as part of the metadata inside the data frame content.

I'll refactor it.

…op-level frame as well as on each row. Updated serialization_format.md

cmelchior · 2024-05-29T14:10:14Z

After some discussion with @ermolenkodev we decided to rework the metadata a little. I have updated the PR and description. So it should be ready for a 2nd round of review.

KTNB-693 Send the full dataframe schema as metadata, so it can be use…

6a3b790

…d by AI actions.

cmelchior requested review from Jolanrensen and ermolenkodev May 27, 2024 11:07

Jolanrensen reviewed May 27, 2024

View reviewed changes

core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/io/writeJson.kt Outdated Show resolved Hide resolved

Jolanrensen approved these changes May 27, 2024

View reviewed changes

koperagen reviewed May 27, 2024

View reviewed changes

ermolenkodev requested changes May 28, 2024

View reviewed changes

cmelchior added 2 commits May 28, 2024 17:34

Move schema metadata into kotlin_dataframe elements

76969bd

Type information is now an object kind and type. It is added on the t…

2679f48

…op-level frame as well as on each row. Updated serialization_format.md

cmelchior requested a review from ermolenkodev May 29, 2024 14:09

ermolenkodev approved these changes May 29, 2024

View reviewed changes

cmelchior merged commit 75d8e78 into master May 30, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KTNB-693 Send the full dataframe schema as metadata #706

KTNB-693 Send the full dataframe schema as metadata #706

cmelchior commented May 27, 2024 •

edited

Loading

koperagen May 27, 2024

cmelchior May 28, 2024

cmelchior May 29, 2024

ermolenkodev commented May 27, 2024

ermolenkodev left a comment

cmelchior commented May 28, 2024

cmelchior commented May 29, 2024

KTNB-693 Send the full dataframe schema as metadata #706

KTNB-693 Send the full dataframe schema as metadata #706

Conversation

cmelchior commented May 27, 2024 • edited Loading

koperagen May 27, 2024

Choose a reason for hiding this comment

cmelchior May 28, 2024

Choose a reason for hiding this comment

cmelchior May 29, 2024

Choose a reason for hiding this comment

ermolenkodev commented May 27, 2024

ermolenkodev left a comment

Choose a reason for hiding this comment

cmelchior commented May 28, 2024

cmelchior commented May 29, 2024

cmelchior commented May 27, 2024 •

edited

Loading