Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

writeArrowFeather not working with nested type ? #271

Open
phodal opened this issue Feb 16, 2023 · 7 comments
Open

writeArrowFeather not working with nested type ? #271

phodal opened this issue Feb 16, 2023 · 7 comments
Labels
enhancement New feature or request
Milestone

Comments

@phodal
Copy link

phodal commented Feb 16, 2023

Hi, in my case, I want to create a arrow file in client side, then pass to server side. But when I just try run writeArrowFeather, will show the IndexOutOfBoundsException issues.

Exception in thread "main" java.lang.IndexOutOfBoundsException: index: 31393, length: 2320 (expected: range(0, 32768))
	at org.apache.arrow.memory.ArrowBuf.checkIndex(ArrowBuf.java:701)
	at org.apache.arrow.memory.ArrowBuf.setBytes(ArrowBuf.java:765)
	at org.apache.arrow.vector.BaseVariableWidthVector.setBytes(BaseVariableWidthVector.java:1244)
	at org.apache.arrow.vector.BaseVariableWidthVector.set(BaseVariableWidthVector.java:1059)
	at org.apache.arrow.vector.VarCharVector.set(VarCharVector.java:255)
	at org.jetbrains.kotlinx.dataframe.io.ArrowWriterImpl$infillVector$1.invoke(ArrowWriterImpl.kt:111)
	at org.jetbrains.kotlinx.dataframe.io.ArrowWriterImpl$infillVector$1.invoke(ArrowWriterImpl.kt:111)
	at org.jetbrains.kotlinx.dataframe.api.ForEachKt.forEachIndexed(forEach.kt:34)
	at org.jetbrains.kotlinx.dataframe.io.ArrowWriterImpl.infillVector(ArrowWriterImpl.kt:111)
	at org.jetbrains.kotlinx.dataframe.io.ArrowWriterImpl.allocateVectorAndInfill(ArrowWriterImpl.kt:197)
	at org.jetbrains.kotlinx.dataframe.io.ArrowWriterImpl.allocateVectorSchemaRoot(ArrowWriterImpl.kt:223)
	at org.jetbrains.kotlinx.dataframe.io.ArrowWriter$DefaultImpls.writeArrowFeather(ArrowWriter.kt:114)
	at org.jetbrains.kotlinx.dataframe.io.ArrowWriterImpl.writeArrowFeather(ArrowWriterImpl.kt:61)
	at org.jetbrains.kotlinx.dataframe.io.ArrowWriter$DefaultImpls.writeArrowFeather(ArrowWriter.kt:125)
	at org.jetbrains.kotlinx.dataframe.io.ArrowWriterImpl.writeArrowFeather(ArrowWriterImpl.kt:61)
	at org.jetbrains.kotlinx.dataframe.io.ArrowWriter$DefaultImpls.writeArrowFeather(ArrowWriter.kt:133)
	at org.jetbrains.kotlinx.dataframe.io.ArrowWriterImpl.writeArrowFeather(ArrowWriterImpl.kt:61)
	at org.jetbrains.kotlinx.dataframe.io.ArrowWritingKt.writeArrowFeather(arrowWriting.kt:89)
	at com.phodal.chapi.arrow.MainKt.main(Main.kt:26)
	Suppressed: java.lang.IllegalStateException: Memory was leaked by query. Memory leaked: (33024)
Allocator(ROOT) 0/33024/264192/9223372036854775807 (res/actual/peak/limit)

		at org.apache.arrow.memory.BaseAllocator.close(BaseAllocator.java:437)
		at org.apache.arrow.memory.RootAllocator.close(RootAllocator.java:29)
		at org.jetbrains.kotlinx.dataframe.io.ArrowWriterImpl.close(ArrowWriterImpl.kt:247)
		at kotlin.jdk7.AutoCloseableKt.closeFinally(AutoCloseable.kt:64)
		at org.jetbrains.kotlinx.dataframe.io.ArrowWritingKt.writeArrowFeather(arrowWriting.kt:88)
		... 1 more

FAILURE: Build failed with an exception.

Here is my demo code with writer and some debug information:

val dataFrame = DataFrame.read("https://raw.githubusercontent.com/phodal-archive/apache-arrow-chapi-demo/master/data/0_codes.json")
dataFrame.schema().print()

val toArrowSchema = dataFrame.columns().toArrowSchema()
println(toArrowSchema.toJson())

dataFrame.writeArrowFeather(File("codes.arrow"))

When i try to debug, in the dataFrame.schema().print(), it will return correct schema:

NodeName: String
Module: String
Type: String
Package: String?
FilePath: String
Fields: *
    TypeType: String
    TypeKey: String
    Modifiers: List<String>
    TypeValue: String?
    Annotations: *
        Name: String
        KeyValues: *
            Key: String
            Value: String


Implements: List<String>
Functions: *
    Name: String
    Package: String?
    ReturnType: String
    Parameters: *
        TypeValue: String
        TypeType: String
    FunctionCalls: *
        Package: String?
        NodeName: String?
        FunctionName: String
        Position:
            StartLine: Int
            StartLinePosition: Int
            StopLine: Int
            StopLinePosition: Int
        Parameters: *
            TypeValue: String
            TypeType: String
        Type: String?
    Position:
        StartLine: Int
        StartLinePosition: Int?
        StopLine: Int
        StopLinePosition: Int?
    LocalVariables: *
        TypeValue: String
        TypeType: String
    IsConstructor: Boolean?
    Annotations: *
        Name: String
        KeyValues: *
            Key: String
            Value: String


Imports: *
    Source: String
    AsName: String
Position:
    StartLine: Int?
    StopLine: Int?
    StartLinePosition: Int?
    StopLinePosition: Int?
Annotations: *
    Name: String

But, in dataFrame.columns().toArrowSchema() the type will be error:

{
  "fields" : [ {
    "name" : "NodeName",
    "nullable" : false,
    "type" : {
      "name" : "utf8"
    },
    "children" : [ ]
  }, {
    "name" : "Module",
    "nullable" : false,
    "type" : {
      "name" : "utf8"
    },
    "children" : [ ]
  }, {
    "name" : "Type",
    "nullable" : false,
    "type" : {
      "name" : "utf8"
    },
    "children" : [ ]
  }, {
    "name" : "Package",
    "nullable" : true,
    "type" : {
      "name" : "utf8"
    },
    "children" : [ ]
  }, {
    "name" : "FilePath",
    "nullable" : false,
    "type" : {
      "name" : "utf8"
    },
    "children" : [ ]
  }, {
    "name" : "Fields",
    "nullable" : true,
    "type" : {
      "name" : "utf8"
    },
    "children" : [ ]
  }, {
    "name" : "Implements",
    "nullable" : true,
    "type" : {
      "name" : "utf8"
    },
    "children" : [ ]
  }, {
    "name" : "Functions",
    "nullable" : true,
    "type" : {
      "name" : "utf8"
    },
    "children" : [ ]
  }, {
    "name" : "Imports",
    "nullable" : true,
    "type" : {
      "name" : "utf8"
    },
    "children" : [ ]
  }, {
    "name" : "Position",
    "nullable" : true,
    "type" : {
      "name" : "utf8"
    },
    "children" : [ ]
  }, {
    "name" : "Annotations",
    "nullable" : true,
    "type" : {
      "name" : "utf8"
    },
    "children" : [ ]
  } ]
}

I lost something?

@koperagen
Copy link
Collaborator

No, you're right - nested typed are not yet supported. :( Interesting data you've got there

@phodal
Copy link
Author

phodal commented Feb 16, 2023

Thanks for share it. Any plan on it? or I just try to modifiy AnyCol.toArrowField to implementation it ?

@koperagen
Copy link
Collaborator

Honestly, i overlooked that our Arrow support misses nested types, so this improvement isn't planned. Right now the team is occupied with improvements to the documentation and notebooks experience. I think nobody is going to work on Arrow in near weeks.
You can submit a PR if you want, but apart from toArrowField there will be modification in actual writing here: infillVector https://github.com/Kotlin/dataframe/blob/master/dataframe-arrow/src/main/kotlin/org/jetbrains/kotlinx/dataframe/io/ArrowWriterImpl.kt

@phodal
Copy link
Author

phodal commented Feb 16, 2023

Thank you, I will try to find a solution.

@Kopilov
Copy link
Contributor

Kopilov commented Apr 11, 2023

IndexOutOfBoundsException: index: 31393, length: 2320 (expected: range(0, 32768)) is unexpected error, I am working on this (just got same in my project). This is because VariableWidthVector (where String column is saved to) does not know it's actual size.

About nested types, @phodal, do you have any examples in other Java-based projects with Arrow support as an example? And what is your target Arrow schema (does it contain SructVector, ListVector or any other)?

@phodal
Copy link
Author

phodal commented Apr 11, 2023

@Kopilov Sorry, I try to do it, but it need lots of code. So, I don't use dataframe with Arrow, just keep to use JSON.

@Kopilov
Copy link
Contributor

Kopilov commented Apr 12, 2023

Exception is fixed in #350
Nested types are still not supported natively, should be saved correctly as strings

@zaleslaw zaleslaw added the enhancement New feature or request label Apr 25, 2023
@zaleslaw zaleslaw added this to the Backlog milestone Apr 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants