Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arrow_json: support binary deserialization #4945

Closed
Folyd opened this issue Oct 17, 2023 · 4 comments · Fixed by #4975
Closed

arrow_json: support binary deserialization #4945

Folyd opened this issue Oct 17, 2023 · 4 comments · Fixed by #4975
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog

Comments

@Folyd
Copy link
Contributor

Folyd commented Oct 17, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I have a bunch of parquet files that contain a binary column, here is the schema:

CREATE EXTERNAL TABLE `buildings`(
  `id` string,
  `updateTime` string,
  `version` int,
  `names` string,
  `level` int,
  `height` double,
  `geometry` binary
)

The geometry column is binary. Here is my struct:

#[derive(Deserialize)]
struct Building {
    // omit for brevity ...

    #[serde(default)]
    geometry: Vec<u8>,
}

However, it failed to deserialize the parquet data with arrow::json::writer::record_batches_to_json_rows(), here is the error:

Error: Json error: data type Binary not supported in nested map for json writer

Describe the solution you'd like

arrow_json should support deserialize binary to Vec<u8>.

@Folyd Folyd added the enhancement Any new improvement worthy of a entry in the changelog label Oct 17, 2023
@tustvold
Copy link
Contributor

So if I understand correctly you are reading binary data from a parquet file and are trying to write it to JSON?

Unfortunately JSON does not define a mechanism to transport binary data, you have two common options:

  • If the binary data is actually UTF-8 you could cast the column to a StringArray
  • You could use a crate like base64 to base64 encode the binary data prior to writing it

@Folyd
Copy link
Contributor Author

Folyd commented Oct 18, 2023

In my example, the geometry is actually a wkb (well-known-binary) type. Does arrow support wkb?

Here is an example of wkb type, it can be decoded to GeoJson:
image

@tustvold
Copy link
Contributor

Arrow supports WKB, but JSON would only support WKT

@tustvold
Copy link
Contributor

tustvold commented Nov 2, 2023

label_issue.py automatically added labels {'arrow'} from #4967

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
2 participants