diff --git a/README.md b/README.md index 33dcffa..ef2114f 100644 --- a/README.md +++ b/README.md @@ -13,9 +13,9 @@ English | [中文](README_ZH.md) A fast Rust JSON library based on SIMD. It has some references to other open-source libraries like [sonic_cpp](https://github.com/bytedance/sonic-cpp), [serde_json](https://github.com/serde-rs/json), [sonic](https://github.com/bytedance/sonic), [simdjson](https://github.com/simdjson/simdjson), [rust-std](https://github.com/rust-lang/rust/tree/master/library/core/src/num) and more. -***For Golang users to use `sonic_rs`, please see [for_Golang_user.md](docs/for_Golang_user.md)*** +***For Golang users to use `sonic_rs`, please see [for_Golang_user.md](https://github.com/cloudwego/sonic-rs/blob/main/docs/for_Golang_user.md)*** -***For users to migrate from `serde_json` to `sonic_rs`, can see [serdejson_compatibility](docs/serdejson_compatibility.md)*** +***For users to migrate from `serde_json` to `sonic_rs`, can see [serdejson_compatibility](https://github.com/cloudwego/sonic-rs/blob/main/docs/serdejson_compatibility.md)*** ## Requirements/Notes @@ -463,5 +463,10 @@ Thanks the following open-source libraries. sonic-rs has some references to othe We rewrote many SIMD algorithms from sonic-cpp/sonic/simdjson/yyjson for performance. We reused the de/ser codes and modified necessary parts from serde_json to make high compatibility with `serde`. We reused part codes about floating parsing from rust-std to make it more accurate. +Referenced papers: +1. [Parsing Gigabytes of JSON per Second](https://arxiv.org/abs/1902.08318) +2. [JSONSki: streaming semi-structured data with bit-parallel fast-forwarding](https://dl.acm.org/doi/10.1145/3503222.3507719) + + ## Contributing Please read [CONTRIBUTING.md](CONTRIBUTING.md) for information on contributing to sonic-rs. diff --git a/README_ZH.md b/README_ZH.md index 3aa7669..e8de97a 100644 --- a/README_ZH.md +++ b/README_ZH.md @@ -14,9 +14,9 @@ sonic-rs 是一个基于 SIMD 的高性能 JSON 库。它参考了其他开源库如 [sonic_cpp](https://github.com/bytedance/sonic-cpp),[serde_json](https://github.com/serde-rs/json),[sonic](https://github.com/bytedance/sonic),[simdjson](https://github.com/simdjson/simdjson),[rust-std](https://github.com/rust-lang/rust/tree/master/library/core/src/num) 等。 -***对于 Golang 用户迁移 Rust 使用 `sonic_rs`, 请参考 [for_Golang_user_zh.md](docs/for_Golang_user_zh.md)*** +***对于 Golang 用户迁移 Rust 使用 `sonic_rs`, 请参考 [for_Golang_user_zh.md](https://github.com/cloudwego/sonic-rs/blob/main/docs/for_Golang_user_zh.md)*** -***对于 用户从 `serde_json` 迁移 `sonic_rs`, 请参考 [serdejson_compatibility](docs/serdejson_compatibility.md)*** +***对于 用户从 `serde_json` 迁移 `sonic_rs`, 请参考 [serdejson_compatibility](https://github.com/cloudwego/sonic-rs/blob/main/docs/serdejson_compatibility.md)*** ## ***要求/注意事项*** @@ -455,6 +455,11 @@ Thanks the following open-source libraries. sonic-rs has some references to othe 我们为了性能重写了来自 sonic-cpp/sonic/simdjson/yyjson 的许多 SIMD 算法。我们重用了来自 serde_json 的反/序列化代码,并修改了必要的部分以与 serde 高度兼容。我们重用了来自 rust-std 的部分浮点解析代码,使其结构更准确。 +参考论文: +1. [Parsing Gigabytes of JSON per Second](https://arxiv.org/abs/1902.08318) +2. [JSONSki: streaming semi-structured data with bit-parallel fast-forwarding](https://dl.acm.org/doi/10.1145/3503222.3507719) + + ## 如何贡献 请阅读 [CONTRIBUTING.md](CONTRIBUTING.md)。 diff --git a/ROADMAP.md b/ROADMAP.md index 95f3dd4..ea3e568 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -17,7 +17,7 @@ This document shows key roadmap of `sonic-rs` development. It may help users kno 0. ~~make sonic-rs support stable Rust~~ -1. optimize the performance in aarch64 (WIP: 50%) +1. ~~optimize the performance in aarch64 (WIP: 50%)~~ 2. runtime CPU detection diff --git a/docs/for_Golang_user.md b/docs/for_Golang_user.md index 19ebaad..4ac1e31 100644 --- a/docs/for_Golang_user.md +++ b/docs/for_Golang_user.md @@ -14,11 +14,7 @@ Corresponding API references: - Parsing into Golang `interface{}/any` or sonic-go `ast.Node`: - It is recommended to replace it with `sonic_rs::Value` for better performance. - - ***if the json has duplicated keys, pls use `serde_json::Value`, because `sonic_rs::Value` not maintain a hashmap inner*** - - ***even though use `serde_json::Value`, still can be parsed use `sonic_rs::from_str/from_slice`*** + It is recommended to replace it with `sonic_rs::Value` for better performance. - Using `gjson.Get` or `jsonparser.Get` APIs: diff --git a/docs/for_Golang_user_zh.md b/docs/for_Golang_user_zh.md index 385a530..de61fa0 100644 --- a/docs/for_Golang_user_zh.md +++ b/docs/for_Golang_user_zh.md @@ -16,14 +16,10 @@ 建议使用 `sonic_rs::Value` 替换,性能更优。 - ***如果 JSON 中有重复的key,建议使用 `serde_json::Value`, 因为 `sonic_rs::Value` 中没有建立哈希表*** - - ***即使使用 `serde_json::Value`, 也可以使用 `sonic_rs::from_str/from_slice` 进行解析,性能相比原生会更好一些*** - - 使用 `gjson.Get` 或 `jsonparser.Get` 等API: gjson/jsonparser get API 本身未做严格的JSON 校验,因此可以使用 `sonic_rs::get_unchecked` 进行平替。 sonic_rs get API 会返回一个 `Result`. 如果没有找到该字段,会报错。 - `LazyValue` 可以用 `as_bool, as_str`等将 JSON 进一步***解析成对应的类型**。 + `LazyValue` 可以用 `as_bool, as_str`等将 JSON 进一步**解析成对应的类型**。 如果只需要拿到原始的raw JSON, ***不做解析***,请使用 `as_raw_str, as_raw_faststr` 等 API. 参考例子: [get_from.rs](../examples/get_from.rs) diff --git a/docs/performance.md b/docs/performance.md index 439a64a..ba4de4f 100644 --- a/docs/performance.md +++ b/docs/performance.md @@ -4,7 +4,7 @@ This document will introduce some performance optimization details of sonic-rs ( ## Get fields from JSON/parsing JSON on-demand -The on-demand parsing algorithm focuses on skipping unnecessary fields, and the challenge lies in skipping JSON containers, including JSON Objects and JSON Arrays. This is because we need to pay attention to the brackets in the JSON string, such as `{ "key": "value {}"}`. We utilize the SIMD instructions to calculate the bitmap of the string, and then by counting the number of brackets, we can skip the entire JSON container. +The on-demand parsing algorithm focuses on skipping unnecessary fields, and the challenge lies in skipping JSON containers, including JSON Objects and JSON Arrays. This is because we need to pay attention to the brackets in the JSON string, such as `{ "key": "value {}"}`. We utilize the SIMD instructions to calculate the bitmap of the string, and then by counting the number of brackets, we can skip the entire JSON container. Reference the paper [JSONSki](https://dl.acm.org/doi/10.1145/3503222.3507719). The overall algorithm is as follows: @@ -118,7 +118,7 @@ In addition, we also optimize for compact JSON and cases where there's only one ## Float number parsing using SIMD -Parsing floating-point numbers is one of the most time-consuming operations in JSON parsing. For 16-length number strings, we can directly use SIMD instructions for parsing, as it can read ASCII number characters and accumulate them step by step. Refer to ‎[simd_str2int](https://github.com/cloudwego/sonic-rs/blob/main/src/util/arch/x86_64.rs#L115) for the specific algorithm. This algorithm comes from [sonic-cpp](https://github.com/bytedance/sonic-cpp/blob/master/include/sonic/internal/arch/sse/str2int.h). +Parsing floating-point numbers is one of the most time-consuming operations in JSON parsing. For 16-length number strings, we can directly use SIMD instructions for parsing, as it can read ASCII number characters and accumulate them step by step. Refer to [simd_str2int](https://github.com/cloudwego/sonic-rs/blob/main/src/util/arch/x86_64.rs#L115) for the specific algorithm. This algorithm comes from [sonic-cpp](https://github.com/bytedance/sonic-cpp/blob/master/include/sonic/internal/arch/sse/str2int.h). When parsing floating-point numbers, we only need to consider 17 significant digit bits for 64-bit floating-point numbers according to the IEEE754 specification. Thus, in this function, we employ a switch table to decrease unnecessary SIMD instructions. diff --git a/docs/performance_zh.md b/docs/performance_zh.md index 0e99194..7df6f8b 100644 --- a/docs/performance_zh.md +++ b/docs/performance_zh.md @@ -4,7 +4,7 @@ ## 按需解析 -如何实现一个性能更好的按需解析算法。按需解析的性能关键在于跳过不需要的字段,其中难点在于如何跳过 JSON container, 包括 JSON Object 和 JSON array,因为我们需要注意 JSON 字符串中的括号,例如 `"{ "key": "value {}"}`。 我们利用了 simd 指令计算字符串的bitmap,然后通过计算括号的数量来跳过整个JSON container。 +如何实现一个性能更好的按需解析算法。按需解析的性能关键在于跳过不需要的字段,其中难点在于如何跳过 JSON container, 包括 JSON Object 和 JSON array,因为我们需要注意 JSON 字符串中的括号,例如 `"{ "key": "value {}"}`。 我们利用了 simd 指令计算字符串的bitmap,然后通过计算括号的数量来跳过整个JSON container。参考论文 [JSONSki](https://dl.acm.org/doi/10.1145/3503222.3507719). 整体算法如下: @@ -122,7 +122,7 @@ JSON 规范中的空格字符有: ` `, `\n`, '\r', '\t`. 利用 SIMD 指令跳 ``` -对于长度为16的数字字符串,是可以直接使用 SIMD 指令进行解析,读取 ascii 数字字符并且逐步累加的。 具体算法可以参考‎[simd_str2int](https://github.com/cloudwego/sonic-rs/blob/main/src/util/arch/x86_64.rs#L115)。这个算法来源于 [sonic-cpp](https://github.com/bytedance/sonic-cpp/blob/master/include/sonic/internal/arch/sse/str2int.h). 在解析浮点数时,按照 IEEE754 规范,对于64 位浮点数,我们只需要关注17位有效数字。因此,在这个函数里面使用了一个 switch table 来减少不必要的 SIMD 指令。 +对于长度为16的数字字符串,是可以直接使用 SIMD 指令进行解析,读取 ascii 数字字符并且逐步累加的。 具体算法可以参考[simd_str2int](https://github.com/cloudwego/sonic-rs/blob/main/src/util/arch/x86_64.rs#L115)。这个算法来源于 [sonic-cpp](https://github.com/bytedance/sonic-cpp/blob/master/include/sonic/internal/arch/sse/str2int.h). 在解析浮点数时,按照 IEEE754 规范,对于64 位浮点数,我们只需要关注17位有效数字。因此,在这个函数里面使用了一个 switch table 来减少不必要的 SIMD 指令。 ## 使用 SIMD 序列化 JSON string diff --git a/docs/serdejson_compatibility.md b/docs/serdejson_compatibility.md index 4034558..d1730fa 100644 --- a/docs/serdejson_compatibility.md +++ b/docs/serdejson_compatibility.md @@ -1,6 +1,6 @@ # A quick guide to migrate from serde_json -The goal of sonic-rs is performance and easiness (more APIs and ALLINONE) to use. Otherwise, reconmend to use `serde_json`. +The goal of sonic-rs is performance and easiness (more APIs and ALLINONE) to use. Otherwise, recommended to use `serde_json`. Just replace as follows: diff --git a/docs/value_design.md b/docs/value_design.md index 9d6871b..e8308f8 100644 --- a/docs/value_design.md +++ b/docs/value_design.md @@ -2,4 +2,3 @@ # A new and user-friendly areana-based document design -TODO: ^_^ ... \ No newline at end of file diff --git a/src/reader.rs b/src/reader.rs index dc008aa..c2ffb1a 100644 --- a/src/reader.rs +++ b/src/reader.rs @@ -245,10 +245,10 @@ impl<'a> Reader<'a> for Read<'a> { } } - fn validate_utf8(&mut self, allowd_space: (usize, usize)) -> Result<()> { - if self.next_invalid_utf8 < allowd_space.0 { + fn validate_utf8(&mut self, allowed_space: (usize, usize)) -> Result<()> { + if self.next_invalid_utf8 < allowed_space.0 { Err(invalid_utf8(self.slice, self.next_invalid_utf8)) - } else if self.next_invalid_utf8 < allowd_space.1 { + } else if self.next_invalid_utf8 < allowed_space.1 { // this space is allowed, should update the next invalid utf8 position self.next_invalid_utf8 = match from_utf8(&self.slice[self.index..]) { Ok(_) => usize::MAX,