Skip to content

Commit

Permalink
README draft + associated changes
Browse files Browse the repository at this point in the history
  • Loading branch information
tempname11 committed Sep 15, 2023
1 parent c7a1fb7 commit d29ecf7
Show file tree
Hide file tree
Showing 28 changed files with 677 additions and 170 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/main.yml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Main
name: Build

on:
workflow_dispatch:
Expand Down
96 changes: 96 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Simple, Versatile Format

[![Build Status](https://github.com/tempname11/svf/workflows/main/badge.svg)](https://github.com/tempname11/svf/actions?workflow=main)
![Alpha](https://img.shields.io/badge/alpha-blue)

**SVF** is a format for structured data that is focused on being both machine-friendly, and human-friendly — in that order! Currently, it's in **alpha**, meaning that while most of the core functionality is working, there is still a lot of work to be done (see [Roadmap](#roadmap)).

More precisely, this project currently includes:
- A small text language to describe data schemas.
- A CLI tool to work with the schemas.
- C and C++ libraries to actually work with data at runtime.

### Machine-friendly

One of the design goals for the format is simplicity. "Simplicity" includes "how many things the machine has to do before it can access the data". The vast majority of the modern CPUs are 64-bit, little endian (x86-64 and ARM), and it makes sense to optimize for them. The format is binary, and bar [versioning hurdles](#Data evolution), reading it essentially involves repeatedly viewing regions of memory as known structures (i.e. C/C++ structs, with caveats). Writing it is similarly straightforward, in typical scenarios involving just copying structures to buffers, and lightweight book-keeping.

If the data you are reading is "binary-compatible" with the schema you are expecting, then the overhead should be small. If not, there will be a pre-processing step involved. However, these cases should be either rare, or entirely avoidable with a little bit of planning.

### Human-friendly

The recent proliferation of text-based data formats like JSON seemingly tells us that they are superior, at least in terms of being more suited for viewing and manipulation by people. That may be true (it's not easy to read binary blobs, let alone modify them by hand!), but a large part of the problem is simply solved by good tooling, and the remaining drawbacks are outweighted by the beneftis. This is an assumption that needs proving, of course.

A key part of this approach is a good GUI viewer/editor for the format. See [Roadmap](#roadmap).

### Data Schemas

The format is structured, meaning that all data has an associated schema. Here is a silly example of a schema:
```rs
#name Hello

World: struct {
population: U64; // 64-bit unsigned integer.
gravitational_constant: F32; // 32-bit floating-point.
current_year: I16; // 16-bit integer (signed).
mechanics: Mechanics;
name: String;
};

Mechanics: choice {
classical;
quantum;
custom: String;
};

String: struct {
utf8: U8[];
};
```

Note, that the concept of a "string" is not built-in, but can be easily expressed in a schema. In this instance, UTF-8 is the chosen encoding, and expressed how you would expect — as an array 8-bit characters.

### Data Evolution

The above schema may resemble `struct` declarations in C/C++. In fact, the CLI code generator will output structures very much like what you would expect. What's the benefit over writing these C/C++ declarations by hand? Well, programs change over time, with their data requirements also changing. There are two typical scenarios:

- Data is written into a file by a program. Later, a newer version of that program, reads it back. This requires **backward compatibility**, if the schema was modified.
- Data is being sent over the network, say, from server to client. The server is updated. The client is now older than the server, but still needs to process the message. This requires **forward compatibility**, if the server schema was modified.

Both of these scenarios are covered by:
- Checking the schema compatibility before reading the message.
- Potentially converting the data to the new schema. This is only necessary, if the schemas are not "binary-compatible".

### Usage

See:
- ["Hello" example schema](svf_tools/schema/Hello.txt), the same as above.
- ["Hello" written to file in C](svf_tools/src/test/hello_write.c)
- ["Hello" read from bytes in C++](svf_tools/src/test/hello_read.cpp)
- ["JSON" example schema](svf_tools/schema/JSON.txt), as a small exercise to encode JSON.
- ["JSON" written to file in C++](svf_tools/src/test/json_write.cpp)
- ["JSON" read from bytes in C](svf_tools/src/test/json_read.c)

### Roadmap

There is a number of things left to do before **beta** status can be assigned, which would signify that the project is tentatively ready for "serious" usage. This list is not comprehensive.

- Adversarial inputs need to be handled well in the runtime libraries. Since the format is binary, and the code is written in C, there's a lot of potential for exploits. There are some guardrails in place already, and nothing prevents safety in theory, but there was no big, focused effort in this direction yet.
- Better error handling, with readable error messages where applicable (e.g. `svfc` output). Currently, a lot of code simply sets an error flag, it gets checked later, and that's about it.
- More extensive testing. Some basic tests are in-place, but something more serious needs to be done to catch corner cases. Perhaps, a fuzzy test case generator.
- Clear platform support. This includes OS, CPU architectures, compilers, language standards.
- Schema-less messages, and generally more focus on optimizing working with schema at runtime. Right now, the full schema is always included in the message, which is simple, and easy to debug. But e.g. in network scenarios, where the data might be much smaller than the schema, and transferred often, excluding it would be critical. This means that the user's code would need some other way to get the schema. Also, checking schema compatibility is not free, and could be skipped in some cases.
- Once the data format is stabilized, some kind of unambiguous specification, whether formal, or informal, needs to be written down. This is also true for the schema text language, although less important.
- A GUI tool to work with data, which is essential for debugging, and for general use as well. It needs to be easy to use, and runnable on most developer machines.

### Alternatives

Here are some established formats you might want to use instead. They have their own drawbacks, of course, which is why this project was started.

- Protocol Buffers: https://protobuf.dev/
- Cap'n Proto: https://capnproto.org/
- FlatBuffers: https://flatbuffers.dev/
- Text-based formats, like JSON, YAML, or XML.

### License

[MIT](./LICENSE.txt).
9 changes: 6 additions & 3 deletions library/misc/natvis.xml
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
<?xml version="1.0" encoding="utf-8"?>
<AutoVisualizer xmlns="http://schemas.microsoft.com/vstudio/debugger/natvis/2010">
<Type Name="Range&lt;*&gt;">
<DisplayString>{{ count={count} }}</DisplayString>
<Expand>
<Item Name="[count]" ExcludeView="simple">count</Item>
<AlternativeType Name="svf::runtime::Range&lt;*&gt;" />
<DisplayString>{pointer,[count]}</DisplayString>
<StringView>pointer,[count]</StringView>
<Expand>
<Item Name="[count]">count</Item>
<Item Name="[pointer]">pointer,[count]</Item>
<ArrayItems>
<Size>count</Size>
<ValuePointer>pointer</ValuePointer>
Expand Down
18 changes: 9 additions & 9 deletions svf_runtime/src/svf_compatibility.c
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ typedef struct SVFRT_CheckContext {
SVF_META_Schema* s1;

// Byte ranges of the two schemas.
SVFRT_RangeU8 r0;
SVFRT_RangeU8 r1;
SVFRT_Bytes r0;
SVFRT_Bytes r1;

// What do Schema 1 structs/choices match to in Schema 0?
SVFRT_RangeU32 s1_struct_matches;
Expand Down Expand Up @@ -171,6 +171,7 @@ bool SVFRT_check_struct(
// Internal error.
return false;
}
ctx->s1_struct_matches.pointer[s1_index] = s0_index;

SVFRT_RangeStructDefinition structs0 = SVFRT_RANGE_FROM_ARRAY(ctx->r0, ctx->s0->structs, SVF_META_StructDefinition);
SVFRT_RangeStructDefinition structs1 = SVFRT_RANGE_FROM_ARRAY(ctx->r1, ctx->s1->structs, SVF_META_StructDefinition);
Expand Down Expand Up @@ -254,7 +255,6 @@ bool SVFRT_check_struct(
}
}

ctx->s1_struct_matches.pointer[s1_index] = s0_index;
ctx->s1_struct_strides.pointer[s1_index] = s0->size;

return true;
Expand All @@ -274,6 +274,7 @@ bool SVFRT_check_choice(
// Internal error.
return false;
}
ctx->s1_choice_matches.pointer[s1_index] = s0_index;

SVFRT_RangeChoiceDefinition choices0 = SVFRT_RANGE_FROM_ARRAY(ctx->r0, ctx->s0->choices, SVF_META_ChoiceDefinition);
SVFRT_RangeChoiceDefinition choices1 = SVFRT_RANGE_FROM_ARRAY(ctx->r1, ctx->s1->choices, SVF_META_ChoiceDefinition);
Expand Down Expand Up @@ -346,15 +347,14 @@ bool SVFRT_check_choice(
}
}

ctx->s1_choice_matches.pointer[s1_index] = s0_index;
return true;
}

void SVFRT_check_compatibility(
SVFRT_CompatibilityResult *result,
SVFRT_RangeU8 scratch_memory,
SVFRT_RangeU8 schema_write,
SVFRT_RangeU8 schema_read,
SVFRT_Bytes scratch_memory,
SVFRT_Bytes schema_write,
SVFRT_Bytes schema_read,
uint64_t entry_name_hash,
SVFRT_CompatibilityLevel required_level,
SVFRT_CompatibilityLevel sufficient_level
Expand All @@ -367,8 +367,8 @@ void SVFRT_check_compatibility(
return;
}

SVFRT_RangeU8 r0 = schema_write;
SVFRT_RangeU8 r1 = schema_read;
SVFRT_Bytes r0 = schema_write;
SVFRT_Bytes r1 = schema_read;

// This will break when proper alignment is done. @proper-alignment
SVF_META_Schema *s0 = (SVF_META_Schema *) (r0.pointer + r0.count - sizeof(SVF_META_Schema));
Expand Down
Loading

0 comments on commit d29ecf7

Please sign in to comment.