Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chunk store #848

Merged
merged 13 commits into from
Nov 5, 2024
Merged

Chunk store #848

merged 13 commits into from
Nov 5, 2024

Conversation

cody-littley
Copy link
Contributor

Why are these changes needed?

This PR adds a framework that will be used by encoders to upload data.

As requested by @dmanc, I split apart the logic for uploading proofs and uploading coefficients into separate methods.

Since this functionality is needed to unblock @dmanc, I'm pushing it in a partially completed form. Namely, the framework does not currently break large files into smaller ones when pushing to S3. I plan on adding that as a follow up task.

Checks

  • I've made sure the lint is passing in this PR.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, in that case, please comment that they are not relevant.
  • I've checked the new test coverage and the coverage percentage didn't drop.
  • Testing Strategy
    • Unit tests
    • Integration tests
    • This PR is not tested :(

@cody-littley cody-littley self-assigned this Oct 30, 2024
@@ -0,0 +1,103 @@
package chunkstore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like chunkstore belongs outside disperser. Should it be under relay?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved it to the relay.

I was actually a little confused about where this should live. The relay reads from it, but the encoder will be writing to it. But I guess relay is as good as anywhere (unless we decide to make it a top level directory, which I don't think it deserves to be).

@@ -33,3 +37,61 @@ func Decode(b []byte) (Frame, error) {
}
return f, nil
}

// EncodeFrames serializes a slice of frames into a byte slice.
func EncodeFrames(frames []*Frame) ([]byte, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we use existing serialization methods like this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From split encoder perspective we want to serialize the proofs and the coefficients separately. From node perspective we could make sure it receives the chunks in the expected serialized format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmanc requested that the chunk store had the capability of uploading the encoding.Proof objects and rs.Frame objects separately. The code you point to is capable of serializing both at the same time, but does provide a way to serialize/deserialize them separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. How about this one?

func (r *chunkReader) GetChunkCoefficients(
ctx context.Context,
blobKey disperser.BlobKey,
metadata *ChunkCoefficientMetadata) ([]*rs.Frame, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does chunk reader need chunkMetadataStore if this method takes metadata as input?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point, it won't need this to be passed in. I didn't notice because it's not actually used in this PR's iteration of the feature. Removed.

)

// ChunkMetadataStore is an interface for storing and retrieving metadata about chunks.
type ChunkMetadataStore interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we need another store abstraction for writing/reading chunk metadata.
Since chunk metadata lives inside blob metadata, I think write/read should happen via blob metadata store.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was something separate since we had originally discussed not putting the extra metadata into the regular blob metadata store. Now that this data has merged into the other blob matadata, I agree that it doesn't make sense to have a separate chunk metadata store. Removed.

// The total size of file containing all chunk coefficients for the blob.
DataSize int
// The maximum fragment size used to store the chunk coefficients.
FragmentSize int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we be more specific here and say its uint64 or what's appropriate than generic int type ?

Copy link
Contributor Author

@cody-littley cody-littley Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. Let's go with uint64 for the sake of future compatibility (I can't imagine having >4gb files, but let's not limit ourselves here... it's not that much overhead).

As a side note, one of the quirks of golang that drives me up a wall is how they strongly encourage everybody to use the int type everywhere. For example, why does len(x) return a signed value? If I were in charge of the language design, I'd never have supported the types int and uint in the first place. /rant

bytes = append(bytes, proofBytes[:]...)
}

err := c.s3Client.UploadObject(ctx, c.bucketName, s3Key, bytes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is our s3 object versioning policy ? would we need to care for objects that already exist before uploading ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, repeating conclusion of that talk here for others.

Since each key is unique and has a deterministic value, writing a value to a key more than once is harmless (i.e. the data is overwritten with the exact same data).

Copy link
Contributor

@ian-shim ian-shim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! one comment re: serialization. Can you also take a look at the lint failure?

@@ -33,3 +37,61 @@ func Decode(b []byte) (Frame, error) {
}
return f, nil
}

// EncodeFrames serializes a slice of frames into a byte slice.
func EncodeFrames(frames []*Frame) ([]byte, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. How about this one?

@cody-littley cody-littley marked this pull request as draft November 1, 2024 13:56
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
@cody-littley cody-littley marked this pull request as ready for review November 4, 2024 17:01
Copy link
Contributor

@dmanc dmanc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, left a couple small comments

// GnarkEncodeFrames serializes a slice of frames into a byte slice.
func GnarkEncodeFrames(frames []*Frame) ([]byte, error) {

// Serialization format:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move this to above the function so it shows up in the go docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// GnarkEncodeFrames serializes a slice of frames into a byte slice.
//
// Serialization format:
// [number of frames: 4 byte uint32]
// [size of frame 1: 4 byte uint32][frame 1]
// [size of frame 2: 4 byte uint32][frame 2]
// ...
// [size of frame n: 4 byte uint32][frame n]
//
// Where relevant, big endian encoding is used.
func GnarkEncodeFrames(frames []*Frame) ([]byte, error) {

return nil, 0, fmt.Errorf("invalid frame size: %d", len(serializedFrame))
}

coeffs := make([]encoding.Symbol, frameCount)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I feel like encoding.Symbol is not really used anywhere. Maybe it's worth deprecating it and just using fr.Element.

// Symbol is a symbol in the field used for polynomial commitments
type Symbol = fr.Element

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Cody Littley <[email protected]>

func (r *chunkReader) GetChunkProofs(
ctx context.Context,
blobKey disperser.BlobKey) ([]*encoding.Proof, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this blobkey references the v1 blob key. In StoreBlob for V2 we use blobKey.Hex() = string

func (b *BlobStore) StoreBlob(ctx context.Context, blobKey string, data []byte) error {
.

V2 blob key:

type BlobKey [32]byte

func (b BlobKey) Hex() string {
	return hex.EncodeToString(b[:])
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also when we fetch for proofs vs coefficients don't we need a different S3 key to differentiate it?

Copy link
Contributor Author

@cody-littley cody-littley Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been assuming we'd use different buckets. Started a slack conversation to discuss. Will circle back on this prior to merging once we decide how we want to handle buckets and namespacing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've switched over to using v2.BlobKey as recommended by @ian-shim.


// ChunkCoefficientMetadata contains metadata about how chunk coefficients are stored.
// Required for reading chunk coefficients using ChunkReader.GetChunkCoefficients().
type ChunkCoefficientMetadata struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these the same? encoding.FragmentInfo

type FragmentInfo struct {
	TotalChunkSizeBytes uint32
	NumFragments        uint32
}

Copy link
Contributor Author

@cody-littley cody-littley Nov 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They aren't currently the same, but should be. Now fixed.

The primary reason why I didn't originally enable fragmented read/write operations was because I wasn't initially sure how the metdata store would handle this data. Now that Ian merged his PR, I've unified ChunkCoefficientMetadata with FragmentInfo and have enabled chunk file fragmentation.

@cody-littley cody-littley merged commit 5fd9a08 into Layr-Labs:master Nov 5, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants