Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic write support #45

Open
wants to merge 20 commits into
base: main
Choose a base branch
from
Open

Add basic write support #45

wants to merge 20 commits into from

Conversation

pka
Copy link

@pka pka commented Oct 19, 2024

This is a minimal writer implementation. Before implementing the missing parts, I would like to discuss the basic design.

Main questions:

  1. Should we prepare for other backends (e.g. S3)?
    • (Trait could be done later)
  2. Do we want an async API?
    • Probably depends on 1)

Missing functionality:

  • Leaf directories
  • API for setting header data (center_zoom, etc.)
  • Configurable compression
  • Compression for tiles
  • Proper errors
  • Some TODOs and unwraps

@nyurik
Copy link
Collaborator

nyurik commented Oct 21, 2024

wrt 1 - you mean writing directly to s3? Does s3 allow random access of the already open file (we would need to go back to the directory to update file positions i would think)

@pka
Copy link
Author

pka commented Oct 21, 2024

AFAIK, S3 doesn't support seek operations or partial updates, but with a multipart upload, you can defer uploading header and root directory. So a direct upload to S3 would be possble. But this has no high priority for me.

@nyurik
Copy link
Collaborator

nyurik commented Oct 21, 2024

TIL, thx. And lets follow YAGNI - if someone wants it, they will add it :)

@pka
Copy link
Author

pka commented Oct 23, 2024

This is now ready for a review.

I realized an important property, which I want to add as a followup: deduplicating non-subsequent tiles. This has performance drawbacks when reading, but probably reduces the file size substantially.

Copy link
Collaborator

@nyurik nyurik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks pretty good! I left a few thoughts

justfile Outdated Show resolved Hide resolved
src/async_reader.rs Outdated Show resolved Hide resolved
src/directory.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
src/writer.rs Show resolved Hide resolved
src/writer.rs Outdated Show resolved Hide resolved
src/writer.rs Outdated Show resolved Hide resolved
src/writer.rs Outdated Show resolved Hide resolved
src/writer.rs Outdated Show resolved Hide resolved
src/writer.rs Show resolved Hide resolved
@pka
Copy link
Author

pka commented Oct 24, 2024

Regarding the failed semver check:

--- failure enum_variant_added: enum variant added on exhaustive enum ---

Description:
A publicly-visible enum without #[non_exhaustive] has a new variant.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#enum-variant-new
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.36.0/src/lints/enum_variant_added.ron

Failed in:

     Summary semver requires new major version: 1 major and 0 minor checks failed
  variant PmtError:IndexEntryOverflow in /home/runner/work/pmtiles-rs/pmtiles-rs/src/error.rs:28

Shall I add #[non_exhaustive] or increase the version?

@nyurik
Copy link
Collaborator

nyurik commented Oct 25, 2024

Might as well bump the version

@nyurik
Copy link
Collaborator

nyurik commented Oct 25, 2024

Why do you want to make it non-exhaustive? I am ok to break compat - not a biggie tbh

@nyurik
Copy link
Collaborator

nyurik commented Oct 25, 2024

but come to think of it, errors are a pretty good candidate for non-exhaustive, so yeah, lets keep it going forward... but enabling it would still be a breaking change, right?

@nyurik
Copy link
Collaborator

nyurik commented Oct 25, 2024

one more thing - i think write support is a less used feature, and since it requires extra dependencies, lets make this into an optional feature

@nyurik nyurik requested a review from lseelenbinder October 25, 2024 18:30
@pka
Copy link
Author

pka commented Oct 25, 2024

one more thing - i think write support is a less used feature, and since it requires extra dependencies, lets make this into an optional feature

flate2 is already an indirect dependency and countio is quite small. So I thougt it's not worth the hassle.

@lseelenbinder
Copy link
Member

Ohhh! This is great. Just seeing the PR now for some reason. I'll give it a review shortly. Thank you @pka!

About 3 times faster and very similar compression rate for MVTs
README.md Show resolved Hide resolved
README.md Show resolved Hide resolved
mod tile;
mod writer;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you clarify why cfg(__async) was removed here? We really ought to make this PR behind a feature flag

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was glad to reduce the cfg(__) complexity a bit with 72f5762 - it reminded me of the bad old C/++ #ifndef days. IMO, a pmtile library should come with its core functionality (like calculating a tile id) and dependencies (like flate2) by default. I understand the feature flags for the different backend implementations with heavy or specialized dependencies, but not for everything. That said, if the maintainers prefer a feature flag for writing pmtile files, I can give it a try.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we shouldn't go overboard on features. My only concern is that if we have additional dependencies due to write support, it is better to make write - optional, simply because write is a relatively rare usage compare to read. My understanding is that you need compression, which might be a significant extra compilation burden - and that IMO should not be levied on users unless needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree a feature would be nice.

Copy link
Member

@lseelenbinder lseelenbinder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thanks for setting this all up! I'm glad we're getting this functionality added.

I'd like to understand how much impact having the leaf directories at the end of the archive would have on performance. Hopefully it's not too much, because this elegantly solves the need for a temporary file.

Another option is to reserve N bytes at the beginning of every archive for the header + directories, but that's suboptimal as well.

README.md Show resolved Hide resolved
README.md Show resolved Hide resolved
@@ -206,6 +209,9 @@ impl<B: AsyncBackend + Sync + Send, C: DirectoryCache + Sync + Send> AsyncPmTile
.read_to_end(&mut decompressed_bytes)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The decompressed_bytes declaration should probably move inside of this now, if we're supporting compression modes where it's not needed.

mod tile;
mod writer;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree a feature would be nice.

src/writer.rs Show resolved Hide resolved

/// Set the compression for metadata and directories.
#[must_use]
pub fn internal_compression(mut self, compression: Compression) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this (and a few other of the functions) use a Builder pattern without using a Builder struct. Can we split the building from the Writer struct? Then the writer struct will be quite simple (really only needs an add_tile and finalize method).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PmTilesWriter is a builder struct, generating a PmTilesStreamWriter struct with create. Builder is not in the name, because the resulting code looks better without it, IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Hmm…I see. I think the nomenclature threw me off: I wouldn't expect a PmTilesWriter to produce a PmTilesStreamWriter—one of which is a builder for the other.

Any ideas for a better name? I think the namespacing is probably contributing to it. We probably should have pmtiles::Writer and pmtiles::Reader instead of PmTilesWriter and …PmTilesReader.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we switched to pmtiles::writer::StreamWriter then we could have pmtiles::writer::Builder without confusion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I preferred the somewhat confusing names for prettier resulting user code:

let mut writer = Builder::new(TileType::Mvt).create(file)?
vs.
let mut writer = PmTilesWriter::new(TileType::Mvt).create(file)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can accomplish the same thing with name spacing: pmtiles::writer::StreamBuilder and pmtiles::writer::StreamWriter. It's a pretty common pattern in Rust, and I think keeps things concise and manageable.

Any thoughts on this @nyurik?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a few seconds to understand "name spacing" meant "namespacing" :D Naming is hard...

I am not a big fan of mega-generic names like Builder by itself, nor adding namespace as part of the required pattern... I do like PmTilesWriter::new(...).create(...) as it is cleaner to the user. We could name the resulting struct as PmTilesWriterStream or PmTilesStream or whatever.

Copy link
Member

@lseelenbinder lseelenbinder Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this is a pretty fundamental decision—so I'd say this blocks the MR until we decide.

I did some research, and here's my suggestion.

The "thing" we want to build is called PmTilesWriter (I think "stream" is an implementation detail—99% of use-cases prefer streaming anyways.).

Then we can have a Builder in the same module, but hide the complexity behind a PmTilesWriter::builder() call.

This results in the following pseudocode:

let writer = PmTilesWriter::builder(tile_type).create("file.pmtiles");

The user never needs to know about the underlying builder structure, and it keeps the naming manageable. I'm happy to do the refactor, but just want to make sure there's no strong objections.

src/writer.rs Show resolved Hide resolved
src/writer.rs Show resolved Hide resolved
src/writer.rs Show resolved Hide resolved
src/writer.rs Outdated Show resolved Hide resolved
@pka
Copy link
Author

pka commented Oct 29, 2024

I'd like to understand how much impact having the leaf directories at the end of the archive would have on performance. Hopefully it's not too much, because this elegantly solves the need for a temporary file.

Another option is to reserve N bytes at the beginning of every archive for the header + directories, but that's suboptimal as well.

I would not expect a performance impact, even via http. An optimization for initial loading would be adding the entries for z0-z2 to the root directory, which is a TODO copied from the pmtiles-go implementation. Shouldn't be difficult to add to the current implementation.

@lseelenbinder
Copy link
Member

wrt 1 - you mean writing directly to s3? Does s3 allow random access of the already open file (we would need to go back to the directory to update file positions i would think)

I think this would be theoretically possible if we use multi-part upload, and refrain from uploading the first part until the end, but I'd have to check the docs to confirm this is possible.

@lseelenbinder
Copy link
Member

When I have the chance, I was planning to grab a copy of this PR and do some additional cleanup that makes sense now. I'll make it dependent against your code, @pka, so we don't have merge issues.

I'm struggling with how much to ask for in this PR vs what we should clean up after. Let's not immediately release after merging to give us some time to clean up the APIs in a more general way.

@lseelenbinder
Copy link
Member

Okay! We're close. Just have to resolve the naming detail, and I'll be happy to do a cleanup pass after merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants