StreamChain: An API for streams-processing data (e.g. HTTP → ZIP → XML → HTML) #1

adamziel · 2024-07-15T18:32:13Z

This PR explores a generic Stream interface that allows piping data through different format processors, e.g. HTTP request → ZIP decoder → XML reader → HTML Processor → WordPress Database.

Jump to the last status update and feedback request

It brings together all the stream processing explorations in WordPress to enable stream-rewriting site URLs in a WXR file coming from a remote server. All of that with no curl, DOMDocument, or other PHP dependencies. It's just a few small libraries built with WordPress core in mind:

The rewriter is easy to extend. It could, e.g. stream-rewrite data from a ZIP-ped XML file, re-zip it on the fly, and return it as a HTTP response.

FYI @dmsnell @akirk @brandonpayton @bgrgicak @jordesign @mtias @griffbrad – this is exploratory for now, but will likely become relevant for production use sooner than later.

Related to:

Historically, this PR started as an exploration of rewriting URLs in a remote WXR file.

Brings together a few explorations to stream-rewrite site URLs in a WXR file coming from a remote server. All of that with no curl, DOMDocument, or other PHP dependencies. It's just a few small libraries built with WordPress core in mind: * [AsyncHttp\Client](WordPress/blueprints#52) * [WP_XML_Processor](WordPress/wordpress-develop#6713) * [WP_Block_Markup_Url_Processor](https://github.com/adamziel/site-transfer-protocol) * [WP_HTML_Tag_Processor](https://developer.wordpress.org/reference/classes/wp_html_tag_processor/) Here's what the rewriter looks like: ```php $wxr_url = "https://raw.githubusercontent.com/WordPress/blueprints/normalize-wxr-assets/blueprints/stylish-press-clone/woo-products.wxr"; $xml_processor = new WP_XML_Processor('', [], WP_XML_Processor::IN_PROLOG_CONTEXT); foreach( stream_remote_file( $wxr_url ) as $chunk ) { $xml_processor->stream_append_xml($chunk); foreach ( xml_next_content_node_for_rewriting( $xml_processor ) as $text ) { $string_new_site_url = 'https://mynew.site/'; $parsed_new_site_url = WP_URL::parse( $string_new_site_url ); $current_site_url = 'https://raw.githubusercontent.com/wordpress/blueprints/normalize-wxr-assets/blueprints/stylish-press-clone/wxr-assets/'; $parsed_current_site_url = WP_URL::parse( $current_site_url ); $base_url = 'https://playground.internal'; $url_processor = new WP_Block_Markup_Url_Processor( $text, $base_url ); foreach ( html_next_url( $url_processor, $current_site_url ) as $parsed_matched_url ) { $updated_raw_url = rewrite_url( $url_processor->get_raw_url(), $parsed_matched_url, $parsed_current_site_url, $parsed_new_site_url ); $url_processor->set_raw_url( $updated_raw_url ); } $updated_text = $url_processor->get_updated_html(); if ($updated_text !== $text) { $xml_processor->set_modifiable_text($updated_text); } } echo $xml_processor->get_processed_xml(); } echo $xml_processor->get_unprocessed_xml(); ```

adamziel · 2024-07-16T10:32:44Z

Show me the code

Here's what the rewriter looks like:

$wxr_url = "https://raw.githubusercontent.com/WordPress/blueprints/normalize-wxr-assets/blueprints/stylish-press-clone/woo-products.wxr";

$xml_processor = new WP_XML_Processor('', [], WP_XML_Processor::IN_PROLOG_CONTEXT);
foreach( stream_remote_file( $wxr_url ) as $chunk ) {
    $xml_processor->stream_append_xml($chunk);
    foreach ( xml_next_content_node_for_rewriting( $xml_processor ) as $text ) {
        $url_processor = new WP_Block_Markup_Url_Processor( $text, $base_url );

        foreach ( html_next_url( $url_processor, $current_site_url ) as $parsed_matched_url ) {
            $updated_raw_url = rewrite_url(
                $url_processor->get_raw_url(),
                $parsed_matched_url,
                $parsed_current_site_url,
                $parsed_new_site_url
            );
            $url_processor->set_raw_url( $updated_raw_url );
        }

        $updated_text = $url_processor->get_updated_html();
        if ($updated_text !== $text) {
            $xml_processor->set_modifiable_text($updated_text);
        }
    }
    echo $xml_processor->get_processed_xml();
}
echo $xml_processor->get_unprocessed_xml();

Architecture

The rewriter explored here pipes and stream-processes data as follows:

AsyncHttp\Client -> WP_XML_Processor -> WP_Block_Markup_Url_Processor -> WP_Migration_URL_In_Text_Processor -> WP_URL

The layers of data at play are:

AsyncHttp\Client: HTTPS encrypted data -> Chunked encoding -> Gzip compression
WP_XML_Processor: XML (entities, attributes, text, comments, CDATA nodes)
WP_Block_Markup_Url_Processor: HTML (entities, attributes, text, comments, block comments), JSON (in block comments)
WP_Migration_URL_In_Text_Processor: URLs in text nodes
WP_URL: URL parsing and serialization

Remaining work

This PR explores a Streaming / Pipes API to make the streams easy to compose and visualize. While the implementation may change, the goal is to pipe chunks of data as far as possible from upstream to downstream while supporting both blocking and non-blocking streams.

Build new ZipReaderStream() and new ZipWriterStream() – what would be the API to manage multiple files?
Explore new BlockMarkupToMarkdownStream() and new MarkdownToBlockMarkupStream()
Explore a new SQLDumpProcessorStream( $value_visitor ) to rewrite URLs in database dump files before importing them

Open Questions

Passing bytes around is great for a consistent interface and byte-oriented operations.

However, a HTTP request yields response headers before the body. Reading from a ZIP file produces a series of metadata and data streams – one for every decoded file. How can we use pipes with these more complex data structures? Should we even try? If yes, what would be the API? Would there be multiplexing? Or returning other data types? Or would it be a different interface?

adamziel · 2024-07-16T10:42:50Z

: I've been exploring a Pipe-based API for easy composing of all those data transformations, here's what I came up with:

Pipe::run( [
	new RequestStream( new Request( 'https://raw.githubusercontent.com/WordPress/blueprints/normalize-wxr-assets/blueprints/stylish-press-clone/woo-products.wxr' ) ),
	new XMLProcessorStream(function (WP_XML_Processor $processor) {
		if(is_wxr_content_node($processor)) {
			$text         = $processor->get_modifiable_text();
			$updated_text = Pipe::run([
				new BlockMarkupURLRewriteStream( 
					$text,
					[
						'from_url' => 'https://raw.githubusercontent.com/wordpress/blueprints/normalize-wxr-assets/blueprints/stylish-press-clone/wxr-assets/',
						'to_url'   => 'https://mynew.site/',
					]
				),
			]);
			if ( $updated_text !== $text ) {
				$processor->set_modifiable_text( $updated_text );
			}
		}
	}),
	new EchoStream(),
] );

It's based on the following two interfaces (that are likely to keep changing for now):

interface ReadableStream {
	public function read(): bool;
	public function is_finished(): bool;
	public function consume_output(): ?string;
	public function get_error(): ?string;
}

interface WritableStream {
	public function write( string $data ): bool;
	public function get_error(): ?string;
}

Here's a few more streams I would like to have:

new BlockMarkupToMarkdownStream() and new MarkdownToBlockMarkupStream()
new SQLDumpProcessorStream( $value_visitor ) to rewrite URLs in database dump files before importing them
new ZipReaderStream() and new ZipWriterStream() – what would be the API to manage multiple files?
new GitSparseCheckoutStream()

That way we'll be able to put together pipes like this:

Pipe::run( [
	new RequestStream( new Request( 'https://site.com/export.wxr.zip' ) ),
	new ZipReaderStream( '/export.wxr' ),
	new XMLProcessorStream(function (WP_XML_Processor $processor) use ($assets_downloader) {
		if(is_wxr_content_node($processor)) {
			$text         = $processor->get_modifiable_text();

			// Download the missing assets files
			$assets_downloader->process( $text );
			if(!$assets_downloader->everything_already_downloaded()) {
			    // Don't import content that has pending downloads
			    return;
			}

			// Update the URLs in the text
			$updated_text = Pipe::run([
				new BlockMarkupURLRewriteStream( 
					$text,
					[ 'from_url' => $from_site, 'to_url'  => $to_site ]
				),
			]);
			if ( $updated_text !== $text ) {
				$processor->set_modifiable_text( $updated_text );
			}
		}
	})
] );

or this:

Pipe::run( [
	new GitSparseCheckoutStream( 'https://github.com/WordPress/gutenberg.git', [
		'docs/**/*.md'
	] ),
	new MarkdownToBlockMarkupStream(),
	new BlockMarkupURLRewriteStream( 
		$text,
		[ 'from_url' => $from_site, 'to_url'  => $to_site ]
	),
	new CreatePageStream()
] );

adamziel · 2024-07-16T12:20:08Z

I’ve played with ideas like flatMap() and filter() to express more complex data flows using both objects, byte streams, concurrent and serial streams, and splitting and combining the dataflow:

graph TD
    A[HttpClient] -->|runs 10 concurrent requests| B[Pipeline]
    
    B -->|filter ZIP files| C[ZipPipeline]
    B -->|filter XML files| D[XmlPipeline]

    C -->|decode ZIP files| E[ZipDecoder]
    E -->|output XML entries| F[ZipXmlFilter]
    F -->|filter XML files| G[XmlProcessor]

    D -->|passthrough| G

    G -->|find WXR content nodes| H[XmlProcessor]
    H -->|parse as HTML| I[BlockMarkupURLProcessor]
    I -->|rewrite URLs| J[HTML string]
    J -->|write to local files| K[LocalFileWriter]

    classDef blue fill:#bbf,stroke:#f66,stroke-width:2px;
    class B,C,D,E,F,G,H,I,J,K blue;

Sadly, the best result I got was a complex DSL you couldn't use without spending time with the documentation:

<?php
// Create the main pipeline
$pipeline = HttpClient::pipeline([
    "http://example.com/file1.zip",
    "http://example.com/file2.zip",
    "http://example.com/file3.zip",
    "http://example.com/file4.zip",
    "http://example.com/file5.zip",
    "http://example.com/file6.xml",
    "http://example.com/file7.xml",
    "http://example.com/file8.xml",
    "http://example.com/file9.xml",
    "http://example.com/file10.xml"
]);

[$zipPipeline, $xmlPipeline] = $pipeline->split(HttpClient::filterContentType('application/zip'));

$zipPipeline
    ->flatMap(ZipDecoder::create())
    ->filter(Pipeline::filterFileName('.xml$'))
    ->combineWith($xmlPipeline)
    ->map(new WXRRewriter())
    ->map(Pipeline::defaultFilename('output.xml'))
    ->map(new LocalFileWriter('./'))

The alternative is the following imperative code:

$zips = [
    "http://example.com/file1.zip",
    "http://example.com/file2.zip",
    "http://example.com/file3.zip",
    "http://example.com/file4.zip",
    "http://example.com/file5.zip",
];
$zip_decoders = [];
$xmls = [
    "http://example.com/file6.xml",
    "http://example.com/file7.xml",
    "http://example.com/file8.xml",
    "http://example.com/file9.xml",
    "http://example.com/file10.xml"
];
$local_paths = [];
$xml_rewriters = [];
$client = new Client();
$client->enqueue( [ ...$zips, ...$xmls ] );

while ( $client->await_next_event() ) {
    $request = $client->get_request();
    $original_url = $request->original_request()->url;

    switch ( $client->get_event() ) {
        case Client::EVENT_HEADERS_RECEIVED:
            if ( in_array( $original_url, $zips ) ) {
                $zip_decoders[$original_url] = new ZipStreamReader();
            } else {
                $xml_rewriters[$original_url] = new XmlRewriter();
            }

            break;
        case Client::EVENT_BODY_CHUNK_AVAILABLE:
            if ( in_array( $original_url, $zips ) ) {
                $zip_decoders[$original_url]->write( $request->get_response_body_chunk() );
            } else {
                $xml_rewriters[$original_url]->write( $request->get_response_body_chunk() );
            }
            break;
        case Client::EVENT_FAILED:
        case Client::EVENT_FINISHED:
            unset( $zip_decoders[$request->original_request()->id] );
            continue 2;
    }

    foreach( $zip_decoders as $url => $zip ) {
        if ( $zip->is_file_finished() ) {
            $zip->next_file();
        }
        while ( $zip->read() ) {
            if( $zip->get_last_error() ) {
                // TODO: Handle error
                continue 2;
            }

            $file = $zip->get_file_name();
            if(!isset($xml_rewriters[$file])) {
                $xml_rewriters[$file] = new XmlRewriter();
            }
            $xml_rewriters[$url]->write( $zip->get_content_chunk() );
        }
    }

    foreach ( $xml_rewriters as $url => $rewriter ) {
        while ( $rewriter->read() ) {
            file_put_contents(
                $local_paths[$url],
                $rewriter->get_response_body_chunk(),
                FILE_APPEND
            );
        }
    }
}

It is longer, sure, but there's way less ideas in it, you have more control, and it can also be encapsulated similarly as AsyncHttp\Client:

public function next_chunk() {
    $this->await_response_bytes();
    $this->process_zip_chunks();
    $this->process_xml_chunks();
    $this->write_output_bytes();
}

It's not declarative but it's simple.

akirk · 2024-07-16T12:44:06Z

One option might be to something like a Brancher extends TransformStream class that itself accepts, single TransformStreams and/or Pipe of multiple streams that will be selected by the Brancher either based on the content (maybe through a callback) or the first stream that doesn't thrown an exception.

I was wondering if something modeled after JavaScript Promises might be more flexible in providing branching abilities.

adamziel · 2024-07-16T14:57:26Z

One option might be to something like a Brancher extends TransformStream class that itself accepts, single TransformStreams and/or Pipe of multiple streams that will be selected by the Brancher either based on the content (maybe through a callback) or the first stream that doesn't thrown an exception.

Noodling on that idea, we'd need a new type category for multiple data flows:

MultiTransformer – (stream_id, in_chunk) => out_chunk – transforms many streams of the same type of data, e.g. rewrites many XML files at once.
Demultiplexer – single input, multiple outputs, e.g. a HTTP client could pipe a single byte stream into multiple HTTP sockets, each having its own response stream.
Multiplexer – multiple inputs, single output, e.g. a ZipEncoder could turn multiple File[] streams into a single byte stream.

Here's one way how they could combine:

$client = new Client();
$client->enqueue( [ ...$zips, ...$xmls ] );

MultiPipeline::run([
    // This produces multiple Request[] streams
    $client->demultiplex(),

    MultiPipeline::branch(
        ( $request ) => is_zip($request),

        // ZipStreamDemultiplexer is a bytes -> File[] array transformer. It's not
        // a demultiplexer because the next file is always produced before the next
        // one so there is no concurrent processing here. We could, perhaps, implement
        // it as a demultiplexer anyway to reduce the number of ideas in the codebase.
        [ () => new ZipStreamReader( '*.xml' ) ]
    ),

    // XmlRewriter is a regular bytes -> bytes stream. In here,
    // we support multiple concurrent XML streams.
    // We can skip the new MultiTransformer() call and have MultiPipeline backfill it for us.
    () => new XmlRewriter(),

    // And now we're gathering all the File objects into a single File stream.
    new Multiplexer(),

    () => new ZipStreamEncoder()

    // Let's write to a local file.
    // At this point we only have a single stream id, but we're still
    // in a multi-stream world so we have to wrap with a MultiTransformer.
    () => new LocalFileWriter( 'out.zip' )
]);

This looks much better than the bloat I outlined in my previous comment. Perhaps it can be simplified even further.

Although, I guess it's not that different from:

$client = new Client();
$client->enqueue( [ ...$zips, ...$xmls ] );

$client
    ->demultiplex()
    ->branch(
        ( $request ) => is_zip($request),
        ( $branch ) => $branch->pipeTo( () => new ZipStreamReader( '*.xml' ) )
    )
    ->pipeTo( () => new XmlRewriter() )
    ->multiplex()
    ->pipeTo( new ZipStreamEncoder() )
    ->pipeTo( new LocalFileWriter( 'out.zip' ) )

One thing I'm not sure about is passing bytes vs File($metadata, $body_stream) objects. We don't need that as much in a byte processing world, but it's super useful in the demultiplexing world. We can either make the Byte streams pass around File/DataUnit objects, or we can convert between them and streams in the multi-stream world.

I was wondering if something modeled after JavaScript Promises might be more flexible in providing branching abilities.

I don't have anything against callbacks, but I'd rather keep the data flow here as linear as possible and err on the side of simplicity over allowing multiple forks, splitting the data in success streams and error streams etc.

adamziel · 2024-07-16T22:48:45Z

I just realized piping objects is the same as piping bytes + metadata.

Therefore, we can pipe HTTP responses, ZIP files etc. without almost any additional complexity. We would pipe bytes as we do now, and then we'd also support moving an optional $metadata object along the pipe together with bytes.

To support multiplexing, I introduced a StreamMetadata interface that requires a get_resource_id() method. That's how we can distinguish between chunks associated with different requests, files, etc.

A Demultiplexer is just a regular TransformStream that:

On write, it creates a new sub-pipe whenever it sees a new $resource_id. It then routes the incoming data chunks to the relevant sub-pipe.
On read, it goes through the pipes round-robin and outputs the next available set of bytes + metadata.

A Multiplexer isn't even needed as every pipe is a linear stream of bytes + metadata and, while demultiplexers augment that temporarily, they clean up after themselves.

Here's a snippet of code that actually works with the latest version of this branch:

Pipe::run( [
	new RequestStream( [
		new Request( 'https://raw.githubusercontent.com/WordPress/blueprints-library/trunk/php.ini' ),
		new Request( 'https://raw.githubusercontent.com/WordPress/blueprints-library/trunk/phpcs.xml' ),
		new Request( 'https://raw.githubusercontent.com/WordPress/blueprints/trunk/blueprints/stylish-press/site-content.wxr' ),
	] ),

    // Filter response chunks as a flat list
	new FilterStream( fn ($metadata) => (
		str_ends_with( $metadata->get_filename(), '.xml' ) ||
		str_ends_with( $metadata->get_filename(), '.wxr' )
	) ),

    // This demultiplexer pipes each response through a separate
    // XMLProcessor so that each parser only deals with a single
    // XML document.
	new DemultiplexerStream(fn () => $wxr_rewriter()),

    // We're back to a flat list, let's strtoupper() each data chunk
	new UppercaseTransformer(),

    // A Pipe is also a TransformStream and allows us to compose multiple streams for demultiplexing
	new DemultiplexerStream(fn () => Pipe::from([
		new EchoTransformer(),
		new LocalFileStream(fn ($metadata) => __DIR__ . '/output/' . $metadata->get_resource_id() . '.chunk'),
	])),
] );

With this design, we could easily add fluid API if needed and also add support for ZIP files and other data types.

Some open questions are:

How should stream error be handled with multiplexing? How to allow one request to fail without stopping everything? How to catch that? Do we need a new CatchStream() after all?
What names would be useful here? There are streams, pipes, transformers – let's choose a cohesive set of terms.
Do we need to distinguish between Writable and Readable streams? Or would it be more useful for "non-writable" streams to ignore any data they receive, and for non-readable streams to pass through any data they receive?

dmsnell · 2024-07-17T00:30:22Z

I have found the loop-orientation of the HTML API useful and more concrete than abstract types and interfaces. To that end, I also like the way bookmarks get a user-defined name.

In these pipelines it seems like they could be added with a name, and a context object could provide stage-specific metadata and control through the entire stack.

For example, I could write something like this.

Pipe::run( [
	'http' => new RequestStream( new Request( 'https://site.com/export.wxr.zip' ) ),
	'zip'  => new ZipReaderStream( '/export.wxr' ),
	'xml'  => new XMLProcessorStream(function (WP_XML_Processor $processor, $context) use ($assets_downloader) {
		if(!str_ends_with($context['zip']->filename, '.wxr')) {
			return $context['zip']->skip_file();
		}
		
		if(is_wxr_content_node($processor)) {
			$text         = $processor->get_modifiable_text();

			// Download the missing assets files
			$assets_downloader->process( $text );
			if(!$assets_downloader->everything_already_downloaded()) {
			    // Don't import content that has pending downloads
			    return;
			}

			// Update the URLs in the text
			$updated_text = Pipe::run([
				new BlockMarkupURLRewriteStream( 
					$text,
					[ 'from_url' => $from_site, 'to_url'  => $to_site ]
				),
			]);
			if ( $updated_text !== $text ) {
				$processor->set_modifiable_text( $updated_text );
			}
		}
	})
] );

In fact this whole stack could build a generator which can then be called in a loop.

$pipe = Pipe::run( [ … ] );

while ( $pipe->next() as $context ) {
	list( 'xml' => $xml, 'zip' => $zip ) = $context;

	if ( ! str_ends_with( $zip->get_filename(), '.wxr' ) ) {
		$zip->skip_file();
		continue;
	}

	// start processing.
}

adamziel · 2024-07-17T11:58:31Z

@dmsnell I love the idea, I'm confused about the details. Would the loop run for every stage of the pipeline? Or just for the final outcome? In the latter scenario, the filtering would happen after the chunks have been already processed. Also, what would this look like for a "demultiplexing" (streaming 5 concurrent requests) and a "branching" (only unzip zip files) use-cases?

dmsnell · 2024-07-17T23:10:18Z

no idea @adamziel 😄

but I think it relates to the need for requesting more. for example, the loop could execute as soon as any and every stage has something ready to process.

in the case of XML, it could sit there in the loop and as long as it doesn't have enough data to process could say $context->continue(). this is, in effect, a flattened version of the pipeline - perhaps the tradeoff is being explicit about what runs. but you can filter things before they unpack and this was my attempt to highlight in the code snippet. the following lines would do something like $zip->read_file(), possibly.

for demultiplexing I would assume that the multiplexed stream would provide a way to access the contents of each sub-stream.

adamziel · 2024-07-17T23:17:28Z

I like reducing nesting @dmsnell. While demuxing is powerful, it's also complex and feels like solving an overly general problem instead of tailoring something simple to WordPress use-cases. Here's a take on processing multiple XML files using a flat stream structure:

Pipe::run( [
	'http' => new RequestStream( [ /* ... */ ] ),
	'zip'  => new ZipReaderStream( fn ($context) => {
		if(!str_ends_with($context['http']->url, '.zip')) {
			return $context->skip();
		}
		$context['zip']->set_processed_resource( $context['http']->url );
	} ),
	'xml'  => new XMLProcessorStream(fn ($context) => {
		if( 
		    ! str_ends_with($context['zip']->filename, '.wxr') &&
		    ! str_ends_with($context['http']->url, '.wxr')
		) {
			return $context->skip();
		}

		$context['xml']->set_processed_resource( $context['zip']->filename );
		$xml_processor = $context['xml']->get_processor( );
		while(WXR_Processor::next_content_node($xml_processor)) {
			// Migrate URLs and download assets
		}
	}),
] );

dmsnell · 2024-07-17T23:23:38Z

if we want this, it would seem like each callback should potentially have access to the context of all stages above and below it, plus space for shared state.

in the case of fn () => …-style callbacks this isn't essential, but using function means that variables won't be enclosed. perhaps this is okay, but it's a wart to usage.

adamziel · 2024-07-17T23:30:43Z

but I think it relates to the need for requesting more. for example, the loop could execute as soon as any and every stage has something ready to process.

I think that's a must, otherwise we'd need buffer size / backpressure semantics. By processing each incoming chunk right away we may sometimes go too granular or do too many checks, but perhaps it wouldn't be too bad – especially when networking and not CPU is the bottleneck.

if we want this, it would seem like each callback should potentially have access to the context of all stages above and below it, plus space for shared state.

Shared data and context lookaheads sounds like trouble, though. I was hoping that read-only access to context from all the stages above would suffice.

dmsnell · 2024-07-17T23:55:50Z

Shared data and context lookaheads sounds like trouble, though. I was hoping that read-only access to context from all the stages above would suffice.

these are valid concerns. I share them. still, I think that undoubtedly, someone will want to do something like conditionally skip a file in the ZIP based on something in the WXR processor, and being able to interact with that from below seems much more useful.

this is maybe the challenge that separate callback functions creates, because the flat model doesn't separate the layers.

adamziel · 2024-07-18T22:55:42Z

these are valid concerns. I share them. still, I think that undoubtedly, someone will want to do something like conditionally skip a file in the ZIP based on something in the WXR processor, and being able to interact with that from below seems much more useful.

Agreed! The challenge is we may only get the information necessary to reject a file after processing 10 or a 1000 chunks from that file. I can only see three solutions here:

Stream to a local file first, do the filtering, then start another pipe to process the buffered list.
- Ups: No risk of going out of memory. Low complexity.
- Downs: Double processing. Slower. Need storage, could be 100GB for a large file.
Buffer the information in-memory or on disk until we can make a decision. Push all the chunks to the end of the pipe and then "cancel" some files and rollback any side-effects the piping may have triggered. This could work with database inserts but not with piping to REST API requests.
- Ups: Fast, single-pass processing.
- Downs: Risk of going out of memory. Adds complexity. Rollbacks may require tracking changes and won't be always possible.
Buffer the information in-memory or on disk until we can make a decision. Stop processing before the decision point, then filter out some files and pipe the rest to the next stage.
- Ups: Still fast "1,5 pass" processing. Reprocessed data would likely by low in volume.
- Downs: Risk of going out of memory (can be mitigated with disk buffering). Adds complexity.

this is maybe the challenge that separate callback functions creates, because the flat model doesn't separate the layers.

I realized one more gotcha:

Imagine requesting 5 WXR exports, rewriting URLs, and saving them all to a local ZIP file. The ZIP writer needs to write data sequentially, so write all the chunks of the first file, write all the chunks of second file after that, and so on. However, sourcing data from HTTP would interleave chunks from different files. Simply piping those chunks to ZIPStreamWriter would produce a broken zip file.

We could turn it into a constraint solving problem. Stream classes would declare whether they:

Produce sequential chunks or interleaved chunks
Consume sequential chunks or interleaved chunks

On mismatch, the entire pipe would error out without even starting.

dmsnell · 2024-07-19T00:00:08Z

Downs: Risk of going out of memory.
we may only get the information necessary to reject a file after processing 10 or a 1000 chunks

to me this reads as a statement of the problem, not an impediment to the problem. if we have to wait for 1000 chunks before knowing whether to process a file, that's a sunk cost.

not with piping to REST API requests.
requesting 5 WXR exports, rewriting URLs, and saving them all to a local ZIP file

maybe it's just me but I'm lost in all of this. these examples are complicated, but are they likely? are they practical? where is the scope of what we're doing?

I'm hoping for a simpler code structure and clearer data flows.

dmsnell · 2024-08-06T06:05:21Z

There's much more I'll do to review and think through this, but at the top of my head one question arises: how does it look to be re-entrant here?

Perhaps in the Playground this isn't a big problem, with unlimited execution time, but on any real PHP server we're dealing with max_execution_time and I would imagine any multi-GB import will need to be able to pause and resume.

Without asking you to instantly solve this, do you see a way to persist the in-transit state of the pipeline so that it can be resumed later? Could we put a pause button in here that someone clicks on and then can resume later?

dmsnell

@adamziel monumental work here. of the three pipes I like the controller version the best because of how it seems like the processing steps are a little more global in those cases.

but I noticed something in all formulations: the pipeline doesn't seem to be where the complexity lies. it seems like the examples focus on pipelining the download of files, which I think involves files that get queued while processing.

what would this look like if instead of this processing pipeline we had a main loop where each stage was exposed directly, without the pipeline abstraction, but the files could be downloaded still in parallel?

what could that look like? would it be worse? I think I'm puzzled on how to abstract a universal interface for streaming things, apart from calling everything a token, but your example of the WXR rewriter demonstrates how in many cases the individual token is not the right step function. in many cases, we will process many bytes all at once, and one production from an earlier stage might create many tokens for the next stage.

I'm also thinking more about re-entrancy and how to wrap the indices throughout the pipeline. in this system I suppose we could add new methods exposing the current bookmark, the start and end of the current token for a given stage. this might be critical for being able to pause and resume progress.

at this point I think I have some feel for the design, so I'd like to ask you for some leading questions if you have any. I know this is inherently a very complicated task; the code itself also seems very complicated.

dmsnell · 2024-08-06T23:56:09Z

class-wp-wxr-normalizer.php

+                    $this->set_modifiable_html_text(
+                        $html,
+                        substr($text, 0, $at) . json_encode($new_attributes, JSON_HEX_TAG | JSON_HEX_AMP)
+                    );


My block comment delimiter finder might help here.

WordPress/wordpress-develop#6760

dmsnell · 2024-08-07T00:00:39Z

class-wp-wxr-normalizer.php

+            foreach($attributes as $key => $value) {
+                $new_attributes[$key] = $this->process_block_attributes($value);
+            }
+            return $new_attributes;


array_walk_recursive might be of help here. your code is working fine, but presumably this could perform better, if it does.

I suppose there's no practical concern here about stack overflow, since this is only processing block attributes, but I'm on the lookout for any non-tail-recursive recursion (and I think that no user-space PHP code is, even if it's in tail-recursive form, which this isn't).

Alternatively there's also the approach of adding values to a stack to process, where the initial search runs over the stack, adding new items for each one that it finds that's an array.

This is not important; I just noticed it.

dmsnell · 2024-08-07T00:02:03Z

class-wp-wxr-normalizer.php

+     * @TODO: Investigate how bad this is – would it stand the test of time, or do we need
+     *        a proper URL-matching state machine?
+     */
+    const URL_REGEXP = '\b((?:(https?):\/\/|www\.)[-a-zA-Z0-9@:%._\+\~#=]+(?:\.[a-zA-Z0-9]{2,})+[-a-zA-Z0-9@:%_\+.\~#?&//=]*)\b';


check out the extended flag x

If this modifier is set, whitespace data characters in the pattern are totally ignored except when escaped or inside a character class, and characters between an unescaped # outside a character class and the next newline character, inclusive, are also ignored. This is equivalent to Perl's /x modifier, and makes it possible to include commentary inside complicated patterns. Note, however, that this applies only to data characters. Whitespace characters may never appear within special character sequences in a pattern, for example within the sequence (?( which introduces a conditional subpattern.

https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

this can help make long and confusing regexes clearer, with comments to annotate

I would image that this review is more about the pipeline, but I think for URLs, if we're using a WHAT-WG compliant URL parser, we can probably jump to \b(?:[a-z-]+://|www\.|/) and start checking if those base points can produce a valid parse. it looks like this code isn't using what you've done in other explorations, so this comment may not be valid

dmsnell · 2024-08-07T00:27:52Z

class-wp-wxr-normalizer.php

+                }
+                if(
+                    $p->get_token_type() === '#cdata-section' && 
+                    strpos($new_value, '>') !== false 


if it's #cdata-section then it's a real CDATA section and we should check for ]]>. if it's #comment and WP_HTML_Tag_Processor::COMMENT_AS_CDATA_LOOKALIKE === $p->get_comment_type() then it's a lookalike and > is the closer.

dmsnell · 2024-08-07T00:31:52Z

class-wp-xml-processor.php

+		$this->xml = $new_xml;
+		$this->stack_of_open_elements = $breadcrumbs;
+		$this->parser_context         = $parser_context;
+		$this->had_previous_chunks    = true;


with the HTML API's extend() I've planned on ensuring that we only cut off as much from the front of the document until the first bookmark.

I've considered two modes: one simply extends (which is what #5050 does), and the other extends and forgets.

The major difference is what comes out of get_updated_html()

Here for XML this may be easier, but for HTML it's not as easy as resetting the stack open elements. There's a lot more state to track and modify, so right now in trunk it will reset to the start and crawl forward until it reaches the bookmark again if the bookmark is before the cursor.

@dmsnell My thinking is the processor has no idea whether the input stream is finished or not. It can make an assumption that an unclosed tag means we're paused at incomplete input, but the input stream may be in fact exhausted. The reverse is also problematic – we may have enough input to infer parsing is finished when in fact more input is coming. Perhaps these processors need to be explicitly told "we're still waiting for more data" or "no more input data will come".

with the HTML API's extend() I've planned on ensuring that we only cut off as much from the front of the document until the first bookmark.

Are there any system-level bookmarks that are implicitly created? As in, is there a chance we'd never forget any bytes because we've seen the <body> tag and we'll keep track of it indefinitely?

A memory limit also crossed my mind, as in "never buffer more than 1MB of data", although that seems more complex and maybe not worth it.

Perhaps these processors need to be explicitly told "we're still waiting for more data" or "no more input data will come".

Yes I believe this is going to be the demand. At some point I think we will probably add some method like $processor->get_incomplete_end_of_document() but it's not there because I have no idea what that should be right now, or if it's truly necessary.

Only the caller will be able to know if the document was truncated or if more chunks are inbound. This is also true for cases where we have everything in memory, e.g. we got truncated HTML as input and don't know where it came from - "that's it, that's all!"

Are there any system-level bookmarks that are implicitly created? As in, is there a chance we'd never forget any bytes because we've seen the tag and we'll keep track of it indefinitely?

In the HTML Processor there are for sure, though in the case of the fragment parser, since the context element never exists on the stack of open elements this shouldn't be a problem. We should be able to eject portions of the string that are closed.

dmsnell · 2024-08-07T00:37:16Z

pipes-controller.php

+     * to the second ZIP file, and so on.
+     * 
+     * This way we can maintain a predictable $context variable that carries upstream
+     * metadata and exposes methods like skip_file().


good comment

adamziel · 2024-08-27T08:48:46Z

what would this look like if instead of this processing pipeline we had a main loop where each stage was exposed directly, without the pipeline abstraction, but the files could be downloaded still in parallel?

I explored that in bd19ad7.

I like that it's less code overall.

Here's what I don't like:

I find it less readable.
There's no $state with a "call stack" references of data sources that got us to a given chunk (e.g. $http client that yielded a specific chunk, the zip processor that uncompressed it, etc.) so we cannot easily abort an earlier stage from a later stage (e.g. stop downloading the entire zip file once we find the file we need inside the archive).
It forces us into a cascade of nested loops and makes nested error handling difficult. The code pattern like the general problem monads solve, and the Byte_Stream abstraction is quasi-monadic with a few intentional limitations, e.g. you can't compose more arrows after it's created.
There's no single code structure to handle re-entrancy, we'd have to think about them at every stage of the pipeline.
We'll keep repeating the same switch/error handling patterns and multiplexing caches ($xml_processors = [];)
Internal streaming details leak out into the userland code. That's annoying for the current implementation of the XML processor, but I think this one could be solved with a better XML streaming interface.

I explored inlining the loop cascade into a single loop with a switch-based stage management in daaba8a. It's more readable, but the other painpoints still stand.

what could that look like? would it be worse? I think I'm puzzled on how to abstract a universal interface for streaming things, apart from calling everything a token, but your example of the WXR rewriter demonstrates how in many cases the individual token is not the right step function. in many cases, we will process many bytes all at once, and one production from an earlier stage might create many tokens for the next stage.

You may be pointing at this already with your choice of words – I'm noticing a lot of similarities between this work and the MySQL parser explorations. We're ingesting "tokens" in form of bytes, XML tags etc, identifying the next non-terminal processing rule, and moving them there. If we squint and forget about sourcing data from the network, disk, etc., we're just composing parsers here. At an abstract level, the entire process could be driven by a grammar declaration – I now think the pipeline definition is just that.

I'm also thinking more about re-entrancy and how to wrap the indices throughout the pipeline. in this system I suppose we could add new methods exposing the current bookmark, the start and end of the current token for a given stage. this might be critical for being able to pause and resume progress.

My initial thinking is we could store the cursor as follows:

HTTP transmission state (paused while processing bytes 81920–90112)
The nested ZIP processing state (we're processing context.xml and we've paused while processing bytes 2048-4096)
The nested XML processing state (the current trace is root > post > wp:content, we've seen 3015 bytes so far, and we're about to consume the next tag), etc.

Upon resuming, each processor would restore the frozen state and skip over to the relevant byte in the stream.

On the upside, it seems simple.

On the downside:

It wouldn't work if we pause mid "sub chunk", e.g. a GZIP block. We may need to include the "sub block size" in the design and only consider specific byte offsets as resumable checkpoints. I'm not sure how to map that across different data models, e.g. HTTP 1.1 chunked/gzipped transfer offset -> ZIP byte stream -> last gzip block -> XML byte offset.
We'll download more than we need upon resuming. We can't easily map the exact byte number where we stopped processing XML and so we'll need to request the entire byte range 81920–90112.

at this point I think I have some feel for the design, so I'd like to ask you for some leading questions if you have any. I know this is inherently a very complicated task; the code itself also seems very complicated.

I didn't phrase much of this comment as questions, but it's all me asking for your thoughts.

adamziel · 2024-09-25T16:39:46Z

Highly relevant PR from @dmsnell: WordPress/wordpress-develop#6883

@todo

… WebApp Redesign (#1731) ## Description Implements a large part of the [website redesign](#1561): ![CleanShot 2024-09-14 at 10 24 57@2x](https://github.com/user-attachments/assets/f245c7ac-cb8c-4e5a-b90a-b4aeff802e7b) High-level changes shipped in this PR: * Multiple Playgrounds. Every temporary Playground can be saved either in the browser storage (OPFS) or in a local directory (Chrome desktop only for now). * New Playground settings options: Name name, language, multisite * URL as the source of truth for the application state * State management via Redux This work is a convergence of 18+ months of effort and discussions. The new UI opens relieves the users from juggling ephemeral Playgrounds and losing their work. It opens up space for long-lived site configurations and additional integrations. We could bring over all the [PR previewers and demos](https://playground.wordpress.net/demos/) right into the Playground app. Here's just a few features unblocked by this PR: * #1438 – no more losing your work by accident 🎉 * #797 – with multiple sites we can progressively build features we'll eventually propose for WordPress core: * A Playground export and import feature, pioneering the standard export format for WordPress sites. * A "Clone this Playground" feature, pioneering the [Site Transfer Protocol](https://core.trac.wordpress.org/ticket/60375). * A "Sync two Playgrounds" feature, pioneering the Site Sync Protocol * #1445 – better git support is in top 5 most highly requested features. With multiple Playgrounds, we can save your work and get rid of the "save your work before connecting GitHub or you'll lose it" and cumbersome "repo setup" forms on every interaction. Instead, we can make git operations like Pull, Commit, etc. very easy and even enable auto-syncing with a git repository. * #1025 – as we bring in more PHP plumbing into this repository, we'll replace the TypeScript parts with PHP parts to create a WordPress core-first Blueprints engine * #1056 – Site transfer protocol will unlocks seamlessly passing Playgrounds between the browser and a local development environment * #1558 – we'll integrate [the Blueprints directory] and offer single-click Playground setups, e.g. an Ecommerce store or a Slide deck editor. #718. * #539 – the recorded Blueprints would be directly editable in Playground and perhaps saved as a new Playground template * #696 – the new interaction model creates space for additional integrations. * #707 – you could create a "GitHub–synchronized" Playground * #760 – we can bootstrap one inside Playground using a Blueprint and benefit the users immediately, and then gradually work towards enabling it on WordPress.org * #768 – the new UI has space for a "new in Playground" section, similar to what Chrome Devtools do * #629 * #32 * #104 * #497 * #562 * #580 ### Remaining work - [ ] Write a release note for https://make.wordpress.org/playground/ - [x] Make sure GitHub integration is working. Looks like OAuth connection leads to 404. - [x] Fix temp site "Edit Settings" functionality to actually edit settings (forking a temp site can come in a follow-up PR) - [x] Fix style issue with overlapping site name label with narrow site info views - [x] Fix style issue with bottom "Open Site" and "WP Admin" buttons missing for mobile viewports - [x] Make sure there is a path for existing OPFS sites to continue to load - [x] Adjust E2E tests. - [x] Reflect OPFS write error in UI when saving temp site fails - [x] Find a path forward for [try-wordpress](https://github.com/WordPress/try-wordpress) to continue working after this PR - [x] Figure out why does the browser get so choppy during OPFS save. It looks as if there was a lot of synchronous work going on. Shouldn't all the effort be done by a worker a non-blocking way? - [x] Test with Safari and Firefox. Might require a local production setup as FF won't work with the Playground dev server. - [x] Fix Safari error: `Unhandled Promise Rejection: UnknownError: Invalid platform file handle` when saving a temporary Playground to OPFS. - [x] Fix to allow deleting site that fails to boot. This is possible when saving a temp site fails partway through. - [x] Fix this crash: ```ts /** * @todo: Fix OPFS site storage write timeout that happens alongside 2000 * "Cannot read properties of undefined (reading 'apply')" errors here: * I suspect the postMessage call we do to the safari worker causes it to * respond with another message and these unexpected exchange throws off * Comlink. We should make Comlink ignore those. */ // redirectTo(PlaygroundRoute.site(selectSiteBySlug(state, siteSlug))); ``` - [x] Test different scenarios manually, in particular those involving Blueprints passed via hash - [x] Ensure we have all the aria, `name=""` etc. accessibility attributes we need, see AXE tools for Chrome. - [x] Update developer documentation on the `storage` query arg (it's removed in this PR) - [x] Go through all the `TODOs` added in this PR and decide whether to solve or punt them - [x] Handle errors like "site not found in OPFS", "files missing from a local directory" - [x] Disable any `Local Filesystem` UI in browsers that don't support them. Don't just hide them, though. Provide a help text to explain why are they disabled. - [x] Reduce the naming confusion, e.g. `updateSite` in redux-store.ts vs `updateSite` in `site-storage.ts`. What would an unambiguous code pattern look like? - [x] Find a reliable and intuitive way of updating these deeply nested redux state properties. Right now we do an ad-hoc recursive merge that's slightly different for sites and clients. Which patterns used in other apps would make it intuitive? - [x] Have a single entrypoint for each logical action such as "Create a new site", "Update site", "Select site" etc. that will take care of updating the redux store, updating OPFS, and updating the URL. My ideal scenario is calling something like `updateSite(slug, newConfig)` in a React Component and being done without thinking "ughh I still need to update OPFS" or "I also have to adjust that .json file over there" - [x] Fix all the tiny design imperfections, e.g. cut-off labels in the site settings form. ### Follow up work - [ ] Mark all the related blocked issues as unblocked on the project board, e.g. #1703, #1731, and more – [see the All Tasks view](https://github.com/orgs/WordPress/projects/180/views/2?query=sort%3Aupdated-desc+is%3Aopen&filterQuery=status%3A%22Up+next%22%2C%22In+progress%22%2C%22Needs+review%22%2C%22Reviewed%22%2C%22Done%22%2CBlocked) - [ ] Update WordPress/Learn#1583 with info that the redesign is now in and we're good to record a video tutorial. - [ ] #1746 - [ ] Write a note in [What's new for developers? (October 2024)](WordPress/developer-blog-content#309) - [ ] Document the new site saving flow in `packages/docs/site/docs/main/about/build.md` cc @juanmaguitar - [ ] Update all the screenshots in the documentation cc @juanmaguitar - [ ] When the site fails to load via `.list()`, still return that site's info but make note of the error. Not showing that site on a list could greatly confuse the user ("Hey, where did my site go?"). Let's be explicit about problems. - [ ] Introduce notifications system to provide feedback about outcomes of various user actions. - [ ] Add non-minified WordPress versions to the "New site" modal. - [ ] Fix `console.js:288 TypeError: Cannot read properties of undefined (reading 'apply') at comlink.ts:314:51 at Array.reduce (<anonymous>) at callback (comlink.ts:314:29)` – it seems to happen at trunk, too. - [ ] Attribute log messages to the site that triggered them. - [ ] Take note of any interactions that we find frustrating or confusing. We can perhaps adjust them in a follow-up PR, but let's make sure we notice and document them here. - [ ] Solidify the functional tooling for transforming between `URL`, `runtimeConfiguration`, `Blueprint`, and `site settings form state` for both OPFS sites and in-memory sites. Let's see if we can make it reusable in Playground CLI. - [ ] Speed up OPFS interactions, saving a site can take quite a while. - [ ] A mobile-friendly modal architecture that doesn't stack modals, allows dismissing, and understands some modals (e.g. fatal error report) might have priority over other modals (e.g. connect to GitHub). Discuss whether modals should be declared at the top level, like here, or contextual to where the "Show modal" button is rendered. - [ ] Discuss the need to support strong, masked passwords over a simple password that's just `"password"`. - [ ] Duplicate site feature implemented as "Export site + import site" with the new core-first PHP tools from adamziel/wxr-normalize#1 and https://github.com/adamziel/site-transfer-protocol - [x] Retain temporary sites between site changes. Don't just trash their iframe and state when the user switches to another site. Closes #1719 cc @brandonpayton --------- Co-authored-by: Brandon Payton <[email protected]> Co-authored-by: Bero <[email protected]> Co-authored-by: Bart Kalisz <[email protected]>

dmsnell · 2024-09-28T22:19:22Z

Doodling - this is probably all a disaster.

$pipeline->add( 'http', $client );
$pipeline->add( 'zip', $zip_decoder );
$pipeline->add( 'xml', $xml_processor );

$xml_processor->auto_feeder = array( $zip_decoder, 'read_chunk' );
$zip_decoder->auto_feeder   = array( $client, 'next_file' );

$client->new_item = fn ( $filename, $chunk ) => $zip_decoder->new_stream( $chunk );
$zip_decoder->new_item = fn ( $filename, $chunk ) => $xml_processor->new_stream( $chunk );

while ( $pipeline->keep_going() ) {
	if ( $zip_decoder->get_file_path() !== 'export.xml' ) {
		$zip_decoder->next_file();
		continue;
	}

	if ( ! $xml_processor->next_token() ) {
		wp_insert_post( $post );
		continue;
	}

	$post  = new WP_Post();
	$token = $xml_processor->get_token_name();
	…
}

so maybe this more or less mirrors work you did in the IByteStream or pipes work. it reminds me of something Joe Armstrong wrote about.

 system X is:
      start component a
      start component b
      ...
      connect out1 of a to in2 of b
      connect out2 of b to in2 of c
      ..
      send {logging,on} to control2 of c
      ..
     send run to all

Can we find a simple expression of pipe events without requiring the creation of new classes and without exposing all of the nitty-gritty internals? Maybe not. Maybe the verbose approach is best and largely, code using these streams will be highly-specialized and complicated, and the verbosity is fine because these complicated flows require paying attention to them. 🤔

adamziel · 2024-09-29T17:43:38Z

I have some thoughts about reentrancy unrelated to @dmsnell last comment:

Pausing a pipe may require saving the current state and the data buffer of every parser in the pipe.

Imagine the following pipe:

Local file > zip reader > xml parser > WXR importer

Now imagine we failed to import the post number 10472. Here's what we need to consider:

The WXR importer may have already created some dependent database records. It must either roll these changes back, or support very granular pausing and resuming. My gut says that the former would be much simpler.
The XML parser already moved past the opening <wp:post> tag — so we can't just export the current parser state.
The XML markup for the post may be spanning multiple ZIP chunks, — so we can't just export the last parser state.
The ZIP file includes gzipped data — so we better export the byte offset of the last gzip block.
We can't just remember a single byte offset at which we've finished processing the local file. We don't know it. We're not trying to correlate the byte offset of each XML tag opener with a specific byte in the ZIP file, and I'm not even sure we could given the gzip compression.

Every parser must maintain its internal state in such a way, that we could destroy and recreate all its internal resources at any time. For example, the ZIP parser's buffer should never start mid gzip block because that would prevent it from recreating the deflate handle.

We'll need to set checkpoints after each meaningful downstream task, e.g. when a post is importer. A checkpoint would be a serialized pipe state at that point in time. The downstream WXR parser may import 100 posts from a single zip chunk, and then it may need 100 zip chunks to import 1 post. We need to export all the upstream states and buffers to correctly resume the downstream parser and allow it to pull the next upstream chunk.

We can only set checkpoints after the last task OR at the first chunk of the next task, but not right before the next task. Why? Because we can't know we're about to enter the next WP post without peeking, and peek() isnt supported in the current streaming api.

Later on we may try to optimize the state serialization and:

Explore truncating all the upstream bytes that were already processed downstream.
Explore not storing the buffers, but re-populating the pipe with the upstream bytes.

Both should be possible upstream from the ZIP parser but I'm not sure about downstream. It would require synchronizing parser byte offsets, compressed/uncompressed offsets, and gzip block offsets between the piped parsers.

Streaming ZIP files has one more complexity. We may need two cursors — one to parse the central directory index, and one to go through the actual files. This could be a higher order stream with two inputs, but that smells like complexity and adding a lot of ideas to the streaming architecture. Maybe a custom pipe class that knows how to request new input streams and has a single output?

Cc @sirreal

…re-entrant

adamziel · 2024-09-30T08:42:09Z

We've got the first prototype of re-entrant streams!

In 3c07f99 I've prototyped the pause() and resume() methods:

$file_stream = new File_Byte_Stream('./test.txt', 100);
// Read bytes 0-99
$file_stream->next_bytes();
// Pause the processing
file_put_contents('paused_state.txt', json_encode($file_stream->pause()));

// Resume the processing in another request
$file_stream = new File_Byte_Stream('./test.txt', 100);
$paused_state = json_decode(file_get_contents('paused_state.txt'));
$file_stream->resume($paused_state);
// Read the bytes 100 - 199
$file_stream->next_bytes();

It seems to be working quite well!

What did not work

At first, I tried the following approach:

$file_stream = new File_Byte_Stream('./test.txt', 100);
$file_stream->next_bytes();
$file_stream_2 = File_Byte_Stream::resume( $file_stream->pause() );

It worked well for simple streams, but there's no way to generalize it to callback-based streams like ProcessorByteStream – we can't serialize the callbacks as JSON:

class ZIP_Reader
{
    static public function stream()
    {
        return ProcessorByteStream::demuxed(
            function () { return new ZipStreamReader(); },
            function (ZipStreamReader $zip_reader, ByteStreamState $state) {
                while ($zip_reader->next()) {
                    switch ($zip_reader->get_state()) {
                        case ZipStreamReader::STATE_FILE_ENTRY:
                            $state->file_id = $zip_reader->get_file_path();
                            $state->output_bytes = $zip_reader->get_file_body_chunk();
                            return true;
                    }
                }
                return false;
            }
        );
    }
}

Therefore, I stuck with the approach of creating a stable stream (or stream chain) instance from "schema", and then exporting/importing its internal state:

function create_stream_chain($paused_state=null) {
    $chain = new StreamChain(
        [
            'file' => new File_Byte_Stream('./export.wxr', 100),
            'xml' => XML_Processor::stream(function () { }),
        ]
    );
    if($paused_state) {
        $chain->resume($paused_state);
    }
    return $chain;
}

We could, in theory, provide an interface such as $stream2 = $stream->pause()->resume() by making runtime artifacts serializable. For that, we'd need two transforms: callback -> JSON and JSON -> callback. One way to do it, is through replacing arbitrary dynamic callbacks with statically declared classes:

class StrtoupperStream extends TransformStream {
	protected function transform($chunk) {
		return strtoupper( $chunk );
	}
}
StreamApi::register(StrtoupperStream::class);

class RewriteLinksInWXRStream extends ProcessorTransformStream {
	protected function transform(WP_XML_Processor $processor) {
		// ...
	}
}
StreamApi::register(RewriteLinksInWXRStream::class);

However, you can see how requiring a class registration for every simple transform would unnecessarily increase the complexity and baloon the number of classes, files, dependencies, inheritance hierarchies etc. Having spent a few years with Java, I have to say hard pass.

adamziel · 2024-09-30T08:55:24Z

The API needs more thought and polish here, but we're in a good place to start wrapping up v1 for content import and exports in the WordPress Playground repo. We'll keep iterating and rebuilding it there to serve the real use-cases well.

adamziel · 2024-09-30T11:03:50Z

Zip re-entrancy challenge

Pausing ZIP parsing in the middle of a gzip-compressed file might require a custom GZip deflater and so, at least at first, we may not support resuming ZIP parsing.

GZip has a variable block size and PHP doesn't expose the current block size or boundaries, meaning there's no obvious place where we could split the data. We'd could work around that by exporting the entire deflater's internal state. This would also solve for the sliding window problem. The nth block may refer to any previous block within a 32kb sliding window. However, that previous block, might also refer to something in the previous 32kb. We're effectively maintaining a dictionary that's initialized at byte 0 and keeps evolving throughout the entire stream, and for re-entrancy we'd need to export that dictionary.

Some deflaters cut ties to previous 32kb every now and then by performing an occasional "full flush". This would reduce the paused context size.

Local ZIP file re-entrancy

PHP has a set of functions called gzopen and gzseek that could potentially be shoehorned into scanning to a specific offset in a ZIP archive. This would require a direct access to $fp which means we'd need a specialized LocalZipFileStream that sources data from a local path. This would unlock:

Importing these enterprise-grade 1TB WXR files.
Importing remote WXR files without streaming – we'd have to download them first unless we have a Ranges-based re-entrant HTTPClient, which might actually be easy.

WXR + re-entrancy next steps

It seems like the pause()/resume() interface I've explored in this PR would nicely support the basic reentrancy scenarios such as splitting large imports into multiple batches, or recovering from importing errors. Let's keep that entire architecture open to changes and even complete rewrites as we find out more as we use it to solve specific problems. Meanwhile, for WXR imports, let's proceed as follows:

Import WXR files using a re-entrant Local WXR file > XML > WordPress pipe
Import WXR files using a specialized re-entrant Local ZIP file > XML > WordPress pipe
Import WXR files using generalized re-entrant HTTP > ZIP > XML > WordPress pipe
Import WXR files using generalized re-entrant HTTP > partial ZIP > XML > WordPress pipe that can request, say, two specific files from a 1,000,000 files large archive

adamziel · 2024-09-30T11:55:22Z

Doodling - this is probably all a disaster.

$pipeline->add( 'http', $client );
$pipeline->add( 'zip', $zip_decoder );
$pipeline->add( 'xml', $xml_processor );

$xml_processor->auto_feeder = array( $zip_decoder, 'read_chunk' );
$zip_decoder->auto_feeder   = array( $client, 'next_file' );

$client->new_item = fn ( $filename, $chunk ) => $zip_decoder->new_stream( $chunk );
$zip_decoder->new_item = fn ( $filename, $chunk ) => $xml_processor->new_stream( $chunk );

while ( $pipeline->keep_going() ) {
	if ( $zip_decoder->get_file_path() !== 'export.xml' ) {
		$zip_decoder->next_file();
		continue;
	}

	if ( ! $xml_processor->next_token() ) {
		wp_insert_post( $post );
		continue;
	}

	$post  = new WP_Post();
	$token = $xml_processor->get_token_name();
	…
}

so maybe this more or less mirrors work you did in the IByteStream or pipes work.

@dmsnell it's not too different from the current proposal in this PR:

$pipeline = new StreamChain(
    [
        'http' => HTTP_Client::stream([
            new Request('http://127.0.0.1:9864/export.wxr.zip'),
            // new Request('http://127.0.0.1:9864/export.wxr.zip'),
            // Bad request, will fail:
            new Request('http://127.0.0.1:9865')
        ]),
        'zip' => ZIP_Reader::stream(),
        Byte_Stream::map(function($bytes, $context) {
            if($context['zip']->get_file_id() === 'export.wxr') {
                $context['zip']->skip_file();
                return null;
            }
            return $bytes;
        }),
        'xml' => XML_Processor::stream(function () { }),
        Byte_Stream::map(function($bytes) { return strtoupper($bytes); }),
    ]
);

foreach($pipeline as $chunk) {
	$post = new WP_Post();
	// ...
}

With a bit of augmentation, we could move $context['zip']->skip_file(); into the foreach() loop, but overall we're in a very similar place.

Can we find a simple expression of pipe events without requiring the creation of new classes and without exposing all of the nitty-gritty internals?

Note your example above involves the same number of classes as this PR. There's a class to represent the Pipeline, there's one class per decoder, it seems like there's a class to represent the stream.

adamziel · 2024-09-30T15:31:08Z

In b7102b7 I've prototyped a reentrant ZipStreamReaderLocal. I initially tried implementing it via PHP stream filters, but every time I called stream_filter_remove() the underlying $fp wouldn't return any bytes on following fread() so I resorted back to "manual" inflate_init(), inflate_add() etc.

There's a few rough edges to polish, e.g. the DemultiplexerStream doesn't understand that the streaming have ended. Overall it works pretty well, though, and it seems like we can start with Local ZIP > XML on day 1!

Thinking about the HTTP > ZIP stream...

HTTPClient > local buffer file > ZipStreamReaderLocal should be sufficient for very simple top-to-bottom scanning scenarios,
A dedicated HttpZipReader might be needed for anything more complex than that. Parsing a ZIP file might require multiple random access streams, e.g. Central Directory End > Central Directory > Read 10 files at different offsets > go back to Central Directory > Read 4 more files. Something needs to keep track of the current parsing stage, any file index built up, what can be parallelized and what can't, and I that's way beyond what I meant for the generic StreamChain class. A dedicated HttpZipReader could go arbitrarily fancy with its pause() and resume() methods, too.

adamziel · 2024-09-30T16:04:15Z

The last blocking problem with the API design

Doodling on processing zipped WXR files, I found myself writing this code:

$chain = new StreamChain(
    [
        'zip' => ZIP_Reader_Local::stream('./export.wxr.zip'),
        'xml' => XML_Processor::stream(function ($processor) {
            $breadcrumbs = $processor->get_breadcrumbs();
			if (
                 '#cdata-section' === $processor->get_token_type() &&
                 end($breadcrumbs) === 'content:encoded'
            ) {
                echo '<content:encoded>'.substr(str_replace("\n", "", $processor->get_modifiable_text()), 0, 100)."...</content:encoded>\n\n";
            }
         }),
    ]
);

foreach($chain as $chunk) {
	echo $chunk->get_bytes();
}

This feels weird! The StreamChain only knows how to move bytes around and will not output XML tags by design. This is great for multiple decoding stages, but it's quite inconvenient for working with that final $xml_processor instance meant to extract the import data.

Encoding pull parser semantics into the system would make this feel a lot more natural:

$pipeline = new StreamChain([ 
   'zip' => ZIP_Reader_Local::stream('./export.wxr.zip'),
   'xml' => WP_XML_Processor::consume(),
]);
while($pipeline->keep_going()) {
	if($pipeline['zip']->get_file_extension() !== '.wxr') {
		$pipeline['zip']->next_file();
		continue;
	}

	$processor = $pipeline['xml']->get_processor();
	// next_tag() automatically pulls more data from the "zip" stage
	// when the current buffer is exhausted
	while($processor->next_tag()) {

	}
}

The problem is, the inner while() loop would block the entire processing pipeline until export.wxr.zip is exhausted. This isn't a big deal for processing a single file, but it would be problematic if we requested 3 zip files over HTTP in parallel.

The only solution I can think of for the parallelization case is making the import process re-entrant. Not only that, but we'd need to be ready for a context switch at any point in time – we might run out of data 30 times before processing a single post. The code would look something like this:

$pipeline = new StreamChain([ 
   'zip' => ZIP_Reader_Local::stream('./export.wxr.zip'),
   'xml' => WP_XML_Processor::consume(),
]);
while($pipeline->keep_going()) {
	if($pipeline['zip']->get_file_extension() !== '.wxr') {
		$pipeline['zip']->next_file();
		continue;
	}

	$processor = $pipeline['xml']->get_processor();

	if(!$pipeline['wxr_import']->state) {
		$pipeline['wxr_import']->state = '#scanning-for-post';
	}

	// next_token() doesn't pull anything automatically. It only works with the 
	// information it has available at a moment.
	while($processor->next_token()) {
		if($pipeline['wxr_import']->state === '#scanning-for-post') {
			if(
				$processor->get_tag() === 'item' &&
				$processor->breadcrumbs_match('item')
			) {
				$pipeline['wxr_import']->state = '#post';	
				$pipeline['wxr_import']->post = array();
			}
		} else if($pipeline['wxr_import']->state === '#post') {
			if ( 
				$processor->breadcrumbs_match('content:encoded') &&
				$processor->get_type() === '#cdata-section'
			) {
				$pipeline['wxr_import']->post['post_content'] = $processor->get_modifiable_text();
			} else if // ...
		}
	}
}

Doesn't it look like another stateful streaming processor? This makes me think the pipe could perhaps look as follows:

$pipeline = new StreamChain([ 
   'zip' => ZIP_Reader_Local::stream('./export.wxr.zip'),
   'wxr' => new WP_WXR_Stream_Importer()
]);

while($pipeline->keep_going()) {
	$paused_state = $importer->pause();
	// ...
}

// or:

$importer = new StreamChain([
	HTTP_Client::stream(
		'https://mysite.com/export-1.wxr',
		'https://mysite.com/export-2.wxr',
	),
	new WP_WXR_Stream_Importer()
]);
while($importer->import_next_entity()) {
	$paused_state = $importer->pause();
	// ...
}

I'm now having second thoughts about the StreamChain class. Do we actually need one? A two-element StreamChain seems like an overkill.

On the up side, it centralizes the stream state management logic, cannot be extended with new streams after being declared, and it frees each stream from implementing a method like pipeTo(). Furthermore, it doesn't really contain two elements. The ZIP stream is also Demultiplexer automatically connecting each found file to a fresh WXR stream.

On the down side, the developer in me would rather use this API:

$pipeline = Zip_Reader::from_local_file('./export.wxr.zip')->connect_to(new WXR_Importer());
while($pipeline->keep_going()) {
	// ... twiddle our thumbs ...
}

$pipeline_state = $pipeline->pause();

// ... later ...

$pipeline = Zip_Reader::from_local_file('./export.wxr.zip')->connect_to(new WXR_Importer());
$pipeline->resume($pipeline_state);

What I don't like about it is that each stream class would have to implement a method such as connect_to. And what would connect_to return? Most likely, a Pipeline/StreamChain instance. Perhaps differences between the two APIs are superficial then and amount to a helper method?

adamziel · 2024-09-30T16:19:38Z

A potential pivot away from pipelines?

Uh-oh:

I'm no longer convinced encoding HTTP > ZIP > XML as a three-element pipe is practical. HTTP and ZIP are tightly coupled and need to be in a two-way feedback loop.
A HttpClient manages a bunch of streams and stream-like state transitions internally, and gain, relying on a pipe wouldn't be that practical.

This wasn't clear when I focused on rewriting the URLs in the WXR file, but became apparent when I started exploring an importer.

This makes me question other use-cases discussed in this PR. Do we actually need to build arbitrary pipes? Perhaps we'll only ever work with two streams, like a data source and a data target, each of them potentially being a composition of two streams in itself? In that scenario, we'd have specialized classes such as ZipFromFile, ZipFromHttp etc. and we wouldn't need any pipes.

This work is now unblocked, let's start puting the code explored in this PR to use in Playground

Let's stop hypothesizing and start bringing the basic building blocks (URL parser, XML parser etc) into Playground to use them for feature development. This should reveal much better answers about the API design than going through more thinking exercises here.

@dmsnell

…ools (#1888) Let's officially kickoff [the Data Liberation](https://wordpress.org/data-liberation/) efforts under the Playground umbrella and unlock powerful new use cases for WordPress. ## Rationale ### Why work on Data Liberation? WordPress core _really_ needs reliable data migration tools. There's just no reliable, free, open source solution for: - Content import and export - Site import and export - Site transfer and bulk transfers, e.g. mass WordPress -> WordPress, or Tumblr -> WordPress - Site-to-site synchronization Yes, there's the WXR content export. However, it won't help you backup a photography blog full of media files, plugins, API integrations, and custom tables. There are paid products out there, but nothing in core. At the same time, so many Playground use-cases are **all about moving your data**. Exporting your site as a zip archive, migrating between hosts with the [Data Liberation browser extension](https://github.com/WordPress/try-wordpress/), creating interactive tutorials and showcasing beautiful sites using [the Playground block](https://wordpress.org/plugins/interactive-code-block/), previewing Pull Requests, building new themes, and [editing documentation](#1524) are just the tip of the iceberg. ### Why the existing data migration tools fall short? Moving data around seems easy, but it's a complex problem – consider migrating links. Imagine you're moving a site from [https://my-old-site.com](https://playground-site-1.com) to [https://my-new-site.com/blog/](https://my-site-2.com). If you just moved the posts, all the links would still point to the old domain so you'll need an importer that can adjust all the URLs in your entire database. However, the typical tools like `preg_replace` or `wp search_replace` can only replace some URLs correctly. They won't reliably adjust deeply encoded data, such as this URL inside JSON inside an HTML comment inside a WXR export: The only way to perform a reliable replacement here is to carefully parse each and every data format and replace the relevant parts of the URL at the bottom of it. That requires four parsers: an XML parser, an HTML parser, a JSON parser, a WHATWG URL parser. Most of those tools don't exist in PHP. PHP provides `json_encode()`, which isn't free of issues, and that's it. You can't even rely on DOMDocument to parse XML because of its limited availability and non-streaming nature. ### Why build this in Playground? Playground gives us a lot for free: - **Customer-centric environment.** The need to move data around is so natural in Playground. So many people asked for reliable WXR imports, site exports, synchronization with git, and the ability to share their Playground. Playground allows us to get active users and customer feedback every step of the way. - **Free QA**. Anyone can share a testing link and easily report any problems they found. Playground is the perfect environment to get ample, fast moving feedback. - **Space to mature the API**. Playground doesn’t provide the same backward compatibility guarantees as WordPress core. It's easy to prototype a parser, find a use case where the design breaks down, and start over. - **Control over the runtime.** Playground can lean on PHP extensions to validate our ideas, test them on a simulated slow hardware, and ship them to a tablet to see how they do when the app goes into background and the internet is flaky. Playground enables methodically building spec-compliant software to create the solid foundation WordPress needs. ## The way there ### What needs to be built? There's been a lot of [gathering information, ideas, and tools](https://core.trac.wordpress.org/ticket/60375). This writeup is based on 10 years worth of site transfer problems, WordPress synchronization plugins, chats with developers, analyzing existing codebases, past attempts at data importing, non-WordPress tools, discussions, and more. WordPress needs parsers. Not just any parsers, they must be streaming, re-entrant, fast, standard compliant, and tested using a large body of possible inputs. The data synchronization tools must account for data conflicts, WordPress plugins, invalid inputs, and unexpected power outages. The errors must be non-fatal, retryable, and allow manual resolution by the user. No data loss, ever. The transfer target site should be usable as early as possible and show no broken links or images during the transfer. That's the gist of it. A number of parsers have already been prototyped. There's even [a draft of reliable URL rewriting library](https://github.com/adamziel/site-transfer-protocol). Here's a bunch of early drafts of specific streaming use-cases: - [A URL parser](https://github.com/adamziel/site-transfer-protocol/blob/trunk/src/WP_URL.php) - [A block markup parser](https://github.com/adamziel/site-transfer-protocol/blob/trunk/src/WP_Block_Markup_Processor.php) - [An XML parser](WordPress/wordpress-develop#6713), also explored by @dmsnell and @jonsurrell - [A Zip archive parser](https://github.com/WordPress/blueprints-library/blob/87afea1f9a244062a14aeff3949aae054bf74b70/src/WordPress/Zip/ZipStreamReader.php) - [A multihandle HTTP client](https://github.com/WordPress/blueprints-library/blob/trunk/src/WordPress/AsyncHttp/Client.php) without curl dependency - [A MySQL query parser](WordPress/sqlite-database-integration#157) started by @zieladam and now explored by @JanJakes - [A stream chaining API](adamziel/wxr-normalize#1) to connect all these pieces On top of that, WordPress core now has an HTML parser, and @dmsnell have been exploring a [UTF-8](WordPress/wordpress-develop#6883) decoder that would to enable fast and regex-less URL detection in long data streams. There are still technical challenges to figure out, such as how to pause and resume the data streaming. As this work progresses, you'll start seeing incremental improvements in Playground. One possible roadmap is shipping a reliable content importer, then reliable site zip importer and exporter, then cloning a site, and then extends towards full-featured site transfers and synchronization. ### How soon can it be shipped? Three points: * No dates. * Let's keep building on top of prior work and ship meaningful user flows often. * Let's not ship any stable public APIs until the design is mature. For example, the [Try WordPress extension](https://github.com/WordPress/try-wordpress/) can already give you a Playground site, even if you cannot migrate it to another WordPress site just yet. **Shipping matters. At the same time, taking the time required to build rigorous, reliable software is also important**. An occasional early version of this or that parser may be shipped once its architecture seems alright, but the architecture and the stable API won't be rushed. That would jeopardize the entire project. This project aims for a solid design that will serve WordPress for years. The progress will be communicated in the open, while maintaining feedback loops and using the work to ship new Playground features. ## Plans, goals, details ### Next steps Let's start with building a tool to export and import _a single WordPress post_. Yes! Just one post. The tricky part is that all the URLs will have to be preserved. From there, let's explore the breadth and depth of the problem, e.g.: * Rewriting links * Frontloading media files * Preserving dependent data (post meta, custom tables, etc.) * Exporting/importing a WXR file using the above * Pausing and resuming a WXR export/import * Exporting/importing a full WordPress site as a zip file Ideally, each milestone will result in a small, readily reusable tool. For example "paste WordPress post, paste a new site URL, get your post migrated". There's an ample body of existing work. Let's keep the existing codebases (e.g. WXR, site migration plugins) and discussions open in a browser window during this work. Let's involve the authors of these tools, ask them questions, ask them for reviews. Let's publish the progress and the challenges encountered on the way. ### Design goals - **Fault tolerance** – all the data tools should be able to start, stop, resume, tolerate errors, accept alternative data from the user, e.g. media files, posts etc. - **WordPress-first** – let's build everything in PHP using WordPress naming conventions. - **Compatibility** – Every WordPress version, PHP version (7.2+, CLI), and Playground runtime (web, CLI, browser extension, desktop app, CI etc.) should be supported. - **Dependency-free** – No PHP extensions required. If this means we can't rely on cUrl, then let's build an HTTP client from scratch. Only minimal Composer dependencies allowed, and only when absolutely necessary. - **Simplicity** – no advanced OOP patterns. Our role model is [WP_HTML_Processor](https://developer.wordpress.org/reference/classes/wp_html_processor/) – a **single class** that can parse nearly all HTML. There's no "Node", "Element", "Attribute" classes etc. Let's aim for the same here. - **Extensibility** – Playground should be able to benefit from, say, WASM markdown parser even if core WordPress cannot. - **Reusability** – Each library should be framework-agnostic and usable outside of WordPress. We should be able to use them in WordPress core, WP-CLI, Blueprint steps, Drupal, Symfony bundles, non-WordPress tools like https://github.com/adamziel/playground-content-converters, and even in Next.js via PHP.wasm. ### Prior art Here's a few codebases that needs to be reviewed at minimum, and brought into this project at maximum: - URL rewriter: https://github.com/adamziel/site-transfer-protocol - URL detector : WordPress/wordpress-develop#7450 - WXR rewriter: https://github.com/adamziel/wxr-normalize/ - Stream Chain: adamziel/wxr-normalize#1 - WordPress/wordpress-develop#5466 - WordPress/wordpress-develop#6666 - XML parser: WordPress/wordpress-develop#6713 - Streaming PHP parsers: https://github.com/WordPress/blueprints-library/tree/trunk/src/WordPress - Zip64 support (in JS ZIP parser): #1799 - Local Zip file reader in PHP (seeks to central directory, seeks back as needed): https://github.com/adamziel/wxr-normalize/blob/rewrite-remote-xml/zip-stream-reader-local.php - WordPress/wordpress-develop#6883 - Blocky formats – Markdown <-> Block markup WordPress plugin: https://github.com/dmsnell/blocky-formats - Sandbox Site plugin that exports and imports WordPress to/from a zip file: https://github.com/WordPress/playground-tools/tree/trunk/packages/playground - WordPress + Playground CLI setup to import, convert, and exporting data: https://github.com/adamziel/playground-content-converters - Markdown -> Playground workflow _and WordPress plugins_: https://github.com/adamziel/playground-docs-workflow - _Edit Visually_ browser extension for bringing data in and out of Playground: WordPress/playground-tools#298 - _Try WordPress_ browser extension that imports existing WordPress and non-WordPress sites to Playground: https://github.com/WordPress/try-wordpress/ - Humanmade WXR importer designed by @rmccue: https://github.com/humanmade/WordPress-Importer ### Related resources - [Site transfer protocol](https://core.trac.wordpress.org/ticket/60375) - [Existing data migration plugins](https://core.trac.wordpress.org/ticket/60375#comment:32) - WordPress/data-liberation#74 - #1524 - WordPress/gutenberg#65012 ### The project structure The structure of the `data-liberation` package is an open exploration and will change multiple times. Here's what it aims to achieve. **Structural goals:** - Publish each library as a separate Composer package - Publish each WordPress plugin separately (perhaps a single plugin would be the most useful?) - No duplication of libraries between WordPress plugins - Easy installation in Playground via Blueprints, e.g. no `composer install` required - Compatibility with different Playground runtimes (web, CLI) and versions of WordPress and PHP **Logical parts** - First-party libraries, e.g. streaming parsers - WordPress plugins where those libraries are used, e.g. content importers - Third party libraries installed via Composer, e.g. a URL parser **Ideas:** - Use Composer dependency graph to automatically resolve dependencies between libraries and WordPress plugins - or use WordPress "required plugins" feature to manage dependencies - or use Blueprints to manage dependencies cc @brandonpayton @bgrgicak @mho22 @griffbrad @akirk @psrpinto @ashfame @ryanwelcher @justintadlock @azaozz @annezazu @mtias @schlessera @swissspidy @eliot-akira @sirreal @obenland @rralian @ockham @youknowriad @ellatrix @mcsf @hellofromtonya @jsnajdr @dawidurbanski @palmiak @JanJakes @luisherranz @naruniec @peterwilsoncc @priethor @zzap @michalczaplinski @danluu

A part of #1894. Follows up on #1893. This PR brings in a few more PHP APIs that were initially explored outside of Playground so that they can be incubated in Playground. See the linked descriptions for more details about each API: * XML Processor from WordPress/wordpress-develop#6713 * Stream chain from adamziel/wxr-normalize#1 * A draft of a WXR URL Rewriter class capable of rewriting URLs in WXR files ## Testing instructions * Confirm the PHPUnit tests pass in CI * Confirm the test suite looks reasonabel * That's it for now! It's all new code that's not actually used anywhere in Playground yet. I just want to merge it to keep iterating and improving.

This new ZipStreamReader opens its own file handles which means it can be paused, resumed, and is more reliable. The original implementation was built as a part of adamziel/wxr-normalize#1 This is all new code so there are no testing instructions. Eventually this implementation will replace the existing ZipStreamReader.

adamziel added 4 commits July 15, 2024 20:24

Experiment with pipe interface

d44f701

Use pipes for rewrite-remote-wxr.php

736783f

Remove needs_more() method from WritableStream

22dea1d

adamziel added 2 commits July 17, 2024 00:17

Filtering and demultiplexing via metadata piping

b4290f0

Use a fancier pipe

7596794

adamziel added 9 commits July 20, 2024 13:45

Explore automatic demultiplexing based on the stream class definition

17c5950

Fix the intermittent broken pipe

86316f3

Explore streams as iterators

542ac35

Explore PipeContext

75f1217

Implement context hierarchy and skipping upstream files

9143dc0

Use the word "file", not "resource"

1a47f08

Use consistent file_* naming for file-related methods

e360d08

Explore Unix-like stdin, stderr, stdout-based piping approach.

ceceac5

I'm hoping for a simpler code structure and clearer data flows.

Rename Composite to ShellCommandsChain to refer to 'cat | sort'

855eb14

adamziel mentioned this pull request Aug 2, 2024

Add/sqlite import export support wp-cli/db-command#259

Closed

dmsnell reviewed Aug 7, 2024

View reviewed changes

adamziel mentioned this pull request Aug 9, 2024

Design a file format to store Playground site metadata across all runtimes WordPress/wordpress-playground#1659

Open

13 tasks

adamziel added 2 commits August 27, 2024 10:06

Add a "main loop" that processes each stage of the pipeline explicitly

bd19ad7

A loop-based API without nested loops

daaba8a

adamziel mentioned this pull request Sep 11, 2024

[UX] Stored Playgrounds (no more data loss), multiple Playgrounds, UI WebApp Redesign WordPress/wordpress-playground#1731

Merged

43 tasks

adamziel mentioned this pull request Sep 23, 2024

Rewrite URLs in imported WXR files to avoid broken navigation links (white screen, errors, nested Playground) WordPress/wordpress-playground#1780

Open

Prototype pause() and resume() methods to make the stream processing …

3c07f99

…re-entrant

Prototype a reentrant ZipStreamReaderLocal

b7102b7

adamziel mentioned this pull request Oct 11, 2024

Kickoff Data Liberation: Let's Build WordPress-first Data Migration Tools WordPress/wordpress-playground#1888

Merged

adamziel mentioned this pull request Oct 28, 2024

[Data Liberation] Add XML API, Stream API, WXR URL Rewriter API WordPress/wordpress-playground#1952

Merged

adamziel mentioned this pull request Oct 30, 2024

Port ZipStreamReader from adamziel/wxr-normalize WordPress/blueprints-library#116

Merged

adamziel mentioned this pull request Nov 13, 2024

Exhaustive MySQL Parser WordPress/sqlite-database-integration#157

Merged

StreamChain: An API for streams-processing data (e.g. HTTP → ZIP → XML → HTML) #1

Are you sure you want to change the base?

StreamChain: An API for streams-processing data (e.g. HTTP → ZIP → XML → HTML) #1

Conversation

adamziel commented Jul 15, 2024 • edited Loading

adamziel commented Jul 16, 2024 • edited Loading

Show me the code

Architecture

Remaining work

Open Questions

adamziel commented Jul 16, 2024 • edited Loading

adamziel commented Jul 16, 2024 • edited Loading

akirk commented Jul 16, 2024

adamziel commented Jul 16, 2024 • edited Loading

adamziel commented Jul 16, 2024 • edited Loading

dmsnell commented Jul 17, 2024 • edited Loading

adamziel commented Jul 17, 2024 • edited Loading

dmsnell commented Jul 17, 2024

adamziel commented Jul 17, 2024 • edited Loading

dmsnell commented Jul 17, 2024

adamziel commented Jul 17, 2024

dmsnell commented Jul 17, 2024

adamziel commented Jul 18, 2024 • edited Loading

dmsnell commented Jul 19, 2024

dmsnell commented Aug 6, 2024

dmsnell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamziel Aug 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamziel commented Aug 27, 2024 • edited Loading

adamziel commented Sep 25, 2024

dmsnell commented Sep 28, 2024

adamziel commented Sep 29, 2024 • edited Loading

adamziel commented Sep 30, 2024 • edited Loading

We've got the first prototype of re-entrant streams!

What did not work

adamziel commented Sep 30, 2024

adamziel commented Sep 30, 2024 • edited Loading

Zip re-entrancy challenge

Local ZIP file re-entrancy

WXR + re-entrancy next steps

adamziel commented Sep 30, 2024 • edited Loading

adamziel commented Sep 30, 2024

adamziel commented Sep 30, 2024 • edited Loading

The last blocking problem with the API design

adamziel commented Sep 30, 2024 • edited Loading

A potential pivot away from pipelines?

This work is now unblocked, let's start puting the code explored in this PR to use in Playground

adamziel commented Jul 15, 2024 •

edited

Loading

adamziel commented Jul 16, 2024 •

edited

Loading

adamziel commented Jul 16, 2024 •

edited

Loading

adamziel commented Jul 16, 2024 •

edited

Loading

adamziel commented Jul 16, 2024 •

edited

Loading

adamziel commented Jul 16, 2024 •

edited

Loading

dmsnell commented Jul 17, 2024 •

edited

Loading

adamziel commented Jul 17, 2024 •

edited

Loading

adamziel commented Jul 17, 2024 •

edited

Loading

adamziel commented Jul 18, 2024 •

edited

Loading

adamziel Aug 10, 2024 •

edited

Loading

adamziel commented Aug 27, 2024 •

edited

Loading

adamziel commented Sep 29, 2024 •

edited

Loading

adamziel commented Sep 30, 2024 •

edited

Loading

adamziel commented Sep 30, 2024 •

edited

Loading

adamziel commented Sep 30, 2024 •

edited

Loading

adamziel commented Sep 30, 2024 •

edited

Loading

adamziel commented Sep 30, 2024 •

edited

Loading