Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StreamChain: An API for streams-processing data (e.g. HTTP → ZIP → XML → HTML) #1

Open
wants to merge 72 commits into
base: trunk
Choose a base branch
from

Conversation

adamziel
Copy link
Owner

@adamziel adamziel commented Jul 15, 2024

This PR explores a generic Stream interface that allows piping data through different format processors, e.g. HTTP request → ZIP decoder → XML reader → HTML Processor → WordPress Database.


Jump to the last status update and feedback request

It brings together all the stream processing explorations in WordPress to enable stream-rewriting site URLs in a WXR file coming from a remote server. All of that with no curl, DOMDocument, or other PHP dependencies. It's just a few small libraries built with WordPress core in mind:

The rewriter is easy to extend. It could, e.g. stream-rewrite data from a ZIP-ped XML file, re-zip it on the fly, and return it as a HTTP response.

FYI @dmsnell @akirk @brandonpayton @bgrgicak @jordesign @mtias @griffbrad – this is exploratory for now, but will likely become relevant for production use sooner than later.

Related to:

Historically, this PR started as an exploration of rewriting URLs in a remote WXR file.

adamziel added 4 commits July 15, 2024 20:24
Brings together a few explorations to stream-rewrite site URLs in a WXR file coming
from a remote server. All of that with no curl, DOMDocument, or other
PHP dependencies. It's just a few small libraries built with WordPress
core in mind:

* [AsyncHttp\Client](WordPress/blueprints#52)
* [WP_XML_Processor](WordPress/wordpress-develop#6713)
* [WP_Block_Markup_Url_Processor](https://github.com/adamziel/site-transfer-protocol)
* [WP_HTML_Tag_Processor](https://developer.wordpress.org/reference/classes/wp_html_tag_processor/)

Here's what the rewriter looks like:

```php
$wxr_url = "https://raw.githubusercontent.com/WordPress/blueprints/normalize-wxr-assets/blueprints/stylish-press-clone/woo-products.wxr";
$xml_processor = new WP_XML_Processor('', [], WP_XML_Processor::IN_PROLOG_CONTEXT);
foreach( stream_remote_file( $wxr_url ) as $chunk ) {
    $xml_processor->stream_append_xml($chunk);
    foreach ( xml_next_content_node_for_rewriting( $xml_processor ) as $text ) {
        $string_new_site_url           = 'https://mynew.site/';
        $parsed_new_site_url           = WP_URL::parse( $string_new_site_url );

        $current_site_url              = 'https://raw.githubusercontent.com/wordpress/blueprints/normalize-wxr-assets/blueprints/stylish-press-clone/wxr-assets/';
        $parsed_current_site_url       = WP_URL::parse( $current_site_url );

        $base_url = 'https://playground.internal';
        $url_processor = new WP_Block_Markup_Url_Processor( $text, $base_url );

        foreach ( html_next_url( $url_processor, $current_site_url ) as $parsed_matched_url ) {
            $updated_raw_url = rewrite_url(
                $url_processor->get_raw_url(),
                $parsed_matched_url,
                $parsed_current_site_url,
                $parsed_new_site_url
            );
            $url_processor->set_raw_url( $updated_raw_url );
        }

        $updated_text = $url_processor->get_updated_html();
        if ($updated_text !== $text) {
            $xml_processor->set_modifiable_text($updated_text);
        }
    }
    echo $xml_processor->get_processed_xml();
}
echo $xml_processor->get_unprocessed_xml();
```
@adamziel
Copy link
Owner Author

adamziel commented Jul 16, 2024

Show me the code

Here's what the rewriter looks like:

$wxr_url = "https://raw.githubusercontent.com/WordPress/blueprints/normalize-wxr-assets/blueprints/stylish-press-clone/woo-products.wxr";

$xml_processor = new WP_XML_Processor('', [], WP_XML_Processor::IN_PROLOG_CONTEXT);
foreach( stream_remote_file( $wxr_url ) as $chunk ) {
    $xml_processor->stream_append_xml($chunk);
    foreach ( xml_next_content_node_for_rewriting( $xml_processor ) as $text ) {
        $url_processor = new WP_Block_Markup_Url_Processor( $text, $base_url );

        foreach ( html_next_url( $url_processor, $current_site_url ) as $parsed_matched_url ) {
            $updated_raw_url = rewrite_url(
                $url_processor->get_raw_url(),
                $parsed_matched_url,
                $parsed_current_site_url,
                $parsed_new_site_url
            );
            $url_processor->set_raw_url( $updated_raw_url );
        }

        $updated_text = $url_processor->get_updated_html();
        if ($updated_text !== $text) {
            $xml_processor->set_modifiable_text($updated_text);
        }
    }
    echo $xml_processor->get_processed_xml();
}
echo $xml_processor->get_unprocessed_xml();

Architecture

The rewriter explored here pipes and stream-processes data as follows:

AsyncHttp\Client -> WP_XML_Processor -> WP_Block_Markup_Url_Processor -> WP_Migration_URL_In_Text_Processor -> WP_URL

The layers of data at play are:

  • AsyncHttp\Client: HTTPS encrypted data -> Chunked encoding -> Gzip compression
  • WP_XML_Processor: XML (entities, attributes, text, comments, CDATA nodes)
  • WP_Block_Markup_Url_Processor: HTML (entities, attributes, text, comments, block comments), JSON (in block comments)
  • WP_Migration_URL_In_Text_Processor: URLs in text nodes
  • WP_URL: URL parsing and serialization

Remaining work

This PR explores a Streaming / Pipes API to make the streams easy to compose and visualize. While the implementation may change, the goal is to pipe chunks of data as far as possible from upstream to downstream while supporting both blocking and non-blocking streams.

  • Build new ZipReaderStream() and new ZipWriterStream() – what would be the API to manage multiple files?
  • Explore new BlockMarkupToMarkdownStream() and new MarkdownToBlockMarkupStream()
  • Explore a new SQLDumpProcessorStream( $value_visitor ) to rewrite URLs in database dump files before importing them

Open Questions

Passing bytes around is great for a consistent interface and byte-oriented operations.

However, a HTTP request yields response headers before the body. Reading from a ZIP file produces a series of metadata and data streams – one for every decoded file. How can we use pipes with these more complex data structures? Should we even try? If yes, what would be the API? Would there be multiplexing? Or returning other data types? Or would it be a different interface?

@adamziel
Copy link
Owner Author

adamziel commented Jul 16, 2024

: I've been exploring a Pipe-based API for easy composing of all those data transformations, here's what I came up with:

Pipe::run( [
	new RequestStream( new Request( 'https://raw.githubusercontent.com/WordPress/blueprints/normalize-wxr-assets/blueprints/stylish-press-clone/woo-products.wxr' ) ),
	new XMLProcessorStream(function (WP_XML_Processor $processor) {
		if(is_wxr_content_node($processor)) {
			$text         = $processor->get_modifiable_text();
			$updated_text = Pipe::run([
				new BlockMarkupURLRewriteStream( 
					$text,
					[
						'from_url' => 'https://raw.githubusercontent.com/wordpress/blueprints/normalize-wxr-assets/blueprints/stylish-press-clone/wxr-assets/',
						'to_url'   => 'https://mynew.site/',
					]
				),
			]);
			if ( $updated_text !== $text ) {
				$processor->set_modifiable_text( $updated_text );
			}
		}
	}),
	new EchoStream(),
] );

It's based on the following two interfaces (that are likely to keep changing for now):

interface ReadableStream {
	public function read(): bool;
	public function is_finished(): bool;
	public function consume_output(): ?string;
	public function get_error(): ?string;
}

interface WritableStream {
	public function write( string $data ): bool;
	public function get_error(): ?string;
}

Here's a few more streams I would like to have:

  • new BlockMarkupToMarkdownStream() and new MarkdownToBlockMarkupStream()
  • new SQLDumpProcessorStream( $value_visitor ) to rewrite URLs in database dump files before importing them
  • new ZipReaderStream() and new ZipWriterStream() – what would be the API to manage multiple files?
  • new GitSparseCheckoutStream()

That way we'll be able to put together pipes like this:

Pipe::run( [
	new RequestStream( new Request( 'https://site.com/export.wxr.zip' ) ),
	new ZipReaderStream( '/export.wxr' ),
	new XMLProcessorStream(function (WP_XML_Processor $processor) use ($assets_downloader) {
		if(is_wxr_content_node($processor)) {
			$text         = $processor->get_modifiable_text();

			// Download the missing assets files
			$assets_downloader->process( $text );
			if(!$assets_downloader->everything_already_downloaded()) {
			    // Don't import content that has pending downloads
			    return;
			}

			// Update the URLs in the text
			$updated_text = Pipe::run([
				new BlockMarkupURLRewriteStream( 
					$text,
					[ 'from_url' => $from_site, 'to_url'  => $to_site ]
				),
			]);
			if ( $updated_text !== $text ) {
				$processor->set_modifiable_text( $updated_text );
			}
		}
	})
] );

or this:

Pipe::run( [
	new GitSparseCheckoutStream( 'https://github.com/WordPress/gutenberg.git', [
		'docs/**/*.md'
	] ),
	new MarkdownToBlockMarkupStream(),
	new BlockMarkupURLRewriteStream( 
		$text,
		[ 'from_url' => $from_site, 'to_url'  => $to_site ]
	),
	new CreatePageStream()
] );

@adamziel
Copy link
Owner Author

adamziel commented Jul 16, 2024

I’ve played with ideas like flatMap() and filter() to express more complex data flows using both objects, byte streams, concurrent and serial streams, and splitting and combining the dataflow:

graph TD
    A[HttpClient] -->|runs 10 concurrent requests| B[Pipeline]
    
    B -->|filter ZIP files| C[ZipPipeline]
    B -->|filter XML files| D[XmlPipeline]

    C -->|decode ZIP files| E[ZipDecoder]
    E -->|output XML entries| F[ZipXmlFilter]
    F -->|filter XML files| G[XmlProcessor]

    D -->|passthrough| G

    G -->|find WXR content nodes| H[XmlProcessor]
    H -->|parse as HTML| I[BlockMarkupURLProcessor]
    I -->|rewrite URLs| J[HTML string]
    J -->|write to local files| K[LocalFileWriter]

    classDef blue fill:#bbf,stroke:#f66,stroke-width:2px;
    class B,C,D,E,F,G,H,I,J,K blue;
Loading

Sadly, the best result I got was a complex DSL you couldn't use without spending time with the documentation:

<?php
// Create the main pipeline
$pipeline = HttpClient::pipeline([
    "http://example.com/file1.zip",
    "http://example.com/file2.zip",
    "http://example.com/file3.zip",
    "http://example.com/file4.zip",
    "http://example.com/file5.zip",
    "http://example.com/file6.xml",
    "http://example.com/file7.xml",
    "http://example.com/file8.xml",
    "http://example.com/file9.xml",
    "http://example.com/file10.xml"
]);

[$zipPipeline, $xmlPipeline] = $pipeline->split(HttpClient::filterContentType('application/zip'));

$zipPipeline
    ->flatMap(ZipDecoder::create())
    ->filter(Pipeline::filterFileName('.xml$'))
    ->combineWith($xmlPipeline)
    ->map(new WXRRewriter())
    ->map(Pipeline::defaultFilename('output.xml'))
    ->map(new LocalFileWriter('./'))

The alternative is the following imperative code:

$zips = [
    "http://example.com/file1.zip",
    "http://example.com/file2.zip",
    "http://example.com/file3.zip",
    "http://example.com/file4.zip",
    "http://example.com/file5.zip",
];
$zip_decoders = [];
$xmls = [
    "http://example.com/file6.xml",
    "http://example.com/file7.xml",
    "http://example.com/file8.xml",
    "http://example.com/file9.xml",
    "http://example.com/file10.xml"
];
$local_paths = [];
$xml_rewriters = [];
$client = new Client();
$client->enqueue( [ ...$zips, ...$xmls ] );

while ( $client->await_next_event() ) {
    $request = $client->get_request();
    $original_url = $request->original_request()->url;

    switch ( $client->get_event() ) {
        case Client::EVENT_HEADERS_RECEIVED:
            if ( in_array( $original_url, $zips ) ) {
                $zip_decoders[$original_url] = new ZipStreamReader();
            } else {
                $xml_rewriters[$original_url] = new XmlRewriter();
            }

            break;
        case Client::EVENT_BODY_CHUNK_AVAILABLE:
            if ( in_array( $original_url, $zips ) ) {
                $zip_decoders[$original_url]->write( $request->get_response_body_chunk() );
            } else {
                $xml_rewriters[$original_url]->write( $request->get_response_body_chunk() );
            }
            break;
        case Client::EVENT_FAILED:
        case Client::EVENT_FINISHED:
            unset( $zip_decoders[$request->original_request()->id] );
            continue 2;
    }

    foreach( $zip_decoders as $url => $zip ) {
        if ( $zip->is_file_finished() ) {
            $zip->next_file();
        }
        while ( $zip->read() ) {
            if( $zip->get_last_error() ) {
                // TODO: Handle error
                continue 2;
            }

            $file = $zip->get_file_name();
            if(!isset($xml_rewriters[$file])) {
                $xml_rewriters[$file] = new XmlRewriter();
            }
            $xml_rewriters[$url]->write( $zip->get_content_chunk() );
        }
    }

    foreach ( $xml_rewriters as $url => $rewriter ) {
        while ( $rewriter->read() ) {
            file_put_contents(
                $local_paths[$url],
                $rewriter->get_response_body_chunk(),
                FILE_APPEND
            );
        }
    }
}

It is longer, sure, but there's way less ideas in it, you have more control, and it can also be encapsulated similarly as AsyncHttp\Client:

public function next_chunk() {
    $this->await_response_bytes();
    $this->process_zip_chunks();
    $this->process_xml_chunks();
    $this->write_output_bytes();
}

It's not declarative but it's simple.

@akirk
Copy link

akirk commented Jul 16, 2024

One option might be to something like a Brancher extends TransformStream class that itself accepts, single TransformStreams and/or Pipe of multiple streams that will be selected by the Brancher either based on the content (maybe through a callback) or the first stream that doesn't thrown an exception.

I was wondering if something modeled after JavaScript Promises might be more flexible in providing branching abilities.

@adamziel
Copy link
Owner Author

adamziel commented Jul 16, 2024

One option might be to something like a Brancher extends TransformStream class that itself accepts, single TransformStreams and/or Pipe of multiple streams that will be selected by the Brancher either based on the content (maybe through a callback) or the first stream that doesn't thrown an exception.

Noodling on that idea, we'd need a new type category for multiple data flows:

  • MultiTransformer – (stream_id, in_chunk) => out_chunk – transforms many streams of the same type of data, e.g. rewrites many XML files at once.
  • Demultiplexer – single input, multiple outputs, e.g. a HTTP client could pipe a single byte stream into multiple HTTP sockets, each having its own response stream.
  • Multiplexer – multiple inputs, single output, e.g. a ZipEncoder could turn multiple File[] streams into a single byte stream.

Here's one way how they could combine:

$client = new Client();
$client->enqueue( [ ...$zips, ...$xmls ] );

MultiPipeline::run([
    // This produces multiple Request[] streams
    $client->demultiplex(),

    MultiPipeline::branch(
        ( $request ) => is_zip($request),

        // ZipStreamDemultiplexer is a bytes -> File[] array transformer. It's not
        // a demultiplexer because the next file is always produced before the next
        // one so there is no concurrent processing here. We could, perhaps, implement
        // it as a demultiplexer anyway to reduce the number of ideas in the codebase.
        [ () => new ZipStreamReader( '*.xml' ) ]
    ),

    // XmlRewriter is a regular bytes -> bytes stream. In here,
    // we support multiple concurrent XML streams.
    // We can skip the new MultiTransformer() call and have MultiPipeline backfill it for us.
    () => new XmlRewriter(),

    // And now we're gathering all the File objects into a single File stream.
    new Multiplexer(),

    () => new ZipStreamEncoder()

    // Let's write to a local file.
    // At this point we only have a single stream id, but we're still
    // in a multi-stream world so we have to wrap with a MultiTransformer.
    () => new LocalFileWriter( 'out.zip' )
]);

This looks much better than the bloat I outlined in my previous comment. Perhaps it can be simplified even further.

Although, I guess it's not that different from:

$client = new Client();
$client->enqueue( [ ...$zips, ...$xmls ] );

$client
    ->demultiplex()
    ->branch(
        ( $request ) => is_zip($request),
        ( $branch ) => $branch->pipeTo( () => new ZipStreamReader( '*.xml' ) )
    )
    ->pipeTo( () => new XmlRewriter() )
    ->multiplex()
    ->pipeTo( new ZipStreamEncoder() )
    ->pipeTo( new LocalFileWriter( 'out.zip' ) )

One thing I'm not sure about is passing bytes vs File($metadata, $body_stream) objects. We don't need that as much in a byte processing world, but it's super useful in the demultiplexing world. We can either make the Byte streams pass around File/DataUnit objects, or we can convert between them and streams in the multi-stream world.

I was wondering if something modeled after JavaScript Promises might be more flexible in providing branching abilities.

I don't have anything against callbacks, but I'd rather keep the data flow here as linear as possible and err on the side of simplicity over allowing multiple forks, splitting the data in success streams and error streams etc.

@adamziel
Copy link
Owner Author

adamziel commented Jul 16, 2024

I just realized piping objects is the same as piping bytes + metadata.

Therefore, we can pipe HTTP responses, ZIP files etc. without almost any additional complexity. We would pipe bytes as we do now, and then we'd also support moving an optional $metadata object along the pipe together with bytes.

To support multiplexing, I introduced a StreamMetadata interface that requires a get_resource_id() method. That's how we can distinguish between chunks associated with different requests, files, etc.

A Demultiplexer is just a regular TransformStream that:

  • On write, it creates a new sub-pipe whenever it sees a new $resource_id. It then routes the incoming data chunks to the relevant sub-pipe.
  • On read, it goes through the pipes round-robin and outputs the next available set of bytes + metadata.

A Multiplexer isn't even needed as every pipe is a linear stream of bytes + metadata and, while demultiplexers augment that temporarily, they clean up after themselves.

Here's a snippet of code that actually works with the latest version of this branch:

Pipe::run( [
	new RequestStream( [
		new Request( 'https://raw.githubusercontent.com/WordPress/blueprints-library/trunk/php.ini' ),
		new Request( 'https://raw.githubusercontent.com/WordPress/blueprints-library/trunk/phpcs.xml' ),
		new Request( 'https://raw.githubusercontent.com/WordPress/blueprints/trunk/blueprints/stylish-press/site-content.wxr' ),
	] ),

    // Filter response chunks as a flat list
	new FilterStream( fn ($metadata) => (
		str_ends_with( $metadata->get_filename(), '.xml' ) ||
		str_ends_with( $metadata->get_filename(), '.wxr' )
	) ),

    // This demultiplexer pipes each response through a separate
    // XMLProcessor so that each parser only deals with a single
    // XML document.
	new DemultiplexerStream(fn () => $wxr_rewriter()),

    // We're back to a flat list, let's strtoupper() each data chunk
	new UppercaseTransformer(),

    // A Pipe is also a TransformStream and allows us to compose multiple streams for demultiplexing
	new DemultiplexerStream(fn () => Pipe::from([
		new EchoTransformer(),
		new LocalFileStream(fn ($metadata) => __DIR__ . '/output/' . $metadata->get_resource_id() . '.chunk'),
	])),
] );

With this design, we could easily add fluid API if needed and also add support for ZIP files and other data types.

Some open questions are:

  • How should stream error be handled with multiplexing? How to allow one request to fail without stopping everything? How to catch that? Do we need a new CatchStream() after all?
  • What names would be useful here? There are streams, pipes, transformers – let's choose a cohesive set of terms.
  • Do we need to distinguish between Writable and Readable streams? Or would it be more useful for "non-writable" streams to ignore any data they receive, and for non-readable streams to pass through any data they receive?

@dmsnell
Copy link

dmsnell commented Jul 17, 2024

I have found the loop-orientation of the HTML API useful and more concrete than abstract types and interfaces. To that end, I also like the way bookmarks get a user-defined name.

In these pipelines it seems like they could be added with a name, and a context object could provide stage-specific metadata and control through the entire stack.

For example, I could write something like this.

Pipe::run( [
	'http' => new RequestStream( new Request( 'https://site.com/export.wxr.zip' ) ),
	'zip'  => new ZipReaderStream( '/export.wxr' ),
	'xml'  => new XMLProcessorStream(function (WP_XML_Processor $processor, $context) use ($assets_downloader) {
		if(!str_ends_with($context['zip']->filename, '.wxr')) {
			return $context['zip']->skip_file();
		}
		
		if(is_wxr_content_node($processor)) {
			$text         = $processor->get_modifiable_text();

			// Download the missing assets files
			$assets_downloader->process( $text );
			if(!$assets_downloader->everything_already_downloaded()) {
			    // Don't import content that has pending downloads
			    return;
			}

			// Update the URLs in the text
			$updated_text = Pipe::run([
				new BlockMarkupURLRewriteStream( 
					$text,
					[ 'from_url' => $from_site, 'to_url'  => $to_site ]
				),
			]);
			if ( $updated_text !== $text ) {
				$processor->set_modifiable_text( $updated_text );
			}
		}
	})
] );

In fact this whole stack could build a generator which can then be called in a loop.

$pipe = Pipe::run( [ … ] );

while ( $pipe->next() as $context ) {
	list( 'xml' => $xml, 'zip' => $zip ) = $context;

	if ( ! str_ends_with( $zip->get_filename(), '.wxr' ) ) {
		$zip->skip_file();
		continue;
	}

	// start processing.
}

@adamziel
Copy link
Owner Author

adamziel commented Jul 17, 2024

@dmsnell I love the idea, I'm confused about the details. Would the loop run for every stage of the pipeline? Or just for the final outcome? In the latter scenario, the filtering would happen after the chunks have been already processed. Also, what would this look like for a "demultiplexing" (streaming 5 concurrent requests) and a "branching" (only unzip zip files) use-cases?

@dmsnell
Copy link

dmsnell commented Jul 17, 2024

no idea @adamziel 😄

but I think it relates to the need for requesting more. for example, the loop could execute as soon as any and every stage has something ready to process.

in the case of XML, it could sit there in the loop and as long as it doesn't have enough data to process could say $context->continue(). this is, in effect, a flattened version of the pipeline - perhaps the tradeoff is being explicit about what runs. but you can filter things before they unpack and this was my attempt to highlight in the code snippet. the following lines would do something like $zip->read_file(), possibly.

for demultiplexing I would assume that the multiplexed stream would provide a way to access the contents of each sub-stream.

@adamziel
Copy link
Owner Author

adamziel commented Jul 17, 2024

I like reducing nesting @dmsnell. While demuxing is powerful, it's also complex and feels like solving an overly general problem instead of tailoring something simple to WordPress use-cases. Here's a take on processing multiple XML files using a flat stream structure:

Pipe::run( [
	'http' => new RequestStream( [ /* ... */ ] ),
	'zip'  => new ZipReaderStream( fn ($context) => {
		if(!str_ends_with($context['http']->url, '.zip')) {
			return $context->skip();
		}
		$context['zip']->set_processed_resource( $context['http']->url );
	} ),
	'xml'  => new XMLProcessorStream(fn ($context) => {
		if( 
		    ! str_ends_with($context['zip']->filename, '.wxr') &&
		    ! str_ends_with($context['http']->url, '.wxr')
		) {
			return $context->skip();
		}

		$context['xml']->set_processed_resource( $context['zip']->filename );
		$xml_processor = $context['xml']->get_processor( );
		while(WXR_Processor::next_content_node($xml_processor)) {
			// Migrate URLs and download assets
		}
	}),
] );

@dmsnell
Copy link

dmsnell commented Jul 17, 2024

if we want this, it would seem like each callback should potentially have access to the context of all stages above and below it, plus space for shared state.

in the case of fn () => …-style callbacks this isn't essential, but using function means that variables won't be enclosed. perhaps this is okay, but it's a wart to usage.

@adamziel
Copy link
Owner Author

but I think it relates to the need for requesting more. for example, the loop could execute as soon as any and every stage has something ready to process.

I think that's a must, otherwise we'd need buffer size / backpressure semantics. By processing each incoming chunk right away we may sometimes go too granular or do too many checks, but perhaps it wouldn't be too bad – especially when networking and not CPU is the bottleneck.

if we want this, it would seem like each callback should potentially have access to the context of all stages above and below it, plus space for shared state.

Shared data and context lookaheads sounds like trouble, though. I was hoping that read-only access to context from all the stages above would suffice.

@dmsnell
Copy link

dmsnell commented Jul 17, 2024

Shared data and context lookaheads sounds like trouble, though. I was hoping that read-only access to context from all the stages above would suffice.

these are valid concerns. I share them. still, I think that undoubtedly, someone will want to do something like conditionally skip a file in the ZIP based on something in the WXR processor, and being able to interact with that from below seems much more useful.

this is maybe the challenge that separate callback functions creates, because the flat model doesn't separate the layers.

@adamziel
Copy link
Owner Author

adamziel commented Jul 18, 2024

these are valid concerns. I share them. still, I think that undoubtedly, someone will want to do something like conditionally skip a file in the ZIP based on something in the WXR processor, and being able to interact with that from below seems much more useful.

Agreed! The challenge is we may only get the information necessary to reject a file after processing 10 or a 1000 chunks from that file. I can only see three solutions here:

  • Stream to a local file first, do the filtering, then start another pipe to process the buffered list.
    • Ups: No risk of going out of memory. Low complexity.
    • Downs: Double processing. Slower. Need storage, could be 100GB for a large file.
  • Buffer the information in-memory or on disk until we can make a decision. Push all the chunks to the end of the pipe and then "cancel" some files and rollback any side-effects the piping may have triggered. This could work with database inserts but not with piping to REST API requests.
    • Ups: Fast, single-pass processing.
    • Downs: Risk of going out of memory. Adds complexity. Rollbacks may require tracking changes and won't be always possible.
  • Buffer the information in-memory or on disk until we can make a decision. Stop processing before the decision point, then filter out some files and pipe the rest to the next stage.
    • Ups: Still fast "1,5 pass" processing. Reprocessed data would likely by low in volume.
    • Downs: Risk of going out of memory (can be mitigated with disk buffering). Adds complexity.

this is maybe the challenge that separate callback functions creates, because the flat model doesn't separate the layers.

I realized one more gotcha:

Imagine requesting 5 WXR exports, rewriting URLs, and saving them all to a local ZIP file. The ZIP writer needs to write data sequentially, so write all the chunks of the first file, write all the chunks of second file after that, and so on. However, sourcing data from HTTP would interleave chunks from different files. Simply piping those chunks to ZIPStreamWriter would produce a broken zip file.

We could turn it into a constraint solving problem. Stream classes would declare whether they:

  • Produce sequential chunks or interleaved chunks
  • Consume sequential chunks or interleaved chunks

On mismatch, the entire pipe would error out without even starting.

@dmsnell
Copy link

dmsnell commented Jul 19, 2024

Downs: Risk of going out of memory.
we may only get the information necessary to reject a file after processing 10 or a 1000 chunks

to me this reads as a statement of the problem, not an impediment to the problem. if we have to wait for 1000 chunks before knowing whether to process a file, that's a sunk cost.

not with piping to REST API requests.
requesting 5 WXR exports, rewriting URLs, and saving them all to a local ZIP file

maybe it's just me but I'm lost in all of this. these examples are complicated, but are they likely? are they practical? where is the scope of what we're doing?

@dmsnell
Copy link

dmsnell commented Aug 6, 2024

There's much more I'll do to review and think through this, but at the top of my head one question arises: how does it look to be re-entrant here?

Perhaps in the Playground this isn't a big problem, with unlimited execution time, but on any real PHP server we're dealing with max_execution_time and I would imagine any multi-GB import will need to be able to pause and resume.

Without asking you to instantly solve this, do you see a way to persist the in-transit state of the pipeline so that it can be resumed later? Could we put a pause button in here that someone clicks on and then can resume later?

Copy link

@dmsnell dmsnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamziel monumental work here. of the three pipes I like the controller version the best because of how it seems like the processing steps are a little more global in those cases.

but I noticed something in all formulations: the pipeline doesn't seem to be where the complexity lies. it seems like the examples focus on pipelining the download of files, which I think involves files that get queued while processing.

what would this look like if instead of this processing pipeline we had a main loop where each stage was exposed directly, without the pipeline abstraction, but the files could be downloaded still in parallel?

what could that look like? would it be worse? I think I'm puzzled on how to abstract a universal interface for streaming things, apart from calling everything a token, but your example of the WXR rewriter demonstrates how in many cases the individual token is not the right step function. in many cases, we will process many bytes all at once, and one production from an earlier stage might create many tokens for the next stage.

I'm also thinking more about re-entrancy and how to wrap the indices throughout the pipeline. in this system I suppose we could add new methods exposing the current bookmark, the start and end of the current token for a given stage. this might be critical for being able to pause and resume progress.

at this point I think I have some feel for the design, so I'd like to ask you for some leading questions if you have any. I know this is inherently a very complicated task; the code itself also seems very complicated.

$this->set_modifiable_html_text(
$html,
substr($text, 0, $at) . json_encode($new_attributes, JSON_HEX_TAG | JSON_HEX_AMP)
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My block comment delimiter finder might help here.

WordPress/wordpress-develop#6760

foreach($attributes as $key => $value) {
$new_attributes[$key] = $this->process_block_attributes($value);
}
return $new_attributes;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

array_walk_recursive might be of help here. your code is working fine, but presumably this could perform better, if it does.

I suppose there's no practical concern here about stack overflow, since this is only processing block attributes, but I'm on the lookout for any non-tail-recursive recursion (and I think that no user-space PHP code is, even if it's in tail-recursive form, which this isn't).

Alternatively there's also the approach of adding values to a stack to process, where the initial search runs over the stack, adding new items for each one that it finds that's an array.

This is not important; I just noticed it.

* @TODO: Investigate how bad this is – would it stand the test of time, or do we need
* a proper URL-matching state machine?
*/
const URL_REGEXP = '\b((?:(https?):\/\/|www\.)[-a-zA-Z0-9@:%._\+\~#=]+(?:\.[a-zA-Z0-9]{2,})+[-a-zA-Z0-9@:%_\+.\~#?&//=]*)\b';
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check out the extended flag x

If this modifier is set, whitespace data characters in the pattern are totally ignored except when escaped or inside a character class, and characters between an unescaped # outside a character class and the next newline character, inclusive, are also ignored. This is equivalent to Perl's /x modifier, and makes it possible to include commentary inside complicated patterns. Note, however, that this applies only to data characters. Whitespace characters may never appear within special character sequences in a pattern, for example within the sequence (?( which introduces a conditional subpattern.

https://www.php.net/manual/en/reference.pcre.pattern.modifiers.php

this can help make long and confusing regexes clearer, with comments to annotate

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would image that this review is more about the pipeline, but I think for URLs, if we're using a WHAT-WG compliant URL parser, we can probably jump to \b(?:[a-z-]+://|www\.|/) and start checking if those base points can produce a valid parse. it looks like this code isn't using what you've done in other explorations, so this comment may not be valid

}
if(
$p->get_token_type() === '#cdata-section' &&
strpos($new_value, '>') !== false
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's #cdata-section then it's a real CDATA section and we should check for ]]>. if it's #comment and WP_HTML_Tag_Processor::COMMENT_AS_CDATA_LOOKALIKE === $p->get_comment_type() then it's a lookalike and > is the closer.

$this->xml = $new_xml;
$this->stack_of_open_elements = $breadcrumbs;
$this->parser_context = $parser_context;
$this->had_previous_chunks = true;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the HTML API's extend() I've planned on ensuring that we only cut off as much from the front of the document until the first bookmark.

I've considered two modes: one simply extends (which is what #5050 does), and the other extends and forgets.

The major difference is what comes out of get_updated_html()

Here for XML this may be easier, but for HTML it's not as easy as resetting the stack open elements. There's a lot more state to track and modify, so right now in trunk it will reset to the start and crawl forward until it reaches the bookmark again if the bookmark is before the cursor.

Copy link
Owner Author

@adamziel adamziel Aug 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dmsnell My thinking is the processor has no idea whether the input stream is finished or not. It can make an assumption that an unclosed tag means we're paused at incomplete input, but the input stream may be in fact exhausted. The reverse is also problematic – we may have enough input to infer parsing is finished when in fact more input is coming. Perhaps these processors need to be explicitly told "we're still waiting for more data" or "no more input data will come".

with the HTML API's extend() I've planned on ensuring that we only cut off as much from the front of the document until the first bookmark.

Are there any system-level bookmarks that are implicitly created? As in, is there a chance we'd never forget any bytes because we've seen the <body> tag and we'll keep track of it indefinitely?

A memory limit also crossed my mind, as in "never buffer more than 1MB of data", although that seems more complex and maybe not worth it.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps these processors need to be explicitly told "we're still waiting for more data" or "no more input data will come".

Yes I believe this is going to be the demand. At some point I think we will probably add some method like $processor->get_incomplete_end_of_document() but it's not there because I have no idea what that should be right now, or if it's truly necessary.

Only the caller will be able to know if the document was truncated or if more chunks are inbound. This is also true for cases where we have everything in memory, e.g. we got truncated HTML as input and don't know where it came from - "that's it, that's all!"

Are there any system-level bookmarks that are implicitly created? As in, is there a chance we'd never forget any bytes because we've seen the tag and we'll keep track of it indefinitely?

In the HTML Processor there are for sure, though in the case of the fragment parser, since the context element never exists on the stack of open elements this shouldn't be a problem. We should be able to eject portions of the string that are closed.

* to the second ZIP file, and so on.
*
* This way we can maintain a predictable $context variable that carries upstream
* metadata and exposes methods like skip_file().
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good comment

@adamziel
Copy link
Owner Author

adamziel commented Aug 27, 2024

what would this look like if instead of this processing pipeline we had a main loop where each stage was exposed directly, without the pipeline abstraction, but the files could be downloaded still in parallel?

I explored that in bd19ad7.

I like that it's less code overall.

Here's what I don't like:

  • I find it less readable.
  • There's no $state with a "call stack" references of data sources that got us to a given chunk (e.g. $http client that yielded a specific chunk, the zip processor that uncompressed it, etc.) so we cannot easily abort an earlier stage from a later stage (e.g. stop downloading the entire zip file once we find the file we need inside the archive).
  • It forces us into a cascade of nested loops and makes nested error handling difficult. The code pattern like the general problem monads solve, and the Byte_Stream abstraction is quasi-monadic with a few intentional limitations, e.g. you can't compose more arrows after it's created.
  • There's no single code structure to handle re-entrancy, we'd have to think about them at every stage of the pipeline.
  • We'll keep repeating the same switch/error handling patterns and multiplexing caches ($xml_processors = [];)
  • Internal streaming details leak out into the userland code. That's annoying for the current implementation of the XML processor, but I think this one could be solved with a better XML streaming interface.

I explored inlining the loop cascade into a single loop with a switch-based stage management in daaba8a. It's more readable, but the other painpoints still stand.

what could that look like? would it be worse? I think I'm puzzled on how to abstract a universal interface for streaming things, apart from calling everything a token, but your example of the WXR rewriter demonstrates how in many cases the individual token is not the right step function. in many cases, we will process many bytes all at once, and one production from an earlier stage might create many tokens for the next stage.

You may be pointing at this already with your choice of words – I'm noticing a lot of similarities between this work and the MySQL parser explorations. We're ingesting "tokens" in form of bytes, XML tags etc, identifying the next non-terminal processing rule, and moving them there. If we squint and forget about sourcing data from the network, disk, etc., we're just composing parsers here. At an abstract level, the entire process could be driven by a grammar declaration – I now think the pipeline definition is just that.

I'm also thinking more about re-entrancy and how to wrap the indices throughout the pipeline. in this system I suppose we could add new methods exposing the current bookmark, the start and end of the current token for a given stage. this might be critical for being able to pause and resume progress.

My initial thinking is we could store the cursor as follows:

  • HTTP transmission state (paused while processing bytes 81920–90112)
  • The nested ZIP processing state (we're processing context.xml and we've paused while processing bytes 2048-4096)
  • The nested XML processing state (the current trace is root > post > wp:content, we've seen 3015 bytes so far, and we're about to consume the next tag), etc.

Upon resuming, each processor would restore the frozen state and skip over to the relevant byte in the stream.

On the upside, it seems simple.

On the downside:

  • It wouldn't work if we pause mid "sub chunk", e.g. a GZIP block. We may need to include the "sub block size" in the design and only consider specific byte offsets as resumable checkpoints. I'm not sure how to map that across different data models, e.g. HTTP 1.1 chunked/gzipped transfer offset -> ZIP byte stream -> last gzip block -> XML byte offset.
  • We'll download more than we need upon resuming. We can't easily map the exact byte number where we stopped processing XML and so we'll need to request the entire byte range 81920–90112.

at this point I think I have some feel for the design, so I'd like to ask you for some leading questions if you have any. I know this is inherently a very complicated task; the code itself also seems very complicated.

I didn't phrase much of this comment as questions, but it's all me asking for your thoughts.

@adamziel
Copy link
Owner Author

Highly relevant PR from @dmsnell: WordPress/wordpress-develop#6883

brandonpayton added a commit to WordPress/wordpress-playground that referenced this pull request Sep 27, 2024
… WebApp Redesign (#1731)

## Description

Implements a large part of the [website
redesign](#1561):

![CleanShot 2024-09-14 at 10 24
57@2x](https://github.com/user-attachments/assets/f245c7ac-cb8c-4e5a-b90a-b4aeff802e7b)


High-level changes shipped in this PR:

* Multiple Playgrounds. Every temporary Playground can be saved either
in the browser storage (OPFS) or in a local directory (Chrome desktop
only for now).
* New Playground settings options: Name name, language, multisite
* URL as the source of truth for the application state
* State management via Redux

This work is a convergence of 18+ months of effort and discussions. The
new UI opens relieves the users from juggling ephemeral Playgrounds and
losing their work. It opens up space for long-lived site configurations
and additional integrations. We could bring over all the [PR previewers
and demos](https://playground.wordpress.net/demos/) right into the
Playground app.

Here's just a few features unblocked by this PR:

* #1438 – no
more losing your work by accident 🎉
* #797 – with
multiple sites we can progressively build features we'll eventually
propose for WordPress core:
* A Playground export and import feature, pioneering the standard export
format for WordPress sites.
* A "Clone this Playground" feature, pioneering the [Site Transfer
Protocol](https://core.trac.wordpress.org/ticket/60375).
   * A "Sync two Playgrounds" feature, pioneering the Site Sync Protocol
* #1445 – better
git support is in top 5 most highly requested features. With multiple
Playgrounds, we can save your work and get rid of the "save your work
before connecting GitHub or you'll lose it" and cumbersome "repo setup"
forms on every interaction. Instead, we can make git operations like
Pull, Commit, etc. very easy and even enable auto-syncing with a git
repository.
* #1025 – as we
bring in more PHP plumbing into this repository, we'll replace the
TypeScript parts with PHP parts to create a WordPress core-first
Blueprints engine
* #1056 – Site
transfer protocol will unlocks seamlessly passing Playgrounds between
the browser and a local development environment
* #1558 – we'll
integrate [the Blueprints directory] and offer single-click Playground
setups, e.g. an Ecommerce store or a Slide deck editor.
#718.
* #539 – the
recorded Blueprints would be directly editable in Playground and perhaps
saved as a new Playground template
* #696 – the new
interaction model creates space for additional integrations.
* #707 – you
could create a "GitHub–synchronized" Playground
* #760 – we can
bootstrap one inside Playground using a Blueprint and benefit the users
immediately, and then gradually work towards enabling it on
WordPress.org
* #768 – the new
UI has space for a "new in Playground" section, similar to what Chrome
Devtools do
* #629  
* #32
* #104
* #497
* #562
* #580 

### Remaining work

- [ ] Write a release note for https://make.wordpress.org/playground/
- [x] Make sure GitHub integration is working. Looks like OAuth
connection leads to 404.
- [x] Fix temp site "Edit Settings" functionality to actually edit
settings (forking a temp site can come in a follow-up PR)
- [x] Fix style issue with overlapping site name label with narrow site
info views
- [x] Fix style issue with bottom "Open Site" and "WP Admin" buttons
missing for mobile viewports
- [x] Make sure there is a path for existing OPFS sites to continue to
load
- [x] Adjust E2E tests.
- [x] Reflect OPFS write error in UI when saving temp site fails
- [x] Find a path forward for
[try-wordpress](https://github.com/WordPress/try-wordpress) to continue
working after this PR
- [x] Figure out why does the browser get so choppy during OPFS save. It
looks as if there was a lot of synchronous work going on. Shouldn't all
the effort be done by a worker a non-blocking way?
- [x] Test with Safari and Firefox. Might require a local production
setup as FF won't work with the Playground dev server.
- [x] Fix Safari error: `Unhandled Promise Rejection: UnknownError:
Invalid platform file handle` when saving a temporary Playground to
OPFS.
- [x] Fix to allow deleting site that fails to boot. This is possible
when saving a temp site fails partway through.
- [x] Fix this crash:

```ts
		/**
		 * @todo: Fix OPFS site storage write timeout that happens alongside 2000
		 *        "Cannot read properties of undefined (reading 'apply')" errors here:
		 * I suspect the postMessage call we do to the safari worker causes it to
		 * respond with another message and these unexpected exchange throws off
		 * Comlink. We should make Comlink ignore those.
		 */
		// redirectTo(PlaygroundRoute.site(selectSiteBySlug(state, siteSlug)));
```
- [x] Test different scenarios manually, in particular those involving
Blueprints passed via hash
- [x] Ensure we have all the aria, `name=""` etc. accessibility
attributes we need, see AXE tools for Chrome.
- [x] Update developer documentation on the `storage` query arg (it's
removed in this PR)
- [x] Go through all the `TODOs` added in this PR and decide whether to
solve or punt them
- [x] Handle errors like "site not found in OPFS", "files missing from a
local directory"
- [x] Disable any `Local Filesystem` UI in browsers that don't support
them. Don't just hide them, though. Provide a help text to explain why
are they disabled.
- [x] Reduce the naming confusion, e.g. `updateSite` in redux-store.ts
vs `updateSite` in `site-storage.ts`. What would an unambiguous code
pattern look like?
- [x] Find a reliable and intuitive way of updating these deeply nested
redux state properties. Right now we do an ad-hoc recursive merge that's
slightly different for sites and clients. Which patterns used in other
apps would make it intuitive?
- [x] Have a single entrypoint for each logical action such as "Create a
new site", "Update site", "Select site" etc. that will take care of
updating the redux store, updating OPFS, and updating the URL. My ideal
scenario is calling something like `updateSite(slug, newConfig)` in a
React Component and being done without thinking "ughh I still need to
update OPFS" or "I also have to adjust that .json file over there"
- [x] Fix all the tiny design imperfections, e.g. cut-off labels in the
site settings form.

### Follow up work

- [ ] Mark all the related blocked issues as unblocked on the project
board, e.g.
#1703,
#1731, and more –
[see the All Tasks
view](https://github.com/orgs/WordPress/projects/180/views/2?query=sort%3Aupdated-desc+is%3Aopen&filterQuery=status%3A%22Up+next%22%2C%22In+progress%22%2C%22Needs+review%22%2C%22Reviewed%22%2C%22Done%22%2CBlocked)
- [ ] Update WordPress/Learn#1583 with info
that the redesign is now in and we're good to record a video tutorial.
- [ ] #1746
- [ ] Write a note in [What's new for developers? (October
2024)](WordPress/developer-blog-content#309)
- [ ] Document the new site saving flow in
`packages/docs/site/docs/main/about/build.md` cc @juanmaguitar
- [ ] Update all the screenshots in the documentation cc @juanmaguitar 
- [ ] When the site fails to load via `.list()`, still return that
site's info but make note of the error. Not showing that site on a list
could greatly confuse the user ("Hey, where did my site go?"). Let's be
explicit about problems.
- [ ] Introduce notifications system to provide feedback about outcomes
of various user actions.
- [ ] Add non-minified WordPress versions to the "New site" modal.
- [ ] Fix `console.js:288 TypeError: Cannot read properties of undefined
(reading 'apply') at comlink.ts:314:51 at Array.reduce (<anonymous>) at
callback (comlink.ts:314:29)` – it seems to happen at trunk, too.
- [ ] Attribute log messages to the site that triggered them.
- [ ] Take note of any interactions that we find frustrating or
confusing. We can perhaps adjust them in a follow-up PR, but let's make
sure we notice and document them here.
- [ ] Solidify the functional tooling for transforming between `URL`,
`runtimeConfiguration`, `Blueprint`, and `site settings form state` for
both OPFS sites and in-memory sites. Let's see if we can make it
reusable in Playground CLI.
- [ ] Speed up OPFS interactions, saving a site can take quite a while.
- [ ] A mobile-friendly modal architecture that doesn't stack modals,
allows dismissing, and understands some modals (e.g. fatal error report)
might have priority over other modals (e.g. connect to GitHub). Discuss
whether modals should be declared at the top level, like here, or
contextual to where the "Show modal" button is rendered.
- [ ] Discuss the need to support strong, masked passwords over a simple
password that's just `"password"`.
- [ ] Duplicate site feature implemented as "Export site + import site"
with the new core-first PHP tools from
adamziel/wxr-normalize#1 and
https://github.com/adamziel/site-transfer-protocol
- [x] Retain temporary sites between site changes. Don't just trash
their iframe and state when the user switches to another site.

Closes #1719

cc @brandonpayton

---------

Co-authored-by: Brandon Payton <[email protected]>
Co-authored-by: Bero <[email protected]>
Co-authored-by: Bart Kalisz <[email protected]>
@dmsnell
Copy link

dmsnell commented Sep 28, 2024

Doodling - this is probably all a disaster.

$pipeline->add( 'http', $client );
$pipeline->add( 'zip', $zip_decoder );
$pipeline->add( 'xml', $xml_processor );

$xml_processor->auto_feeder = array( $zip_decoder, 'read_chunk' );
$zip_decoder->auto_feeder   = array( $client, 'next_file' );

$client->new_item = fn ( $filename, $chunk ) => $zip_decoder->new_stream( $chunk );
$zip_decoder->new_item = fn ( $filename, $chunk ) => $xml_processor->new_stream( $chunk );

while ( $pipeline->keep_going() ) {
	if ( $zip_decoder->get_file_path() !== 'export.xml' ) {
		$zip_decoder->next_file();
		continue;
	}

	if ( ! $xml_processor->next_token() ) {
		wp_insert_post( $post );
		continue;
	}

	$post  = new WP_Post();
	$token = $xml_processor->get_token_name();
	…
}

so maybe this more or less mirrors work you did in the IByteStream or pipes work. it reminds me of something Joe Armstrong wrote about.

 system X is:
      start component a
      start component b
      ...
      connect out1 of a to in2 of b
      connect out2 of b to in2 of c
      ..
      send {logging,on} to control2 of c
      ..
     send run to all

Can we find a simple expression of pipe events without requiring the creation of new classes and without exposing all of the nitty-gritty internals? Maybe not. Maybe the verbose approach is best and largely, code using these streams will be highly-specialized and complicated, and the verbosity is fine because these complicated flows require paying attention to them. 🤔

@adamziel
Copy link
Owner Author

adamziel commented Sep 29, 2024

I have some thoughts about reentrancy unrelated to @dmsnell last comment:

Pausing a pipe may require saving the current state and the data buffer of every parser in the pipe.

Imagine the following pipe:

Local file > zip reader > xml parser > WXR importer

Now imagine we failed to import the post number 10472. Here's what we need to consider:

  • The WXR importer may have already created some dependent database records. It must either roll these changes back, or support very granular pausing and resuming. My gut says that the former would be much simpler.
  • The XML parser already moved past the opening <wp:post> tag — so we can't just export the current parser state.
  • The XML markup for the post may be spanning multiple ZIP chunks, — so we can't just export the last parser state.
  • The ZIP file includes gzipped data — so we better export the byte offset of the last gzip block.
  • We can't just remember a single byte offset at which we've finished processing the local file. We don't know it. We're not trying to correlate the byte offset of each XML tag opener with a specific byte in the ZIP file, and I'm not even sure we could given the gzip compression.

Every parser must maintain its internal state in such a way, that we could destroy and recreate all its internal resources at any time. For example, the ZIP parser's buffer should never start mid gzip block because that would prevent it from recreating the deflate handle.

We'll need to set checkpoints after each meaningful downstream task, e.g. when a post is importer. A checkpoint would be a serialized pipe state at that point in time. The downstream WXR parser may import 100 posts from a single zip chunk, and then it may need 100 zip chunks to import 1 post. We need to export all the upstream states and buffers to correctly resume the downstream parser and allow it to pull the next upstream chunk.

We can only set checkpoints after the last task OR at the first chunk of the next task, but not right before the next task. Why? Because we can't know we're about to enter the next WP post without peeking, and peek() isnt supported in the current streaming api.

Later on we may try to optimize the state serialization and:

  • Explore truncating all the upstream bytes that were already processed downstream.
  • Explore not storing the buffers, but re-populating the pipe with the upstream bytes.

Both should be possible upstream from the ZIP parser but I'm not sure about downstream. It would require synchronizing parser byte offsets, compressed/uncompressed offsets, and gzip block offsets between the piped parsers.

Streaming ZIP files has one more complexity. We may need two cursors — one to parse the central directory index, and one to go through the actual files. This could be a higher order stream with two inputs, but that smells like complexity and adding a lot of ideas to the streaming architecture. Maybe a custom pipe class that knows how to request new input streams and has a single output?

Cc @sirreal

@adamziel
Copy link
Owner Author

adamziel commented Sep 30, 2024

We've got the first prototype of re-entrant streams!

In 3c07f99 I've prototyped the pause() and resume() methods:

$file_stream = new File_Byte_Stream('./test.txt', 100);
// Read bytes 0-99
$file_stream->next_bytes();
// Pause the processing
file_put_contents('paused_state.txt', json_encode($file_stream->pause()));

// Resume the processing in another request
$file_stream = new File_Byte_Stream('./test.txt', 100);
$paused_state = json_decode(file_get_contents('paused_state.txt'));
$file_stream->resume($paused_state);
// Read the bytes 100 - 199
$file_stream->next_bytes();

It seems to be working quite well!

What did not work

At first, I tried the following approach:

$file_stream = new File_Byte_Stream('./test.txt', 100);
$file_stream->next_bytes();
$file_stream_2 = File_Byte_Stream::resume( $file_stream->pause() );

It worked well for simple streams, but there's no way to generalize it to callback-based streams like ProcessorByteStream – we can't serialize the callbacks as JSON:

class ZIP_Reader
{
    static public function stream()
    {
        return ProcessorByteStream::demuxed(
            function () { return new ZipStreamReader(); },
            function (ZipStreamReader $zip_reader, ByteStreamState $state) {
                while ($zip_reader->next()) {
                    switch ($zip_reader->get_state()) {
                        case ZipStreamReader::STATE_FILE_ENTRY:
                            $state->file_id = $zip_reader->get_file_path();
                            $state->output_bytes = $zip_reader->get_file_body_chunk();
                            return true;
                    }
                }
                return false;
            }
        );
    }
}

Therefore, I stuck with the approach of creating a stable stream (or stream chain) instance from "schema", and then exporting/importing its internal state:

function create_stream_chain($paused_state=null) {
    $chain = new StreamChain(
        [
            'file' => new File_Byte_Stream('./export.wxr', 100),
            'xml' => XML_Processor::stream(function () { }),
        ]
    );
    if($paused_state) {
        $chain->resume($paused_state);
    }
    return $chain;
}

We could, in theory, provide an interface such as $stream2 = $stream->pause()->resume() by making runtime artifacts serializable. For that, we'd need two transforms: callback -> JSON and JSON -> callback. One way to do it, is through replacing arbitrary dynamic callbacks with statically declared classes:

class StrtoupperStream extends TransformStream {
	protected function transform($chunk) {
		return strtoupper( $chunk );
	}
}
StreamApi::register(StrtoupperStream::class);

class RewriteLinksInWXRStream extends ProcessorTransformStream {
	protected function transform(WP_XML_Processor $processor) {
		// ...
	}
}
StreamApi::register(RewriteLinksInWXRStream::class);

However, you can see how requiring a class registration for every simple transform would unnecessarily increase the complexity and baloon the number of classes, files, dependencies, inheritance hierarchies etc. Having spent a few years with Java, I have to say hard pass.

@adamziel
Copy link
Owner Author

The API needs more thought and polish here, but we're in a good place to start wrapping up v1 for content import and exports in the WordPress Playground repo. We'll keep iterating and rebuilding it there to serve the real use-cases well.

@adamziel
Copy link
Owner Author

adamziel commented Sep 30, 2024

Zip re-entrancy challenge

Pausing ZIP parsing in the middle of a gzip-compressed file might require a custom GZip deflater and so, at least at first, we may not support resuming ZIP parsing.

GZip has a variable block size and PHP doesn't expose the current block size or boundaries, meaning there's no obvious place where we could split the data. We'd could work around that by exporting the entire deflater's internal state. This would also solve for the sliding window problem. The nth block may refer to any previous block within a 32kb sliding window. However, that previous block, might also refer to something in the previous 32kb. We're effectively maintaining a dictionary that's initialized at byte 0 and keeps evolving throughout the entire stream, and for re-entrancy we'd need to export that dictionary.

Some deflaters cut ties to previous 32kb every now and then by performing an occasional "full flush". This would reduce the paused context size.

Local ZIP file re-entrancy

PHP has a set of functions called gzopen and gzseek that could potentially be shoehorned into scanning to a specific offset in a ZIP archive. This would require a direct access to $fp which means we'd need a specialized LocalZipFileStream that sources data from a local path. This would unlock:

  • Importing these enterprise-grade 1TB WXR files.
  • Importing remote WXR files without streaming – we'd have to download them first unless we have a Ranges-based re-entrant HTTPClient, which might actually be easy.

WXR + re-entrancy next steps

It seems like the pause()/resume() interface I've explored in this PR would nicely support the basic reentrancy scenarios such as splitting large imports into multiple batches, or recovering from importing errors. Let's keep that entire architecture open to changes and even complete rewrites as we find out more as we use it to solve specific problems. Meanwhile, for WXR imports, let's proceed as follows:

  1. Import WXR files using a re-entrant Local WXR file > XML > WordPress pipe
  2. Import WXR files using a specialized re-entrant Local ZIP file > XML > WordPress pipe
  3. Import WXR files using generalized re-entrant HTTP > ZIP > XML > WordPress pipe
  4. Import WXR files using generalized re-entrant HTTP > partial ZIP > XML > WordPress pipe that can request, say, two specific files from a 1,000,000 files large archive

@adamziel
Copy link
Owner Author

adamziel commented Sep 30, 2024

Doodling - this is probably all a disaster.

$pipeline->add( 'http', $client );
$pipeline->add( 'zip', $zip_decoder );
$pipeline->add( 'xml', $xml_processor );

$xml_processor->auto_feeder = array( $zip_decoder, 'read_chunk' );
$zip_decoder->auto_feeder   = array( $client, 'next_file' );

$client->new_item = fn ( $filename, $chunk ) => $zip_decoder->new_stream( $chunk );
$zip_decoder->new_item = fn ( $filename, $chunk ) => $xml_processor->new_stream( $chunk );

while ( $pipeline->keep_going() ) {
	if ( $zip_decoder->get_file_path() !== 'export.xml' ) {
		$zip_decoder->next_file();
		continue;
	}

	if ( ! $xml_processor->next_token() ) {
		wp_insert_post( $post );
		continue;
	}

	$post  = new WP_Post();
	$token = $xml_processor->get_token_name();
	…
}

so maybe this more or less mirrors work you did in the IByteStream or pipes work.

@dmsnell it's not too different from the current proposal in this PR:

$pipeline = new StreamChain(
    [
        'http' => HTTP_Client::stream([
            new Request('http://127.0.0.1:9864/export.wxr.zip'),
            // new Request('http://127.0.0.1:9864/export.wxr.zip'),
            // Bad request, will fail:
            new Request('http://127.0.0.1:9865')
        ]),
        'zip' => ZIP_Reader::stream(),
        Byte_Stream::map(function($bytes, $context) {
            if($context['zip']->get_file_id() === 'export.wxr') {
                $context['zip']->skip_file();
                return null;
            }
            return $bytes;
        }),
        'xml' => XML_Processor::stream(function () { }),
        Byte_Stream::map(function($bytes) { return strtoupper($bytes); }),
    ]
);

foreach($pipeline as $chunk) {
	$post = new WP_Post();
	// ...
}

With a bit of augmentation, we could move $context['zip']->skip_file(); into the foreach() loop, but overall we're in a very similar place.

Can we find a simple expression of pipe events without requiring the creation of new classes and without exposing all of the nitty-gritty internals?

Note your example above involves the same number of classes as this PR. There's a class to represent the Pipeline, there's one class per decoder, it seems like there's a class to represent the stream.

@adamziel
Copy link
Owner Author

In b7102b7 I've prototyped a reentrant ZipStreamReaderLocal. I initially tried implementing it via PHP stream filters, but every time I called stream_filter_remove() the underlying $fp wouldn't return any bytes on following fread() so I resorted back to "manual" inflate_init(), inflate_add() etc.

There's a few rough edges to polish, e.g. the DemultiplexerStream doesn't understand that the streaming have ended. Overall it works pretty well, though, and it seems like we can start with Local ZIP > XML on day 1!

Thinking about the HTTP > ZIP stream...

  • HTTPClient > local buffer file > ZipStreamReaderLocal should be sufficient for very simple top-to-bottom scanning scenarios,
  • A dedicated HttpZipReader might be needed for anything more complex than that. Parsing a ZIP file might require multiple random access streams, e.g. Central Directory End > Central Directory > Read 10 files at different offsets > go back to Central Directory > Read 4 more files. Something needs to keep track of the current parsing stage, any file index built up, what can be parallelized and what can't, and I that's way beyond what I meant for the generic StreamChain class. A dedicated HttpZipReader could go arbitrarily fancy with its pause() and resume() methods, too.

@adamziel
Copy link
Owner Author

adamziel commented Sep 30, 2024

The last blocking problem with the API design

Doodling on processing zipped WXR files, I found myself writing this code:

$chain = new StreamChain(
    [
        'zip' => ZIP_Reader_Local::stream('./export.wxr.zip'),
        'xml' => XML_Processor::stream(function ($processor) {
            $breadcrumbs = $processor->get_breadcrumbs();
			if (
                 '#cdata-section' === $processor->get_token_type() &&
                 end($breadcrumbs) === 'content:encoded'
            ) {
                echo '<content:encoded>'.substr(str_replace("\n", "", $processor->get_modifiable_text()), 0, 100)."...</content:encoded>\n\n";
            }
         }),
    ]
);

foreach($chain as $chunk) {
	echo $chunk->get_bytes();
}

This feels weird! The StreamChain only knows how to move bytes around and will not output XML tags by design. This is great for multiple decoding stages, but it's quite inconvenient for working with that final $xml_processor instance meant to extract the import data.

Encoding pull parser semantics into the system would make this feel a lot more natural:

$pipeline = new StreamChain([ 
   'zip' => ZIP_Reader_Local::stream('./export.wxr.zip'),
   'xml' => WP_XML_Processor::consume(),
]);
while($pipeline->keep_going()) {
	if($pipeline['zip']->get_file_extension() !== '.wxr') {
		$pipeline['zip']->next_file();
		continue;
	}

	$processor = $pipeline['xml']->get_processor();
	// next_tag() automatically pulls more data from the "zip" stage
	// when the current buffer is exhausted
	while($processor->next_tag()) {

	}
}

The problem is, the inner while() loop would block the entire processing pipeline until export.wxr.zip is exhausted. This isn't a big deal for processing a single file, but it would be problematic if we requested 3 zip files over HTTP in parallel.

The only solution I can think of for the parallelization case is making the import process re-entrant. Not only that, but we'd need to be ready for a context switch at any point in time – we might run out of data 30 times before processing a single post. The code would look something like this:

$pipeline = new StreamChain([ 
   'zip' => ZIP_Reader_Local::stream('./export.wxr.zip'),
   'xml' => WP_XML_Processor::consume(),
]);
while($pipeline->keep_going()) {
	if($pipeline['zip']->get_file_extension() !== '.wxr') {
		$pipeline['zip']->next_file();
		continue;
	}

	$processor = $pipeline['xml']->get_processor();

	if(!$pipeline['wxr_import']->state) {
		$pipeline['wxr_import']->state = '#scanning-for-post';
	}

	// next_token() doesn't pull anything automatically. It only works with the 
	// information it has available at a moment.
	while($processor->next_token()) {
		if($pipeline['wxr_import']->state === '#scanning-for-post') {
			if(
				$processor->get_tag() === 'item' &&
				$processor->breadcrumbs_match('item')
			) {
				$pipeline['wxr_import']->state = '#post';	
				$pipeline['wxr_import']->post = array();
			}
		} else if($pipeline['wxr_import']->state === '#post') {
			if ( 
				$processor->breadcrumbs_match('content:encoded') &&
				$processor->get_type() === '#cdata-section'
			) {
				$pipeline['wxr_import']->post['post_content'] = $processor->get_modifiable_text();
			} else if // ...
		}
	}
}

Doesn't it look like another stateful streaming processor? This makes me think the pipe could perhaps look as follows:

$pipeline = new StreamChain([ 
   'zip' => ZIP_Reader_Local::stream('./export.wxr.zip'),
   'wxr' => new WP_WXR_Stream_Importer()
]);

while($pipeline->keep_going()) {
	$paused_state = $importer->pause();
	// ...
}

// or:

$importer = new StreamChain([
	HTTP_Client::stream(
		'https://mysite.com/export-1.wxr',
		'https://mysite.com/export-2.wxr',
	),
	new WP_WXR_Stream_Importer()
]);
while($importer->import_next_entity()) {
	$paused_state = $importer->pause();
	// ...
}

I'm now having second thoughts about the StreamChain class. Do we actually need one? A two-element StreamChain seems like an overkill.

On the up side, it centralizes the stream state management logic, cannot be extended with new streams after being declared, and it frees each stream from implementing a method like pipeTo(). Furthermore, it doesn't really contain two elements. The ZIP stream is also Demultiplexer automatically connecting each found file to a fresh WXR stream.

On the down side, the developer in me would rather use this API:

$pipeline = Zip_Reader::from_local_file('./export.wxr.zip')->connect_to(new WXR_Importer());
while($pipeline->keep_going()) {
	// ... twiddle our thumbs ...
}

$pipeline_state = $pipeline->pause();

// ... later ...

$pipeline = Zip_Reader::from_local_file('./export.wxr.zip')->connect_to(new WXR_Importer());
$pipeline->resume($pipeline_state);

What I don't like about it is that each stream class would have to implement a method such as connect_to. And what would connect_to return? Most likely, a Pipeline/StreamChain instance. Perhaps differences between the two APIs are superficial then and amount to a helper method?

@adamziel
Copy link
Owner Author

adamziel commented Sep 30, 2024

A potential pivot away from pipelines?

Uh-oh:

  • I'm no longer convinced encoding HTTP > ZIP > XML as a three-element pipe is practical. HTTP and ZIP are tightly coupled and need to be in a two-way feedback loop.
  • A HttpClient manages a bunch of streams and stream-like state transitions internally, and gain, relying on a pipe wouldn't be that practical.

This wasn't clear when I focused on rewriting the URLs in the WXR file, but became apparent when I started exploring an importer.

This makes me question other use-cases discussed in this PR. Do we actually need to build arbitrary pipes? Perhaps we'll only ever work with two streams, like a data source and a data target, each of them potentially being a composition of two streams in itself? In that scenario, we'd have specialized classes such as ZipFromFile, ZipFromHttp etc. and we wouldn't need any pipes.

This work is now unblocked, let's start puting the code explored in this PR to use in Playground

Let's stop hypothesizing and start bringing the basic building blocks (URL parser, XML parser etc) into Playground to use them for feature development. This should reveal much better answers about the API design than going through more thinking exercises here.

adamziel added a commit to WordPress/wordpress-playground that referenced this pull request Oct 14, 2024
…ools (#1888)

Let's officially kickoff [the Data
Liberation](https://wordpress.org/data-liberation/) efforts under the
Playground umbrella and unlock powerful new use cases for WordPress.

## Rationale

### Why work on Data Liberation?

WordPress core _really_ needs reliable data migration tools. There's
just no reliable, free, open source solution for:

-   Content import and export
-   Site import and export
- Site transfer and bulk transfers, e.g. mass WordPress -> WordPress, or
Tumblr -> WordPress
-   Site-to-site synchronization

Yes, there's the WXR content export. However, it won't help you backup a
photography blog full of media files, plugins, API integrations, and
custom tables. There are paid products out there, but nothing in core.

At the same time, so many Playground use-cases are **all about moving
your data**. Exporting your site as a zip archive, migrating between
hosts with the [Data Liberation browser
extension](https://github.com/WordPress/try-wordpress/), creating
interactive tutorials and showcasing beautiful sites using [the
Playground
block](https://wordpress.org/plugins/interactive-code-block/),
previewing Pull Requests, building new themes, and [editing
documentation](#1524)
are just the tip of the iceberg.

### Why the existing data migration tools fall short?

Moving data around seems easy, but it's a complex problem – consider
migrating links.

Imagine you're moving a site from
[https://my-old-site.com](https://playground-site-1.com) to
[https://my-new-site.com/blog/](https://my-site-2.com). If you just
moved the posts, all the links would still point to the old domain so
you'll need an importer that can adjust all the URLs in your entire
database. However, the typical tools like `preg_replace` or `wp
search_replace` can only replace some URLs correctly. They won't
reliably adjust deeply encoded data, such as this URL inside JSON inside
an HTML comment inside a WXR export:

The only way to perform a reliable replacement here is to carefully
parse each and every data format and replace the relevant parts of the
URL at the bottom of it. That requires four parsers: an XML parser, an
HTML parser, a JSON parser, a WHATWG URL parser. Most of those tools
don't exist in PHP. PHP provides `json_encode()`, which isn't free of
issues, and that's it. You can't even rely on DOMDocument to parse XML
because of its limited availability and non-streaming nature.

### Why build this in Playground?

Playground gives us a lot for free:

- **Customer-centric environment.** The need to move data around is so
natural in Playground. So many people asked for reliable WXR imports,
site exports, synchronization with git, and the ability to share their
Playground. Playground allows us to get active users and customer
feedback every step of the way.
- **Free QA**. Anyone can share a testing link and easily report any
problems they found. Playground is the perfect environment to get ample,
fast moving feedback.
- **Space to mature the API**. Playground doesn’t provide the same
backward compatibility guarantees as WordPress core. It's easy to
prototype a parser, find a use case where the design breaks down, and
start over.
- **Control over the runtime.** Playground can lean on PHP extensions to
validate our ideas, test them on a simulated slow hardware, and ship
them to a tablet to see how they do when the app goes into background
and the internet is flaky.

Playground enables methodically building spec-compliant software to
create the solid foundation WordPress needs.

## The way there

### What needs to be built?

There's been a lot of [gathering information, ideas, and
tools](https://core.trac.wordpress.org/ticket/60375). This writeup is
based on 10 years worth of site transfer problems, WordPress
synchronization plugins, chats with developers, analyzing existing
codebases, past attempts at data importing, non-WordPress tools,
discussions, and more.

WordPress needs parsers. Not just any parsers, they must be streaming,
re-entrant, fast, standard compliant, and tested using a large body of
possible inputs. The data synchronization tools must account for data
conflicts, WordPress plugins, invalid inputs, and unexpected power
outages. The errors must be non-fatal, retryable, and allow manual
resolution by the user. No data loss, ever. The transfer target site
should be usable as early as possible and show no broken links or images
during the transfer. That's the gist of it.

A number of parsers have already been prototyped. There's even [a draft
of reliable URL rewriting
library](https://github.com/adamziel/site-transfer-protocol). Here's a
bunch of early drafts of specific streaming use-cases:

- [A URL
parser](https://github.com/adamziel/site-transfer-protocol/blob/trunk/src/WP_URL.php)
- [A block markup
parser](https://github.com/adamziel/site-transfer-protocol/blob/trunk/src/WP_Block_Markup_Processor.php)
- [An XML
parser](WordPress/wordpress-develop#6713), also
explored by @dmsnell and @jonsurrell
- [A Zip archive
parser](https://github.com/WordPress/blueprints-library/blob/87afea1f9a244062a14aeff3949aae054bf74b70/src/WordPress/Zip/ZipStreamReader.php)
- [A multihandle HTTP
client](https://github.com/WordPress/blueprints-library/blob/trunk/src/WordPress/AsyncHttp/Client.php)
without curl dependency
- [A MySQL query
parser](WordPress/sqlite-database-integration#157)
started by @zieladam and now explored by @JanJakes
- [A stream chaining
API](adamziel/wxr-normalize#1) to connect all
these pieces

On top of that, WordPress core now has an HTML parser, and @dmsnell have
been exploring a
[UTF-8](WordPress/wordpress-develop#6883)
decoder that would to enable fast and regex-less URL detection in long
data streams.

There are still technical challenges to figure out, such as how to pause
and resume the data streaming. As this work progresses, you'll start
seeing incremental improvements in Playground. One possible roadmap is
shipping a reliable content importer, then reliable site zip importer
and exporter, then cloning a site, and then extends towards
full-featured site transfers and synchronization.

### How soon can it be shipped?

Three points:

* No dates.
* Let's keep building on top of prior work and ship meaningful user
flows often.
* Let's not ship any stable public APIs until the design is mature.

For example, the [Try WordPress
extension](https://github.com/WordPress/try-wordpress/) can already give
you a Playground site, even if you cannot migrate it to another
WordPress site just yet.

**Shipping matters. At the same time, taking the time required to build
rigorous, reliable software is also important**. An occasional early
version of this or that parser may be shipped once its architecture
seems alright, but the architecture and the stable API won't be rushed.
That would jeopardize the entire project. This project aims for a solid
design that will serve WordPress for years.

The progress will be communicated in the open, while maintaining
feedback loops and using the work to ship new Playground features.

## Plans, goals, details

### Next steps

Let's start with building a tool to export and import _a single
WordPress post_. Yes! Just one post. The tricky part is that all the
URLs will have to be preserved.

From there, let's explore the breadth and depth of the problem, e.g.:

* Rewriting links
* Frontloading media files
* Preserving dependent data (post meta, custom tables, etc.)
* Exporting/importing a WXR file using the above
* Pausing and resuming a WXR export/import
* Exporting/importing a full WordPress site as a zip file

Ideally, each milestone will result in a small, readily reusable tool.
For example "paste WordPress post, paste a new site URL, get your post
migrated".

There's an ample body of existing work. Let's keep the existing
codebases (e.g. WXR, site migration plugins) and discussions open in a
browser window during this work. Let's involve the authors of these
tools, ask them questions, ask them for reviews. Let's publish the
progress and the challenges encountered on the way.

### Design goals

- **Fault tolerance** – all the data tools should be able to start,
stop, resume, tolerate errors, accept alternative data from the user,
e.g. media files, posts etc.
- **WordPress-first** – let's build everything in PHP using WordPress
naming conventions.
- **Compatibility** – Every WordPress version, PHP version (7.2+, CLI),
and Playground runtime (web, CLI, browser extension, desktop app, CI
etc.) should be supported.
- **Dependency-free** – No PHP extensions required. If this means we
can't rely on cUrl, then let's build an HTTP client from scratch. Only
minimal Composer dependencies allowed, and only when absolutely
necessary.
- **Simplicity** – no advanced OOP patterns. Our role model is
[WP_HTML_Processor](https://developer.wordpress.org/reference/classes/wp_html_processor/)
– a **single class** that can parse nearly all HTML. There's no "Node",
"Element", "Attribute" classes etc. Let's aim for the same here.
- **Extensibility** – Playground should be able to benefit from, say,
WASM markdown parser even if core WordPress cannot.
- **Reusability** – Each library should be framework-agnostic and usable
outside of WordPress. We should be able to use them in WordPress core,
WP-CLI, Blueprint steps, Drupal, Symfony bundles, non-WordPress tools
like https://github.com/adamziel/playground-content-converters, and even
in Next.js via PHP.wasm.


### Prior art

Here's a few codebases that needs to be reviewed at minimum, and brought
into this project at maximum:

- URL rewriter: https://github.com/adamziel/site-transfer-protocol
- URL detector :
WordPress/wordpress-develop#7450
- WXR rewriter: https://github.com/adamziel/wxr-normalize/
- Stream Chain: adamziel/wxr-normalize#1
- WordPress/wordpress-develop#5466
- WordPress/wordpress-develop#6666
- XML parser: WordPress/wordpress-develop#6713
- Streaming PHP parsers:
https://github.com/WordPress/blueprints-library/tree/trunk/src/WordPress
- Zip64 support (in JS ZIP parser):
#1799
- Local Zip file reader in PHP (seeks to central directory, seeks back
as needed):
https://github.com/adamziel/wxr-normalize/blob/rewrite-remote-xml/zip-stream-reader-local.php
- WordPress/wordpress-develop#6883
- Blocky formats – Markdown <-> Block markup WordPress plugin:
https://github.com/dmsnell/blocky-formats
- Sandbox Site plugin that exports and imports WordPress to/from a zip
file:
https://github.com/WordPress/playground-tools/tree/trunk/packages/playground
- WordPress + Playground CLI setup to import, convert, and exporting
data: https://github.com/adamziel/playground-content-converters
- Markdown -> Playground workflow _and WordPress plugins_:
https://github.com/adamziel/playground-docs-workflow
- _Edit Visually_ browser extension for bringing data in and out of
Playground: WordPress/playground-tools#298
- _Try WordPress_ browser extension that imports existing WordPress and
non-WordPress sites to Playground:
https://github.com/WordPress/try-wordpress/
- Humanmade WXR importer designed by @rmccue:
https://github.com/humanmade/WordPress-Importer

### Related resources

- [Site transfer protocol](https://core.trac.wordpress.org/ticket/60375)
- [Existing data migration
plugins](https://core.trac.wordpress.org/ticket/60375#comment:32)
- WordPress/data-liberation#74
- #1524
- WordPress/gutenberg#65012

### The project structure

The structure of the `data-liberation` package is an open exploration
and will change multiple times. Here's what it aims to achieve.

**Structural goals:**

- Publish each library as a separate Composer package
- Publish each WordPress plugin separately (perhaps a single plugin
would be the most useful?)
- No duplication of libraries between WordPress plugins
- Easy installation in Playground via Blueprints, e.g. no `composer
install` required
- Compatibility with different Playground runtimes (web, CLI) and
versions of WordPress and PHP

**Logical parts**

- First-party libraries, e.g. streaming parsers
- WordPress plugins where those libraries are used, e.g. content
importers
- Third party libraries installed via Composer, e.g. a URL parser

**Ideas:**

- Use Composer dependency graph to automatically resolve dependencies
between libraries and WordPress plugins
- or use WordPress "required plugins" feature to manage dependencies
- or use Blueprints to manage dependencies


cc @brandonpayton @bgrgicak @mho22 @griffbrad @akirk @psrpinto @ashfame
@ryanwelcher @justintadlock @azaozz @annezazu @mtias @schlessera
@swissspidy @eliot-akira @sirreal @obenland @rralian @ockham
@youknowriad @ellatrix @mcsf @hellofromtonya @jsnajdr @dawidurbanski
@palmiak @JanJakes @luisherranz @naruniec @peterwilsoncc @priethor @zzap
@michalczaplinski @danluu
adamziel added a commit to WordPress/wordpress-playground that referenced this pull request Oct 28, 2024
A part of #1894.
Follows up on
#1893.

This PR brings in a few more PHP APIs that were initially explored
outside of Playground so that they can be incubated in Playground. See
the linked descriptions for more details about each API:

* XML Processor from
WordPress/wordpress-develop#6713
* Stream chain from adamziel/wxr-normalize#1
* A draft of a WXR URL Rewriter class capable of rewriting URLs in WXR
files

## Testing instructions

* Confirm the PHPUnit tests pass in CI
* Confirm the test suite looks reasonabel
* That's it for now! It's all new code that's not actually used anywhere
in Playground yet. I just want to merge it to keep iterating and
improving.
adamziel added a commit to WordPress/blueprints-library that referenced this pull request Oct 30, 2024
This new ZipStreamReader opens its own file handles which means
it can be paused, resumed, and is more reliable. The original
implementation was built as a part of adamziel/wxr-normalize#1

This is all new code so there are no testing instructions. Eventually
this implementation will replace the existing ZipStreamReader.
adamziel added a commit to WordPress/blueprints-library that referenced this pull request Oct 30, 2024
This new ZipStreamReader opens its own file handles which means it can
be paused, resumed, and is more reliable. The original implementation
was built as a part of adamziel/wxr-normalize#1

This is all new code so there are no testing instructions. Eventually
this implementation will replace the existing ZipStreamReader.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants