Kickoff Data Liberation: Let's Build WordPress-first Data Migration Tools #1888
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
For project updates, see the tracking issue.
Let's officially kickoff the Data Liberation efforts under the Playground umbrella and unlock powerful new use cases for WordPress.
Rationale
Why work on Data Liberation?
WordPress core really needs reliable data migration tools. There's just no reliable, free, open source solution for:
Yes, there's the WXR content export. However, it won't help you backup a photography blog full of media files, plugins, API integrations, and custom tables. There are paid products out there, but nothing in core.
At the same time, so many Playground use-cases are all about moving your data. Exporting your site as a zip archive, migrating between hosts with the Data Liberation browser extension, creating interactive tutorials and showcasing beautiful sites using the Playground block, previewing Pull Requests, building new themes, and editing documentation are just the tip of the iceberg.
Why the existing data migration tools fall short?
Moving data around seems easy, but it's a complex problem – consider migrating links.
Imagine you're moving a site from https://my-old-site.com to https://my-new-site.com/blog/. If you just moved the posts, all the links would still point to the old domain so you'll need an importer that can adjust all the URLs in your entire database. However, the typical tools like
preg_replace
orwp search_replace
can only replace some URLs correctly. They won't reliably adjust deeply encoded data, such as this URL inside JSON inside an HTML comment inside a WXR export:The only way to perform a reliable replacement here is to carefully parse each and every data format and replace the relevant parts of the URL at the bottom of it. That requires four parsers: an XML parser, an HTML parser, a JSON parser, a WHATWG URL parser. Most of those tools don't exist in PHP. PHP provides
json_encode()
, which isn't free of issues, and that's it. You can't even rely on DOMDocument to parse XML because of its limited availability and non-streaming nature.Why build this in Playground?
Playground gives us a lot for free:
Playground enables methodically building spec-compliant software to create the solid foundation WordPress needs.
The way there
What needs to be built?
There's been a lot of gathering information, ideas, and tools. This writeup is based on 10 years worth of site transfer problems, WordPress synchronization plugins, chats with developers, analyzing existing codebases, past attempts at data importing, non-WordPress tools, discussions, and more.
WordPress needs parsers. Not just any parsers, they must be streaming, re-entrant, fast, standard compliant, and tested using a large body of possible inputs. The data synchronization tools must account for data conflicts, WordPress plugins, invalid inputs, and unexpected power outages. The errors must be non-fatal, retryable, and allow manual resolution by the user. No data loss, ever. The transfer target site should be usable as early as possible and show no broken links or images during the transfer. That's the gist of it.
A number of parsers have already been prototyped. There's even a draft of reliable URL rewriting library. Here's a bunch of early drafts of specific streaming use-cases:
On top of that, WordPress core now has an HTML parser, and @dmsnell have been exploring a UTF-8 decoder that would to enable fast and regex-less URL detection in long data streams.
There are still technical challenges to figure out, such as how to pause and resume the data streaming. As this work progresses, you'll start seeing incremental improvements in Playground. One possible roadmap is shipping a reliable content importer, then reliable site zip importer and exporter, then cloning a site, and then extends towards full-featured site transfers and synchronization.
How soon can it be shipped?
Three points:
For example, the Try WordPress extension can already give you a Playground site, even if you cannot migrate it to another WordPress site just yet.
Shipping matters. At the same time, taking the time required to build rigorous, reliable software is also important. An occasional early version of this or that parser may be shipped once its architecture seems alright, but the architecture and the stable API won't be rushed. That would jeopardize the entire project. This project aims for a solid design that will serve WordPress for years.
The progress will be communicated in the open, while maintaining feedback loops and using the work to ship new Playground features.
Plans, goals, details
Next steps
Let's start with building a tool to export and import a single WordPress post. Yes! Just one post. The tricky part is that all the URLs will have to be preserved.
From there, let's explore the breadth and depth of the problem, e.g.:
Ideally, each milestone will result in a small, readily reusable tool. For example "paste WordPress post, paste a new site URL, get your post migrated".
There's an ample body of existing work. Let's keep the existing codebases (e.g. WXR, site migration plugins) and discussions open in a browser window during this work. Let's involve the authors of these tools, ask them questions, ask them for reviews. Let's publish the progress and the challenges encountered on the way.
Design goals
Prior art
Here's a few codebases that needs to be reviewed at minimum, and brought into this project at maximum:
make_clickable()
wordpress-develop#7450sluggify()
. wordpress-develop#5466Related resources
The project structure
The structure of the
data-liberation
package is an open exploration and will change multiple times. Here's what it aims to achieve.Structural goals:
composer install
requiredLogical parts
Ideas:
cc @brandonpayton @bgrgicak @mho22 @griffbrad @akirk @psrpinto @ashfame @ryanwelcher @justintadlock @azaozz @annezazu @mtias @schlessera @swissspidy @eliot-akira @sirreal @obenland @rralian @ockham @youknowriad @ellatrix @mcsf @hellofromtonya @jsnajdr @dawidurbanski @palmiak @JanJakes @luisherranz @naruniec @peterwilsoncc @priethor @zzap @michalczaplinski @danluu