-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data Liberation] Re-entrant WP_Stream_Importer #2004
Conversation
Exploratory PR to keep track of the import state so that, upon crash, the next run may seamlessly resume where the previous one left off.
…dest entity whose downloads were finalized
…()/seek() methods
cc @zaerl @brandonpayton @dmsnell @sirreal @ellatrix – any thoughts or comments? |
I've given this a once-over and it seems well thought out. A lot of this (streaming, wxr, etc.) is outside my day-to-day so I'm sorry I don't have a lot of insights to share. |
I like the It is ok now; there is always room for improvement. Great work. SQLite has a relatively strict anomaly testing that a modern PHP app should follow. While the I/O one is pretty impossible to generate by a server not made of wood, the other two are more frequent than one thinks with a default |
Yes! I'd like to keep tabs on the download progress, too. It seems like a cursor problem, too, doesn't it? As in, a downloading a 300GB file might take multiple sessions and the ability to pause and resume it matters.
Thanks!
Oh that's a great idea in context of importers! Would you mind starting a new issue to track that and link to it in the tracking issue? #1894 |
Adds re-entrancy semantics to the importer API to enable pausing and resuming data imports:
Motivation
Most WordPress importers fail because they assume a happy path: we have enough memory, we have enough time, all the assets will be available, and so on.
In Data Liberation, I want to assume the worst possible path through thorny quicksand in full sun with venomous wasps stinging us. We'll run out of memory after the first post, all the assets will be 40GB large, and half of them won't be possible to download.
Pausing, resuming, and recovering from errors should be a basic primitive of the system. The first step to supporting that is the ability to suspend the import operation and restart it from the same spot later on. And that's exactly what this PR adds.
Re-entrancy interface
This PR doesn't store any information in the database yet. It merely adds the plumbing for pausing and resuming the
WP_Stream_Importer
instance.WP_Byte_Stream re-entrancy
The
WP_Byte_Stream
interface directly exposes atell(): int
andseek($offset)
methods. There's no need for anything fancier than that – we're only interested in an offset in the stream. It seems to work well for simple byte streams.My only worry is we may need to revisit this interface later on to support fetching fixed-size chunks from large files using byte ranges.
WP_XML_Processor re-entrancy
WP_XML_Processor
supports exporting state via:get_reentrancy_cursor()
methodcreate($xml, $options, $cursor=null)
.get_token_byte_offset_in_the_input_stream()
No method in the XML processor API will ever accept the cursor or the byte offset as a way of moving to another location in the document. You can only create a new XML processor at
$cursor
.This is a measure to:
seek()
-ing. We already have named bookmarks for that.Usage:
WP_WXR_Reader re-entrancy
The
WP_WXR_Reader
class uses the sameget_reentrancy_cursor()
interface asWP_XML_Processor
.WP_Stream_Importer re-entrancy
The
WP_Stream_Importer
class uses the sameget_reentrancy_cursor()
interface asWP_XML_Processor
. See the example at the top of this description.Testing instructions
TBD. We don't yet have a good way of running PHPUnit in the WordPress context yet. @zaerl is working on running import in CLI, we may need to wait for that before adding tests to this PR and shipping it.