-
Notifications
You must be signed in to change notification settings - Fork 269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data Liberation] First topological sorter draft #2030
base: trunk
Are you sure you want to change the base?
Conversation
c51d9c4
to
7778714
Compare
packages/playground/data-liberation/src/import/WP_Stream_Importer.php
Outdated
Show resolved
Hide resolved
packages/playground/data-liberation/src/import/WP_Stream_Importer.php
Outdated
Show resolved
Hide resolved
packages/playground/data-liberation/src/import/WP_Topological_Sorter.php
Show resolved
Hide resolved
85c850b
to
78f1cb3
Compare
78f1cb3
to
5af5722
Compare
* Quicksort performs badly on already sorted arrays, O(n^2) is the worst case. | ||
* Let's consider using a different sorting algorithm. | ||
*/ | ||
uksort( $elements, $sort_callback ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires fitting all $elements
into memory. This may be fine for v1, since every $element is relatively small, but we'll hit the limits of this approach sooner than later. Is it possible to perform topological sorting with at a reasonable speed without holding everything in memory? If not, how much RAM would this need to process one of these huge VIP 1TB exports?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can save the data in a custom DB table (Jetpack have a custom wp_jetpack_sync_queue
table) instead of a simple array in memory. I didn't disturb the database in this first phase, but we can. Reproducing this in a table and using the custom sort here is straightforward. At the end of collecting the byte offsets, it's a matter of ALTER TABLE wxr_sorting ORDER BY custom_sorting_like_the_one_above
, and we are done, even with our new reentrancy model.
When it finds a row in this table, the streamer can jump to the "correct" position (the one it should load before the current one) and proceed to the next one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lovely! If it's straightforward, let's include it in v1 – it seems like we can ship something that's on the right track with relatively little effort. It would also enforce shaping the API to stream the sorted list instead of loading it into memory, which would have implications for the reentrancy cursor. It may seem excessive for v1, but thinking about it more it seems like something that could have a ripple through the entire system if we don't account for streaming early on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may seem excessive for v1, but thinking about it more it seems like something that could have a ripple through the entire system if we don't account for streaming early on.
Agree. It's reasonable, let's proceed this way. 👍
} while ( $importer->advance_to_next_stage() ); | ||
|
||
$expected_string = '¯\_(ツ)_/¯'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really glad to be seeing multibyte UTF-8 characters in the tests 👍
$meta_item['comment_id'] = $comment_id; | ||
} | ||
|
||
$value = maybe_unserialize( $meta_item['meta_value'] ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we unserializing it only for add_comment_meta
to serialize it again? Is there any way we could avoid doing that work? I suppose not, since add_comment_meta
likely triggers a bunch of hooks – although I wonder if we should disable all the content insertion hooks during the import 🤔
Also, should we worry about __wakeup
code execution vulnerabilities here? Or would WordPress call maybe_unserialize
on that value after inserting it into the database anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, WP_IMPORT
should block everything not deeply related to the import. I have seen plugins that send an email when a comment is inserted. Imagine what happens when you import millions of comments.
|
||
private $mapped_pre_filters = array( | ||
// Name of the filter, and the number of arguments it accepts. | ||
'wxr_importer_pre_process_comment' => 2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting any filter names I left in are temporary and will need rethinking, e.g. should we have data source-specific filters such as wxr_
? Or should we filter at the entity level? Also, we'll need filters for plugin authors to process their custom blocks, e.g. WP Bakery serializes markup as base64 and they'll need to decode it to enable rewriting URLs in there. I only mean this comment to inform, I don't think any of that should hold up the topo sort work.
/** | ||
* Remove the filters. | ||
*/ | ||
public function __destruct() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm always nervous about coupling cleanup logic with GC behavior we don't control. Will __destruct
be called while $this
reference is present in the list of filters? I don't know, but I'd guess "no". With an explicit cleanup()
or so method, we'd never have to guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, I will add the unregistering to the deactivate method.
public function __construct( $options = array() ) { | ||
if ( array_key_exists( 'session_id', $options ) ) { | ||
$this->current_session = $options['session_id']; | ||
} | ||
|
||
// The topological sorter needs to know about the mapped IDs for comments, terms, and posts. | ||
foreach ( $this->mapped_pre_filters as $name => $accepted_args ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have an explicit method to register these? Not sure about the name, but ideally we'd use the same name every time we need to run side effects to initialize a class instance. initialize()
maybe? Or attach()
? Not sure. Anyway – that would give us the ability to create new objects of this class without immediately hooking them into the WordPress filtering system. This would be in line with the rest of the importing system design, where we try to be as lazy as possible. Also, as a human, I would find it easier to reason about. Implicit side effects can be tricky to test, debug, and remember.
* Called by 'wxr_importer_processed_*' actions. This adds the entity to the | ||
* sorter table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
public function action_wxr_importer_processed
reflects the context where the method is called. Can we, instead, reflect the intended outcome of that method? As in store_imported_entity
, or store_imported_entity_for_eventual_sorting
or add_importer_entity_to_sorter_table
or so. Long names are fine if they add clarity.
I know nitpicking names is annoying, but I think it's worth it in the long run. It makes the difference between encoding the intention directly in the system and building a system where humans need to jump around the codebase to recover the original intentions. As an ADHD individual, I'm very sensitive to these. I forget a lot and I can't hold too much context in my head. Reducing side effects, using human-readable names, and avoiding abstractions really help me navigate the code.
'mapped_id' => is_null( $id ) ? null : (string) $id, | ||
'parent_id' => null, | ||
'byte_offset' => 0, | ||
// Items with a parent has at least a sort order of 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a paragraph of text or a few to explain the idea behind sort_order
?
THIS IS A DRAFT.
What still needs to be done?
test_small_import
wp_get_post_categories
intest_small_import
unit testnot_test_serialized_term_meta
Motivation for the change, related issues
Topological Sorting of WXR entities before starting the import to ensure parent posts are imported before child posts. Processing WXR in a topological order may require an index with all offsets and lengths of items in the WXR.
Implementation details
Entities preloading/loading/mapping
The topological sort happens during a
STAGE_TOPOLOGICAL_SORT
phase runt before theSTAGE_FRONTLOAD_ASSETS
.This PR removes the
WP_Entity_Importer::mapping
andWP_Entity_Importer::exists
arrays and all memory preload. It is slow and memory-consuming, with thousands of entries and not support sessions. It adds a new table with import IDs and mapped IDs. During the first phase, it is prefilled. During theWP_Entity_Importer
import it maps the imported IDs by using thewxr_importer_*
filters and actions.New WP-CLI script
This PR also introduces the new CLI script and moves the logger there.
New unit tests
Added all WordPress core unit tests and made some changes to make them work. Updated the WXRs to last version.
New PHPUnit filter
This PR adds a
PHPUNIT_FILTER
constant topackages/playground/data-liberation/tests/import/blueprint-import.json.
If the value is not falsy it will be passed to PHPUnit when callingnpx nx run playground-data-liberation:test:wp-phpunit
. So, for example you can set"PHPUNIT_FILTER": "WPRewriteUrlsTests"
. It will be the same as runningphpunit --filter WPRewriteUrlsTests
.Testing Instructions (or ideally a Blueprint)
Unit tests
Data test
The PR adds the WordPress core importer tests
test-serialized-comment-meta.xml
andtest-serialized-postmeta-no-cdata.xml
. About to add and pass all the tests found here https://github.com/WordPress/wordpress-importer/tree/master/phpunit/tests.New script
or:
It accepts a file, a URL, or a folder. Run
wp help data-liberation import
to see all options.