Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data Liberation] First topological sorter draft #2030

Draft
wants to merge 50 commits into
base: trunk
Choose a base branch
from

Conversation

zaerl
Copy link
Collaborator

@zaerl zaerl commented Nov 26, 2024

THIS IS A DRAFT.

What still needs to be done?

  1. Handle importing of users to pass the core unit test test_small_import
  2. Check the wrong count on wp_get_post_categories in test_small_import unit test
  3. Term meta are not imported by Data Liberation now. Needs to add support to pass the core unit test not_test_serialized_term_meta

Motivation for the change, related issues

Topological Sorting of WXR entities before starting the import to ensure parent posts are imported before child posts. Processing WXR in a topological order may require an index with all offsets and lengths of items in the WXR.

Implementation details

Entities preloading/loading/mapping

The topological sort happens during a STAGE_TOPOLOGICAL_SORT phase runt before the STAGE_FRONTLOAD_ASSETS.

This PR removes the WP_Entity_Importer::mapping and WP_Entity_Importer::exists arrays and all memory preload. It is slow and memory-consuming, with thousands of entries and not support sessions. It adds a new table with import IDs and mapped IDs. During the first phase, it is prefilled. During the WP_Entity_Importer import it maps the imported IDs by using the wxr_importer_* filters and actions.

New WP-CLI script

This PR also introduces the new CLI script and moves the logger there.

New unit tests

Added all WordPress core unit tests and made some changes to make them work. Updated the WXRs to last version.

New PHPUnit filter

This PR adds a PHPUNIT_FILTER constant to packages/playground/data-liberation/tests/import/blueprint-import.json. If the value is not falsy it will be passed to PHPUnit when calling npx nx run playground-data-liberation:test:wp-phpunit. So, for example you can set "PHPUNIT_FILTER": "WPRewriteUrlsTests". It will be the same as running phpunit --filter WPRewriteUrlsTests.

Testing Instructions (or ideally a Blueprint)

Unit tests

npx nx run playground-data-liberation:test:wp-phpunit

# Or only tests that do not need a WordPress environment
cd packages/playground/data-liberation
./vendor/bin/phpunit

Data test

  1. Spin Playground
  2. Create a page
  3. Create another page
  4. Set the parent page or the first page as the second one
  5. Export the XML
  6. Spin Playground again
  7. Import the XML. Pages should have the same hierarchy

The PR adds the WordPress core importer tests test-serialized-comment-meta.xml and test-serialized-postmeta-no-cdata.xml. About to add and pass all the tests found here https://github.com/WordPress/wordpress-importer/tree/master/phpunit/tests.

New script

wp data-liberation import test.xml

or:

cd packages/playground/data-liberation/bin/import
bash import-wxr.sh a-folder-with-xmls

It accepts a file, a URL, or a folder. Run wp help data-liberation import to see all options.

@zaerl zaerl force-pushed the add/topological-sort branch from c51d9c4 to 7778714 Compare November 26, 2024 13:52
@zaerl zaerl self-assigned this Nov 26, 2024
@zaerl zaerl force-pushed the add/topological-sort branch 2 times, most recently from 85c850b to 78f1cb3 Compare November 29, 2024 11:19
@zaerl zaerl force-pushed the add/topological-sort branch from 78f1cb3 to 5af5722 Compare November 29, 2024 13:02
* Quicksort performs badly on already sorted arrays, O(n^2) is the worst case.
* Let's consider using a different sorting algorithm.
*/
uksort( $elements, $sort_callback );
Copy link
Collaborator

@adamziel adamziel Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires fitting all $elements into memory. This may be fine for v1, since every $element is relatively small, but we'll hit the limits of this approach sooner than later. Is it possible to perform topological sorting with at a reasonable speed without holding everything in memory? If not, how much RAM would this need to process one of these huge VIP 1TB exports?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can save the data in a custom DB table (Jetpack have a custom wp_jetpack_sync_queue table) instead of a simple array in memory. I didn't disturb the database in this first phase, but we can. Reproducing this in a table and using the custom sort here is straightforward. At the end of collecting the byte offsets, it's a matter of ALTER TABLE wxr_sorting ORDER BY custom_sorting_like_the_one_above, and we are done, even with our new reentrancy model.

When it finds a row in this table, the streamer can jump to the "correct" position (the one it should load before the current one) and proceed to the next one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely! If it's straightforward, let's include it in v1 – it seems like we can ship something that's on the right track with relatively little effort. It would also enforce shaping the API to stream the sorted list instead of loading it into memory, which would have implications for the reentrancy cursor. It may seem excessive for v1, but thinking about it more it seems like something that could have a ripple through the entire system if we don't account for streaming early on.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may seem excessive for v1, but thinking about it more it seems like something that could have a ripple through the entire system if we don't account for streaming early on.

Agree. It's reasonable, let's proceed this way. 👍

@zaerl zaerl changed the title First topological sorter draft [Data Liberation] First topological sorter draft Dec 10, 2024
} while ( $importer->advance_to_next_stage() );

$expected_string = '¯\_(ツ)_/¯';
Copy link
Collaborator

@adamziel adamziel Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really glad to be seeing multibyte UTF-8 characters in the tests 👍

$meta_item['comment_id'] = $comment_id;
}

$value = maybe_unserialize( $meta_item['meta_value'] );
Copy link
Collaborator

@adamziel adamziel Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we unserializing it only for add_comment_meta to serialize it again? Is there any way we could avoid doing that work? I suppose not, since add_comment_meta likely triggers a bunch of hooks – although I wonder if we should disable all the content insertion hooks during the import 🤔

Also, should we worry about __wakeup code execution vulnerabilities here? Or would WordPress call maybe_unserialize on that value after inserting it into the database anyway?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, WP_IMPORT should block everything not deeply related to the import. I have seen plugins that send an email when a comment is inserted. Imagine what happens when you import millions of comments.


private $mapped_pre_filters = array(
// Name of the filter, and the number of arguments it accepts.
'wxr_importer_pre_process_comment' => 2,
Copy link
Collaborator

@adamziel adamziel Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting any filter names I left in are temporary and will need rethinking, e.g. should we have data source-specific filters such as wxr_? Or should we filter at the entity level? Also, we'll need filters for plugin authors to process their custom blocks, e.g. WP Bakery serializes markup as base64 and they'll need to decode it to enable rewriting URLs in there. I only mean this comment to inform, I don't think any of that should hold up the topo sort work.

/**
* Remove the filters.
*/
public function __destruct() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm always nervous about coupling cleanup logic with GC behavior we don't control. Will __destruct be called while $this reference is present in the list of filters? I don't know, but I'd guess "no". With an explicit cleanup() or so method, we'd never have to guess.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, I will add the unregistering to the deactivate method.

public function __construct( $options = array() ) {
if ( array_key_exists( 'session_id', $options ) ) {
$this->current_session = $options['session_id'];
}

// The topological sorter needs to know about the mapped IDs for comments, terms, and posts.
foreach ( $this->mapped_pre_filters as $name => $accepted_args ) {
Copy link
Collaborator

@adamziel adamziel Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have an explicit method to register these? Not sure about the name, but ideally we'd use the same name every time we need to run side effects to initialize a class instance. initialize() maybe? Or attach()? Not sure. Anyway – that would give us the ability to create new objects of this class without immediately hooking them into the WordPress filtering system. This would be in line with the rest of the importing system design, where we try to be as lazy as possible. Also, as a human, I would find it easier to reason about. Implicit side effects can be tricky to test, debug, and remember.

Comment on lines 284 to 285
* Called by 'wxr_importer_processed_*' actions. This adds the entity to the
* sorter table.
Copy link
Collaborator

@adamziel adamziel Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

public function action_wxr_importer_processed reflects the context where the method is called. Can we, instead, reflect the intended outcome of that method? As in store_imported_entity, or store_imported_entity_for_eventual_sorting or add_importer_entity_to_sorter_table or so. Long names are fine if they add clarity.

I know nitpicking names is annoying, but I think it's worth it in the long run. It makes the difference between encoding the intention directly in the system and building a system where humans need to jump around the codebase to recover the original intentions. As an ADHD individual, I'm very sensitive to these. I forget a lot and I can't hold too much context in my head. Reducing side effects, using human-readable names, and avoiding abstractions really help me navigate the code.

'mapped_id' => is_null( $id ) ? null : (string) $id,
'parent_id' => null,
'byte_offset' => 0,
// Items with a parent has at least a sort order of 2.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a paragraph of text or a few to explain the idea behind sort_order?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Inbox
Development

Successfully merging this pull request may close these issues.

3 participants