-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Versions list query performance with large datasets #194
Comments
cc @tractorcow - you might be able to provide some background on the join and whether it can be replaced safely? |
The logic is (from comment in Versioned) "Join on latest version filtered by date". This is the part that doesn't perform under load |
After some investigation, since the join is based on a date filter and we don't have the record object available to use for filtering (record ID is 0) the only option is to mirror the query's current where conditions into the inner join as well. Logically this seems fine in my head, since we're expecting the parent query to return multiple rows without the join, and adding the join is narrowing it down to a single record, so the parent query conditions are valid for the join query too. Here's a patch which works, but I'm unsure that it's the right approach: diff --git a/src/Versioned.php b/src/Versioned.php
index ac1c919..9a33cca 100644
--- a/src/Versioned.php
+++ b/src/Versioned.php
@@ -634,6 +634,22 @@ class Versioned extends DataExtension implements TemplateGlobalProvider, Resetta
$stageCondition = '';
}
+ // Get base query conditions and add them to the inner join as well, this helps with performance
+ // in larger data sets
+ $baseQueryConditions = '';
+ $baseQueryParams = [];
+ foreach ($query->getWhere() as $condition) {
+ foreach ($condition as $key => $value) {
+ $baseQueryConditions .= ' AND ' . $key;
+ // Normalise array values
+ if (is_array($value) && count($value) === 1) {
+ $value = reset($value);
+ }
+ $baseQueryParams[] = $value;
+ }
+ }
+
+
// Join on latest version filtered by date
$query->addInnerJoin(
<<<SQL
@@ -643,6 +659,7 @@ class Versioned extends DataExtension implements TemplateGlobalProvider, Resetta
FROM "{$baseTable}_Versions"
WHERE "{$baseTable}_Versions"."LastEdited" <= ?
{$stageCondition}
+ {$baseQueryConditions}
GROUP BY "{$baseTable}_Versions"."RecordID"
)
SQL
@@ -654,7 +671,7 @@ SQL
,
"{$baseTable}_Versions_Latest",
20,
- [$date]
+ array_merge([$date], $baseQueryParams)
);
} @tractorcow if you get a chance would you mind putting your two cents into this please? |
I've opened a PR for discussion on this #197 |
Also worth asking @tractorcow - does that inner join work with Fluent? |
@robbieaverill what's the performance when used in conjunction with There's really no reason we should ever do such a query without a limit (e.g. gridfield pagination). |
Unfortunately limiting the base query doesn't affect the DB's ability to process it efficiently, and limiting the inner join subquery returns no records =( |
The major issue here as I see it is that the sub-select in the join will probably end up resolving to the latest version of every single block in the site (with MySQLs planner 🗡 ): SELECT `Element_Versions`.`RecordID`, MAX(`Element_Versions`.`Version`) AS `LatestVersion`
FROM `Element_Versions`
WHERE `Element_Versions`.`LastEdited` <= "2018-11-08 10:29:30"
AND `Element_Versions`.`WasDraft` = 1
GROUP BY `Element_Versions`.`RecordID` Pretty nuts. Just moving it to a subselect in the SELECT DISTINCT `Element_Versions`.`ClassName`,
`Element_Versions`.`LastEdited`,
`Element_Versions`.`Created`,
`Element_Versions`.`Visibility`,
`Element_Versions`.`Version`,
`Element_Versions`.`Title`,
`Element_Versions`.`ShowTitle`,
`Element_Versions`.`Sort`,
`Element_Versions`.`ExtraClass`,
`Element_Versions`.`Style`,
`Element_Versions`.`ParentID`,
`Element_Versions`.`RecordID` AS `ID`,
CASE
WHEN `Element_Versions`.`ClassName` IS NOT NULL THEN `Element_Versions`.`ClassName`
ELSE "DNADesign\\Elemental\\Models\\BaseElement" END AS `RecordClassName`,
`Element_Versions`.`RecordID`,
`Element_Versions`.`WasPublished`,
`Element_Versions`.`WasDeleted`,
`Element_Versions`.`WasDraft`,
`Element_Versions`.`AuthorID`,
`Element_Versions`.`PublisherID`
FROM `Element_Versions`
WHERE (`Element_Versions`.`ParentID` = "307341")
AND (`Element_Versions`.`WasDeleted` = "0")
AND (Element_Versions.Version = (
SELECT MAX(b.`Version`)
FROM Element_Versions b
WHERE b.`LastEdited` <= "2018-11-08 10:29:30" AND `Element_Versions`.`WasDraft` = 1 AND b.RecordID = Element_Versions.RecordID
))
ORDER BY `Element_Versions`.`Sort` ASC Time taken: 0.01 seconds. This is essentially no different. Excuse the terrible table alias naming and unquoted table names. This means no duping of filters and it'll work with multiple records being returned etc. I'm not sure it will scale very well when lots of blocks are being returned but it'll certainly be better than aggregating the latest version after a given date for every single block in the CMS. Not sure how possible this is to write with our ORM though. Sorry I ran out of time to look at this further today. FYI the explain for that updated query:
I miss PostgreSQL ( |
@ScopeyNZ re: "will probably end up resolving to the latest version of every single block in the site", you're missing the "ON" conditions for the join. Edit: oh you mean that it'll process every single block version before deciding which to use, yeah that's true |
I had tried the "put it into a where condition" option and didn't get the same performance gain, but will give it another go today |
Alright, I've given it another go and I still can't find a solution that works across the board (some tests still fail). My WIP is at #197 for reference, but I'm going to move on from this issue for now. |
@robbieaverill Given where the discussion on #195 is heading, the introduction of a |
I think the only thing the original PR was missing was to replicate the behaviour of an inner join when no joined row is found, which is to exclude the parent query's row from being returned. Moving the join to a WHERE condition behaves more like a left join, which returns the parent query row when the joined record is null - I've added another condition to ensure that the Version column (matching against subquery) is not null as well and the tests are green now. TestsContext
Running on a SSP vagrant box on a 2015 Macbook Pro. Loading an element's version list in history viewerNote: uses Query beforeSELECT DISTINCT `Element_Versions`.`ClassName`, `Element_Versions`.`LastEdited`, `Element_Versions`.`Created`, `Element_Versions`.`Visibility`, `Element_Versions`.`Version`, `Element_Versions`.`Title`, `Element_Versions`.`ShowTitle`, `Element_Versions`.`Sort`, `Element_Versions`.`ExtraClass`, `Element_Versions`.`Style`, `Element_Versions`.`ParentID`, `Element_Versions`.`RecordID` AS `ID`,
CASE WHEN `Element_Versions`.`ClassName` IS NOT NULL THEN `Element_Versions`.`ClassName`
ELSE "DNADesign\\Elemental\\Models\\BaseElement" END AS `RecordClassName`, `Element_Versions`.`RecordID`, `Element_Versions`.`WasPublished`, `Element_Versions`.`WasDeleted`, `Element_Versions`.`WasDraft`, `Element_Versions`.`AuthorID`, `Element_Versions`.`PublisherID`
FROM `Element_Versions` INNER JOIN (
SELECT `Element_Versions`.`RecordID`,
MAX(`Element_Versions`.`Version`) AS `LatestVersion`
FROM `Element_Versions`
WHERE `Element_Versions`.`WasDeleted` = 0
GROUP BY `Element_Versions`.`RecordID`
) AS `Element_Versions_Latest` ON `Element_Versions_Latest`.`RecordID` = `Element_Versions`.`RecordID`
AND `Element_Versions_Latest`.`LatestVersion` = `Element_Versions`.`Version`
WHERE (`Element_Versions`.`RecordID` = 256309)
AND (`Element_Versions`.`WasDeleted` = 0)
ORDER BY `Element_Versions`.`Sort` ASC
LIMIT 1 Execution time: 3.72s Query afterSELECT DISTINCT `Element_Versions`.`ClassName`, `Element_Versions`.`LastEdited`, `Element_Versions`.`Created`, `Element_Versions`.`Visibility`, `Element_Versions`.`Version`, `Element_Versions`.`Title`, `Element_Versions`.`ShowTitle`, `Element_Versions`.`Sort`, `Element_Versions`.`ExtraClass`, `Element_Versions`.`Style`, `Element_Versions`.`ParentID`, `Element_Versions`.`RecordID` AS `ID`,
CASE WHEN `Element_Versions`.`ClassName` IS NOT NULL THEN `Element_Versions`.`ClassName`
ELSE "DNADesign\\Elemental\\Models\\BaseElement" END AS `RecordClassName`, `Element_Versions`.`RecordID`, `Element_Versions`.`WasPublished`, `Element_Versions`.`WasDeleted`, `Element_Versions`.`WasDraft`, `Element_Versions`.`AuthorID`, `Element_Versions`.`PublisherID`
FROM `Element_Versions`
WHERE (`Element_Versions`.`RecordID` = 256309)
AND (`Element_Versions`.`WasDeleted` = 0)
AND (`Element_Versions`.`Version` = (
SELECT MAX(`Element_Versions_Latest`.`Version`)
FROM `Element_Versions` AS `Element_Versions_Latest`
WHERE `Element_Versions_Latest`.`WasDeleted` = 0
AND `Element_Versions_Latest`.`RecordID` = `Element_Versions`.`RecordID`
))
AND (`Element_Versions`.`Version` IS NOT NULL)
ORDER BY `Element_Versions`.`Sort` ASC
LIMIT 1 Execution time: 0.9ms Loading a specific element version in history viewerNote: uses Query beforeSELECT DISTINCT `ElementalArea_Versions`.`ClassName`, `ElementalArea_Versions`.`LastEdited`, `ElementalArea_Versions`.`Created`, `ElementalArea_Versions`.`Version`, `ElementalArea_Versions`.`OwnerClassName`, `ElementalArea_Versions`.`RecordID` AS `ID`,
CASE WHEN `ElementalArea_Versions`.`ClassName` IS NOT NULL THEN `ElementalArea_Versions`.`ClassName`
ELSE 'DNADesign\\Elemental\\Models\\ElementalArea' END AS `RecordClassName`, `ElementalArea_Versions`.`RecordID`, `ElementalArea_Versions`.`WasPublished`, `ElementalArea_Versions`.`WasDeleted`, `ElementalArea_Versions`.`WasDraft`, `ElementalArea_Versions`.`AuthorID`, `ElementalArea_Versions`.`PublisherID`
FROM `ElementalArea_Versions` INNER JOIN (
SELECT `ElementalArea_Versions`.`RecordID`,
MAX(`ElementalArea_Versions`.`Version`) AS `LatestVersion`
FROM `ElementalArea_Versions`
WHERE `ElementalArea_Versions`.`LastEdited` <= '2018-09-16 02:45:45'
AND `ElementalArea_Versions`.`WasDraft` = 1
GROUP BY `ElementalArea_Versions`.`RecordID`
) AS `ElementalArea_Versions_Latest` ON `ElementalArea_Versions_Latest`.`RecordID` = `ElementalArea_Versions`.`RecordID`
AND `ElementalArea_Versions_Latest`.`LatestVersion` = `ElementalArea_Versions`.`Version`
WHERE (`ElementalArea_Versions`.`RecordID` = 256308)
AND (`ElementalArea_Versions`.`WasDeleted` = 0)
LIMIT 1 Execution time: 0.6s Query afterSELECT DISTINCT `ElementalArea_Versions`.`ClassName`, `ElementalArea_Versions`.`LastEdited`, `ElementalArea_Versions`.`Created`, `ElementalArea_Versions`.`Version`, `ElementalArea_Versions`.`OwnerClassName`, `ElementalArea_Versions`.`RecordID` AS `ID`,
CASE WHEN `ElementalArea_Versions`.`ClassName` IS NOT NULL THEN `ElementalArea_Versions`.`ClassName`
ELSE 'DNADesign\\Elemental\\Models\\ElementalArea' END AS `RecordClassName`, `ElementalArea_Versions`.`RecordID`, `ElementalArea_Versions`.`WasPublished`, `ElementalArea_Versions`.`WasDeleted`, `ElementalArea_Versions`.`WasDraft`, `ElementalArea_Versions`.`AuthorID`, `ElementalArea_Versions`.`PublisherID`
FROM `ElementalArea_Versions`
WHERE (`ElementalArea_Versions`.`RecordID` = 256308)
AND (`ElementalArea_Versions`.`WasDeleted` = 0)
AND (`ElementalArea_Versions`.`Version` = (
SELECT MAX(`ElementalArea_Versions_Latest`.`Version`)
FROM `ElementalArea_Versions` AS `ElementalArea_Versions_Latest`
WHERE `ElementalArea_Versions_Latest`.`LastEdited` <= '2018-09-16 02:45:45'
AND `ElementalArea_Versions_Latest`.`RecordID` = `ElementalArea_Versions`.`RecordID`
AND `ElementalArea_Versions_Latest`.`WasDraft` = 1
))
AND (`ElementalArea_Versions`.`Version` IS NOT NULL)
LIMIT 1 Execution time: 0.9ms Original query from issueOriginalSELECT DISTINCT `Element_Versions`.`ClassName`,
`Element_Versions`.`LastEdited`,
`Element_Versions`.`Created`,
`Element_Versions`.`Visibility`,
`Element_Versions`.`Version`,
`Element_Versions`.`Title`,
`Element_Versions`.`ShowTitle`,
`Element_Versions`.`Sort`,
`Element_Versions`.`ExtraClass`,
`Element_Versions`.`Style`,
`Element_Versions`.`ParentID`,
`Element_Versions`.`RecordID` AS `ID`,
CASE
WHEN `Element_Versions`.`ClassName` IS NOT NULL THEN `Element_Versions`.`ClassName`
ELSE "DNADesign\\Elemental\\Models\\BaseElement" END AS `RecordClassName`,
`Element_Versions`.`RecordID`,
`Element_Versions`.`WasPublished`,
`Element_Versions`.`WasDeleted`,
`Element_Versions`.`WasDraft`,
`Element_Versions`.`AuthorID`,
`Element_Versions`.`PublisherID`
FROM `Element_Versions`
INNER JOIN (SELECT `Element_Versions`.`RecordID`, MAX(`Element_Versions`.`Version`) AS `LatestVersion`
FROM `Element_Versions`
WHERE `Element_Versions`.`LastEdited` <= "2018-11-08 10:29:30"
AND `Element_Versions`.`WasDraft` = 1
GROUP BY `Element_Versions`.`RecordID`) AS `Element_Versions_Latest`
ON `Element_Versions_Latest`.`RecordID` = `Element_Versions`.`RecordID` AND
`Element_Versions_Latest`.`LatestVersion` = `Element_Versions`.`Version`
WHERE (`Element_Versions`.`ParentID` = "307341")
AND (`Element_Versions`.`WasDeleted` = "0")
ORDER BY `Element_Versions`.`Sort` ASC Execution time: 3.8s UpdatedSELECT DISTINCT `Element_Versions`.`ClassName`,
`Element_Versions`.`LastEdited`,
`Element_Versions`.`Created`,
`Element_Versions`.`Visibility`,
`Element_Versions`.`Version`,
`Element_Versions`.`Title`,
`Element_Versions`.`ShowTitle`,
`Element_Versions`.`Sort`,
`Element_Versions`.`ExtraClass`,
`Element_Versions`.`Style`,
`Element_Versions`.`ParentID`,
`Element_Versions`.`RecordID` AS `ID`,
CASE
WHEN `Element_Versions`.`ClassName` IS NOT NULL THEN `Element_Versions`.`ClassName`
ELSE "DNADesign\\Elemental\\Models\\BaseElement" END AS `RecordClassName`,
`Element_Versions`.`RecordID`,
`Element_Versions`.`WasPublished`,
`Element_Versions`.`WasDeleted`,
`Element_Versions`.`WasDraft`,
`Element_Versions`.`AuthorID`,
`Element_Versions`.`PublisherID`
FROM `Element_Versions`
WHERE (`Element_Versions`.`ParentID` = "307341")
AND (`Element_Versions`.`WasDeleted` = "0")
AND (`Element_Versions`.`Version` = (
SELECT MAX(`Element_Versions_Latest`.`Version`)
FROM `Element_Versions` AS `Element_Versions_Latest`
WHERE `Element_Versions_Latest`.`LastEdited` <= '2018-11-08 10:29:30'
AND `Element_Versions_Latest`.`RecordID` = `Element_Versions`.`RecordID`
AND `Element_Versions_Latest`.`WasDraft` = 1
))
AND (`Element_Versions`.`Version` IS NOT NULL)
ORDER BY `Element_Versions`.`Sort` ASC
Execution time: 1.3ms |
New PR: #213 |
Removing 4.4.0 milestone based on @unclecheese recommendation, sounds like this can get into 4.4.1 as a patch level change. We're hitting the deadlines with stabilising 4.4.0, and there's lots of other stuff on. If it turns out not to be a patch level change, it'll have to be a fork for large projects relying on this (and 4.5.0) |
The patch is currently targeted at versioned 1.3 (in line with 4.3). It's currently ready for review. I hope that we can still squeeze this in before the eventual 4.4 release as it's not a "feature" but a "fix". |
@ScopeyNZ all PRs listed in the issue are merged/closed - is this finished now? |
I've updated the OP - there was a missing PR from the list. |
Affected version: SS 4.3.0-rc1
Description
When a versioned DataObject has a large number of versions (e.g. 1,000,000), the query to retrieve the list of versions resorts to ignoring DB indexes and using filesort, which is considerably slower (1.5 seconds or so when compared to 0.0x seconds).
Example query generated by the ORM - the issue can be reproduced in an environment which has a large number of rows in the Element_Versions table:
The analysis of this query:
Notes
Using where; Using temporary; Using filesort
when compared to the second row which isUsing index
. This query between 1.5 and 3.5 seconds on average for me (key project database, 1 million records in the table).Purpose of the join
The join appears to exist purely to return only the "latest draft version" for the object. When removing the join, you get a few rows.
The joined result set isn't used directly in any of the conditions or select fields, so it is used purely to limit the main query's results down.
If this is the purpose, we may be able to replace the join with
ORDER BY Element_Versions.Version DESC LIMIT 1
which appears to return the correct result in "0.00 seconds".Pull requests
WIP: FIX Filtering the version history by a specific date is now more performant in large datasets #197The text was updated successfully, but these errors were encountered: