Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve BaseBuilder::updateBatch() SQL #6373

Merged
merged 13 commits into from
Sep 10, 2022

Conversation

sclubricants
Copy link
Member

@sclubricants sclubricants commented Aug 12, 2022

This PR redesigns the SQL queries generated in _batchUpdate().

This implementation provides more elegant and smaller queries. This method selects the data into a psudo in memory table which provides more flexibility and availability to do more things with.

Because it treats the dataset as a table, a query or a table can be substituted for the dataset. This will be helpful in future ambitions.

Here is a sample of the difference in MySQL:

Old way - 982 Characters - This gets really hard to read when 100 rows long and 15 columns wide:

UPDATE `db_update_batch` SET `column1` = CASE 
WHEN `primary_key` = 1 THEN 'test data string T'
WHEN `primary_key` = 2 THEN 'test data string T'
WHEN `primary_key` = 3 THEN 'test data string T'
ELSE `column1` END, `column2` = CASE 
WHEN `primary_key` = 1 THEN 'test data string T'
WHEN `primary_key` = 2 THEN 'test data string T'
WHEN `primary_key` = 3 THEN 'test data string T'
ELSE `column2` END, `column3` = CASE 
WHEN `primary_key` = 1 THEN 'test data string T'
WHEN `primary_key` = 2 THEN 'test data string T'
WHEN `primary_key` = 3 THEN 'test data string T'
ELSE `column3` END, `column4` = CASE 
WHEN `primary_key` = 1 THEN 'test data string T'
WHEN `primary_key` = 2 THEN 'test data string T'
WHEN `primary_key` = 3 THEN 'test data string T'
ELSE `column4` END, `column5` = CASE 
WHEN `primary_key` = 1 THEN 'test data string T'
WHEN `primary_key` = 2 THEN 'test data string T'
WHEN `primary_key` = 3 THEN 'test data string T'
ELSE `column5` END
WHERE `primary_key` IN(1,2,3)

New way - 791 Characters - This is easy to read no matter the number of rows:

UPDATE `db_update_batch` AS t
INNER JOIN (
SELECT 1 `primary_key`, 'test data string T' `column1`, 'test data string T' `column2`, 'test data string T' `column3`, 'test data string T' `column4`, 'test data string T' `column5` UNION ALL
SELECT 2 `primary_key`, 'test data string T' `column1`, 'test data string T' `column2`, 'test data string T' `column3`, 'test data string T' `column4`, 'test data string T' `column5` UNION ALL
SELECT 3 `primary_key`, 'test data string T' `column1`, 'test data string T' `column2`, 'test data string T' `column3`, 'test data string T' `column4`, 'test data string T' `column5`
) u
ON t.`primary_key` = u.`primary_key`
SET 
t.`column1` = u.`column1`,
t.`column2` = u.`column2`,
t.`column3` = u.`column3`,
t.`column4` = u.`column4`,
t.`column5` = u.`column5`

The size difference varies depending on the query. A long or multiple primary key significantly lengthens the old method. This then is compounded by the length of the dataset.

On MySQL I ran some tests on performance:

With 100 records with 55 character primary_key and 15 columns - Query strlen():
Old Method: 157,605
New Method: 54,821

50,000 records update - 15 Fields 255 characters - Four passes each:
Old Method: 29,368 ms 84.68 mb
New Method: 28,374 ms 83.25 mb

Query execution of batch of 100 Fields 55 Characters - 2 passes:
Old Method: 0.0195 s
New Method: 0.0095 s

There doesn't seem to be a huge gain in performance but no loss either. Smaller queries should perform much better when using a remote database where network speed becomes a factor.

Future ambitions include being able to update from a query or another table. Also, the current implementation limits the use of a single field as a constraint. This makes it useless if you have composite primary keys. I work with tables all the time that have as many as six fields composing the primary key. This PR has already built in the use of multiple constraints. The rest of it will need to be implemented later. I'd like to have it working with the onConstraint() method created in the upsert PR.

All the drivers are updated to use this type of query. Oracle is using MERGE instead of UPDATE. It accomplishes the same thing and works in the same way as the other queries.

The BaseBuilder method was switched from MySQL to a method used by SQLite, MSSQL, Postgre. MySql and Oracle have methods in Builder. SQLite has method in Builder for < version 3.33.0.

Checklist:

  • Securely signed commits
  • Component(s) with PHPDoc blocks, only if necessary or adds value
  • Unit testing, with >80% coverage
  • User guide updated
  • Conforms to style guide

@kenjis kenjis added 4.3 refactor Pull requests that refactor code database Issues or pull requests that affect the database layer labels Aug 12, 2022
@sclubricants
Copy link
Member Author

sclubricants commented Aug 12, 2022

Having some trouble with SQLite. What version are we running?

The UPDATE-FROM idea is an extension to SQL that allows an UPDATE statement to be driven by other tables in the database. The "target" table is the specific table that is being updated. With UPDATE-FROM you can join the target table against other tables in the database in order to help compute which rows need updating and what the new values should be on those rows. UPDATE-FROM is supported beginning in SQLite version 3.33.0 (2020-08-14).
https://www.sqlite.org/lang_update.html

SQLite older than 3.33.0 will use the old SQL.

@kenjis
Copy link
Member

kenjis commented Aug 13, 2022

What version are we running?

Just SQLite3.

SQLite3 via the SQLite3 driver
https://codeigniter4.github.io/CodeIgniter4/intro/requirements.html

@sclubricants
Copy link
Member Author

SQLite version must be determined by the driver.

If driver is 3.33 or newer version than we use new method, otherwise the overide method is the old method.

I'm not sure if you can update the driver to the newer one.. probably depends on version of php.

    protected function _updateBatch(string $table, array $values, string $index): string
    {
        if ((float) $this->db->getVersion() >= 3.33) {
            return parent::_updateBatch($table, $values, $index);
        }

@kenjis
Copy link
Member

kenjis commented Aug 13, 2022

$this->db->getVersion() returns versionString like 3.5.9. It is not a valid number string.
https://www.php.net/manual/en/sqlite3.version.php
I don't know (float) works always as you expect.

@sclubricants
Copy link
Member Author

I tested casting to float. It basically seems to knock everything off from the second decimal.

$this->db->getVersion() could return an int.. would have been more useful.

https://www.php.net/manual/en/sqlite3.version.php

@kenjis
Copy link
Member

kenjis commented Aug 13, 2022

I tested casting to float. It basically seems to knock everything off from the second decimal.

It seems the behavior is not documented.

Unless it is documented, we should not rely on the current behavior.
It is better to use version_compare().
https://www.php.net/manual/en/function.version-compare.php

Or you can use SQLite3::version() directly.

@sclubricants
Copy link
Member Author

Thanks, didn't know about that function.

Copy link
Member

@kenjis kenjis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and this is more elegant!
But I want some other reviews.

@kenjis kenjis changed the title Redesign BatchUpdate() SQL Redesign BaseBuilder::updateBatch() SQL Aug 26, 2022
@kenjis kenjis changed the title Redesign BaseBuilder::updateBatch() SQL Improve BaseBuilder::updateBatch() SQL Aug 26, 2022
@kenjis
Copy link
Member

kenjis commented Aug 26, 2022

I benchmarked with CSV upload and updateBatch().
Update existing records with exactly the same data with updateBatch().
Although not a very rigorous benchmark, I did see an improvement in performance.
And I confirmed all the records are exactly the same before and after the updateBatch().

149,970 records
develop: 319 MB, 157 seconds
this PR: 319 MB, 130 seconds

@sclubricants
Copy link
Member Author

#6407 has a slightly better implementation as most of the query is cached and doesn't need rebuilt for each batch run.

Also, while performance is important, there are other considerations in the design of the new query. It allows multiple keys to update on where the current implementation only allows one. The majority of the tables I work with have composite keys so this is very helpful to me. Another thing is that you can use a query as a datasource. Although its not implemented here you could also join a table to the dataset.

I have made quite a few changes in my all inclusive PR #6407

I could update this PR with some of them.

@sclubricants
Copy link
Member Author

Another thing you can do with this PR is update a field that isn't in the dataset or filter dataset by field and not use it in update:

UPDATE `db_update_batch` AS t
INNER JOIN (
SELECT 1 `primary_key`, 1 `second_key`, 'Name 1-1' `name`, '2022-08-24 12:56:11' `created_date` UNION ALL
SELECT 1 `primary_key`, 2 `second_key`, 'Name 1-2' `name`, '2022-08-26 09:22:36' `created_date`  UNION ALL
SELECT 2 `primary_key`, 1 `second_key`, 'Name 2-1' `name` , '2022-08-26 09:31:57' `created_date`
) u
ON t.`primary_key` = u.`primary_key` AND t.`second_key` = u.`second_key` AND DATE(u.`created_date`) = CURDATE()
SET 
t.`name` = u.`name`,
t.`last_update` = CURRENT_TIMESTAMP()

This only updates records from the dataset that were created today. This query is possible with this PR using RawSql.

@sclubricants
Copy link
Member Author

This one will only update records where the table data doesn't match the update data. Then only if they don't match will update the record and change the last_update.

UPDATE `db_update_batch` AS t
INNER JOIN (
SELECT 1 `primary_key`, 'Cat 3' `category`, 'Name 1-1' `name` UNION ALL
SELECT 1 `primary_key`, 'Cat 5' `category`, 'Name 1-2' `name` UNION ALL
SELECT 2 `primary_key`, 'Cat 5' `category`, 'Name 2-1' `name`
) u
ON t.`primary_key` = u.`primary_key` AND (t.`category` != u.`category` OR t.`name` != u.`name`)
SET 
t.`name` = u.`name`,
t.`last_update` = CURRENT_TIMESTAMP()

This gives you an accurate last_update because there truly was an update.

@kenjis
Copy link
Member

kenjis commented Aug 26, 2022

@sclubricants Thank you for your work.
I would like to get your work into the framework.
But your PRs are relatively too big, and so the most important thing I would like to say is that make your PR small.
The larger it is, the less likely it is to be merged. And the more it will be closed.

#6407 does not help me. Because it is too big to understand.
You need to run git commit more often. One commit, one intent.

Copy link
Member

@michalsn michalsn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like these changes. However, the most important element is missing. Users won't use the new features if they don't know about them.

We need to explain and demonstrate the new capabilities with examples in the user guide. We also need to indicate in which situations new features will not work (SQLite < 3.33.0)

As for other PRs - let's stick to one rule: 1 PR = 1 enhancement. This way, it's easier for everyone to follow the main purpose and agree/disagree on the feature.

@kenjis kenjis added the docs needed Pull requests needing documentation write-ups and/or revisions. label Aug 28, 2022
@sclubricants
Copy link
Member Author

We need to explain and demonstrate the new capabilities with examples in the user guide. We also need to indicate in which situations new features will not work (SQLite < 3.33.0)

@michalsn
Well actually all the new capabilities come with my other PR. With this PR there really is no difference to the developer. This just changes the query structure. All the cool new functions require some major changes.. which I have worked out mostly. I suggested creating a new branch to build it out but no luck. #6407

@kenjis
Copy link
Member

kenjis commented Aug 30, 2022

The generated SQL statements will be completely changed. So at least we should add it in the changelog.
And the SQL statements will be greatly improved, so I think we can add it as an enhancement in the changelog.

@michalsn
Copy link
Member

Oh... okay, sorry - my mistake. I was browsing the code from the linked PR and mixed things up.

Please, just add the changelog, and we will be ready. After merging, we can go forward with your next proposals/ideas.

Much appreciate.

@kenjis
Copy link
Member

kenjis commented Sep 6, 2022

Run phpcpd --exclude system/Test --exclude system/ThirdParty --exclude system/Database/SQLSRV/Builder.php --exclude system/Database/SQLSRV/Forge.php -- app/ public/ system/
phpcpd 6.0.3 by Sebastian Bergmann.

Found 1 clones with 54 duplicated lines in 3 files:

  - /home/runner/work/CodeIgniter4/CodeIgniter4/system/Database/BaseBuilder.php:2254-2281 (27 lines)
    /home/runner/work/CodeIgniter4/CodeIgniter4/system/Database/MySQLi/Builder.php:62-89
    /home/runner/work/CodeIgniter4/CodeIgniter4/system/Database/OCI8/Builder.php:235-262

0.07% duplicated lines out of 82900 total lines of code.
Average size of duplication is 54 lines, largest clone has 27 of lines

@kenjis kenjis removed the docs needed Pull requests needing documentation write-ups and/or revisions. label Sep 6, 2022
@kenjis
Copy link
Member

kenjis commented Sep 6, 2022

@sclubricants Can you fix PHPCPD error?

@sclubricants
Copy link
Member Author

sclubricants commented Sep 6, 2022

@sclubricants Can you fix PHPCPD error?

@kenjis Its not as bad as it looks. A single statement can take 11 lines.

            $data = implode(
                " UNION ALL\n",
                array_map(
                    static fn ($value) => 'SELECT ' . implode(', ', array_map(
                        static fn ($key, $index) => $index . ' ' . $key,
                        $keys,
                        $value
                    )),
                    $values
                )
            ) . "\n";

The top 6 lines will go away once things get refactored. I wasn't getting this error until I added these.

        // this is a work around until the rest of the platform is refactored
        if ($index !== '') {
            $this->QBOptions['constraints'] = [$index];
        }
        $keys = array_keys(current($values));

Can we suppress these somehow?

@kenjis
Copy link
Member

kenjis commented Sep 7, 2022

Add --exclude option.

run: phpcpd --exclude system/Test --exclude system/ThirdParty --exclude system/Database/SQLSRV/Builder.php --exclude system/Database/SQLSRV/Forge.php -- app/ public/ system/

@sclubricants
Copy link
Member Author

I wish it had a way to suppress certain lines instead of excluding the whole file.

sebastianbergmann/phpcpd#99

@kenjis
Copy link
Member

kenjis commented Sep 9, 2022

It seems you used git merge, but git rebase is recommended if you can.

Could you rebase?
See https://github.com/codeigniter4/CodeIgniter4/blob/develop/contributing/workflow.md#updating-your-branch
This PR goes into 4.3, so the command would be git rebase upstream/4.3.

@sclubricants
Copy link
Member Author

It seems you used git merge, but git rebase is recommended if you can.

I usually do rebase but I did it through the web interface and it created a merge.

sclubricants and others added 13 commits September 9, 2022 09:04
This will be a place to store different options that can be used on queries. I thought about using $QBData but felt this could be confused with the data in QBSet.
This method removes the key fields from the update fields. It also handles using RawSql.
This is a bit more refined then the previous version. This version caches most of the query after the first run of the batch. It also is ready to accept various options that can be fed to it.. perhaps in a later PR. For instance the fields updated can be extracted from the field set as is currently implemented or they could be explicitly set before hand.  Likewise the data  is set by QBSet data but could potentially be fed a query instead.
entry was in the wrong place
@kenjis
Copy link
Member

kenjis commented Sep 10, 2022

Thank you for rebasing.

@kenjis kenjis merged commit e249460 into codeigniter4:4.3 Sep 10, 2022
@sclubricants sclubricants deleted the BatchUpdateSQL branch September 11, 2022 19:10

return 'UPDATE ' . $this->compileIgnore('update') . $table . ' SET ' . substr($cases, 0, -2) . $this->compileWhereHaving('QBWhere');
return sprintf($sql, $data);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sprintf() was introduced here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4.3 database Issues or pull requests that affect the database layer refactor Pull requests that refactor code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants