Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GRIDFIELD] BUG: pagination in GridField does not work if DataObject default_sort field values are not unique #7249

Open
sunnysideup opened this issue Aug 7, 2017 · 57 comments

Comments

@sunnysideup
Copy link
Contributor

sunnysideup commented Aug 7, 2017

SUMMARY:

Pagination does not work well when the data is provided by MySQL 5.6 / 5.7 (php 7.1) where the data-set is sorted by a field (or fields) that are not enforced to be unique on a database level. The reason for this is that MySQL may return datasets in a random order (beyond what it has been asked to sort by).

You can possibly replicate this bug by yourself by using the tools listed below:

This means, for example, that:

  • batch processing may not work correctly (you will process some records twice and other ones not at all)
  • grid field pagination is broken (some items listed twice or more, on separate pages, others not at all)
  • pagination on the front-end is broken (some items listed twice or more, on separate pages, others not at all)

There are still a lot of questions about this bug, but I have been able to consistently demonstrate this error on several machines with clean installs of 3.6.1

Details

the issue: when the default_sort value for a dataobject contains non-unique values for one or more rows then the GridField ends up NEVER showing some records while showing other records more than once:

How to replicate:

  1. install SS 3.6.1 using standard installer

  2. add the following dataobject (MyDataObject):

<?php


class MyDataObject extends DataObject
{

    private static $db = array(
        'Title' => 'Varchar(20)',
        'SameSame' => 'Varchar(20)'
    );


    private static $default_sort = array(
        'SameSame' => 'ASC'
    );

    function requireDefaultRecords()
    {
        DB::query('DELETE FROM MyDataObject');
        for($i = 1; $i < 47; $i++) {
            $filter = array('Title' => "item #".$i);
            if(! MyDataObject::get()->filter($filter)->first()) {
                $obj = MyDataObject::create($filter);
                $obj->SameSame = 'same';
                $obj->write();
            }
        }
    }

}

then add modeladmin (MyModelAdmin.php):

<?php

class MyModelAdminTest extends ModelAdmin {

    private static $managed_models = array('MyDataObject');

    private static $url_segment = 'test';

    private static $menu_title = 'Test';
}

You get the following results:
image

image

For my client, I was sorting a GridField by EndDate and StartDate. Some of the items has the same EndDate and StartDate and thus were missing. Making it impossible for them to be found by the client.

@sunnysideup sunnysideup changed the title BUG: sameSame default sort leads to bad data in ModelAdmin BUG: pagination in GridField does not work if DataObject sort field contain non-unique values Aug 7, 2017
@sunnysideup sunnysideup changed the title BUG: pagination in GridField does not work if DataObject sort field contain non-unique values BUG: pagination in GridField does not work if DataObject default_sort field values are not unique Aug 7, 2017
@flamerohr
Copy link
Contributor

Have you tried adding ID as the last item in the list?
This does look like a genuine bug, but perhaps adding ID would be a workaround for it

@sunnysideup
Copy link
Contributor Author

sunnysideup commented Aug 7, 2017

@flamerohr - i fixed my code as soon as I noticed the bug by adding ID as a sort key and that should also be the solution for this bug. Data Objects can be sorted in any way, but sort by ID ASC/DESC should always be added as the last sort criteria, no matter what.

Check this out:

I have now added to Page_Controller:

    public function PaginatedObjects()
    {
        $list = MyDataObject::get();

        return new PaginatedList($list, $this->getRequest());
    }

and I have added to /themes/simple/templates/Layout/Page.ss:

    <ul>
        <% loop $PaginatedObjects %>
            <li>$Title</li>
        <% end_loop %>
    </ul>

    <% if $PaginatedObjects.MoreThanOnePage %>
        <% loop $PaginatedObjects.Pages %>
            <% if $CurrentBool %>
                $PageNum
            <% else %>
                <% if $Link %>
                    <a href="$Link">$PageNum</a>
                <% else %>
                    ...
                <% end_if %>
            <% end_if %>
            <% end_loop %>
    <% end_if %>

PAGE 1:
image

PAGE 2:

image

PAGE 3:
image

AMAZING!

I agree with you that this is a critical bug and I am very surprised to find it.

It is so serious because anytime you do not sort by a field that is enforced to be unique on a database level, i.e. in basically all cases, e.g. sorting by Title, Created, LastEdited, SortNumber (SortableGridField), Sort (SiteTree), by Date, and so on.... ) you will potentially end up with missing / double-up records in a GridField / Paginated List.

@flamerohr
Copy link
Contributor

unfortunately, this is a potential problem in any given relational database, many queries do not explicitly provide a sort (or not a sort with at least a unique column) and has the "implied" sorting by ID, which is what happened here.

I've marked as critical because of data disappearing, but I encourage always adding explicit ID sort to the end of any sort calls.

@unclecheese
Copy link

I can't replicate this on 3.6.1. Can you test it on a plain install?

@sunnysideup
Copy link
Contributor Author

I run this on a plain 3.6.1

@unclecheese
Copy link

I wonder if the auto increment on your ID is isn't working properly. Can you try deleting the entire DB and starting over? I'm curious if you're getting IDs that are non sequential with your Item numbers due to all the deletions. Try adding ID as a summary_field to help debug.

@unclecheese
Copy link

screenshot 2017-08-08 10 53 58

@sunnysideup
Copy link
Contributor Author

sunnysideup commented Aug 7, 2017

No problems with IDs

Title ID
item no. 1 334
item no. 2 335
item no. 3 336
item no. 4 337
item no. 5 338
item no. 6 339
item no. 7 340
item no. 8 341
item no. 9 342
item no. 10 343
item no. 11 344
item no. 12 345
item no. 13 346
item no. 14 347
item no. 15 348
item no. 16 349
item no. 17 350
item no. 18 351
item no. 19 352
item no. 20 353
item no. 21 354
item no. 22 355
item no. 23 356
item no. 24 357
item no. 25 358
item no. 26 359
item no. 27 360
item no. 28 361
item no. 29 362
item no. 30 363
item no. 31 364
item no. 32 365
item no. 33 366
item no. 34 367
item no. 35 368
item no. 36 369
item no. 37 370
item no. 38 371
item no. 39 372
item no. 40 373
item no. 41 374
item no. 42 375
item no. 43 376
item no. 44 377
item no. 45 378
item no. 46 379

This error actually goes back to mysql. We are running

SELECT * FROM MyDataObject ORDER BY SameSame LIMIT 0,10
SELECT * FROM MyDataObject ORDER BY SameSame LIMIT 10,10
SELECT * FROM MyDataObject ORDER BY SameSame LIMIT 20,10

And Mysql returns the records in a random order - although the first page of pagination seems to refute this theory.

but I encourage always adding explicit ID sort to the end of any sort calls.

@flamerohr - I have not come across many (if any) modules that explicitly sort by ID. For example SiteTree does not sort by ID and there will definitely be situations where the Sort values are not unique in SiteTree from what I know as this is not enforced on a Database level.

image

@unclecheese
Copy link

What kind of output do you get when you just run SELECT Title FROM MyDataObject ORDER BY SameSame ASC in the database?

@sunnysideup
Copy link
Contributor Author

sunnysideup commented Aug 8, 2017

I run the following code:

    public function MyDataObjectRaw()
    {
        $this->html .= '<br />-------------------------------------';
        $rows = DB::query('SELECT Title, ID FROM MyDataObject ORDER BY SameSame ASC;');
        foreach($rows as $row) {
            $this->html .= '<li>'.$row['Title'].' = '.$row['ID'];
        }
        $this->html .= '<br />-------------------------------------';
        $rows = DB::query('SELECT Title, ID FROM MyDataObject ORDER BY SameSame ASC LIMIT 0, 10;');
        foreach($rows as $row) {
            $this->html .= '<li>'.$row['Title'].' = '.$row['ID'];
        }
        $this->html .= '<br />-------------------------------------';
        $rows = DB::query('SELECT Title, ID FROM MyDataObject ORDER BY SameSame ASC LIMIT 10, 10;');
        foreach($rows as $row) {
            $this->html .= '<li>'.$row['Title'].' = '.$row['ID'];
        }
        $this->html .= '<br />-------------------------------------';
        $rows = DB::query('SELECT Title, ID FROM MyDataObject ORDER BY SameSame ASC LIMIT 20, 10;');
        foreach($rows as $row) {
            $this->html .= '<li>'.$row['Title'].' = '.$row['ID'];
        }
    }

and the output that I get is as follows:

item No.36 = 369
item No.25 = 358
item No.26 = 359
item No.27 = 360
item No.28 = 361
item No.29 = 362
item No.30 = 363
item No.31 = 364
item No.32 = 365
item No.33 = 366
item No.34 = 367
item No.35 = 368
item No.24 = 357
item No.37 = 370
item No.38 = 371
item No.39 = 372
item No.40 = 373
item No.41 = 374
item No.42 = 375
item No.43 = 376
item No.44 = 377
item No.45 = 378
item No.46 = 379
item No.13 = 346
item No.2 = 335
item No.3 = 336
item No.4 = 337
item No.5 = 338
item No.6 = 339
item No.7 = 340
item No.8 = 341
item No.9 = 342
item No.10 = 343
item No.11 = 344
item No.12 = 345
item No.1 = 334
item No.14 = 347
item No.15 = 348
item No.16 = 349
item No.17 = 350
item No.18 = 351
item No.19 = 352
item No.20 = 353
item No.21 = 354
item No.22 = 355
item No.23 = 356

limit stuff 0 -> 9

item No.1 = 334
item No.2 = 335
item No.3 = 336
item No.4 = 337
item No.5 = 338
item No.6 = 339
item No.7 = 340
item No.8 = 341
item No.9 = 342
item No.10 = 343

limit stuff 10 -> 19

item No.12 = 345
item No.1 = 334
item No.10 = 343
item No.9 = 342
item No.8 = 341
item No.7 = 340
item No.6 = 339
item No.5 = 338
item No.4 = 337
item No.3 = 336

limit stuff 19 -> 29

item No.12 = 345
item No.11 = 344
item No.10 = 343
item No.9 = 342
item No.8 = 341
item No.7 = 340
item No.6 = 339
item No.5 = 338
item No.4 = 337
item No.3 = 336

@unclecheese
Copy link

I'm not convinced this is a SilverStripe issue. The way that MySQL resolves an ambiguous sort is highly unpredictable. See (https://dba.stackexchange.com/questions/6051/what-is-the-default-order-of-records-for-a-select-statement-in-mysql).

MySQL sorts the rows any way it wants to, without any guarantee of consistency.

That I can't replicate it should be an indication that we're dealing with some kind of unknown. It could be as simple as differing storage engines, MySQL versions, or both. In short, don't use ambiguous sorts.

@sunnysideup
Copy link
Contributor Author

sunnysideup commented Aug 8, 2017

Hi Aaron,

I'm not convinced this is a SilverStripe issue.

It is definitely a MYSQL issue - but we can't change MYSQL so we will have to deal with the weirdness in Silverstripe.

In short, don't use ambiguous sorts.

That is true in theory, but I don't think that is true practice, most of the time you are sorting by an ambiguous sort field: e.g.

  • Sort / SortNumber (which are usually not database enforced uique),
  • Title
  • Name,
  • Calendar Start Date,
  • Created,
  • LastEdited,

etc...

Anytime they are not unique, you could end up with this problem. So, I reckon we should fix this, for the following reasons:

(1)
I am sure that most default_sort entries, such as SiteTree do not sort by a database enforced unique key (as per above). If any of them are displayed in a paginated list (e.g. a gridfield, search results, filtered list, etc... etc... ) there is a real chance that some will show up twice on different pages and that other items will never show.

(2) There is an easy fix - add an ORDER BY BaseTable.ID to the end of all SQL Sort statements

(3) The danger with this bug is that no one will notice until it has real life consequences. There is no error, these lists seem to operate fine, etc...

(4) A perfect developer should remember to add the Sort By ID to the end of all sort statements that do not include an unambigious field - but if
(a) you always need to add this no matter what, AND
(b) few developers actually do this in practice -
then it appears to me that the Silverstripe Core should take care of this.

@sunnysideup
Copy link
Contributor Author

sunnysideup commented Aug 8, 2017

I could replicate this behaviour on my lappy:

+-------------------------+-------------------------+
| Variable_name           | Value                   |
+-------------------------+-------------------------+
| innodb_version          | 5.7.19                  |
| protocol_version        | 10                      |
| slave_type_conversions  |                         |
| tls_version             | TLSv1,TLSv1.1           |
| version                 | 5.7.19-0ubuntu0.16.04.1 |
| version_comment         | (Ubuntu)                |
| version_compile_machine | x86_64                  |
| version_compile_os      | Linux                   |
+-------------------------+-------------------------+

and on our server

+-------------------------+------------------------------------------------------+
| Variable_name           | Value                                                |
+-------------------------+------------------------------------------------------+
| innodb_version          | 5.6.34-79.1                                          |
| protocol_version        | 10                                                   |
| slave_type_conversions  |                                                      |
| tls_version             | TLSv1.1,TLSv1.2                                      |
| version                 | 5.6.34-79.1                                          |
| version_comment         | Percona Server (GPL), Release 79.1, Revision 1c589f9 |
| version_compile_machine | x86_64                                               |
| version_compile_os      | debian-linux-gnu                                     |
+-------------------------+------------------------------------------------------+

and here is my work machine (issue is also happening here):


+-------------------------+-------------------------+
| Variable_name           | Value                   |
+-------------------------+-------------------------+
| innodb_version          | 5.7.19                  |
| protocol_version        | 10                      |
| slave_type_conversions  |                         |
| tls_version             | TLSv1,TLSv1.1           |
| version                 | 5.7.19-0ubuntu0.16.04.1 |
| version_comment         | (Ubuntu)                |
| version_compile_machine | x86_64                  |
| version_compile_os      | Linux                   |
+-------------------------+-------------------------+

sunnysideup/ecommerce_merchants | allow many businesses to sell within one shop. | 0.00 per day
sunnysideup/ecommerce_modifier_example | provides an example of a modifier to help you create your ow... | 0.00 per day
sunnysideup/ecommerce_mybusinessworld | This module adds a basic connection between silverstripe e-c... | 0.00 per day
sunnysideup/ecommerce_nz_connectivity | New Zealand Post - Client for the rate finder web API for Si... | 0.00 per day
sunnysideup/ecommerce_product_variation_colours | add colours to your silverstripe e-commerce product variatio... | 0.00 per day
sunnysideup/ecommerce_repeatorders | silverstripe-ecommerce_repeatorders | 0.00 per day

@sunnysideup
Copy link
Contributor Author

sunnysideup commented Aug 8, 2017

Here is another test I ran:

Page:

    function requireDefaultRecords()
    {
        for($i = 1; $i < 400; $i++) {
            DB::alteration_message('Creating Pages: '.$i);
            $filter = ['Title' => 'Page No. '.$i];
            $page = Page::get()->filter($filter)->first();
            if(! $page) {
                $page = Page::create($filter);
            }
            if($i > 10) {
                $parentPage = Page::get()->filter(array('ID:LessThan' => 10))->sort('RAND()')->first();
                if($parentPage) {
                    $page->ParentID = $parentPage->ID;
                }
            } else {
                $page->ParentID = 0;
            }
            $page->writeToStage('Stage');
            $page->publish('Stage', 'Live');
        }
    }

Page_Controller:

    public function UnpaginatedPages()
    {
        return Page::get();
    }

    public function PaginatedPages()
    {
        $list = Page::get();

        $list = new PaginatedList($list, $this->getRequest());
        $list->setPageLength(395);
        return $list;
    }

Page.ss

    <% loop PaginatedPages %>
        <li>$Title</li>
    <% end_loop %>

    <% if $PaginatedPages.MoreThanOnePage %>
        <% loop $PaginatedPages.Pages %>
            <% if $CurrentBool %>
                $PageNum
            <% else %>
                <% if $Link %>
                    <a href="$Link">$PageNum</a>
                <% else %>
                    ...
                <% end_if %>
            <% end_if %>
            <% end_loop %>
    <% end_if %>

Pages show up in random order rather than the order they were added because the Sort value for all pages is 1:
image

Having all the Page.Sort values at 1 means that it is very likely this bug occurs.

@unclecheese
Copy link

unclecheese commented Aug 8, 2017

(2) There is an easy fix - add an ORDER BY BaseTable.ID to the end of all SQL Sort statements

I really think this is dangerous scope creep for the ORM, and it sets a really bad precedent. The ORM is an abstraction layer between you and SQL. It's primary purpose is to make querying more succinct, more secure, and more declarative. It is not the ORM's job to fix your bugs. It should be very possible to write bad queries with the ORM, and sorting by something you know to have a singular value is one of those areas.

(a) you always need to add this no matter what

Personally, I've never had to do this. Keep in mind, you're testing a very extreme case, where all the values of a column are equal, and in any real-world scenario, it would be apparent to the developer that he was sorting on a column with uniform values. If, by contrast, I sort by Title ASC, and 2/100 records have the same title, I'm not sure I've ever noticed any consequences for the logic MySQL uses to resolve that conflict. If I did, I'd add a second column.

@tractorcow
Copy link
Contributor

tractorcow commented Aug 8, 2017

The best practice, in these scenarios, is to encourage developers to implement composite default_sort values for their models, if they are aware that their sort column is not unique. I concur with @unclecheese on this.

E.g.

private static $default_sort = '"BaseTable"."Sort" ASC, "BaseTable"."ID" ASC';

These can also be specified via array syntax.

private static $default_sort = [
	'Sort' => 'ASC',
	'ID' => 'ASC'
];

I believe the array syntax supports ORM to SQL column mapping under the hood, so that might be preferable to raw SQL fragments. :)

Shall we make the outcome of this problem a documentation update?

@flamerohr
Copy link
Contributor

I agree, documentation to encourage this would be the best outcome for this...

I don't believe enforcing an ID sort at the end of all queries which uses a sort is a good idea.

The SiteTree.default_sort also not having ID doesn't mean it's best practice :) in fact if anyone created a pull request to add ID in for SiteTree, I think it'd be a welcomed change.

@sunnysideup
Copy link
Contributor Author

The best practice, in these scenarios, is to encourage developers to implement composite default_sort values for their model

@tractorcow

What about SiteTree.Sort and Blog:

https://github.com/silverstripe/silverstripe-blog/blob/master/src/Model/BlogPost.php#L134-L139

Both models are bound to have non-unique sort values.

@tractorcow
Copy link
Contributor

We could add ID to those models exclusively, but that's still preferable to baking it into datalist for all classes globally.

@sunnysideup
Copy link
Contributor Author

sunnysideup commented Aug 8, 2017

example 1:

On an e-commerce website we offer discount vouchers. We had set the default_sort to EndDate DESC, StartDate DESC as that made sense at the time. In the modelAdmin that worked great when we tested it.

Then the client adds a whole bunch of discount vouchers with the same start and end dates.

When he is trying to delete them, there are around 8 / 46 that are not showing up in the ModelAdmin (and another 8 that show up twice) ...

example 2:

Member List for Club, Ms Smith can not find herself in list, calls admin, admin finds her on page three of list, sends her a screenshot and a link to page three. Ms Smith can not find herself on that page (there are a few Smith people, list is sorted by last name).

What can we do?

I am not married to a solution, but it seems to me that this is definitely something we should improve for our users because they have much to loose and nothing to gain from Mysql seemingly random sorts.

I can see the following options:

  1. add ID sort to all ORM get calls that do not use a non-unique Sort Field.

  2. add ORDER BY ID ASC at the end of any ORDER BY statements to all Mysql select statements. If we could work out when / what versions are affected then we could limit it to those versions?

  3. provide a yml option for (2)

  4. improve documentation about this

  5. add ID sort to PaginatedList and GridField pagination

@kinglozzer
Copy link
Member

SiteTree.Sort isn’t supposed to be unique, is it? The sort is relative to the parent page

@sunnysideup
Copy link
Contributor Author

sunnysideup commented Aug 9, 2017

@kinglozzer no, SiteTree.Sort is not supposed to be unique, but what I found is that when you sort by a non-unique field (e.g. SiteTree / BlogPost - and 100s of other DataObjects) then pagination (including GridField pagination) can do pretty crazy things as Mysql is rather unpredictable in its sortorder when there are non-unique values (i.e. it sorts correctly, but rows with the same values show up in a random order). This means that some items show up twice on separate pages and other items never show up at all.

@sunnysideup
Copy link
Contributor Author

sunnysideup commented Aug 9, 2017

Here is my code as a repository:

https://github.com/sunnysideup/silverstripe-sort-test

I did some further testing. It does not happen with pages, it only happens with DataObjects.

@unclecheese
Copy link

unclecheese commented Aug 9, 2017

I think this is really getting into a philosophical discussion on what is the role of framework/ORM. That seems to be where we're divided. My thoughts:

If you assign a singular sort clause to a DataObject for a column that is not entirely unique, and it produces unexpected results, that's on you as the developer who wrote that code. To expect the framework to jump in and save you is undue, for the same reason the template parser doesn't close your </p> tags.

If we were to implicitly put ID ASC in the sort of every DataList, we would be punishing every developer who went about it the right way, and used unique columns, or composite clauses. There's a performance impact for having multiple sort columns, and it's nothing I would want to impose on every query without a full opt-in from the user. If I'm sorting on something almost guaranteed to be unique, like Created, I don't need or want ID in the mix as well, and if I have to write special code to tell the framework not to do that, that creates a really suboptimal developer experience.

@sunnysideup
Copy link
Contributor Author

sunnysideup commented Aug 9, 2017

Here are some alternative ideas:

  • add sort by ID to GridField paginated lists?
  • add sort by ID to Paginated lists (optional)?

Also - Created is often not unique (e.g. import of data - import through third-party API - high volume sales).

I think firstly that the bug needs more research (exactly when does it happen and how / why).

@patricknelson
Copy link
Contributor

patricknelson commented Aug 11, 2017

That's GridField, however, which allows you to select sort columns. I was replying to this (emphasis mine):

... but sort by ID ASC/DESC should always be added as the last sort criteria, no matter what.

... and then clarified that this shouldn't happen at the ORM (or framework) level when sorting in general. When you're in a GridField, I think it might be fine to enable that feature (possibly by default) and then maybe allow disabling of it via some extra setting on your GridField instance, if it gets in the way.

@sunnysideup
Copy link
Contributor Author

sunnysideup commented Aug 11, 2017

Hi Patrick,

Thank you for our reply. I think the first thing is really understand the issue, when does it happen, where does it happen, etc... ? I also jumped to the "solution" but I think the key is to understand the issue first. It does not seem to happen with all DataObjects, so I want to know, for example, what DataObjects are affected, when, etc... Its random presentation also makes it really easy to miss or not to worry about it where it may raise its ugly head when you least expect it.

Basically what I am saying: we are potentially presenting our users with bad data (some data multiple times, other data hidden) - this happens, for example, with GridField - out of the box. I think that is the crux. How we fix it is secondary (we may leave it to developers as suggested by core team). I want to fix it in all my sites, but I want to do it in a smart way.

It seems to me when you add DataObjects programatically (i.e. many at the same time) then it is more likely to happen. Also, I could not replicate it on Pages, why is that?

@unclecheese unclecheese reopened this Aug 12, 2017
@unclecheese
Copy link

unclecheese commented Aug 12, 2017

We've had a bit of a chat about this internally, and there's some consensus that a halfway measure to mitigate this is appropriate.

First, the condition of MySQL returning a non deterministic result set when ambiguous sort columns are used is absolutely a feature, not a bug. From the MySQL docs:

If multiple rows have identical values in the ORDER BY columns, the server is free to return those rows in any order, and may do so differently depending on the overall execution plan. In other words, the sort order of those rows is nondeterministic with respect to the nonordered columns. One factor that affects the execution plan is LIMIT, so an ORDER BY query with and without LIMIT may return rows in different orders.

While MySQL is providing the "expected result" (albeit non deterministic) in this case, it's fair to say DataList is not. As a implementor of the Sortable interface, DataList is expected to sort deterministically. Further, it is the job of the ORM to obscure the "how" from the user, and create the list in a consistent, uniform way, whether you're using MySQL, Postgres, etc, or just plain ArrayData.

All that said, to force ID into the sort is still too much magic, but ensuring default_sort is added seems like a reasonable halfway point. The problem is, with the direction being included as part of the default sort, it makes sort direction toggling (I.e. GridField column headers) unreliable. Therefore, a more sensible setting might be something more like default_sort_column, with null being allowed to disable forced deterministic sorts.

Thoughts?

@sunnysideup
Copy link
Contributor Author

sunnysideup commented Aug 12, 2017

Good thoughts ;-) I agree wit hall of that.

Here are a couple of thoughts I had:

  • limit without reliable order of records is useless (i.e. limit 10, 10 may return exactly the same as limit 0, 10).
  • database enforced unique fields are not a good option in silverstripe: unique index / indexes are error prone #6819.

Putting these two together ... whenever you use limit in an ORM call, you will have to sort by ID as the last sort fragment (e.g. turn ORDER BY "StartDate" DESC into ORDER BY "StartDate" DESC, "ID" ASC).

So, what I would recommend is: whenever we limit a recordset in an ORM call, we add sort by ID as the last sort phrase (with the ability to turn it off) by default. This is how I would do it:

When putting together the final SQL for a database query - IF:

  • limit is included AND
  • there is no sort by a database enforced unique field AND
  • if the database is MySQL (others???) AND
  • code does not specifically turns off this feature THEN
  • add sort by ID

Research questions:

  • why does this random sorting no occur on Pages - AFAIK?
  • does it also occur on other Databases (apart from MySQL)?
  • what is the time difference between SELECT * FROM Table AND SELECT * FROM Table ORDER BY ID - for a table with a few records and a table with a ton of records...? (I can test this)
  • is 4.0 also affected?
  • are there faster ways to ensure a reliable sort order other than adding ORDER BY ID
  • how we do we ensure that there is no ambiguous ID added (i.e. the baseclass / table is always added).

@unclecheese
Copy link

unclecheese commented Aug 13, 2017

So the answer to most of your questions/concerns is that it's not a bug or oddity with MySQL, or Postgres, or pages, or DataObjects. This is something that's encoded in the SQL standard. The server is free to return the rows in any order it wants when identical values are encountered. That means, potentially, all database platforms are affected, and the way that this issue may manifest itself in various contexts, be it with the application of certain clauses, implementation of / version of storage engines, or just plain phases of the moon, is irrelevant. The point is that the SQL standard in this case tells us the result is nondeterministic, and we have an API that promises to be deterministic, so we need to reconcile that conflict somehow.

I think if we were to add a default_sort_column setting, it would have to be applied to all queries -- not just those that use LIMIT. If you had page of results that was all-inclusive, with ambiguous sort, that URL now becomes non-cacheable, as the results could in theory be returned differently on subsequent page loads. The goal should be to make DataList deterministic, not to negotiate the random behaviour of your database platform.

@flamerohr
Copy link
Contributor

are we able to infer default_sort_column from getting the PrimaryKey type fields in DataObjectSchema::databaseFields()?

Probably a bit too much magic for my taste, but it's an idea to bounce around.

@patricknelson
Copy link
Contributor

patricknelson commented Aug 13, 2017

Since I think this is just in reference to DataObjects (where the default_sort_column setting will exist), the ID column should also always exist. Is it possible to change change a DataObject's primary key? I don't think so, but if it is, then maybe that'd be a concern.

@sunnysideup
Copy link
Contributor Author

How does it work in PHPMyAdmin?

I think we need to do more research. I found that only some tables return with duplicates between limited segments of the table, while other tables do not (e.g. SiteTree).

@tractorcow
Copy link
Contributor

A challenge I've considered with a default_sort_column is that there needs to be a way to tell a datalist to reverse it in the case where a non-default sort column (E.g. Title) is reversed, otherwise groups of rows with the duplicate Title column would not reverse. That makes the solution not able to be applied so transparently.

@sunnysideup
Copy link
Contributor Author

sunnysideup commented Aug 14, 2017

IMPORTANT

A:

SELECT * FROM `SiteTree` LIMIT 20, 10 

B:

SELECT * FROM `SiteTree` LIMIT 20, 10 ORDER BY `ID`

C:

SELECT * FROM `SiteTree` LIMIT 20, 10 ORDER BY `sort`

NOT sorting and sorting by ID are equally fast, but sorting by Sort is significantly slower ... see: #7272

This evidence shows that sort by ID vs no sort key has no performance impact and sorting by ID is faster, most of the time, than sorting by the default_sort.

@patricknelson
Copy link
Contributor

Yeah it's not just in sorting in ASC/DESC but also if/when you're using functions to sort (e.g. FIELD(...) or IF(...) and maybe some others, mentioned his above). I don't know what the solution to this would be since this is known behavior, generally speaking. You could always suggest people include a secondary sort column in the meantime (separate issue, update doc's). Depending on how users apply sort, it may be possible to infer the direction if you apply ID as the second column. Often time it's just a second parameter passed to ->sort('Sort', 'DESC'). It's when you get into custom sort queries that his coildnt be parsed reliably (you could try) but I think at that level the dev should have more control with less magic.

Ask: Yeah @sunnysideup, typically sorting by indexed fields is a bit faster than unindexed.

@chillu
Copy link
Member

chillu commented Sep 11, 2017

OK, there's about twenty screenpages of discussion here ;) If you care strongly about resolution, can somebody please summarise the state of the discussion with options going forward?

@unclecheese
Copy link

Summary:

  • When an ambiguous sort column is used, the database is free to return the rows in any order it wants, making the operation non-deterministic.
  • This makes pagination unpredictable (e.g. repeated entries across pages)
  • As an implementor of Sortable, DataList should sort deterministically.
  • Therefore, we should patch it
  • Forcing ID ASC into the query is too much magic. Something like default_sort_column would be more appropriate
  • The bug has existed at least since SS2, and no one raised it until now, so impact/low is fitting.

@patricknelson
Copy link
Contributor

patricknelson commented Sep 11, 2017

Forcing ID ASC into the query is too much magic. Something like default_sort_column would be more appropriate

I think we need to disambiguate the use (or the nomenclature) of default_sort_column since we already have default_sort which can define both one or more columns and their corresponding directions. As such, since we're addressing a specific problem (deterministic sorting) I think we should take a specific approach that makes it a bit more transparent, both in nomenclature and in implementation.

So, I'd suggest we do the following:

  • Utilize ID as our secondary sorting column failover. These are all DataObjects and, as such, all will contain this column.
  • Still add a new config option, but instead of default_sort_column (which is somewhat ambiguous in meaning), we use the name such as: sort_deterministic, sort_secondary_id, sort_failover or similar.
    • Note: I'd stay away from the default_sort_ prefix since we're not defaulting anything per se (as you'll see below). Instead, we're enabling/disabling a deterministic sorting algorithm at the database level, which (yes) does mean we have to inject extra fluff into the ORDER BY clause and (yes) this setting is enabled by default. However, this is a boolean (noted below).
  • Per above (and discussion): This should be an on/off setting and this should default to true.
  • Direction: We can automatically adjust the direction of secondary ID sort failover by defaulting to ASC in our algorithm and then parsing the existing order setting in our query prior to final execution for a direction and, if not blank/ASC, we can then set it to DESC.
  • Disabling/overriding: This can be done one of either one or two ways (uncertain on second):
    • Override the sort_deterministic setting and set it to false.
    • Parse final order clause for any secondary sort fields and, if they exist, remove.
      • NOTE 1: I'm unsure if this is necessary, as you can have as many additional sort columns as you like (as far as I'm aware) and a developer with the confidence of adding a secondary sort column may already have a very unique sort, so adding ID may be redundant for several reasons but not necessarily negative or have an impact on performance.
      • NOTE 2: Then again, if we are parsing the first sort column for direction, this is an argument for disabling once we see two columns, since we can't reliably determine what direction to sort in if we have multiple columns already provided, can we?
      • At minimum, if we retain ID even if a secondary column is manually provided, we may still want to do a quick de-dupe check to ensure this ID column isn't already specified (to prevent accidentally overriding a contrary direction, e.g. DESC instead of ASC).

Beyond the uncertainty of those two notes above on automatically disabling, I think we'd be set. Thoughts?

@flamerohr
Copy link
Contributor

I agree with @patricknelson that having another config setting called default_sort_something is ambiguous and conflicts with the existing default_sort, but the idea is the same - to have a failover column to enable deterministic sorting.
Having ID as a default for whatever this setting is called is a good idea, as all DataObjects have this (maybe rare exceptions).

If the idea can be kept simple such as a single config private static $determined_sort = ['ID' => 'ASC']; and that can be extended/disabled as desired by developers through php or yml that would be good - yes you can set a config to false or null in yml in ss4.

@sunnysideup
Copy link
Contributor Author

sunnysideup commented Sep 11, 2017

I think some research may be useful:

  • what are the consequences of adding a sort by ID ASC to all sql statements (after any other sort statements - if any), so that you always have a predictable outcome? This would be my solution and apart from performance reductions, I can't see any other reason why this is not a good solution. [We may consider only doing this lists that are "limited").

  • I noticed that dataobjects do sort randomly but pages do not and I wonder why that is. It would be worthwhile to work out why (maybe to do with joins?)

Also - while no one has noticed this issues, this is not a benign one. At least two of my clients noticed it (and were unable to edit data because of it) in gridfields (where I had set the sort order to a field that was something like a date field).

@sunnysideup
Copy link
Contributor Author

sunnysideup commented Sep 14, 2017

Adding an index to SiteTree.Sort should be a no-brainer AFAIK because it will make all retrievals of Pages faster (whenever there is no custom sort) - see: https://dba.stackexchange.com/questions/11031/order-by-column-should-have-index-or-not.

@patricknelson
Copy link
Contributor

patricknelson commented Dec 21, 2017

Actually, what if we just had a default_sort_secondary field? This could not only be consistent with the existing default_sort field, but can be automatically defaulted to ID ASC (or be empty to retain existing functionality). Let me explain some scenarios and why I think this might work.

Functionality:

  • Default value for default_sort_secondary is ID ASC.
    • EDIT: Alternatively, this could be left empty and when default_sort_secondary is not defined (or no second column is defined in default_sort) we retain existing functionality.
  • Behind the scenes, the secondary column and direction are parsed separately before being handed to the database. This way, we can handle scenarios when the user begins to customize sorting. The sort direction is just an initial value that we start with and then negate when the user negates the initial direction. That way if the user ever selects another primary sort column or ever chooses to negate the direction, we just negate our initial direction. No magic/mystery to this, just keep it logical. So, we're just shifting perspective. Logically, if you reverse the current sort, the secondary sort must also be reversed as well for consistency.
  • If default_sort is defined with a secondary sort value, that secondary sort will be used. e.g. Date ASC, Title ASC will imply a default_sort_secondary value of Title ASC.
  • Conflict resolution: If current sort column (e.g. Title) is equal to the above resolved secondary sort, secondary sort fails over to the ID column (since we know ID will always be unique).

This satisfies:

  • Nomenclature consistency (default_sort_secondary as a natural companion to default_sort). Useful for understanding internally how sorting is being handled and etc.
  • Sane initial defaults which are less magical (e.g. if the user sets up a secondary sort column in default_sort, this is parsed/retained in default_sort_secondary and functionality is extended).
  • Logical, consistent and deterministic results for pagination.

What do you guys think about that? Am I missing any scenarios or edge cases, possibly?

EDIT: Also, I propose that if we do this at all, at minimum it should be performed for GridField's because this is where it keeps arising for me (particularly due to the automation and framework level handling of user controlled sorting, making it difficult to modify/access without changing core code). That's only if we're not comfortable adding this to every SELECT ... SQL query.

@sunnysideup
Copy link
Contributor Author

Any movement ?

@sminnee sminnee changed the title BUG: pagination in GridField does not work if DataObject default_sort field values are not unique [GRIDFIELD] BUG: pagination in GridField does not work if DataObject default_sort field values are not unique Oct 6, 2018
@maxime-rainville
Copy link
Contributor

Just skim read through this delightful thread. While the suggested feature would provide some benefit for some very specific use cases, it sounds like it's somewhat low value for most common scenarios. Those people who would benefit from this have a relatively easy alternative (just manually include a ID sort in their queries).

My instinct would be to close this issue has a "won't fix" since there's no clear agreed upon action to take.

@patricknelson
Copy link
Contributor

patricknelson commented Aug 11, 2020

Reread this (skimmed). That would be fine with me as it's still possible (as the developer) to manually define the secondary sort column via ::$default_sort. However, how do you (as SilverStripe) conclude this and resolve @unclecheese's point that DataList is intended to be deterministic, per #7249 (comment)?

That is ultimately what I think this issue morphed into, which makes it more broad and less about a specific use case like you'd find just in GridField pagination (or whatever). It's probably still low priority since while it may occur frequently, it's easy to fix (i.e. via best practices @tractorcow mentioned), but would still represent an important low level change, considering the breadth of it.

So to summarize: Is DataList deterministic, despite the underlying database layer or not?

@jonom
Copy link
Contributor

jonom commented Jun 30, 2023

FWIW I experienced this bug on SS4 today with 'out of the box' models - in the Security->Users section. A site had many members with no first or last name, so I got repeat/missing members on subsequent pages. Looks like it would remain an issue on SS5.

Putting aside the bigger debate for now, quick win would be to fix up all core models to ensure consistent sorting (e.g. add ID and maybe Email to Member default_sort), and maybe update the docs to flag this issue as suggested previously?

Edit: I guess that would only help for literally the 'default sort' case. So if I applied a sort on first/last name in my example by clicking a column header, the issue would come back because ID would no longer be included in the sort clause.

@jonom
Copy link
Contributor

jonom commented Jun 30, 2023

Think the scope is bigger than the title suggests as it's not just GridField (it's all DataLists) and not just about default_sort. Even if you fix your default_sort to include ID, that won't come for the ride if you choose a different sort by using the sortable headers that are baked in to Silverstripe CMS.

Another pain point is that it doesn't appear that you can sort on multiple fields with the GraphQL module. sort expects an object instead of an array so passing multiple fields already wouldn't let you specify which field should come first, but in any case if you do it seems that only the field name that's alphabetically last is used for the sort. So overall this doesn't appear to be an issue that developers can easily workaround by tweaking a few sort clauses in their own code. Edit: this looks at least partially to be a bug.

@sunnysideup
Copy link
Contributor Author

BUMP BUMP BUMP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants