Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIX isInDB actually checks if the object is in DB #7799

Closed
wants to merge 2 commits into from

Conversation

dhensby
Copy link
Contributor

@dhensby dhensby commented Jan 24, 2018

Both DataObject::exists() and DataObject::isInDB() claim to check if the current object is in the DB yet neither do so and both provide a slightly different way of determining if the object does in fact "exist in the DB".

This change introduces checking of the DB when checking the DB for an object's existence.

@kinglozzer
Copy link
Member

This seems far less performant than the existing check, is there a bug this fixes?

@dhensby
Copy link
Contributor Author

dhensby commented Jan 24, 2018

Clearly there's an impact in checking the data source for whether an object exists - however the assumption that "if a DO has an ID it exists in the DB" is deeply flawed and (unfortunately) one that is prevalent amongst our code which I'd like to slowly work on removing

@kinglozzer
Copy link
Member

Hypothetical example: dataobjects which may or may not have an image attached.

<% loop $Objects %>
    <% if $Image %>{$Image.Fit(100, 100)}<% end_if %>
    {$Title}
<% end_loop %>

This change is going to result in one extra database query for each dataobject that has an image, right?

I’m a little hesitant about this one, as far as I’m aware we’ve not seen any real-world issues caused by this behaviour. How about meeting halfway - leave exists() to check for ID and update the docs for that method, but update isInDB() to actually check if the object is in the database?

@tractorcow
Copy link
Contributor

I'm extremely nervous about this change, as we use isInDB() in many many places, and there is going to be a serious performance cost here.

The "is actually in the DB" Is meant to be maintained by deleting the object having it's ID reset.

The extra API (which ignores things such a stage, extra tables, etc) feels awkward to have to remember to invoke when doing low level DB methods. E.g. using SQLDelete() to directly delete rows won't automatically update.

I would vote to just close this I'm afraid.

If you can show an example of where we absolutely need a real DB check, I would probably recommend a solution based on that specific example instead.

@dhensby
Copy link
Contributor Author

dhensby commented Jan 25, 2018

we use isInDB() in many many places and there is going to be a serious performance cost here

It hasn't introduced a significant impact from my limited testing. I've found that a default page load on a vanilla install doesn't invoke the SQL query once.

I'm not doubting that checking the DB for existence of a record will introduce an overhead, but to argue that we should just avoid checking the DB when claiming to check the DB on performance grounds seems misguided to me. What other parts of the framework should we stop actually doing what it says because it's faster to "guess"?

The "is actually in the DB" Is meant to be maintained by deleting the object having it's ID reset.

We do nothing to enforce that the ID can't be set by user code, so it's not in any way reliable way of determining if the object exists.

The extra API (which ignores things such a stage, extra tables, etc) feels awkward to have to remember to invoke when doing low level DB methods. E.g. using SQLDelete() to directly delete rows won't automatically update.

Using low-level APIs will always be awkward in terms of clearing cache stores. At the moment if you do DataObject::get_by_id(Page::class, 1); then run an SQLDelete to delete the page DataObject::get_by_id(Page::class, 1); will still return an object and it will still return true for isInDB (unless you remember to clear the caches) - which would also clear the isInDB cache too...

If you can show an example of where we absolutely need a real DB check, I would probably recommend a solution based on that specific example instead.

TBH, if the answer is "don't use our methods that claim to check the DB because they don't do what they say" then that's an argument for deprecating these methods

@dhensby
Copy link
Contributor Author

dhensby commented Jan 25, 2018

I think I'm going to propose an RFC on this because there is clear flaws to assuming that an ID != 0 means presence in the DB (especially as anything can set the value of the ID field) and there are further changes I want to make that all come under a similar umbrella. This may also be too radical for a 4.x release and instead focused on 5.

But there is clearly a mismatch in expectations here - isInDB does nothing to check an object is in the DB, it checks theres a numeric ID > 0 - it should be called hasNumericIDGreaterThanZero(). DataObject::exists() is equally misleading as the documentation states:

The default behaviour for a DataObject is to return true if the object exists in the database, you can override this in subclasses.

The exists() and isInDB() methods also have slightly different implementations that result in different behaviours and (ultimately) misreport presence in the DB.

$page = new Page();
$page->ID = "Oh looks, this is an ID";
$page->isInDB(); // false
$page->exists(); // true
$page->ID = 9999;
$page->isInDB(); // true
$page->exists(); // true

@tractorcow
Copy link
Contributor

Using low-level APIs will always be awkward in terms of clearing cache stores. At the moment if you do DataObject::get_by_id(Page::class, 1); then run an SQLDelete to delete the page DataObject::get_by_id(Page::class, 1); will still return an object and it will still return true for isInDB (unless you remember to clear the caches) - which would also clear the isInDB cache too...

We should ensure that any ORM method for selecting / deleting behaved consistently. We can't protect against raw SQL queries, which is why we don't recommend them if possible.

@dhensby
Copy link
Contributor Author

dhensby commented Jan 26, 2018

exactly ;)

/**
* @param DataObject $object
*/
public function unregister_object($object)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can(should?) be marked as static too?

@tractorcow
Copy link
Contributor

How about isInDB(true) to force a hard check, but default to false?

@sminnee
Copy link
Member

sminnee commented Oct 12, 2018

There's definitely still a place for a method that checks "has this DataObject been written" without actually querying the database. I agree that checking whether ID is set is weak (your other RFC covers why) but we can still remember that the record has been written and return true for the remainder of the PHP request.

We could change this to hasBeenWritten(), or maybe we just change the API docblock? Or we could have two different methods in one, with a binary flag, as Damian suggests.

@tractorcow
Copy link
Contributor

->exists() will return false for any archived items; Which may seem logical at first, but if you are doing a loop over versioned records in a template, for example, that'll suddenly make that loop empty.

<% if $Item %> internally does ->exists() on the underlying object.

}
$class = $this->baseClass();
if (!isset(self::$_cache_in_db[$class][$this->ID])) {
$sqlSelect = new SQLSelect();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only way that you can get to this code is if you have manually set the ID and not yet written the record. So, this code is picking up the case when you've manually set the ID to an existing record.

We have the option of simply returning false in this case. If we did this, the functionality would match the "has been written" logic that I've recommended. It would remove the cause for performance / logic concerns that others have had.

My view would be that checking the database for the presence of an ID that has been manually applied is a narrow & different use-case from the current uses of isInDB(). In many cases where this happened it's not so much "this record is in the database" as "I am about to corrupt my database". Code responsible for ID collision would need some careful and separate thought.

So I think that should simply return false in this case, and the update the docblock of isInDB() to

Returns true if this record was read from the database or has been written to it.

Note that it does not query the database for the presence of a manually-applied ID.

Note also that, after this change, the static registry can probably be replaced with an instance-specific private boolean $isInDb variable.

@robbieaverill
Copy link
Contributor

@tractorcow

->exists() will return false for any archived items; Which may seem logical at first, but if you are doing a loop over versioned records in a template, for example, that'll suddenly make that loop empty.

That seems OK, but I think that isInDB() would ideally return true for archived objects. Currently it behaves the same way as exists(). I had an issue recently with unexpected behaviour on archived items because ->isInDB() was false. It is in the database though... Thoughts?

@maxime-rainville
Copy link
Contributor

Doesn't look like a consensus has been reached. Shall we close this for now?

@patricknelson
Copy link
Contributor

patricknelson commented Dec 14, 2019

Just wanted to note here that I also just recently created an issue about this here: #9349

Keep in mind that the issue itself calls out the semantic duplication in the API without diving much into the details of how it should be implemented (albeit it's hard to untangle the two). IMHO, we should:

  1. Point ->exists() to ->isInDB (allow them to coexist): This is because not only do they currently function nearly the same way (notwithstanding the differences noted above by @dhensby, which only adds to the confusion on the API) but to simplify the API for consistency (i.e. they do the same thing) and then clearly document why they both exist. Specifically: For DataObject's, what does it really mean to "exist"? To me, that means it is in the database. For a File (also a DataObject), it exists in the database and on the filesystem, at which point File would override ->exists() and add the extra check to validate presence of a file as well. In this way, the API retains semantic meaning and consistency for descendant classes. I do concede, however, that this duality will only have meaning in a minority of cases (largely just File and it's descendants).
  2. Actually hit the DB when checking: Hear me out; the main reason for this is because accessing the database generally these days is an extremely fast operation, particularly if it's the same query done frequently. This is because caching can be done at the database level. However, not doing this will require caching at the framework level using global/static state which can get complicated, particularly if something happens during the current page execution that isn't properly accounted for (e.g. direct DB queries, race conditions or transactions not committed/rolled back). That is: I believe that asking "Is this in the database?" should actually tell me that it's in the database. I think the penalty for getting that wrong (especially in terms of legacy and ongoing maintenance) could be higher with mistakes made or issues in debugging than the time-penalty paid by issuing a simple SELECT ID FROM BaseTable WHERE ID=123.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants