Skip to content
This repository has been archived by the owner on Jan 3, 2024. It is now read-only.

rgw/sfs: honor retry_raced_bucket_write mechanism #240

Draft
wants to merge 1 commit into
base: s3gw
Choose a base branch
from

Conversation

giubacc
Copy link

@giubacc giubacc commented Oct 26, 2023

Updating bucket's metadata concurrently by two or more threads is allowed in radosgw.
There is a retry mechanism: retry_raced_bucket_write(), that expects the bucket references to fetch the latest data from the persistent store.
rgw/sfs driver didn't implement try_refresh_info() in its bucket class definition; this could cause two references to the same bucket to potentially lead to partial metadata updates.

Fixes: https://github.com/aquarist-labs/s3gw/issues/637

Checklist

  • Tracker (select at least one)
    • References tracker ticket
    • Very recent bug; references commit where it was introduced
    • New feature (ticket optional)
    • Doc update (no ticket needed)
    • Code cleanup (no ticket needed)
  • Documentation (select at least one)
    • Updates relevant documentation
    • No doc update is appropriate
  • Tests (select at least one)

@giubacc giubacc force-pushed the retry-bucket-raced-metadata-write branch from 99d28ed to 8e46b3a Compare October 26, 2023 15:31
@giubacc
Copy link
Author

giubacc commented Oct 26, 2023

@jecluis I need some help to properly configure the formatter in vscode, it seems correct ... but I'm struggling a bit understanding why it keeps formatting a bit not in the right way 😓

@jecluis
Copy link
Member

jecluis commented Oct 26, 2023

@jecluis I need some help to properly configure the formatter in vscode, it seems correct ... but I'm struggling a bit understanding why it keeps formatting a bit not in the right way 😓

Alright. Ping me tomorrow and we can set up a call to figure out what's going on.

@jecluis jecluis requested review from jecluis and irq0 October 26, 2023 18:15
@jecluis jecluis added kind/enhancement Change that positively impacts existing code area/rgw-sfs RGW & SFS related labels Oct 26, 2023
@giubacc giubacc requested review from 0xavi0 and tserong October 27, 2023 10:32
@giubacc giubacc force-pushed the retry-bucket-raced-metadata-write branch from 8e46b3a to ce2bf84 Compare October 27, 2023 12:25
Copy link
Member

@tserong tserong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'm not so familiar with this part of the code, so could do with approval from someone else.

src/rgw/driver/sfs/bucket.cc Outdated Show resolved Hide resolved
@@ -534,11 +534,39 @@ int SFSBucket::abort_multiparts(
return sfs::SFSMultipartUploadV2::abort_multiparts(dpp, store, this);
}

/**
* @brief Refresh this bucket's rep with the state obtained from the store.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does 'rep' mean?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something coming from the days at uni, apparently in Pisa all professors kept referring to an object's representation with the term: rep.
An object rep refers generically to how an object type keeps organized its state. The same interface could, normally, be represented with different reps.
Eg: a Map could be implemented with a Tree rep or with a Hash rep depending on the usage that makes one implementation more fit with a certain workload.
I always liked that term, so I tend to use that; I dunno how much could be standard anyway :).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only references I find about "object representation" is about how bytes are organized (see reference). I don't think this translates to the way you are using the term, so maybe just drop it and replace that with something less confusing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, this can be definitely bad to use.
What about simply use:

Refresh this bucket object with the state obtained from the store.
Indeed it can happen that the state of this bucket is obsolete due to
concurrent threads updating metadata using their own SFSBucket instance.

// maybe excessively safe approach: instead of directly assign the returned
// value to this.bucket let's check first if bref is valid.
if (!bref) {
lsfs_dout(dpp, 0) << fmt::format("no such bucket! {}", get_name()) << dendl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might have been a concurrent bucket delete + GC, right? This would mean this isn't 'excessively safe', just normal way of doing business. -ERR_NO_SUCH_BUCKET sounds sensible to me, logging an error not (because this is normal behavior). Suggest to remote the message or make it debug + improve it. There isn't much to debug on with 'no such bucket! NAME' in hand.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there could be a concurrent delete, I will remove the comment.
I think the "error" logging message has its own dignity to exist, maybe lowering at warning level, this could help to spot a client bug in its application logic (a client issuing a concurrent delete + update metadata on the same bucket could smell a bit) .
How would you improve the message anyway (what fields dump)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From our perspective it is hard to tell if this is a client issue or normal behavior. There might be concurrent clients doing this operation. It is not our job to log warnings for legal but possible weird client behavior. If someone wants to debug this, there is the operation log feature and we already log that the bucket doens't exists with the standard request log that includes the error return. A "no such bucket" warning is redundant at best.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will remove that.

}
bucket = bref;

// update views like we do in the constructor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is like we do in the constructor this screams refactoring - we should not duplicate code that updates the in memory view.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree! will improve

auto ceph_context = std::make_shared<CephContext>(CEPH_ENTITY_TYPE_CLIENT);
ceph_context->_conf.set_val("rgw_sfs_data_path", getTestDir());
ceph_context->_log->start();
auto store = new rgw::sal::SFStore(ceph_context.get(), getTestDir());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test dir cleanup code missing

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies, what do you mean with that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code above creates a new store at getTestDir, but never deletes it. See the other unit tests teardown hooks, there is usually a delete in there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but isn't this function supposed to be called for every test:
void TearDown() override ?
am I missing something here?

Comment on lines 1609 to 1610
// obj lock structure in memory cannot be equal at this point in memory for the
// two b1 and b2 objects; this simulates two threads updating metadata over the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why cant they be equal?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per test construction they cannot be equal, b1 memory is modified with merge_and_store_attrs() and b2 memory is modified with put_info().
Maybe better change in the comment the b1 and b2 aliases with their correct names ( bucket_from_store_1 and bucket_from_store_2) ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is worth adding a bit more of an explanation in this case. What you just replied on this thread, for instance.

* @brief Refresh this bucket's rep with the state obtained from the store.
* Indeed it can happen that the rep of this bucket is obsolete due to
* concurrent threads updating metadata using their own SFSBucket instance.
*/
int SFSBucket::try_refresh_info(
Copy link
Member

@irq0 irq0 Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be only half the solution. I didn't dig too deep, but hope you know more :)

If I inline and simplify the logic a bit we have:

int retry_raced_bucket_write(const DoutPrefixProvider *dpp, rgw::sal::Bucket* b, const F& f) {
  auto r = f();
  for (auto i = 0u; i < 15u && r == -ECANCELED; ++i) {
    r = b->try_refresh_info(dpp, nullptr);
    if (r >= 0) {
      rgw::sal::Attrs attrs = s->bucket->get_attrs();
      attrs[RGW_ATTR_TAGS] = tags_bl;
      r = s->bucket->merge_and_store_attrs(this, attrs, y);
    }
  }
  return r;
}

What if something goes state between try_refresh_info and merge_and_store_attrs? Is merge_and_store_attrs responsible to return -ECANCELED when it detects staleness? Can we detect merge conflicts there?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I have to say the truth, I don't think the retry_raced_bucket_write can solve the problem for all cases, it can only mitigate the problem.
Imagine 2 physical threads A and B and a bucket with an initial state S0, they both race and run true-parallel on their own CPU core the try_refresh_info function; at this point for both A and B the state of the bucket in memory is S0, they both think the S0 state is the latest one and they happily update the bucket both with their new data.
Now, because in any decent implementation, an update will be done mutual exclusively, A or B will run a portion of critical code in a serialized way, so in the end, either the state of A or the state of B will prevaricate on the other.
Is this an error? I don't think so, it is part of the game when you allow a bucket to be modified concurrently by more than 1 physical thread.
At a certain point, a thread will have to say: "Can I safely assume this state in my cached memory is good enough?"; it doesn't matter how much far you go checking this, at a certain point you simply must go on.
This retry_raced_bucket_write mechanism can work well with logical threads mapped over 1 physical thread.
In this case they can always be serialized in some way; in this case the retry_raced_bucket_write solves something it is correct to solve because the 2 logical threads run over a single core as if they where a single logical sequence of statements.
This is the reasoning I've done myself trying to figure out what were the original "meaning and purposes" of the original authors of this code.
Let's always keep in mind that the actual Truth, until we are working downstream, and dealing with parts developed upstream will be hard to be completely clarified; we would need the support of the original developers to really understand what was the aim of something.
The best we can do now is to hope that our guess is good enough :) .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's assume we are in s->bucket->merge_and_store_attrs(this, new, y);

State we have: database state db, our current copy in bucket current and the update in new.
Caller based new on current fetched from db.

We must transition from db to new. Right now we transition form current to new. This is a bug, because current may be stale. This is the case if current != db.

If we make the update transaction conditional on current == db, we would transition from db to new in turn. If at time of the update current != db, we fail with -ECANCELED and let the retry logic compute a fresh update.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, now I think I got what you mean, thanks; I try to explain the limit case so we can discuss from that:
Let's assume both threads A and B reach update() concurrently.
Let's assume that update() is the function that flushes a state on the db and cannot be executed by 2 threads concurrently because mutex protected.

thread A

try_update_data()
	-> fetch_from_db() -> db_state_0
	->  current_state = db_state_0

update(current_state + new_A)

now bucket's state on db is db_state_A

thread B

try_update_data()
	-> fetch_from_db() -> db_state_0
	-> current_state = db_state_0

update(current_state_0 + new_B)

now bucket's state on db is db_state_B

final state on db: db_state_B

db_state_A != db_state_B

new_A is lost.

To avoid the scenario above, we should implement a mechanism inside the update() mutex protected function (and it must be only that way for the following check).
So update() should be something like:

int update(const State &current, const &new_delta)
{
  mutex.lock()
  defer mutex.unlock()

  db_state = _fetch_db_state()
  if current != db_state return -ECANCELED
  return _update(current+new_delta);
}

A good candidate for the update() function could/should be sfs::get_meta_buckets(get_store().db_conn) ->store_bucket(), correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be one layer deeper in the database update. Our Bucket code still keeps a shared copy of the database representation in a map, so this might be a bit more tricky with the database access and mutex around the map.

In essence I think we need something like 'update bucket set attrs = new attrs where attrs = cached attrs'

  • If that transaction didn't change anything, bail out and let the SAL layer update and retry.

Unfortunately I don't think can't say 'attrs = cached attrs', because both are ceph serialized blobs. A transaction that fetches, deserializes, compares and conditionally rolls back / commits may work.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done a patch in this direction that seemed promising, but in the end I stumbled in an issue that IMO requires comments from upstream, so I've opened: https://tracker.ceph.com/issues/63560

@giubacc giubacc force-pushed the retry-bucket-raced-metadata-write branch from ce2bf84 to 1846c13 Compare November 2, 2023 10:10
@giubacc giubacc marked this pull request as draft November 3, 2023 15:01
@github-actions github-actions bot added the needs-rebase Changes need a rebase label Nov 6, 2023
Copy link

github-actions bot commented Nov 6, 2023

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@giubacc giubacc added this to the v0.23.0 milestone Nov 13, 2023
@giubacc giubacc self-assigned this Nov 13, 2023
@vmoutoussamy vmoutoussamy added the priority/1 Should be fixed for next release label Nov 15, 2023
Updating bucket's metadata concurrently by two or more threads is allowed in radosgw.
There is a retry mechanism: retry_raced_bucket_write(), that expects the bucket references to fetch the latest data from the persistent store.
rgw/sfs driver didn't implement try_refresh_info() in its bucket class definition; this could cause two references to the same bucket to potentially lead to partial metadata updates.

Fixes: https://github.com/aquarist-labs/s3gw/issues/637
Signed-off-by: Giuseppe Baccini <[email protected]>
@giubacc giubacc force-pushed the retry-bucket-raced-metadata-write branch from 1846c13 to 06ac4ab Compare November 17, 2023 10:53
@jecluis jecluis modified the milestones: v0.23.0, v0.24.0 Nov 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/rgw-sfs RGW & SFS related kind/enhancement Change that positively impacts existing code needs-rebase Changes need a rebase priority/1 Should be fixed for next release
Projects
Status: Blocked / On hold
Development

Successfully merging this pull request may close these issues.

[BUG] rgw/sfs: metadata update calls don't honor retry_raced_bucket_write mechanism
5 participants