Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add worker tasks to the agent. #1393

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Add worker tasks to the agent. #1393

wants to merge 4 commits into from

Conversation

leftwo
Copy link
Contributor

@leftwo leftwo commented Aug 15, 2024

Added more tasks to the crucible-agent.

We now spawn up to 5 tasks to handle a region creation operation. This includes a clone operation where we create an empty region, then fill it by copying from another running downstairs.

I attempted to do this in a way that preserved much of how the worker tasks operated, instead of refactoring it.
I moved what was in the main worker() loop for region create into a spawned task, it should be the same set of operations that existed in the previous version.

The existing doorbell mechanism used in the first_in_states()loop is also used to indicate that a worker task has finished. first_in_states()` will also now check for jobs that have completed and remove them from the work queue.

To keep SMF happy, and to not clobber ourselves, I put a lock around the SMF update. This seemed simpler then trying to break apart apply_smf() into the various things it does and then try to only call the part I need when I need it.

Other actions besides region creation will still be processed by the agent's current worker task.

Updated agent-antagonist test to print a little less in some places and a little more in other.
Because agent-antagonist will go much faster than Nexus (and has fewer restrictions) I added a loop in the clone path that will try to connect to the source endpoint and verify the remote downstairs is responding before trying to clone from it. I don't believe this kind of check is needed in Omicron.

agent-antagonist now has a new test that just tests downstairs cloning operations in a loop.

I added a "work queue" API endpoint to the agent. I'll come back when this lands in Omicron and add that to omdb.

This enables crucible agent to have more than one (currently 5)
outstanding region creation operations while still handling creation
and deletion of snapshots (SMF services) and deletion of regions.

Updated agent-antagonist test to print a little less in some places
and a little more in other.

agent-antagonist now has a new test that just tests downstairs cloning
operations in a loop.
Copy link
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some changes needed:

  • adding work to the queue has to be idempotent, as the agent is called from a saga context
  • what happens when there is work on the queue and someone tombstones the region? does that tombstone serialize behind the region creation job?

// room on the work queue. Otherwise, they remain
// requested until we can service them.
if r.state == State::Requested {
assert!(!outer.work_queue.contains_key(&r.id));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this assert shouldn't be required, this function should still hold the lock on outer and we just checked contains_key

@@ -69,7 +84,7 @@ fn command(log: &Logger, bin: &'static str, args: &[&str]) -> Result<String> {
info!(log, "{} {:?} took {:?}", bin, args, elapsed);

if !cmd.status.success() {
bail!("zfs list failed!");
bail!("command: {} {:?} failed {:?}", bin, args, cmd);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where's the dunce emoji

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be clear I meant myself! :o

name.clone(),
rs.clone(),
);
if outer.work_queue.contains_key(rid) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this branch, nothing currently does work on running snapshots right?

@@ -400,15 +404,17 @@ where
*/
let expected_downstairs_instances = regions
.iter()
.filter(|r| r.state == State::Created)
.filter(|r| r.state == State::Created || r.state == State::Requested)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right - if I remember correctly this would start a downstairs process for a region that either hasn't had its dataset created yet, or could still be cloning.

.map(|r| format!("{}-{}", downstairs_prefix, r.id.0))
.collect::<HashSet<_>>();

let expected_snapshot_instances: Vec<String> = running_snapshots
.iter()
.flat_map(|(_, n)| {
n.iter()
.filter(|(_, rs)| rs.state == State::Created)
.filter(|(_, rs)| {
rs.state == State::Created || rs.state == State::Requested
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, I don't think this is right, as it would start a downstairs process for a snapshot's .zfs/snapshot/... directory when that might not exist yet

@@ -799,6 +805,10 @@ where
}
crucible_smf::CommitResult::OutOfDate => {
error!(log, "concurrent modification?!");
panic!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why panic here and not in the other check for concurrent modification?

// that does not already have a job on work queue, so
// now we will create that job and spawn a task for
// it to do the work in.
let log0 = log.new(o!("component" => "worktask"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this log context should also have the region id as a tag / whatever slog calls it :)

@leftwo
Copy link
Contributor Author

leftwo commented Oct 10, 2024

adding work to the queue has to be idempotent, as the agent is called from a saga context

Do you mean calling the agent to take an action has to be idempotent? If we call the agent several times it should only do the work once?

what happens when there is work on the queue and someone tombstones the region? does that tombstone serialize behind the region creation job?

It should, yes (though I'll verify).

@jmpesp
Copy link
Contributor

jmpesp commented Oct 10, 2024

Do you mean calling the agent to take an action has to be idempotent? If we call the agent several times it should only do the work once?

Yeah - if Nexus sends the same request, it should receive the same answer, polling until the state changes to Created. The agent should only do the work once

It should, yes (though I'll verify).

I thought it should as well, yeah

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants