Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(store): deleting compute graphs will now delete all dependencies #987

Merged

Conversation

seriousben
Copy link
Member

@seriousben seriousben commented Oct 29, 2024

Some gaps exist in the delete compute graph code:

  • diagnostic (exception/stdout/stderr) files were not deleted from blob storage.
  • Tasks were not deleted from storage.
  • TaskOutputs were not deleted from storage.
  • UnallocatedTasks were not deleted from storage.

This PR correctly deletes all these dependencies.

This PR does not

  • Clean up StateChanges related to a deleted compute graph, this cleanup could happen out of band (system task) based on the time of each entry.
  • Does not move Compute Graph deletion in a System Task. While this might be ideal, it will be done separately from this PR. An issue was created to track it.

Verification

Verification was performed using this dump script (manually added to server/state_store/src/lib.rs):

        tracing::info!("Listing all keys in the db");
        IndexifyObjectsColumns::iter().for_each(|col| {
            let cf = &col.cf_db(&db);
            let mut num = 0;
            println!("Column Family: {}", col);
            db.iterator_cf(cf, rocksdb::IteratorMode::End)
                .for_each(|r| match r {
                    Ok((k, v)) => {
                        num += 1;
                        let val = match serde_json::from_slice::<serde_json::Value>(v.as_ref()) {
                            Ok(v) => serde_json::to_string_pretty(&v).unwrap(),
                            Err(_) => String::from_utf8(v.to_vec()).unwrap(),
                        };

                        println!(
                            "\tkey: {:?}\n\tValue: {:#?}",
                            String::from_utf8(k.to_vec()).unwrap(),
                            val,
                        )
                    }
                    Err(e) => println!("Error reading entry: {}", e),
                });
            println!("\tLen = {num}");
        });

Storage at initial server startup

All column families are empty.

Storage dump
2024-10-29T13:39:00.354562Z  INFO state_store: Listing all keys in the db
Column Family: StateMachineMetadata
	Len = 0
Column Family: Executors
	Len = 0
Column Family: Namespaces
	Len = 0
Column Family: ComputeGraphs
	Len = 0
Column Family: Tasks
	Len = 0
Column Family: GraphInvocationCtx
	Len = 0
Column Family: ReductionTasks
	Len = 0
Column Family: GraphInvocations
	Len = 0
Column Family: FnOutputs
	Len = 0
Column Family: TaskOutputs
	Len = 0
Column Family: StateChanges
	Len = 0
Column Family: UnprocessedStateChanges
	Len = 0
Column Family: TaskAllocations
	Len = 0
Column Family: UnallocatedTasks
	Len = 0
Column Family: GcUrls
	Len = 0
Column Family: SystemTasks
	Len = 0
Column Family: Stats
	Len = 0

Storage after second server startup without any workloads

  • All compute graphs related column families are empty.
  • default namespace automatically created.
  • 1 executor
  • 1 StateChange
Storage dump ``` 2024-10-29T13:40:23.864436Z INFO state_store: Listing all keys in the db Column Family: StateMachineMetadata Len = 0 Column Family: Executors key: "FVeH-HtHccKoRPLJicFq1" Value: "{\n \"addr\": \"\",\n \"executor_version\": \"0.2.19\",\n \"id\": \"FVeH-HtHccKoRPLJicFq1\",\n \"image_name\": \"tensorlake/indexify-executor-default\",\n \"labels\": {\n \"architecture\": \"arm64\",\n \"image_name\": \"tensorlake/indexify-executor-default\",\n \"os\": \"Darwin\",\n \"python_major_version\": 3,\n \"python_minor_version\": 11\n }\n}" Len = 1 Column Family: Namespaces key: "default" Value: "{\n \"created_at\": 1730209140367,\n \"name\": \"default\"\n}" Len = 1 Column Family: ComputeGraphs Len = 0 Column Family: Tasks Len = 0 Column Family: GraphInvocationCtx Len = 0 Column Family: ReductionTasks Len = 0 Column Family: GraphInvocations Len = 0 Column Family: FnOutputs Len = 0 Column Family: TaskOutputs Len = 0 Column Family: StateChanges key: "\0\0\0\0\0\0\0\0" Value: "{\n \"change_type\": \"ExecutorAdded\",\n \"created_at\": 1730209145257,\n \"id\": 0,\n \"object_id\": \"FVeH-HtHccKoRPLJicFq1\",\n \"processed_at\": 1730209145260\n}" Len = 1 Column Family: UnprocessedStateChanges Len = 0 Column Family: TaskAllocations Len = 0 Column Family: UnallocatedTasks Len = 0 Column Family: GcUrls Len = 0 Column Family: SystemTasks Len = 0 Column Family: Stats Len = 0 ```

Storage after a compute graph run

  • 1 Compute graphs
  • 1 tasks
  • 1 GraphInvocationCtx
  • 1 FnOutputs
  • 1 TaskOutputs
Storage dump ``` 2024-10-29T13:43:08.164077Z INFO state_store: Listing all keys in the db Column Family: StateMachineMetadata Len = 0 Column Family: Executors key: "4qcYQMcXl5HuSvh3NgErV" Value: "{\n \"addr\": \"\",\n \"executor_version\": \"0.2.19\",\n \"id\": \"4qcYQMcXl5HuSvh3NgErV\",\n \"image_name\": \"tensorlake/indexify-executor-default\",\n \"labels\": {\n \"architecture\": \"arm64\",\n \"image_name\": \"tensorlake/indexify-executor-default\",\n \"os\": \"Darwin\",\n \"python_major_version\": 3,\n \"python_minor_version\": 11\n }\n}" Len = 1 Column Family: Namespaces key: "default" Value: "{\n \"created_at\": 1730209140367,\n \"name\": \"default\"\n}" Len = 1 Column Family: ComputeGraphs key: "default|object_detection_workflow" Value: "{\n \"code\": {\n \"path\": \"file:///Users/seriousben/src/github.com/seriousben/indexify-detect-image-objects/indexify_storage/blobs/default_aGF4gRN2uBKH5_5ZfZSFy\",\n \"sha256_hash\": \"0a67e5c6a6814a05b2adfa9bb1013f8d0319f792a3a80ad5d2d4315d236bce1f\",\n \"size\": 13735\n },\n \"created_at\": 0,\n \"description\": \"\",\n \"edges\": {},\n \"name\": \"object_detection_workflow\",\n \"namespace\": \"default\",\n \"nodes\": {\n \"object_detector\": {\n \"Compute\": {\n \"description\": \"\",\n \"fn_name\": \"object_detector\",\n \"image_information\": {\n \"base_image\": \"python:3.10.15-slim-bookworm\",\n \"image_name\": \"tensorlake/indexify-executor-default\",\n \"run_strs\": [\n \"pip install indexify\"\n ],\n \"tag\": \"3.10\"\n },\n \"image_name\": \"tensorlake/indexify-executor-default\",\n \"name\": \"object_detector\",\n \"payload_encoder\": \"cloudpickle\",\n \"placement_constraints\": [],\n \"reducer\": false\n }\n }\n },\n \"runtime_information\": {\n \"major_version\": 3,\n \"minor_version\": 11\n },\n \"start_fn\": {\n \"Compute\": {\n \"description\": \"\",\n \"fn_name\": \"object_detector\",\n \"image_information\": {\n \"base_image\": \"python:3.10.15-slim-bookworm\",\n \"image_name\": \"tensorlake/indexify-executor-default\",\n \"run_strs\": [\n \"pip install indexify\"\n ],\n \"tag\": \"3.10\"\n },\n \"image_name\": \"tensorlake/indexify-executor-default\",\n \"name\": \"object_detector\",\n \"payload_encoder\": \"cloudpickle\",\n \"placement_constraints\": [],\n \"reducer\": false\n }\n },\n \"version\": 1\n}" Len = 1 Column Family: Tasks key: "default|object_detection_workflow|17eacf2d6ba0bf36|object_detector|47b6599c-8824-4bb1-a9a3-a847645a1856" Value: "{\n \"compute_fn_name\": \"object_detector\",\n \"compute_graph_name\": \"object_detection_workflow\",\n \"creation_time\": {\n \"nanos_since_epoch\": 950807000,\n \"secs_since_epoch\": 1730209377\n },\n \"diagnostics\": {\n \"exception\": null,\n \"stderr\": {\n \"path\": \"file:///Users/seriousben/src/github.com/seriousben/indexify-detect-image-objects/indexify_storage/blobs/default.object_detection_workflow.object_detector.17eacf2d6ba0bf36.stderr\",\n \"sha256_hash\": \"d7921c2d8ea1fb4e5afc08f1aa7249f6d4325fc9311f061fbc61777d4c5c4c4d\",\n \"size\": 27\n },\n \"stdout\": {\n \"path\": \"file:///Users/seriousben/src/github.com/seriousben/indexify-detect-image-objects/indexify_storage/blobs/default.object_detection_workflow.object_detector.17eacf2d6ba0bf36.stdout\",\n \"sha256_hash\": \"7e942ba422f0d4a4af6ecedec681b0bd48294bdb5ffeda0cac788eb1d1ea8c9b\",\n \"size\": 183\n }\n },\n \"graph_version\": 1,\n \"id\": \"47b6599c-8824-4bb1-a9a3-a847645a1856\",\n \"input_node_output_key\": \"17eacf2d6ba0bf36\",\n \"invocation_id\": \"17eacf2d6ba0bf36\",\n \"namespace\": \"default\",\n \"outcome\": \"Success\",\n \"reducer_output_id\": null\n}" Len = 1 Column Family: GraphInvocationCtx key: "default|object_detection_workflow|17eacf2d6ba0bf36" Value: "{\n \"completed\": true,\n \"compute_graph_name\": \"object_detection_workflow\",\n \"fn_task_analytics\": {\n \"object_detector\": {\n \"failed_tasks\": 0,\n \"pending_tasks\": 0,\n \"successful_tasks\": 1\n }\n },\n \"graph_version\": 1,\n \"invocation_id\": \"17eacf2d6ba0bf36\",\n \"is_system_task\": false,\n \"namespace\": \"default\",\n \"outstanding_tasks\": 0\n}" Len = 1 Column Family: ReductionTasks Len = 0 Column Family: GraphInvocations key: "default|object_detection_workflow|17eacf2d6ba0bf36" Value: "{\n \"compute_graph_name\": \"object_detection_workflow\",\n \"id\": \"17eacf2d6ba0bf36\",\n \"namespace\": \"default\",\n \"payload\": {\n \"path\": \"file:///Users/seriousben/src/github.com/seriousben/indexify-detect-image-objects/indexify_storage/blobs/01e9c656-eb5c-4f14-a6f9-b9d1661aaaca\",\n \"sha256_hash\": \"65ea2ac07eefe9812ba95e962dcffcae10b9bf78a82309650d2884a643027d1f\",\n \"size\": 296293\n }\n}" Len = 1 Column Family: FnOutputs key: "default|object_detection_workflow|17eacf2d6ba0bf36|object_detector|d6e77e36553267be" Value: "{\n \"compute_fn_name\": \"object_detector\",\n \"compute_graph_name\": \"object_detection_workflow\",\n \"errors\": null,\n \"graph_version\": 1,\n \"id\": \"d6e77e36553267be\",\n \"invocation_id\": \"17eacf2d6ba0bf36\",\n \"namespace\": \"default\",\n \"payload\": {\n \"Fn\": {\n \"path\": \"file:///Users/seriousben/src/github.com/seriousben/indexify-detect-image-objects/indexify_storage/blobs/default.object_detection_workflow.object_detector.17eacf2d6ba0bf36.47b6599c-8824-4bb1-a9a3-a847645a1856.0\",\n \"sha256_hash\": \"85fcd79e6de768b203eea80693f9f59943b4fe19f87eb41e5db8fe1ca678f072\",\n \"size\": 890871\n }\n },\n \"reduced_state\": false\n}" Len = 1 Column Family: TaskOutputs key: "default|47b6599c-8824-4bb1-a9a3-a847645a1856|d6e77e36553267be" Value: "\"default|object_detection_workflow|17eacf2d6ba0bf36|object_detector|d6e77e36553267be\"" Len = 1 Column Family: StateChanges key: "\0\0\0\0\0\0\0\u{4}" Value: "{\n \"change_type\": {\n \"TaskFinished\": {\n \"compute_fn\": \"object_detector\",\n \"compute_graph\": \"object_detection_workflow\",\n \"invocation_id\": \"17eacf2d6ba0bf36\",\n \"namespace\": \"default\",\n \"task_id\": \"47b6599c-8824-4bb1-a9a3-a847645a1856\"\n }\n },\n \"created_at\": 1730209381002,\n \"id\": 4,\n \"object_id\": \"47b6599c-8824-4bb1-a9a3-a847645a1856\",\n \"processed_at\": 1730209381003\n}" key: "\0\0\0\0\0\0\0\u{3}" Value: "{\n \"change_type\": \"TaskCreated\",\n \"created_at\": 1730209377950,\n \"id\": 3,\n \"object_id\": \"47b6599c-8824-4bb1-a9a3-a847645a1856\",\n \"processed_at\": 1730209377951\n}" key: "\0\0\0\0\0\0\0\u{2}" Value: "{\n \"change_type\": {\n \"InvokeComputeGraph\": {\n \"compute_graph\": \"object_detection_workflow\",\n \"invocation_id\": \"17eacf2d6ba0bf36\",\n \"namespace\": \"default\"\n }\n },\n \"created_at\": 1730209377950,\n \"id\": 2,\n \"object_id\": \"17eacf2d6ba0bf36\",\n \"processed_at\": 1730209377950\n}" key: "\0\0\0\0\0\0\0\u{1}" Value: "{\n \"change_type\": \"ExecutorRemoved\",\n \"created_at\": 1730209228876,\n \"id\": 1,\n \"object_id\": \"FVeH-HtHccKoRPLJicFq1\",\n \"processed_at\": 1730209228876\n}" key: "\0\0\0\0\0\0\0\0" Value: "{\n \"change_type\": \"ExecutorAdded\",\n \"created_at\": 1730209228738,\n \"id\": 0,\n \"object_id\": \"4qcYQMcXl5HuSvh3NgErV\",\n \"processed_at\": 1730209228741\n}" Len = 5 Column Family: UnprocessedStateChanges Len = 0 Column Family: TaskAllocations Len = 0 Column Family: UnallocatedTasks Len = 0 Column Family: GcUrls Len = 0 Column Family: SystemTasks Len = 0 Column Family: Stats Len = 0 ```

Storage after compute graph deletion

  • 0 Compute graph
  • 0 Tasks
  • 0 GraphInvocationCtx
  • 0 FnOutputs
  • 0 TaskOutputs
  • 3 files deleted
API and Storage dump

Files deleted by gc:

2024-10-29T13:54:16.117432Z DEBUG indexify_server::gc: Deleting url "file:///Users/seriousben/src/github.com/seriousben/indexify-detect-image-objects/indexify_storage/blobs/default.object_detection_workflow.object_detector.17eacf2d6ba0bf36.47b6599c-8824-4bb1-a9a3-a847645a1856.0"
2024-10-29T13:54:16.120131Z DEBUG indexify_server::gc: Deleting url "file:///Users/seriousben/src/github.com/seriousben/indexify-detect-image-objects/indexify_storage/blobs/default.object_detection_workflow.object_detector.17eacf2d6ba0bf36.stderr"
2024-10-29T13:54:16.120785Z DEBUG indexify_server::gc: Deleting url "file:///Users/seriousben/src/github.com/seriousben/indexify-detect-image-objects/indexify_storage/blobs/default.object_detection_workflow.object_detector.17eacf2d6ba0bf36.stdout"

Storage:

2024-10-29T13:54:24.100036Z  INFO state_store: Listing all keys in the db
Column Family: StateMachineMetadata
	Len = 0
Column Family: Executors
	key: "J8zDo9z-ZPsBKM6jrAqBi"
	Value: "{\n  \"addr\": \"\",\n  \"executor_version\": \"0.2.19\",\n  \"id\": \"J8zDo9z-ZPsBKM6jrAqBi\",\n  \"image_name\": \"tensorlake/indexify-executor-default\",\n  \"labels\": {\n    \"architecture\": \"arm64\",\n    \"image_name\": \"tensorlake/indexify-executor-default\",\n    \"os\": \"Darwin\",\n    \"python_major_version\": 3,\n    \"python_minor_version\": 11\n  }\n}"
	Len = 1
Column Family: Namespaces
	key: "default"
	Value: "{\n  \"created_at\": 1730209140367,\n  \"name\": \"default\"\n}"
	Len = 1
Column Family: ComputeGraphs
	Len = 0
Column Family: Tasks
	Len = 0
Column Family: GraphInvocationCtx
	Len = 0
Column Family: ReductionTasks
	Len = 0
Column Family: GraphInvocations
	Len = 0
Column Family: FnOutputs
	Len = 0
Column Family: TaskOutputs
	Len = 0
Column Family: StateChanges
	key: "\0\0\0\0\0\0\0\u{4}"
	Value: "{\n  \"change_type\": {\n    \"TaskFinished\": {\n      \"compute_fn\": \"object_detector\",\n      \"compute_graph\": \"object_detection_workflow\",\n      \"invocation_id\": \"17eacf2d6ba0bf36\",\n      \"namespace\": \"default\",\n      \"task_id\": \"47b6599c-8824-4bb1-a9a3-a847645a1856\"\n    }\n  },\n  \"created_at\": 1730209381002,\n  \"id\": 4,\n  \"object_id\": \"47b6599c-8824-4bb1-a9a3-a847645a1856\",\n  \"processed_at\": 1730209381003\n}"
	key: "\0\0\0\0\0\0\0\u{3}"
	Value: "{\n  \"change_type\": \"TaskCreated\",\n  \"created_at\": 1730209377950,\n  \"id\": 3,\n  \"object_id\": \"47b6599c-8824-4bb1-a9a3-a847645a1856\",\n  \"processed_at\": 1730209377951\n}"
	key: "\0\0\0\0\0\0\0\u{2}"
	Value: "{\n  \"change_type\": {\n    \"InvokeComputeGraph\": {\n      \"compute_graph\": \"object_detection_workflow\",\n      \"invocation_id\": \"17eacf2d6ba0bf36\",\n      \"namespace\": \"default\"\n    }\n  },\n  \"created_at\": 1730209377950,\n  \"id\": 2,\n  \"object_id\": \"17eacf2d6ba0bf36\",\n  \"processed_at\": 1730209377950\n}"
	key: "\0\0\0\0\0\0\0\u{1}"
	Value: "{\n  \"change_type\": \"ExecutorRemoved\",\n  \"created_at\": 1730209393166,\n  \"id\": 1,\n  \"object_id\": \"4qcYQMcXl5HuSvh3NgErV\",\n  \"processed_at\": 1730209393167\n}"
	key: "\0\0\0\0\0\0\0\0"
	Value: "{\n  \"change_type\": \"ExecutorAdded\",\n  \"created_at\": 1730209393092,\n  \"id\": 0,\n  \"object_id\": \"J8zDo9z-ZPsBKM6jrAqBi\",\n  \"processed_at\": 1730209393092\n}"
	Len = 5
Column Family: UnprocessedStateChanges
	Len = 0
Column Family: TaskAllocations
	Len = 0
Column Family: UnallocatedTasks
	Len = 0
Column Family: GcUrls
	Len = 0
Column Family: SystemTasks
	Len = 0
Column Family: Stats
	Len = 0

Testing

  • Run make fmt.
  • pip install -e ., start server and executor, cd to python-sdk/tests, python test_graph_behaviours.py.
       .
    ----------------------------------------------------------------------
    Ran 7 tests in 1.889s
    
    OK
    

Copy link
Collaborator

@diptanu diptanu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, have some questions.

}
None => {}
}
txn.delete_cf(&IndexifyObjectsColumns::Tasks.cf_db(&db), &key)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine this needs to be a SystemTask as well. May be add a TODO/FIXME here if we want to do this later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a comment, but it is already tracked here: #986

I would make this whole Delete a SystemTask. Let me know if the issue makes sense, I wrote it based on your explanation last week.

)?;
}

delete_cf_prefix(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this use an API like this under the hood or are we iterating? https://docs.rs/rocksdb/0.22.0/rocksdb/type.DBWithThreadMode.html#method.delete_range_cf

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are iterating right now

let iter = txn.iterator_cf_opt(cf, read_options, iterator_mode);
for key in iter {
let (key, _) = key?;
if !key.starts_with(prefix) {
break;
}
txn.delete_cf(cf, &key)?;

@seriousben seriousben merged commit 1b5ce58 into main Oct 29, 2024
3 checks passed
@seriousben seriousben deleted the seriousben/correctly-delete-all-compute-graphs-dependencies branch October 29, 2024 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants