Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add watch_index method and ark-cli watch command #36

Merged
merged 3 commits into from
Nov 20, 2024
Merged

Add watch_index method and ark-cli watch command #36

merged 3 commits into from
Nov 20, 2024

Conversation

tareknaser
Copy link
Collaborator

Description

This pull request adds a new method to fs-index crate, watch_index, to monitor file system changes and automatically update the index.
Additionally, it adds a new command to ark-cli to make this functionality accessible to users.
This change is the first step of addressing issue #21.

Testing

An example of the new method's usage is in the fs-index crate at fs-index/examples/index_watch.rs.
To run the example, run the following command:

cargo run --example index_watch

This command monitors the index at the test-assets/ directory and automatically updates it upon any file system changes.

Copy link

Benchmark for 341c426

Click to view benchmark
Test Base PR %
../test-assets/lena.jpg/compute_bytes 13.6±0.51µs 13.3±0.09µs -2.21%
../test-assets/test.pdf/compute_bytes 139.0±2.61µs 107.6±0.80µs -22.59%
compute_bytes_large/compute_bytes 467.9±9.08µs 139.9±1.85µs -70.10%
compute_bytes_medium/compute_bytes 26.8±0.25µs 27.7±0.79µs +3.36%
compute_bytes_small/compute_bytes 127.2±1.07ns 128.0±6.04ns +0.63%
index_build/index_build/../test-assets/ 161.3±5.81µs 160.5±1.53µs -0.50%

fs-index/src/watch.rs Outdated Show resolved Hide resolved
fs-index/src/watch.rs Outdated Show resolved Hide resolved
fs-index/src/watch.rs Outdated Show resolved Hide resolved
fs-index/src/watch.rs Outdated Show resolved Hide resolved
@kirillt
Copy link
Member

kirillt commented May 11, 2024

It's a good PR, and it seems to be pretty straightforward to complete it, but I'm afraid that merging it before the other index refactorings could make porting ARK-Builders/arklib#72 too difficult. Because we'd need to add one more function to the index, and at the same time we need to check diffs while porting ARK-Builders/arklib#72.

Copy link

Benchmark for 0332bd7

Click to view benchmark
Test Base PR %
../test-assets/lena.jpg/compute_bytes 13.3±0.16µs 13.3±0.08µs 0.00%
../test-assets/test.pdf/compute_bytes 109.8±2.17µs 111.6±0.55µs +1.64%
compute_bytes_large/compute_bytes 471.0±0.78µs 139.5±3.27µs -70.38%
compute_bytes_medium/compute_bytes 30.9±0.20µs 27.7±0.21µs -10.36%
compute_bytes_small/compute_bytes 127.8±1.66ns 128.2±3.58ns +0.31%
index_build/index_build/../test-assets/ 163.0±1.45µs 161.0±0.53µs -1.23%

@tareknaser tareknaser marked this pull request as draft May 12, 2024 18:25
@tareknaser tareknaser mentioned this pull request Sep 1, 2024
@tareknaser
Copy link
Collaborator Author

Updated the watch API to call ResourceIndex::update_one() for files created, removed, or modified based on streams from notify events.

@tareknaser tareknaser marked this pull request as ready for review September 9, 2024 09:23
Copy link

github-actions bot commented Sep 9, 2024

Benchmark for dc67cc3

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 249.7±1.19µs 247.7±0.87µs -0.80%
blake3_resource_id_creation/compute_from_bytes:medium 15.5±0.06µs 15.6±0.08µs +0.65%
blake3_resource_id_creation/compute_from_bytes:small 1350.7±8.76ns 1358.8±6.23ns +0.60%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 197.2±0.54µs 197.0±0.67µs -0.10%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1757.1±7.55µs 1763.0±13.37µs +0.34%
crc32_resource_id_creation/compute_from_bytes:large 86.7±0.24µs 86.8±0.34µs +0.12%
crc32_resource_id_creation/compute_from_bytes:medium 5.4±0.01µs 5.4±0.03µs 0.00%
crc32_resource_id_creation/compute_from_bytes:small 92.4±0.55ns 92.4±0.33ns 0.00%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 64.8±0.29µs 64.8±0.86µs 0.00%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 946.9±3.53µs 949.6±4.11µs +0.29%
resource_index/index_build//tmp/ark-fs-index-benchmarks94k72W 106.6±3.35ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksYCHMXF 105.0±2.17ms N/A N/A
resource_index/index_get_resource_by_id 97.1±0.25ns 99.2±0.37ns +2.16%
resource_index/index_get_resource_by_path 52.8±0.26ns 55.1±0.33ns +4.36%
resource_index/index_update_all 1135.9±41.95ms 1137.9±59.79ms +0.18%
resource_index/index_update_one 684.1±33.46ms 693.3±33.36ms +1.34%

@tareknaser
Copy link
Collaborator Author

There appear to be some unexpected events coming from the notify stream. For example, I've identified a potential flaw with the following steps:

  1. Run the watch API on a folder.
  2. Copy a file multiple times (e.g., file copy.txt, file copy 2.txt).
  3. Up until this point, the index updates correctly.
  4. Delete both files simultaneously (select and delete them together).
    This results in a panic in ResourceIndex::update_one().

This situation requires further investigation. Additionally, we need to test this scenario alongside other ResourceIndex tests to be implemented for #88.

@kirillt
Copy link
Member

kirillt commented Sep 9, 2024

@tareknaser does only simultaneous deletion cause problems? Does simultaneous addition work fine?

Copy link

github-actions bot commented Sep 9, 2024

Benchmark for b8ed0bd

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 250.8±0.85µs 249.0±1.70µs -0.72%
blake3_resource_id_creation/compute_from_bytes:medium 15.5±0.03µs 15.6±0.04µs +0.65%
blake3_resource_id_creation/compute_from_bytes:small 1357.4±3.61ns 1363.3±8.30ns +0.43%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 197.8±2.52µs 197.6±0.65µs -0.10%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1762.4±4.96µs 1768.8±36.65µs +0.36%
crc32_resource_id_creation/compute_from_bytes:large 86.9±0.69µs 86.9±0.42µs 0.00%
crc32_resource_id_creation/compute_from_bytes:medium 5.4±0.01µs 5.4±0.02µs 0.00%
crc32_resource_id_creation/compute_from_bytes:small 92.4±0.70ns 92.7±1.67ns +0.32%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 64.5±0.27µs 64.9±1.47µs +0.62%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 945.7±4.82µs 946.3±5.29µs +0.06%
resource_index/index_build//tmp/ark-fs-index-benchmarks61KWbS 106.6±1.98ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksLAAoc8 111.8±0.74ms N/A N/A
resource_index/index_get_resource_by_id 97.1±0.37ns 96.7±0.50ns -0.41%
resource_index/index_get_resource_by_path 52.6±0.24ns 52.7±0.25ns +0.19%
resource_index/index_update_all 1089.8±34.10ms 1115.0±32.55ms +2.31%
resource_index/index_update_one 669.3±24.95ms 660.4±22.62ms -1.33%

fs-index/src/watch.rs Outdated Show resolved Hide resolved
@tareknaser
Copy link
Collaborator Author

does only simultaneous deletion cause problems? Does simultaneous addition work fine?

Yes and yes
Even simultaneous deletion work fine in some cases but i was able to reproduce the error more than once

@kirillt
Copy link
Member

kirillt commented Sep 9, 2024

README should be updated to explicitly state in which folder this command should be run:

cargo run --example resource_index

If ark-cli can handle similar scenario, it should be mentioned in the README, too.

Copy link

github-actions bot commented Sep 9, 2024

Benchmark for 4fe7076

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 248.4±1.85µs 250.4±3.74µs +0.81%
blake3_resource_id_creation/compute_from_bytes:medium 15.6±0.15µs 16.9±0.17µs +8.33%
blake3_resource_id_creation/compute_from_bytes:small 1360.0±4.45ns 1356.6±5.93ns -0.25%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 197.4±1.32µs 197.7±1.08µs +0.15%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1757.9±10.11µs 1769.9±21.12µs +0.68%
crc32_resource_id_creation/compute_from_bytes:large 87.0±0.89µs 86.8±0.62µs -0.23%
crc32_resource_id_creation/compute_from_bytes:medium 5.4±0.06µs 5.4±0.09µs 0.00%
crc32_resource_id_creation/compute_from_bytes:small 92.6±1.26ns 92.6±1.39ns 0.00%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 65.0±0.47µs 65.0±0.57µs 0.00%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 953.0±24.12µs 967.2±2.47µs +1.49%
resource_index/index_build//tmp/ark-fs-index-benchmarks0HA7fz 106.8±2.43ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksf1Yq2s 112.6±1.21ms N/A N/A
resource_index/index_get_resource_by_id 97.4±0.67ns 94.9±1.11ns -2.57%
resource_index/index_get_resource_by_path 52.9±0.60ns 50.2±0.26ns -5.10%
resource_index/index_update_all 1091.9±36.89ms 1118.4±43.34ms +2.43%
resource_index/index_update_one 653.8±22.83ms 668.3±19.57ms +2.22%

fs-index/src/watch.rs Outdated Show resolved Hide resolved
@tareknaser
Copy link
Collaborator Author

README should be updated to explicitly state in which folder this command should be run:

I added a note on how to run the example and mentioned that more can be done with ark-cli watch.

Copy link

Benchmark for 1bbb897

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 249.6±1.44µs 249.2±1.63µs -0.16%
blake3_resource_id_creation/compute_from_bytes:medium 15.5±0.06µs 15.5±0.06µs 0.00%
blake3_resource_id_creation/compute_from_bytes:small 1362.9±7.31ns 1357.6±7.05ns -0.39%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 197.5±0.41µs 197.5±0.85µs 0.00%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1760.6±8.18µs 1770.1±29.87µs +0.54%
crc32_resource_id_creation/compute_from_bytes:large 86.6±0.29µs 86.8±0.19µs +0.23%
crc32_resource_id_creation/compute_from_bytes:medium 5.4±0.03µs 5.4±0.06µs 0.00%
crc32_resource_id_creation/compute_from_bytes:small 92.4±0.54ns 92.3±0.30ns -0.11%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 64.5±0.30µs 64.9±0.52µs +0.62%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 947.4±4.87µs 952.9±3.74µs +0.58%
resource_index/index_build//tmp/ark-fs-index-benchmarksdlp6ac 117.7±2.70ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksiBuvQD 114.6±2.21ms N/A N/A
resource_index/index_get_resource_by_id 96.8±0.37ns 98.4±0.45ns +1.65%
resource_index/index_get_resource_by_path 52.6±0.15ns 54.4±0.39ns +3.42%
resource_index/index_update_all 1134.7±54.53ms 1169.1±51.93ms +3.03%
resource_index/index_update_one 688.0±29.08ms 701.6±31.71ms +1.98%

fs-index/src/watch.rs Outdated Show resolved Hide resolved

let relative_path = file.strip_prefix(&root_path)?;
log::info!("Relative path: {:?}", relative_path);
index.update_one(relative_path)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result should be used to provide user with actual updates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we have 2 approaches to choose from:

  1. Make update_one return same IndexUpdate type as update_all (simple). Then watch_index would return same type to the user. We could batch updates made in some interval to pack events together, but that's optional (if you find this idea useful, we can create a follow-up task).
  2. Alternatively, we could specialize update_one, so Track API and Watch API would become more powerful comparing to Reactive API. The extra power I mean is more finely-grained events, similar to what notify-rs provides: not only add/remove, but also rename/modify. This is more difficult though, and we would need to unify results from update_all and update_one before returning from watch_index. I suggest creating a follow-up issue for future consideration of this approach.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose the first approach to get things running faster and keep it simpler for now. I also plan to add tests soon.

Next, I want to add integration tests for this functionality. We could either add integration tests for fs-index directly or implement CI shell scripts to test ark-cli watch <PATH> for an end-to-end approach—possibly both?

Do you think CI shell scripts for ark-cli watch would be sufficient, or should we also include programmatic tests? For example, running the watcher in a separate thread and doing many create/delete operations to verify the results. Now that I think about it, writing these tests programmatically could be complex. What’s your take?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest creating a follow-up issue for future consideration of this approach.

tracked in #89

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think CI shell scripts for ark-cli watch would be sufficient, or should we also include programmatic tests? For example, running the watcher in a separate thread and doing many create/delete operations to verify the results. Now that I think about it, writing these tests programmatically could be complex. What’s your take?

Agree, I think we can achieve proper result by simple shell script. I imagine it like this:

  1. Run ark-cli watch in background, direct its output to dedicated log file.
  2. Randomly modify folder content and write the performed actions into another log file.
  3. Then compare the two log files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a shell script integration/ark-cli-watch.sh to check the sanity of ark-cli watch. I think it’s pretty cool because it also checks other parts of the code for sanity along the way, like update_one in this case. Using ark-cli is definitely a great way to write end-to-end shell scripts to verify what we have.

I’ve also set it up to run in the CI with each push/PR to the main branch. You can check out the expected workflow in my fork here. I’ve been using it for debugging

fs-index/Cargo.toml Outdated Show resolved Hide resolved
@tareknaser
Copy link
Collaborator Author

I spent some time today looking into different ways to use notify by going through the docs and examples. Right now, we're using async_monitor, but it might not be the best choice for us.

Order is very important in our case because we need events to happen in the right order (for example, we don’t want to see "file1.txt removed" before "file1.txt created", since this would mess up our update_one() logic). Using an asynchronous watcher could cause issues with keeping the events in order.

Btw, I think this might be why we saw this error. As I reported, the error wasn’t consistent, which could be because asynchronous events don't always happen in the right order.

While looking through the examples, I also noticed that we might want to use the Debouncer. The file system can sometimes send multiple events for what is really just one change, which could cause problems. For example, it might trigger update_one() several times when a file is created.

I'm now testing this in a smaller example and looking at how to set up the event stream properly.

Copy link

Benchmark for e4987f5

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 249.1±0.61µs 248.5±2.46µs -0.24%
blake3_resource_id_creation/compute_from_bytes:medium 15.8±1.14µs 15.6±0.12µs -1.27%
blake3_resource_id_creation/compute_from_bytes:small 1364.5±2.05ns 1365.9±1.64ns +0.10%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 197.3±3.99µs 197.1±3.13µs -0.10%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1699.3±3.73µs 1718.3±22.19µs +1.12%
crc32_resource_id_creation/compute_from_bytes:large 86.7±1.35µs 86.8±1.13µs +0.12%
crc32_resource_id_creation/compute_from_bytes:medium 5.4±0.13µs 5.4±0.01µs 0.00%
crc32_resource_id_creation/compute_from_bytes:small 92.3±0.19ns 92.3±0.30ns 0.00%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 64.1±0.32µs 64.2±0.60µs +0.16%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 953.1±88.59µs 933.0±5.78µs -2.11%
resource_index/index_build//tmp/ark-fs-index-benchmarkst4cPto 109.9±1.24ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarkszKpEve 105.9±3.20ms N/A N/A
resource_index/index_get_resource_by_id 99.6±0.38ns 95.0±2.01ns -4.62%
resource_index/index_get_resource_by_path 55.6±2.38ns 50.6±0.75ns -8.99%
resource_index/index_update_all 1117.9±32.54ms 1125.5±52.37ms +0.68%
resource_index/index_update_one 666.8±18.20ms 667.6±28.00ms +0.12%

Copy link

Benchmark for a95e06a

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 251.3±1.17µs 249.3±0.46µs -0.80%
blake3_resource_id_creation/compute_from_bytes:medium 15.6±0.06µs 15.6±0.07µs 0.00%
blake3_resource_id_creation/compute_from_bytes:small 1362.8±6.87ns 1355.7±8.95ns -0.52%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 196.8±0.35µs 196.8±0.52µs 0.00%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1705.4±4.30µs 1703.1±20.62µs -0.13%
crc32_resource_id_creation/compute_from_bytes:large 86.8±0.38µs 86.7±0.15µs -0.12%
crc32_resource_id_creation/compute_from_bytes:medium 5.4±0.03µs 5.4±0.05µs 0.00%
crc32_resource_id_creation/compute_from_bytes:small 92.3±0.18ns 92.4±0.48ns +0.11%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 64.5±0.37µs 64.7±1.40µs +0.31%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 939.6±4.55µs 941.8±11.76µs +0.23%
resource_index/index_build//tmp/ark-fs-index-benchmarks4VHJsr 109.1±2.05ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksjA6AY8 108.2±2.18ms N/A N/A
resource_index/index_get_resource_by_id 100.5±0.31ns 94.4±0.22ns -6.07%
resource_index/index_get_resource_by_path 53.8±0.30ns 50.4±0.18ns -6.32%
resource_index/index_update_all 1114.8±29.19ms 1118.7±41.25ms +0.35%
resource_index/index_update_one 660.5±20.57ms 665.9±23.63ms +0.82%

fs-index/src/index.rs Outdated Show resolved Hide resolved
@kirillt
Copy link
Member

kirillt commented Sep 16, 2024

Nitpick, but we can define aliases IndexUpdate::addition(id, path) and IndexUpdate::removal(id) for these snippets:

result.removed.insert(id.item);
result
    .added
    .insert(id, HashSet::from([timpestamped_path]));

Then we could immediately return from update_one once we determined the update:

return Ok(IndexUpdate::removal(id.item));
return Ok(IndexUpdate::addition(id, timpestamped_path));

This should be more readable.


By the way, it seems that we could simplify added field of the IndexUpdate structure. Since we don't distinguish duplicates, we can take any path as a representative of the group, so the app could do something with it. In practice, when unique resource is detected, we take its path as the representative. When a duplicate appears, we skip it. If during unique addition, several paths were introduced at once, we take an arbitrary one (options: 1) random; 2) just first in the vector; 3) the shortest path).

From API point of view, we don't need a collection of paths attached to the addition event, only one path (representative).


We might need separate events to track duplicates. Something like DuplicateAdded(id, path) and DuplicateRemoved(id, path). Although, I'm not sure that duplicate removal can be useful, maybe duplicate addition is enough. It could be used to allow the user to select representative manually. Just idea for future.

@tareknaser
Copy link
Collaborator Author

Then we could immediately return from update_one once we determined the update:

update_one() currently has an if statement that checks early on if the update is a removal or an addition, which makes it able to return from each branch as soon as possible.

By the way, it seems that we could simplify added field of the IndexUpdate structure.

I was actually looking at the code today and thought the same thing. The added field definitely has an interesting type, though I can’t quite remember why we landed on it—probably something inherited from older arklib code.

I think we could track this in a separate issue and handle it in its own PR. What do you think?

Copy link

Benchmark for 94fccc9

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 249.8±1.07µs 248.6±0.77µs -0.48%
blake3_resource_id_creation/compute_from_bytes:medium 15.5±0.06µs 15.6±0.07µs +0.65%
blake3_resource_id_creation/compute_from_bytes:small 1361.1±2.14ns 1361.1±2.28ns 0.00%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 196.2±1.30µs 196.0±0.83µs -0.10%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1664.5±5.23µs 1679.8±33.90µs +0.92%
crc32_resource_id_creation/compute_from_bytes:large 86.9±0.36µs 87.0±2.44µs +0.12%
crc32_resource_id_creation/compute_from_bytes:medium 5.4±0.03µs 5.4±0.01µs 0.00%
crc32_resource_id_creation/compute_from_bytes:small 92.3±0.37ns 92.4±0.46ns +0.11%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 64.1±0.33µs 64.0±0.37µs -0.16%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 908.8±5.02µs 911.2±5.90µs +0.26%
resource_index/index_build//tmp/ark-fs-index-benchmarksRURZ2N 108.3±1.83ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksVbGibi 108.1±1.37ms N/A N/A
resource_index/index_get_resource_by_id 96.6±0.43ns 97.5±1.03ns +0.93%
resource_index/index_get_resource_by_path 53.6±0.27ns 55.5±0.42ns +3.54%
resource_index/index_update_all 1102.3±38.40ms 1116.2±38.74ms +1.26%
resource_index/index_update_one 668.0±23.43ms 676.8±22.17ms +1.32%

Comment on lines 532 to 534
result.removed.insert(id.item);
}

result.removed.insert(id.item);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some test cases for this.

As well as test cases demonstrating what happens if the caller violates assumptions:

/// **Note**: The caller must ensure that:

/// - The index is up-to-date with the file system except for the updated
/// resource
/// - In case of a addition, the resource was not already in the index
/// - In case of a modification or removal, the resource was already in the
/// index

I'm thinking now that we should panic when the assumptions are violated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need some test cases for this.

Added a check in the existing test test_track_removal_with_collision for this particular case.

As well as test cases demonstrating what happens if the caller violates assumptions:

update_one() is meant to be a simpler, more targeted way to update the index compared to update_all(). This comes with a few constraints and assumptions that the caller needs to handle. The caller must ensure the index state is as expected before calling update_one() to avoid issues.

  • For example, if update_one() is called on a non-existent file, it will panic with an error:
Caller must ensure that the resource exists in the index: "file.txt"
  • If the caller mistakenly thinks a file was modified but it wasn’t, update_one() will still reindex it with no impact, as the information remains consistent.

Adding checks to enforce conditions—like verifying the index is current with the file system except for the updated resource—would essentially turn update_one() into update_all(), as this would require a full reindex. To keep update_one() efficient, it’s better to leave it as is. We are including a clear warning in the documentation emphasizing that the caller should confirm these conditions. I think this is more than sufficient

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding checks to enforce conditions—like verifying the index is current with the file system except for the updated resource—would essentially turn update_one() into update_all(), as this would require a full reindex. To keep update_one() efficient, it’s better to leave it as is.

Yeah, for sure we don't want to rescan the folder during update_one. We could assert some simple invariants like that the id is present in the index, but this actually already done (implicitly).

We are including a clear warning in the documentation emphasizing that the caller should confirm these conditions. I think this is more than sufficient

Yes, but if we can make it fool-proof, that would be the best.

  1. I see that if user calls update_one(removed_path) and the id of removed file isn't there unwrap will panic, ensuring no inconsistent state is introduced. This is good, but even better would be to provide error message. User could forget to call update_one when resource was introduced, but make the call when it was removed.
  2. If user calls update_one(added_path) we insert the path into self.id_to_paths.entry(id.clone()).or_default() which handles both new addition and duplicate addition (regardless if the path was there or not). This looks good, not need to check anything.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good, but even better would be to provide error message. User could forget to call update_one when resource was introduced, but make the call when it was removed.

Updated the code to gracefully return an error if this case happens

@kirillt
Copy link
Member

kirillt commented Oct 30, 2024

Then we could immediately return from update_one once we determined the update:

update_one() currently has an if statement that checks early on if the update is a removal or an addition, which makes it able to return from each branch as soon as possible.

I'm talking only about code clarity/maintainability.

@kirillt
Copy link
Member

kirillt commented Oct 30, 2024

I was actually looking at the code today and thought the same thing. The added field definitely has an interesting type, though I can’t quite remember why we landed on it—probably something inherited from older arklib code.

We can have multiple paths for same id, and we can have many ids when use update_all.

I think we could track this in a separate issue and handle it in its own PR. What do you think?

Agree, created issue:

Copy link

Benchmark for 4dd461c

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 250.4±2.08µs 249.7±2.22µs -0.28%
blake3_resource_id_creation/compute_from_bytes:medium 15.6±0.13µs 15.6±0.35µs 0.00%
blake3_resource_id_creation/compute_from_bytes:small 1357.0±10.67ns 1362.7±5.55ns +0.42%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 196.3±1.19µs 196.1±1.11µs -0.10%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1672.1±22.35µs 1670.4±18.00µs -0.10%
crc32_resource_id_creation/compute_from_bytes:large 86.8±0.45µs 86.9±0.42µs +0.12%
crc32_resource_id_creation/compute_from_bytes:medium 5.4±0.02µs 5.4±0.03µs 0.00%
crc32_resource_id_creation/compute_from_bytes:small 92.9±2.24ns 92.4±0.47ns -0.54%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 64.7±0.70µs 64.2±0.56µs -0.77%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 907.9±5.89µs 908.4±5.79µs +0.06%
resource_index/index_build//tmp/ark-fs-index-benchmarksDcxEpT 113.5±1.25ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksIfqZ8c 106.8±2.65ms N/A N/A
resource_index/index_get_resource_by_id 96.7±0.71ns 97.4±1.06ns +0.72%
resource_index/index_get_resource_by_path 55.2±2.57ns 55.8±0.75ns +1.09%
resource_index/index_update_all 1126.8±45.57ms 1106.3±45.55ms -1.82%
resource_index/index_update_one 660.4±22.35ms 672.1±25.42ms +1.77%

@tareknaser
Copy link
Collaborator Author

I made a few updates to the PR:

  • Added logging for the index_watch example
  • Added a note in the update_one() doc comment to specify that for rename or move operations, update_one() should be called twice
  • Added a couple of tests to cover these two cases

Copy link

Benchmark for 5c4137d

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 249.4±2.19µs 248.0±0.93µs -0.56%
blake3_resource_id_creation/compute_from_bytes:medium 15.5±0.11µs 15.6±0.19µs +0.65%
blake3_resource_id_creation/compute_from_bytes:small 1364.9±2.48ns 1370.8±13.09ns +0.43%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 196.9±2.37µs 201.6±0.38µs +2.39%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1688.5±5.19µs 1734.0±24.14µs +2.69%
crc32_resource_id_creation/compute_from_bytes:large 87.0±1.35µs 86.7±0.35µs -0.34%
crc32_resource_id_creation/compute_from_bytes:medium 5.5±0.40µs 5.4±0.07µs -1.82%
crc32_resource_id_creation/compute_from_bytes:small 92.9±2.77ns 92.4±0.77ns -0.54%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 64.4±0.33µs 64.3±0.16µs -0.16%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 915.0±4.78µs 915.6±5.14µs +0.07%
resource_index/index_build//tmp/ark-fs-index-benchmarks4CiNdu 110.7±0.74ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarks9xZzKq 112.2±1.63ms N/A N/A
resource_index/index_get_resource_by_id 100.3±1.52ns 97.2±0.59ns -3.09%
resource_index/index_get_resource_by_path 60.9±0.22ns 54.4±0.54ns -10.67%
resource_index/index_update_all 1097.1±28.25ms 1099.0±44.13ms +0.17%
resource_index/index_update_one 664.7±15.27ms 663.4±17.03ms -0.20%

Copy link

Benchmark for 4d69b11

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 249.7±1.05µs 249.0±2.75µs -0.28%
blake3_resource_id_creation/compute_from_bytes:medium 15.8±2.31µs 15.5±0.05µs -1.90%
blake3_resource_id_creation/compute_from_bytes:small 1355.3±35.24ns 1345.9±4.30ns -0.69%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 197.5±3.48µs 197.8±2.29µs +0.15%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1681.1±11.75µs 1690.6±20.60µs +0.57%
crc32_resource_id_creation/compute_from_bytes:large 92.0±1.47µs 91.7±0.46µs -0.33%
crc32_resource_id_creation/compute_from_bytes:medium 5.7±0.11µs 5.7±0.06µs 0.00%
crc32_resource_id_creation/compute_from_bytes:small 96.4±0.43ns 96.5±1.05ns +0.10%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 64.8±0.92µs 64.9±0.66µs +0.15%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 920.3±16.31µs 910.3±11.27µs -1.09%
resource_index/index_build//tmp/ark-fs-index-benchmarksAQUXEx 103.6±2.56ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksq7cThX 105.5±2.89ms N/A N/A
resource_index/index_get_resource_by_id 102.7±3.49ns 99.6±1.64ns -3.02%
resource_index/index_get_resource_by_path 55.8±2.81ns 53.2±0.53ns -4.66%
resource_index/index_update_all 1101.9±34.51ms 1122.1±49.65ms +1.83%
resource_index/index_update_one 662.7±21.92ms 677.0±22.89ms +2.16%

Copy link
Member

@kirillt kirillt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, great job

Copy link

Benchmark for 5031b4a

Click to view benchmark
Test Base PR %
blake3_resource_id_creation/compute_from_bytes:large 248.4±0.35µs 249.5±1.34µs +0.44%
blake3_resource_id_creation/compute_from_bytes:medium 15.6±0.23µs 15.5±0.06µs -0.64%
blake3_resource_id_creation/compute_from_bytes:small 1345.5±1.71ns 1346.0±2.10ns +0.04%
blake3_resource_id_creation/compute_from_path:../test-assets/lena.jpg 197.5±0.75µs 197.5±1.17µs 0.00%
blake3_resource_id_creation/compute_from_path:../test-assets/test.pdf 1692.2±12.89µs 1692.1±30.31µs -0.01%
crc32_resource_id_creation/compute_from_bytes:large 91.8±0.19µs 92.6±5.35µs +0.87%
crc32_resource_id_creation/compute_from_bytes:medium 5.7±0.01µs 5.7±0.03µs 0.00%
crc32_resource_id_creation/compute_from_bytes:small 96.4±0.69ns 96.5±1.59ns +0.10%
crc32_resource_id_creation/compute_from_path:../test-assets/lena.jpg 65.2±2.36µs 65.6±0.96µs +0.61%
crc32_resource_id_creation/compute_from_path:../test-assets/test.pdf 913.1±6.70µs 917.4±4.90µs +0.47%
resource_index/index_build//tmp/ark-fs-index-benchmarksNNAGfc 108.3±1.74ms N/A N/A
resource_index/index_build//tmp/ark-fs-index-benchmarksm7c99V 107.3±1.55ms N/A N/A
resource_index/index_get_resource_by_id 103.2±3.77ns 100.7±2.39ns -2.42%
resource_index/index_get_resource_by_path 54.0±1.92ns 57.3±2.15ns +6.11%
resource_index/index_update_all 1118.0±45.37ms 1114.5±52.34ms -0.31%
resource_index/index_update_one 690.5±25.55ms 655.5±25.81ms -5.07%

@tareknaser tareknaser merged commit 66c9362 into main Nov 20, 2024
4 checks passed
@tareknaser tareknaser deleted the watch branch November 20, 2024 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants