Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GraphBolt] Check return values of liburing calls and refactor. #7518

Merged
merged 29 commits into from
Jul 17, 2024

Conversation

mfbalin
Copy link
Collaborator

@mfbalin mfbalin commented Jul 11, 2024

Description

In the old code, we were not checking the return value of the results. If the reads are short (requested N bytes but it reads M < N bytes), then we have to continue from where it left off and submit another request. Now, we can write a regression test
and measure speed and after speed is measured, check correctness without affecting timing.

Regression test:

import os
import tempfile
from functools import partial
import time

import torch
import dgl.graphbolt as gb

assert_equal = partial(torch.testing.assert_close, rtol=0, atol=0)

def to_on_disk_numpy(test_dir, name, t):
    path = os.path.join(test_dir, name + ".npy")
    gb.numpy_save_aligned(path, t.numpy())
    return path

def test_index_select_throughput_and_iops(shape, dtype, indices, num_threads_list):
    tensor = torch.randint(0, 127, shape, dtype=dtype)

    skip_first = 10

    results = []
    IOPSs = []

    with tempfile.TemporaryDirectory() as test_dir:
        path = to_on_disk_numpy(test_dir, "tensor", tensor)

        for num_threads in num_threads_list:
            feature = gb.DiskBasedFeature(path=path, num_threads=num_threads)

            throughput_sum = 0
            iops_sum = 0

            for i, idx in enumerate(indices):
                start = time.time()
                result = feature.read(idx)
                duration = time.time() - start
                assert_equal(result, tensor[idx])

                if i >= skip_first:
                    throughput_sum += result.nbytes / duration
                    iops_sum += idx.numel() / duration
    
            throughput = throughput_sum / (len(indices) - skip_first)
            iops = iops_sum / (len(indices) - skip_first)
            print(num_threads, int(throughput / (2 ** 20)), "MiB/s", int(iops), "IOPS")
            results.append(throughput)
            IOPSs.append(iops)

    return results, IOPSs

shape = [2500000, 4096]
dtype = torch.int8
batch_size = 100000
indices = [torch.randint(0, shape[0], [batch_size], dtype=torch.int32) for _ in range(25)]
num_threads_list = list(range(1, 9))

throughputs, IOPSs = test_index_select_throughput_and_iops(shape, dtype, indices, num_threads_list)
print(list((num_threads, int(throughput / (2 ** 20)), int(iops)) for num_threads, throughput, iops in zip(num_threads_list, throughputs, IOPSs)))

Benchmark results with 4K byte feature dimension:

(venv) mfbalin@BALIN-PC:~/dgl-1$ python graphbolt/benchmarks/disk_based_feature.py
1 715 MiB/s 183271 IOPS
2 1142 MiB/s 292576 IOPS
3 1571 MiB/s 402287 IOPS
4 1785 MiB/s 457213 IOPS
5 1873 MiB/s 479614 IOPS
6 2094 MiB/s 536154 IOPS
7 2185 MiB/s 559578 IOPS
8 2228 MiB/s 570410 IOPS
[(1, 715, 183271), (2, 1142, 292576), (3, 1571, 402287), (4, 1785, 457213), (5, 1873, 479614), (6, 2094, 536154), (7, 2185, 559578), (8, 2228, 570410)]

Checklist

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • I've leverage the tools to beautify the python and c++ code.
  • The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
  • All changes have test coverage
  • Code is well-documented
  • To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • Related issue is referred in this PR
  • If the PR is for a new model/paper, I've updated the example index here.

Changes

@mfbalin mfbalin marked this pull request as draft July 11, 2024 21:37
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 11, 2024

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch];
    For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 11, 2024

Commit ID: abba7f6

Build ID: 1

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 11, 2024

Commit ID: bfe1981

Build ID: 2

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 12, 2024

Commit ID: 45d3522

Build ID: 3

Status: ❌ CI test failed in Stage [CPU Build (Win64)].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 12, 2024

Commit ID: 0fb9a5b

Build ID: 4

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@mfbalin mfbalin requested a review from frozenbugs July 12, 2024 06:18
@mfbalin mfbalin marked this pull request as ready for review July 12, 2024 06:18
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 12, 2024

Commit ID: 82f1013

Build ID: 5

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 12, 2024

Commit ID: a60d979

Build ID: 6

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 12, 2024

Commit ID: 2a858db

Build ID: 7

Status: ❌ CI test failed in Stage [CPU Build].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 12, 2024

Commit ID: 01b0b84

Build ID: 8

Status: ❌ CI test failed in Stage [CPU Build (Win64)].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 12, 2024

Commit ID: 8d9c8df

Build ID: 9

Status: ❌ CI test failed in Stage [CPU Build (Win64)].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 12, 2024

Commit ID: e8ee557

Build ID: 10

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 12, 2024

Commit ID: ee36568

Build ID: 11

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 14, 2024

Commit ID: ca8da895f52b48eb340b65da409e8b7e150785a5

Build ID: 12

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 16, 2024

Commit ID: 4c877c6

Build ID: 24

Status: ❌ CI test failed in Stage [CPU Build].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 16, 2024

Commit ID: e9e9dd7

Build ID: 25

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

graphbolt/src/cnumpy.h Outdated Show resolved Hide resolved
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 16, 2024

Commit ID: 1affb480c5118bf4190a05a3639122f13000f15a

Build ID: 26

Status: ❌ CI test failed in Stage [Torch CPU Example test].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 16, 2024

Commit ID: e4f93ae

Build ID: 27

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 16, 2024

Commit ID: c84a7f0a13c51d43f52a01ebf50e7d6b64cb4aec

Build ID: 28

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 16, 2024

Commit ID: fcb09ed36b6800cc994ab6556dde2985463206e1

Build ID: 29

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 16, 2024

Commit ID: 4167fa339f3986b72ed871f033467248b035e09a

Build ID: 30

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 16, 2024

Commit ID: 68b75661a829a875e2da6f93f3d01e6283b055ca

Build ID: 31

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 16, 2024

Commit ID: 042df7debdfbe00d61f838f6b4ab172e551222dc

Build ID: 32

Status: ❌ CI test failed in Stage [CPU Build].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 16, 2024

Commit ID: e596c721a01d6dabdcdfffd3f50f4627d6709e78

Build ID: 33

Status: ❌ CI test failed in Stage [CPU Build].

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 16, 2024

Commit ID: 6c5c035

Build ID: 34

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 16, 2024

Commit ID: ba7c89bda3077c7da1e07767885d4e1f3784317e

Build ID: 35

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 17, 2024

Commit ID: 0b2a630

Build ID: 36

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

graphbolt/src/cnumpy.cc Outdated Show resolved Hide resolved
graphbolt/src/cnumpy.cc Outdated Show resolved Hide resolved
graphbolt/src/cnumpy.cc Outdated Show resolved Hide resolved
@mfbalin mfbalin requested a review from frozenbugs July 17, 2024 08:13
graphbolt/src/cnumpy.cc Show resolved Hide resolved
graphbolt/src/cnumpy.cc Show resolved Hide resolved
@dgl-bot
Copy link
Collaborator

dgl-bot commented Jul 17, 2024

Commit ID: ebf0e2b

Build ID: 37

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

@mfbalin mfbalin merged commit d37d516 into dmlc:master Jul 17, 2024
2 checks passed
@mfbalin mfbalin deleted the gb_io_uring_safer branch July 17, 2024 08:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants