Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A question about point cloud data reading #70

Open
r-sy opened this issue Feb 3, 2021 · 2 comments
Open

A question about point cloud data reading #70

r-sy opened this issue Feb 3, 2021 · 2 comments

Comments

@r-sy
Copy link

r-sy commented Feb 3, 2021

I would like to ask you a question about point cloud data reading:

    data_provider = DataProvider(fetch_data, batch_data,
    load_dataset_to_mem=train_config['load_dataset_to_mem'],
    load_dataset_every_N_time=train_config['load_dataset_every_N_time'],
    capacity=train_config['capacity'],
    num_workers=train_config['num_load_dataset_workers'],
    preload_list=list(range(NUM_TEST_SAMPLE)))

    # Training session ==========================================================
    batch_size = train_config.get('batch_size', 1)
    print('batch size=' + str(batch_size))
    frame_idx_list = np.random.permutation(NUM_TEST_SAMPLE)
    print(f"frame_idx_list: {frame_idx_list.shape}")
    print(frame_idx_list)

    batchs = data_provider.provide_batch([1545,1546])
    # print(f"batchs: {batchs}")
    input_v, vertex_coord_list, keypoint_indices_list, edges_list, \
        cls_labels, encoded_boxes, valid_boxes = batchs

    print(f"input_v: {input_v.shape}")
    for i, vertex_coord in enumerate(vertex_coord_list):
        print(f"vertex_coord: {i}: {vertex_coord.shape}")

    for i, indices in enumerate(keypoint_indices_list):
        print(f"indices: {i}: {indices.shape}")
        print(indices)
    for i, edge in enumerate(edges_list):
        print(f"edge: {i}: {edge.shape}")
        print(edge)
        #for item in edge:
        #    if item[0]==item[1]: print(item)
    print(f"cls_labels:{cls_labels.shape}")
    print(f"encoded_boxes: {encoded_boxes.shape}")
    print(f"valid_boxes: {valid_boxes.shape}")
    print(valid_boxes)
    print(f"max: {valid_boxes.max()}, min:{valid_boxes.min()}, sum: {valid_boxes.sum()}")
    print(f"NUM_TEST_SAMPLE: {NUM_TEST_SAMPLE}")

I tried to see the shape and content of the DataProvider function's read data. I can see that there are 3260 related files containing "car_class", and that each GPU processes batch_size=2 randomly. However, I found that Point-GNN combined the two batch data together after processing. So these are two different scenarios and why are they being calculated together?Shouldn't it be done separately in parallel?It made me wonder if I was wrong.

For example, as shown in the code above, I will read a point cloud bin file with two indexes corresponding to [1545, 1546]. I have done the joint read processing of batch_size = 2 and the separate read processing of batch_size = 1 respectively. Results are as follows:

# batchs = data_provider.provide_batch([1545,1546])
frame_idx_list: (3260,)
[ 754 1470  368 ... 3196 3056  788]
input_v: (37255, 1)
vertex_coord: 0: (37255, 3)
vertex_coord: 1: (3011, 3)
vertex_coord: 2: (3011, 3)
indices: 0: (3011, 1)
[[    0]
 [    9]
 [  364]
 ...
 [37001]
 [37064]
 [37141]]
indices: 1: (3011, 1)
[[   0]
 [   1]
 [   2]
 ...
 [3008]
 [3009]
 [3010]]
edge: 0: (212397, 2)
[[    9     0]
 [    7     0]
 [   10     0]
 ...
 [36149  3010]
 [36148  3010]
 [36146  3010]]
edge: 1: (162771, 2)
[[1097    0]
 [1096    0]
 [1093    0]
 ...
 [2967 3010]
 [2977 3010]
 [2937 3010]]
# batchs = data_provider.provide_batch([1545])
frame_idx_list: (3260,)
[2113 1298 1673 ... 2438  605  722]
input_v: (19893, 1)
vertex_coord: 0: (19893, 3)
vertex_coord: 1: (2040, 3)
vertex_coord: 2: (2040, 3)
indices: 0: (2040, 1)
[[    1]
 [    8]
 [   12]
 ...
 [19666]
 [19741]
 [19746]]
indices: 1: (2040, 1)
[[   0]
 [   1]
 [   2]
 ...
 [2037]
 [2038]
 [2039]]
edge: 0: (108590, 2)
[[    8     0]
 [    9     0]
 [   10     0]
 ...
 [19152  2039]
 [19149  2039]
 [19148  2039]]
edge: 1: (126976, 2)
[[1259    0]
 [1104    0]
 [1186    0]
 ...
 [1999 2039]
 [2024 2039]
 [2034 2039]]
# batchs = data_provider.provide_batch([1546])
frame_idx_list: (3260,)
[1981 1082 1125 ... 2300  824 3203]
input_v: (17362, 1)
vertex_coord: 0: (17362, 3)
vertex_coord: 1: (1072, 3)
vertex_coord: 2: (1072, 3)
indices: 0: (1072, 1)
[[    1]
 [  386]
 [  387]
 ...
 [17043]
 [17291]
 [17222]]
indices: 1: (1072, 1)
[[   0]
 [   1]
 [   2]
 ...
 [1069]
 [1070]
 [1071]]
edge: 0: (115093, 2)
[[    0     0]
 [  385     0]
 [    1     0]
 ...
 [16225  1071]
 [16227  1071]
 [16226  1071]]
edge: 1: (46338, 2)
[[ 203    0]
 [   0    0]
 [ 114    0]
 ...
 [1007 1071]
 [1039 1071]
 [1017 1071]]

It can be seen that when batch is 2, the data's vertices and edges of two different scenes are mixed together for calculation

If it is calculated separately and in parallel, could you please explain it to me in the code? This is very important to me. I hope you can help me.

@WeijingShi
Copy link
Owner

Hi @r-sy, you are correct. We do put multiple point clouds into a single batch, kind of like the batch operation for images.

Here is an example of how we do it:

Given two point clouds from frame 1 and frame 2.

Frame 1 is a point cloud with two vertices: A, B, let's say we are doing pooling and select B as a key-point. So we have the source set [A, B] and the destination set [B], therefore indices=[[1]](destination vertex B's index in source set is 1). And we create an edge connecting source A and source B to destination B, edge_1 = [[0 (A's index in the source, 0 (B's index in the destination], [1 (B's index in the source, 0 (B's index in the destination]].

Frame 2 is another point cloud with three vertices: C, D, E, and we want no downsampling. So the source set is [C, D, E] and the destination set is exactly the same set [C, D, E], indices=[[0], [1], [2]]. Let's create some edges connecting C to D, D to E, edges_2 = [[0, 1], [1, 2]]

Now, we can run GNN separately for each frame just as you said, but we can also batch the frames and go through GNN together. We batch the graphs by increment the vertices ids.

A Batch of Frame 1 and Frame 2: the concatenated source set is [A, B, C, D, E] and the concatenate destination set is [B, C, D, E]. The indices that link the source to the destination is now indices=[[1], [0+2], [1+2], [2+2]] edge_1remains the same but we increment the index in edge_2, edges_2 = [[0+2, 1+1], [1+2, 2+1]]. The combined edge = np.concatenate([edge_1, edge_2]). It's now basically a larger disjoint graph with two separated pieces, we can run GNN on it. There are no edges cross the vertices of different frames.

This batching process is done by

def batch_data(batch_list):

This batching operation is just an optimization, if it sounds too much trouble, you can add COPY_PER_GPU in train_config and make sure COPY_PER_GPU*NUM_GPU == batch_size. This will allow train.py to run COPY_PER_GPU networks in parallel in each GPU and for each copy, the batch size would be 1. So it skips the whole batching operation.

Hope it helps.

@r-sy
Copy link
Author

r-sy commented Mar 21, 2021

Thank you for your answer, which is very clear!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants