[Bug] PD GraphCache range attribute has a missing coverage range of 0-65535 #2634

haohao0103 · 2024-08-13T06:54:39Z

Bug Type (问题类型)

performance (性能下降)

Before submit

我已经确认现有的 Issues 与 FAQ 中没有相同 / 重复问题 (I have confirmed and searched that there are no similar problems in the historical issue and documents)

Environment (环境信息)

Server Version: 1.5.0 (Apache Release Version)
Backend: HStore

Expected & Actual behavior (期望与实际表现)

the speed of vertex writing suddenly decreases, while P99 increases; it was found that there is a missing coverage range in the GraphCache range;

Actual range :

result=@ArrayList[
    @TreeRangeMap[
        entriesByLowerBound=@TreeMap[
            @BelowValue[\4096/]:@RangeMapEntry[[4096..8192)=1],
            @BelowValue[\8192/]:@RangeMapEntry[[8192..12288)=2],
            @BelowValue[\12288/]:@RangeMapEntry[[12288..16384)=3],
            @BelowValue[\36864/]:@RangeMapEntry[[36864..40960)=9],
            @BelowValue[\40960/]:@RangeMapEntry[[40960..45056)=10],
            @BelowValue[\45056/]:@RangeMapEntry[[45056..49152)=11],
        ],
        EMPTY_SUB_RANGE_MAP=@[
        ],
    ],
]

Expected range:

@TreeRangeMap[
        entriesByLowerBound=@TreeMap[
            @BelowValue[\0/]:@RangeMapEntry[[0..4096)=0],
            @BelowValue[\4096/]:@RangeMapEntry[[4096..8192)=1],
            @BelowValue[\8192/]:@RangeMapEntry[[8192..12288)=2],
            @BelowValue[\12288/]:@RangeMapEntry[[12288..16384)=3],
            @BelowValue[\16384/]:@RangeMapEntry[[16384..20480)=4],
            @BelowValue[\20480/]:@RangeMapEntry[[20480..24576)=5],
            @BelowValue[\24576/]:@RangeMapEntry[[24576..28672)=6],
            @BelowValue[\28672/]:@RangeMapEntry[[28672..32768)=7],
            @BelowValue[\32768/]:@RangeMapEntry[[32768..36864)=8],
            @BelowValue[\36864/]:@RangeMapEntry[[36864..40960)=9],
            @BelowValue[\40960/]:@RangeMapEntry[[40960..45056)=10],
            @BelowValue[\45056/]:@RangeMapEntry[[45056..49152)=11],
            @BelowValue[\49152/]:@RangeMapEntry[[49152..53248)=12],
            @BelowValue[\53248/]:@RangeMapEntry[[53248..57344)=13],
            @BelowValue[\57344/]:@RangeMapEntry[[57344..61440)=14],
            @BelowValue[\61440/]:@RangeMapEntry[[61440..65536)=15],
        ],
        EMPTY_SUB_RANGE_MAP=@[
        ],

Vertex/Edge example (问题点 / 边数据举例)

No response

Schema [VertexLabel, EdgeLabel, IndexLabel] (元数据结构)

No response

The text was updated successfully, but these errors were encountered:

dosubot · 2024-08-13T06:57:25Z

The missing coverage range in the GraphCache of the HStore backend in HugeGraph version 1.5.0 is likely due to the RangeMap<Long, Integer> range not being properly updated or managed. This can cause a decrease in vertex writing speed and an increase in P99 latency because the system may not be able to efficiently locate the correct partitions for the vertices, leading to increased contention and delays.

The GraphCache class uses a TreeRangeMap to manage ranges, but there is no code provided that shows how this range map is being populated or updated. If the range map is not correctly reflecting the actual ranges of the partitions, it can lead to inefficiencies in writing operations.

Here is the relevant part of the GraphCache class:

@Data
public class GraphCache {

    private Graph graph;
    private AtomicBoolean initialized = new AtomicBoolean(false);
    private AtomicBoolean writing = new AtomicBoolean(false);
    private ReentrantReadWriteLock lock = new ReentrantReadWriteLock();
    private Map<Integer, AtomicBoolean> state = new ConcurrentHashMap<>();
    private Map<Integer, Partition> partitions = new ConcurrentHashMap<>();
    private RangeMap<Long, Integer> range = TreeRangeMap.create();

    public GraphCache(Graph graph) {
        this.graph = graph;
    }

    public GraphCache() {
    }

    public Partition getPartition(Integer id) {
        return partitions.get(id);
    }

    public Partition addPartition(Integer id, Partition p) {
        return partitions.put(id, p);
    }

    public Partition removePartition(Integer id) {
        return partitions.remove(id);
    }
}

To address this issue, you need to ensure that the range map is correctly updated whenever partitions are added or removed. This involves adding methods to manage the range map appropriately.

Additionally, the HstoreTable class in the HStore backend provides methods for querying, inserting, deleting, and updating entries in the table. It also handles metadata and manages shards. The HstoreTable class includes methods like insert, delete, append, eliminate, and various query methods that interact with the backend store. Ensuring that these methods correctly handle the partition ranges and update the range map in GraphCache can help mitigate the issue of missing coverage range and improve performance ^[1]^[2].

To continue talking to Dosu, mention @dosu.

haohao0103 · 2024-08-13T07:02:45Z

@dosu
Is there a thread safety issue with the rang remove/put operation in the public boolean update (String graphName, int partId, Partition partition) method in the ClientCache?

dosubot · 2024-08-13T07:05:46Z

Yes, there is a potential thread safety issue with the range remove/put operation in the update method of the ClientCache class. The range map is being modified without any synchronization, which can lead to inconsistent states and missing ranges.

To address this, you should synchronize the operations on the range map to ensure thread safety. Here is a potential fix for the update method:

public boolean update(String graphName, int partId, Partition partition) {
    GraphCache graph = getGraphCache(graphName);
    graph.lock.writeLock().lock();
    try {
        Partition p = graph.getPartition(partId);
        if (p != null && p.equals(partition)) {
            return false;
        }
        RangeMap<Long, Integer> range = graph.getRange();
        graph.addPartition(partId, partition);
        if (p != null) {
            if (Objects.equals(partition.getId(), range.get(partition.getStartKey())) &&
                Objects.equals(partition.getId(), range.get(partition.getEndKey() - 1))) {
                range.remove(range.getEntry(partition.getStartKey()).getKey());
            }
        }
        range.put(Range.closedOpen(partition.getStartKey(), partition.getEndKey()), partId);
    } catch (Exception e) {
        throw new RuntimeException(e);
    } finally {
        graph.lock.writeLock().unlock();
    }
    return true;
}

By using a ReentrantReadWriteLock and acquiring the write lock before modifying the range map, you can ensure that the operations are thread-safe and that the range map is updated correctly. This should help resolve the performance issues and the missing coverage range in the GraphCache class ^[1]^[2].

To continue talking to Dosu, mention @dosu.

haohao0103 · 2024-08-13T08:33:31Z

@dosu
for this range read-write method, it is necessary to consider adding synchronization in order to completely solve the problem，
How about replacing the rang attribute in GraphCache with a thread safe RangeMap?

dosubot · 2024-08-13T08:35:51Z

Yes, replacing the range attribute in GraphCache with a thread-safe RangeMap would likely help in solving the performance issue and missing coverage range problem in the HStore backend of HugeGraph version 1.5.0. The current implementation uses RangeMap<Long, Integer> from the Guava library, which is not inherently thread-safe.

To ensure thread safety, you can use a ConcurrentHashMap or other thread-safe collections. However, Guava does not provide a thread-safe RangeMap out of the box. You would need to synchronize access to the RangeMap or use a custom implementation that ensures thread safety.

The PartitionCache class from HugeGraph provides a good example of how to manage thread safety with RangeMap using a ReentrantReadWriteLock. Here is an adapted example based on the PartitionCache class:

import com.google.common.collect.Range;
import com.google.common.collect.RangeMap;
import com.google.common.collect.TreeRangeMap;

import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.locks.ReentrantReadWriteLock;

public class GraphCache {
    private final ReentrantReadWriteLock lock = new ReentrantReadWriteLock();
    private final RangeMap<Long, Integer> range = TreeRangeMap.create();
    private final ConcurrentHashMap<String, RangeMap<Long, Integer>> keyToPartIdCache = new ConcurrentHashMap<>();

    public void putRange(String graphName, Range<Long> range, Integer value) {
        lock.writeLock().lock();
        try {
            keyToPartIdCache.computeIfAbsent(graphName, k -> TreeRangeMap.create()).put(range, value);
        } finally {
            lock.writeLock().unlock();
        }
    }

    public Integer getRange(String graphName, Long key) {
        lock.readLock().lock();
        try {
            RangeMap<Long, Integer> rangeMap = keyToPartIdCache.get(graphName);
            return rangeMap != null ? rangeMap.get(key) : null;
        } finally {
            lock.readLock().unlock();
        }
    }

    public void removeRange(String graphName, Range<Long> range) {
        lock.writeLock().lock();
        try {
            RangeMap<Long, Integer> rangeMap = keyToPartIdCache.get(graphName);
            if (rangeMap != null) {
                rangeMap.remove(range);
            }
        } finally {
            lock.writeLock().unlock();
        }
    }

    // Other methods to manipulate the range map
}

In this example, a ReentrantReadWriteLock is used to ensure that reads and writes to the RangeMap are thread-safe. This approach should help mitigate the performance issues and ensure that the coverage range is correctly maintained. Additionally, the use of ConcurrentHashMap for managing multiple RangeMap instances per graph ensures thread safety at a higher level ^[1]^[2].

To continue talking to Dosu, mention @dosu.

haohao0103 · 2024-08-13T11:00:30Z

@dosu
In PartitionCache, such as getPartitionById and getPartitionByCode, waitGraphLock (graphName) is called. In my understanding, it means self selection waiting and no lock is obtained, right? Why not be afraid of thread safety issues?

dosubot · 2024-08-13T11:01:29Z

There is no concern about thread safety issues in the PartitionCache class when methods like getPartitionById and getPartitionByCode call waitGraphLock(graphName) because the waitGraphLock method ensures that the operations are synchronized using a lock mechanism. Specifically, the waitGraphLock method uses an AtomicBoolean lock for each graph to ensure that only one thread can proceed with operations on a particular graph at a time, while other threads spin-wait until the lock is released.

Here is the relevant code:

public void waitGraphLock(String graphName) {
    var lock = getOrCreateGraphLock(graphName);
    while (lock.get()) {
        Thread.onSpinWait();
    }
}

private AtomicBoolean getOrCreateGraphLock(String graphName) {
    var lock = this.locks.get(graphName);
    if (lock == null) {
        try {
            writeLock.lock();
            if ((lock = this.locks.get(graphName)) == null) {
                lock = new AtomicBoolean();
                locks.put(graphName, lock);
            }
        } finally {
            writeLock.unlock();
        }
    }
    return lock;
}

The getOrCreateGraphLock method ensures that a unique AtomicBoolean lock is created for each graph if it does not already exist. The waitGraphLock method then uses this lock to ensure that only one thread can proceed with operations on the graph, thus preventing thread safety issues ^[1].

To continue talking to Dosu, mention @dosu.

haohao0103 · 2024-08-14T01:57:49Z

@dosu
WaitGraphLock method only waits until the lock is released and does not hold the lock itself. In the getPartitionById method, when the subsequent logic performs a get query, is it possible for another thread to make modifications?

dosubot · 2024-08-14T03:26:51Z

Yes, it is possible for another thread to make modifications in the getPartitionById method after the waitGraphLock method is called, because waitGraphLock only waits until the lock is released and does not hold the lock itself.

To ensure thread safety in this scenario, you should use the lockGraph method to acquire the lock before performing the get query and then release the lock using the unlockGraph method after the query is done. Here is how you can modify the getPartitionById method to ensure thread safety:

public KVPair<Metapb.Partition, Metapb.Shard> getPartitionById(String graphName, int partId) {
    try {
        lockGraph(graphName); // Acquire the lock
        GraphCache graph = initGraph(graphName);
        Partition partition = graph.getPartition(partId);
        if (partition == null) {
            return null;
        }
        KVPair<ShardGroup, Shard> group = groups.get(partId);
        if (group == null) {
            return null;
        }
        Shard shard = group.getValue();
        if (shard == null) {
            return null;
        }
        return new KVPair<>(partition, shard);
    } finally {
        unlockGraph(graphName); // Release the lock
    }
}

By acquiring the lock before accessing the partitionCache and releasing it afterward, you ensure that no other thread can modify the cache while the current thread is performing the get query. This approach leverages the ReentrantReadWriteLock from the GraphCache class to manage concurrent access to the partitions, ensuring thread safety ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

haohao0103 · 2024-08-14T03:33:58Z

@dosu
WaitGraphLock method only waits until the lock is released and does not hold the lock itself. What is the significance of the WaitGraphLock method?

dosubot · 2024-08-14T03:35:45Z

The waitGraphLock method in the PartitionCache class ensures that a thread waits until the lock for a specific graph is released before proceeding. This method contributes to thread safety by preventing concurrent modifications to the graph's data structures while another thread is performing operations on them.

Here is the relevant code snippet:

public void waitGraphLock(String graphName) {
    var lock = getOrCreateGraphLock(graphName);
    while (lock.get()) {
        Thread.onSpinWait();
    }
}

The method works by checking the state of an AtomicBoolean lock associated with the graph. If the lock is set (i.e., true), the thread will spin-wait until the lock is released (i.e., set to false). This ensures that no other thread can proceed with operations on the graph until the current operation is complete.

Although waitGraphLock does not hold the lock during subsequent operations, it ensures that the thread only proceeds when it is safe to do so, thus maintaining thread safety. The actual locking and unlocking are handled by the lockGraph and unlockGraph methods, respectively, which set and reset the AtomicBoolean lock.

Here is how lockGraph and unlockGraph work:

public void lockGraph(String graphName) {
    var lock = getOrCreateGraphLock(graphName);
    while (!lock.compareAndSet(false, true)) {
        Thread.onSpinWait();
    }
}

public void unlockGraph(String graphName) {
    var lock = getOrCreateGraphLock(graphName);
    lock.set(false);
}

In summary, waitGraphLock ensures that a thread waits for the lock to be released before proceeding, thereby contributing to thread safety by preventing concurrent modifications ^[1].

To continue talking to Dosu, mention @dosu.

haohao0103 added the bug Something isn't working label Aug 13, 2024

haohao0103 changed the title ~~[Bug] HStore GraphCache range attribute has a missing coverage range of 0-65535~~ [Bug] PD GraphCache range attribute has a missing coverage range of 0-65535 Aug 14, 2024

haohao0103 added a commit to haohao0103/incubator-hugegraph that referenced this issue Aug 14, 2024

close apache#2634

dfc9ce6

haohao0103 mentioned this issue Aug 14, 2024

fix(pd): Ensure range attribute thread safety #2641

Merged

VGalaxies closed this as completed in #2641 Sep 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] PD GraphCache range attribute has a missing coverage range of 0-65535 #2634

[Bug] PD GraphCache range attribute has a missing coverage range of 0-65535 #2634

haohao0103 commented Aug 13, 2024

dosubot bot commented Aug 13, 2024

haohao0103 commented Aug 13, 2024

dosubot bot commented Aug 13, 2024

haohao0103 commented Aug 13, 2024

dosubot bot commented Aug 13, 2024

haohao0103 commented Aug 13, 2024

dosubot bot commented Aug 13, 2024

haohao0103 commented Aug 14, 2024

dosubot bot commented Aug 14, 2024

haohao0103 commented Aug 14, 2024

dosubot bot commented Aug 14, 2024

[Bug] PD GraphCache range attribute has a missing coverage range of 0-65535 #2634

[Bug] PD GraphCache range attribute has a missing coverage range of 0-65535 #2634

Comments

haohao0103 commented Aug 13, 2024

Bug Type (问题类型)

Before submit

Environment (环境信息)

Expected & Actual behavior (期望与实际表现)

Vertex/Edge example (问题点 / 边数据举例)

Schema [VertexLabel, EdgeLabel, IndexLabel] (元数据结构)

dosubot bot commented Aug 13, 2024

haohao0103 commented Aug 13, 2024

dosubot bot commented Aug 13, 2024

haohao0103 commented Aug 13, 2024

dosubot bot commented Aug 13, 2024

haohao0103 commented Aug 13, 2024

dosubot bot commented Aug 13, 2024

haohao0103 commented Aug 14, 2024

dosubot bot commented Aug 14, 2024

haohao0103 commented Aug 14, 2024

dosubot bot commented Aug 14, 2024