-
Notifications
You must be signed in to change notification settings - Fork 391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
util/mr_cache: Add framework for memory registration cache #3617
Conversation
@gladkovdmitry17 - This is a quick implementation for a much simplified version of the mr cache. This does not hook into the mr monitor code yet. There is no limit to the number of registrations that can be performed. The only limit is on the number of entries that are allowed to be cached. So, if the cache is full, the user is allowed to continue with the registration, which would allow them to proceed with the corresponding data transfer. Effectively, the cache size limits the size of the rb tree. Uncacheable registrations are not inserted into the tree and destroyed when unregistered. The API has search/delete calls, which mimic the standard binary tree interface (i.e. tsearch/tdelete). I went back and forth on whether to make cache statistics available as debug only or all the time. I ended up going with always available, since the overhead is small (simple integer increments). I think this captures the desired functionality. Please provide any feedback. I can finish adding any missing requirements tomorrow (Wed.). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reduction looks nice. A simplification of the mechanisms that were in place, as well as a simplification of logic that had long needed to be refactored and reduced.
However, without the memory notifier, without more detailed information to the underlying layer, I don't know that this could work with providers such as the GNI provider.
I'd like you to walk me through how you see providers using this new structure. How would a registration function interact with this cache at a high level? What is the 'add_region' code supposed to do? What is the provider going to do in the 'add_region' and 'del_region' code?
prov/util/src/util_mr_cache.c
Outdated
@@ -0,0 +1,228 @@ | |||
/* | |||
* Copyright (c) 2017 Intel Corporation, Inc. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this has been heavily modified, but I think the Copyright from Cray should still remain.
include/ofi_mr.h
Outdated
uint64_t delete_cnt; | ||
uint64_t hit_cnt; | ||
|
||
int (*add_region)(struct ofi_mr_cache *cache, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally, the fi_mr_reg context was passed along so that the registration could be created if it did not exist in the cache. I do not understand how this model is supposed to behave.
I believe that this function lacks sufficient information for the underlying implementation to create a registration if one is needed. It looks like provider tie-in to memory registration has been stripped and this is intended to function as a modified RB tree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dmitry's changes add more fields to the mr_entry. What data is missing? The cache is intended to hide registrations from the user when they are not needed. Regions that require explicit application registration (e.g. region registered as an RMA target) are not expected to go through the cache.
Because of how the merge feature works, there's not necessarily a direct relation between the input passed into search and the region that is added.
include/ofi_mr.h
Outdated
struct ofi_mr_entry *entry); | ||
void (*delete_region)(struct ofi_mr_cache *cache, | ||
struct ofi_mr_entry *entry); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flush call was originally provided to allow the underlying implementation to remove stale entries in the event that the NIC was resource bound in terms of memory registrations. Specifically with the GNI provider, there is a hard limit of 4096 memory registrations. As such, if stale entries needed to be discarded to free space, the provider could call flush to release stale registrations and attempt to register again. I believe such a mechanism is necessary for more providers than just the GNI provider. Flushing generally means going through the LRU and removing a stale entry to free space for the new registration. Flushing would only be necessary if a current or stale entry couldn't satisfy the registration request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exposing flush should be straightforward. Are we needing to flush all lru entries, or just one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always included a number argument and left it to the upper layer to decide. 1 to x, or -1 to flush them all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an internal interface, so we can change it anytime, but really only need to be concerned with what the callers will actually use. Flush-one seems like it would be the best option unless we end up adding some sort of time stamps. (I'd like to avoid the overhead of taking time stamps.)
diff --git a/prov/util/src/util_mr_cache.c b/prov/util/src/util_mr_cache.c
index 0bd4826..8c9c59b 100644
--- a/prov/util/src/util_mr_cache.c
+++ b/prov/util/src/util_mr_cache.c
@@ -94,7 +94,7 @@ util_mr_cache_create(struct ofi_mr_cache *cache, const struct iovec *iov,
int ret;
FI_DBG(cache->domain->prov, FI_LOG_MR,
- "creating %p\n", attr->iov.iov_base);
+ "creating %p\n", iov->iov_base);
*entry = calloc(1, sizeof(**entry) + cache->entry_data_size);
if (!*entry)
return -FI_ENOMEM;
@@ -159,7 +159,7 @@ int ofi_mr_cache_search(struct ofi_mr_cache *cache, const struct fi_mr_attr *att
assert(attr->iov_count == 1);
FI_DBG(cache->domain->prov, FI_LOG_MR,
- "search %p\n", attr->iov.iov_base);
+ "search %p\n", attr->mr_iov->iov_base);
cache->search_cnt++; These errors cause compilation failures in debug builds. Above is a patch to resolve it. |
I've implemented caching for the verbs provider. It doesn't work well right now. You may play w/ this to reproduce issues. |
@gladkovdmitry17 - thanks for adding the monitor support. I'll pull that in with the other updates mentioned and actually start trying to run the code. :) @jswaro - I didn't have time to get to adding monitor support or verifying anything. I was wanting to get the code out for review by the end of yesterday, so Dmitry could look at it. My goal is to try to get the caching upstream by the end of the week. |
@shefty Please, check the code before pulling this to your changes. It wasn't tested, but I guess it should work logically |
c4d7c62
to
6f2e338
Compare
Updated PR with all changes requested from reviews. For the flush() call, I went with flushing a single registration. Flush returns true if an entry was removed, and false if no flushing could be done. The flush() call fell out of what was needed internally. A caller could simply call flush() in a loop to flush all entries or x number of entries. I am now looking at integrating Dmitry's changes. |
@gladkovdmitry17 - I merged in your changes to add monitor support. (As an aside, there was a bug in the entry free code that might have been responsible for problems that you were seeing. By the time free is called, the entry is no longer in the rbtree. But the find() call will return any entry that has any overlap with iovec being searched for. This could result in removing the wrong entry from the tree.) Along those lines, I added a patch that renamed and inverted the 'retired' flag to 'cached'. This better indicates that there is a relationship between that flag being set and the entry being in the rbtree. Note that prior to merging, I plan on squashing all 3 commits into one. This series should be ready for merging. I'll work on the verbs piece tomorrow unless Dmitry beats me to it. :) |
This is derived from work by Dmitry Gladkov, which was based on the registration cache in the gni provider. The interface for the cache is comprised of initialization and cleanup routines, plus two calls: search and delete. Search will first search the cache for a region that contains the provided input region. If no existing region is found, the new region is added to the cache. Delete marks that the user is done with the region. Every search call should be paired with a delete call. If caching is enabled, the freeing of a delete region will be deferred until it is both no longer being accessed, and is the region that has the oldest access time. Signed-off-by: Sean Hefty <[email protected]> Signed-off-by: Dmitry Gladkov <[email protected]>
@shefty Thanks for this. |
This is an error that produces on IMB over MPI/OFI/verbs due to merging multiple MR cache entries into single entry. When we merging, the MR descriptors and keys of merged entries are no longer valid. Should we merge only those regions that are not being used at the time of merging? |
Ops, it is already done :) nevermind -- The erroneous case here is we have two regions merged ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shefty with proposed fixes, the OFI/verbs + MR caching works and provides acceptable performance results
I'm going to fix some problem on OFI/verbs level and we would obtain what we need
I thought I added your DBG changes to my commit (updated manually). I agree that they're useful. Did you see some that were left out? |
ec72c27
to
50298ed
Compare
Squashed all commits into one. Added a couple more debug prints plus merge fix from Dmitry's code. Added a couple minor simplifications and asserts and fix. The MR entry flags (cached and subscribed) should now always be in sync with rb tree track and monitoring. |
585e4d1
to
c3b3a00
Compare
I think the upper layer should dictate whether merging is acceptable. I imagine it would be hard for the cache to determine whether a registration was actively being used for a transfer or not. We had a case for this in the GNI provider where we were initially observing issues with the merge system and eventually came to the conclusion that we might not want to merge registrations all the time. It would take some burden onto the registration cache system, but I think it would be a more flexible system overall. It was something we had wanted to do when it was eventually generalized, so if we could get that in here, I'd be thrilled. |
include/ofi_mr.h
Outdated
|
||
struct ofi_mr_entry { | ||
struct iovec iov; | ||
uint64_t access; /* TODO */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the GNI provider, I found it was difficult to handle registrations of different types within the same cache. For the GNI provider, we split those registrations into two different caches -- one for read only regions, and one for read-write. For providers such as the verbs provider, this could get trickier in the event of read-only local, read-write, and read-remote registrations. Merging a read-only with a read-write could result in unexpected behavior from the fabric if an application tries to land data into a merged entry where it might have been read-only for a specific region of that registration prior to the merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to test the expectations of these functions with a set of tests before they get committed.
I wouldn't expect a READ-only registration from 0xa000 to 0xb000 to be merged with a 0xa800 to 0xc000 READ-WRITE registration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think separate caches will end up being preferable as well. I'm inclined to remove the field until it is actually used.
I've verified this patch, this works well with verbs/RDM and RxM/verbs/MSG. But I'm still hitting an error:
The verbs provider hangs when trying to register MR via |
while ((cache->cached_cnt >= cache->size) && ofi_mr_cache_flush(cache)) | ||
; | ||
|
||
iter = rbtFind(cache->mr_tree, (void *) attr->mr_iov); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this cache going to deal with access permissions as well when attempting to find a registration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments. I'll wait for your responses before posting more. Thanks!
@gladkovdmitry17 - did you run with debug enabled? Maybe one of the asserts in the code will point to the issue. |
Sure, this is debug build of libfabric. No asserts were triggered |
Dmitry, Sean. One of the rationales behind the 'retired' flag, was that registrations in use from the perspective of the application (an existing fi_mr object) require that the registration persist until the application actually calls fi_close on the object. This is to allow for safe usage of that handle until such time as it closed. This also protects the provider so that it can continue to use a registration handle until it is no longer used by the application as well. The l_key and r_key for verbs should stay the same until the region is no longer actively used by the application. This argument changes slightly for the FI_MR_LOCAL/FI_LOCAL_MR case but should remain roughly the same in terms of operation. Is the application providing the registration, or is it relying on the cache to fetch it each time under the premise of !FI_LOCAL_MR? |
The 'retired' concept is still there. It's just called 'cached' now. An entry remains valid as long as its use_cnt is > 0. It may not be part of the cache for a couple of reasons. An application must match all search() calls with delete(). There is a direct relationship between the value of cached and whether the entry can be found in the rbtree (i.e. the cache) -- this is why I renamed the field. |
Ok, that sounds right. Is there enough debug in the verbs provider to detect if the l_key and/or r_key are changed between when the transaction is initiated from the perspective of the application, and when it is posted to the libibverbs library from the verbs provider? Given the nature of the problem, that would be my first thought. |
Updated to remove unused 'access' field. I agree that we're likely to need separate caches based on access permission. When we start getting to access permissions, we may need to add a new call ofi_mr_cache_find() - that searches for an entry without adding it. But I prefer to do that only when it's actually needed and we know it makes sense. |
@jswaro @shefty The ibv_reg_mr sets This is log that I can obtain before
Seems the verbs driver detects that address range that verbs provider is trying to register isn't mapped to physical pages. It occurs when two address ranges are merging into single entry. |
@gladkovdmitry17 - for your latest run, did you use my updated merged patch? There was at least one one fix that I added to that beyond your changes. Hanging in ibv_reg_mr() seems unlikely unless there's been some sort of memory corruption. |
I've updated to the latest one, I hit this issue again. |
The reason why you are hitting this issue is likely due to the fact the MR monitor may not be hooked up to the cache. Without any notion that the page is freed, and the fact that lazy deregistration is probably enabled, the cache is grabbing the stale entry, merging it with the new request and attempting to register some pages that are mapped and some that are not. I recommend disabling the lazy deregistration feature until you have a memory monitor that you are comfortable with. |
Yes, that's why I'm hitting this issue. this 0x2b8f3238d010 address (from my example) is freed by MPI rank. Let's not spent more time to investigate this issue and continue to get this PR merged. Btw, @jswaro will we integrate this to the GNI provider? Seems, it will be changed to be able work with the GNI |
My intent from day 1 was to get this ported to common core to be used by GNI and by any other interested provider (verbs being the current target). It may need some changes and @shefty has already said that this will be open to additional changes after the initial merge, so I/we will be integrating this with the GNI provider. |
The lazy deregistration feature should be able to be turned off at init time by the requesting provider. If so, when the registration is removed from deleted, it should be deregistered as well -- which will prevent the error that you are seeing. Is the verbs provider missing a matching call to 'delete'? |
I can merge this once CI completes. It's a safe change, since nothing actually uses it currently. :) I agree, adding in the first monitor implementation would be a good next step. |
cache->delete_cnt++; | ||
|
||
util_mr_cache_process_events(cache); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To handle the case where the cache is not required to support lazy deregistration, an additional flag should be added to the cache structure which should indicate whether the feature is in use. After that, the following should be added to the condition below.
if (--entry->use_cnt == 0) {
if (entry->cached && cache->lazy_deregistration) {
dlist_insert_tail(&entry->lru_entry, &cache->lru_list);
} else {
util_mr_free_entry(cache, entry);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Primarily, this is to support application cases where the monitor systems available via libfabric will not be sufficient, in which case the cache uses best-effort caching and aggressively purges entries on calls to 'delete'.
The case that Dmitry is observing is a similar to the above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting the size = 0 will disable the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the cache::size = 1 provides MR cache w/o lazy deregistration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cache::size = 1 will still cache one entry, and has the potential to create the same problem -- it would just be incredibly unlikely.
No, there is matching calls to Btw, I don't follow the idea why the MR caching functionality w/o lazy deregistration is needed. Seems it works like a storage for the MR entries and nothing else @jswaro @shefty Thank you for great collaboration and your efforts to implement this! |
I agree with Dmitry -- disabling lazy deregistration disables caching, and a cache that doesn't actually cache doesn't seem all that useful. :) But such a model is effectively supported by setting the cache size to 0. |
Again, the reason why you wouldn't enable lazy deregistration would be for cases where the memory monitoring system isn't sufficient, and you require that registrations be coherent with application usage (search/delete). Specifically, I've observed applications that bypass simple memory monitoring systems that hook into the glibc allocation systems, and so those monitors may not be sufficient. KDREG for the GNI provider is sufficient to detect this page map changes at the kernel level, so we don't need to worry about it. However, if KDREG isn't available on the customer system, then we disable lazy deregistration at the cache to maintain registration coherence with respect to the application usage. When the registration is created, the entry exists until the use count drops to zero. In this cache, if multiple requests to the same region were created, several libfabric fi_mr objects could be mapped to the same underlying cache entry... which still serves some benefit to the application -- it wouldn't just be a registration storage tree. |
I would generally agree, but look at the case I provided above. It is a sub-optimal case where the cache simply serves as a best-effort to minimize time in the hardware memory registration calls given that it doesn't have sufficient information to reconcile registration entries with the underlying memory management system. Edit: It would still 'cache' the entries in the tree so that subsequent calls could use the existing registration while it hadn't yet been closed by the upper layer. In this way, multiple fi_mr requests can utilize the same underlying memory registration without requiring multiple calls into the hardware registration functions. |
Sure, that makes sense. I missed that. |
This is derived from work by Dmitry Gladkov, which was based on
the registration cache in the gni provider.
The interface for the cache is comprised of initialization and
cleanup routines, plus two calls: search and delete. Search
will first search the cache for a region that contains the
provided input region. If no existing region is found, the
new region is added to the cache. Delete marks that the user
is done with the region. Every search call should be paired
with a delete call.
If caching is enabled, the freeing of a delete region will be
deferred until it is both no longer being accessed, and is the
region that has the oldest access time.
Signed-off-by: Sean Hefty [email protected]