Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

util/mr_cache: Add framework for memory registration cache #3617

Merged
merged 1 commit into from
Dec 15, 2017

Conversation

shefty
Copy link
Member

@shefty shefty commented Dec 13, 2017

This is derived from work by Dmitry Gladkov, which was based on
the registration cache in the gni provider.

The interface for the cache is comprised of initialization and
cleanup routines, plus two calls: search and delete. Search
will first search the cache for a region that contains the
provided input region. If no existing region is found, the
new region is added to the cache. Delete marks that the user
is done with the region. Every search call should be paired
with a delete call.

If caching is enabled, the freeing of a delete region will be
deferred until it is both no longer being accessed, and is the
region that has the oldest access time.

Signed-off-by: Sean Hefty [email protected]

@shefty
Copy link
Member Author

shefty commented Dec 13, 2017

@gladkovdmitry17 - This is a quick implementation for a much simplified version of the mr cache. This does not hook into the mr monitor code yet.

There is no limit to the number of registrations that can be performed. The only limit is on the number of entries that are allowed to be cached. So, if the cache is full, the user is allowed to continue with the registration, which would allow them to proceed with the corresponding data transfer. Effectively, the cache size limits the size of the rb tree. Uncacheable registrations are not inserted into the tree and destroyed when unregistered.

The API has search/delete calls, which mimic the standard binary tree interface (i.e. tsearch/tdelete). I went back and forth on whether to make cache statistics available as debug only or all the time. I ended up going with always available, since the overhead is small (simple integer increments).

I think this captures the desired functionality. Please provide any feedback. I can finish adding any missing requirements tomorrow (Wed.).

@dmitrygx
Copy link
Member

@shefty Kudos to you!
I've prepared a PR with MR monitoring functionality
shefty#7

I'm going to re-implement OFI/verbs to use this implementation and measure performance that we can achieve with this.

Copy link
Contributor

@jswaro jswaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reduction looks nice. A simplification of the mechanisms that were in place, as well as a simplification of logic that had long needed to be refactored and reduced.

However, without the memory notifier, without more detailed information to the underlying layer, I don't know that this could work with providers such as the GNI provider.

I'd like you to walk me through how you see providers using this new structure. How would a registration function interact with this cache at a high level? What is the 'add_region' code supposed to do? What is the provider going to do in the 'add_region' and 'del_region' code?

@@ -0,0 +1,228 @@
/*
* Copyright (c) 2017 Intel Corporation, Inc. All rights reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this has been heavily modified, but I think the Copyright from Cray should still remain.

include/ofi_mr.h Outdated
uint64_t delete_cnt;
uint64_t hit_cnt;

int (*add_region)(struct ofi_mr_cache *cache,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, the fi_mr_reg context was passed along so that the registration could be created if it did not exist in the cache. I do not understand how this model is supposed to behave.

I believe that this function lacks sufficient information for the underlying implementation to create a registration if one is needed. It looks like provider tie-in to memory registration has been stripped and this is intended to function as a modified RB tree.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dmitry's changes add more fields to the mr_entry. What data is missing? The cache is intended to hide registrations from the user when they are not needed. Regions that require explicit application registration (e.g. region registered as an RMA target) are not expected to go through the cache.

Because of how the merge feature works, there's not necessarily a direct relation between the input passed into search and the region that is added.

include/ofi_mr.h Outdated
struct ofi_mr_entry *entry);
void (*delete_region)(struct ofi_mr_cache *cache,
struct ofi_mr_entry *entry);
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The flush call was originally provided to allow the underlying implementation to remove stale entries in the event that the NIC was resource bound in terms of memory registrations. Specifically with the GNI provider, there is a hard limit of 4096 memory registrations. As such, if stale entries needed to be discarded to free space, the provider could call flush to release stale registrations and attempt to register again. I believe such a mechanism is necessary for more providers than just the GNI provider. Flushing generally means going through the LRU and removing a stale entry to free space for the new registration. Flushing would only be necessary if a current or stale entry couldn't satisfy the registration request.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing flush should be straightforward. Are we needing to flush all lru entries, or just one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always included a number argument and left it to the upper layer to decide. 1 to x, or -1 to flush them all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an internal interface, so we can change it anytime, but really only need to be concerned with what the callers will actually use. Flush-one seems like it would be the best option unless we end up adding some sort of time stamps. (I'd like to avoid the overhead of taking time stamps.)

@jswaro
Copy link
Contributor

jswaro commented Dec 13, 2017

diff --git a/prov/util/src/util_mr_cache.c b/prov/util/src/util_mr_cache.c
index 0bd4826..8c9c59b 100644
--- a/prov/util/src/util_mr_cache.c
+++ b/prov/util/src/util_mr_cache.c
@@ -94,7 +94,7 @@ util_mr_cache_create(struct ofi_mr_cache *cache, const struct iovec *iov,
        int ret;

        FI_DBG(cache->domain->prov, FI_LOG_MR,
-              "creating %p\n", attr->iov.iov_base);
+              "creating %p\n", iov->iov_base);
        *entry = calloc(1, sizeof(**entry) + cache->entry_data_size);
        if (!*entry)
                return -FI_ENOMEM;
@@ -159,7 +159,7 @@ int ofi_mr_cache_search(struct ofi_mr_cache *cache, const struct fi_mr_attr *att

        assert(attr->iov_count == 1);
        FI_DBG(cache->domain->prov, FI_LOG_MR,
-              "search %p\n", attr->iov.iov_base);
+              "search %p\n", attr->mr_iov->iov_base);
        cache->search_cnt++;

These errors cause compilation failures in debug builds. Above is a patch to resolve it.

@dmitrygx
Copy link
Member

I've implemented caching for the verbs provider. It doesn't work well right now.

You may play w/ this to reproduce issues.
https://github.com/gladkovdmitry17/libfabric/tree/mr_cache_verbs

@shefty
Copy link
Member Author

shefty commented Dec 13, 2017

@gladkovdmitry17 - thanks for adding the monitor support. I'll pull that in with the other updates mentioned and actually start trying to run the code. :)

@jswaro - I didn't have time to get to adding monitor support or verifying anything. I was wanting to get the code out for review by the end of yesterday, so Dmitry could look at it.

My goal is to try to get the caching upstream by the end of the week.

@dmitrygx
Copy link
Member

thanks for adding the monitor support

@shefty Please, check the code before pulling this to your changes. It wasn't tested, but I guess it should work logically

@shefty shefty force-pushed the master branch 2 times, most recently from c4d7c62 to 6f2e338 Compare December 13, 2017 19:38
@shefty
Copy link
Member Author

shefty commented Dec 13, 2017

Updated PR with all changes requested from reviews. For the flush() call, I went with flushing a single registration. Flush returns true if an entry was removed, and false if no flushing could be done. The flush() call fell out of what was needed internally. A caller could simply call flush() in a loop to flush all entries or x number of entries.

I am now looking at integrating Dmitry's changes.

@shefty
Copy link
Member Author

shefty commented Dec 14, 2017

@gladkovdmitry17 - I merged in your changes to add monitor support. (As an aside, there was a bug in the entry free code that might have been responsible for problems that you were seeing. By the time free is called, the entry is no longer in the rbtree. But the find() call will return any entry that has any overlap with iovec being searched for. This could result in removing the wrong entry from the tree.)

Along those lines, I added a patch that renamed and inverted the 'retired' flag to 'cached'. This better indicates that there is a relationship between that flag being set and the entry being in the rbtree.

Note that prior to merging, I plan on squashing all 3 commits into one. This series should be ready for merging. I'll work on the verbs piece tomorrow unless Dmitry beats me to it. :)

@shefty shefty requested review from jswaro and dmitrygx December 14, 2017 02:34
This is derived from work by Dmitry Gladkov, which was based on
the registration cache in the gni provider.

The interface for the cache is comprised of initialization and
cleanup routines, plus two calls: search and delete.  Search
will first search the cache for a region that contains the
provided input region.  If no existing region is found, the
new region is added to the cache.  Delete marks that the user
is done with the region.  Every search call should be paired
with a delete call.

If caching is enabled, the freeing of a delete region will be
deferred until it is both no longer being accessed, and is the
region that has the oldest access time.

Signed-off-by: Sean Hefty <[email protected]>
Signed-off-by: Dmitry Gladkov <[email protected]>
@dmitrygx
Copy link
Member

@shefty Thanks for this.
I've fixed a problem that causes a core dump while running fi_pingpong over verbs/RDM provider. An appropriate PR is prepared to your GH fork's master branch. could you take a look, please? and if you are okay with that, please, merge. Btw, can we stay proposed DBG traces from my patch? Ut helps to investigate problems.

@dmitrygx
Copy link
Member

libfabric:verbs:mr:ofi_mr_cache_search():231<debug> search 0x13edc50 (len: 65536)
libfabric:verbs:mr:util_mr_cache_merge():208<debug> merging 0x13edc50 (len: 65536) with 0x13ef6f0 (lne: 32768)to 0x13edc50 (lne: 65536)
libfabric:verbs:mr:util_mr_free_entry():58<debug> free 0x13ef6f0 (len: 32768)
libfabric:core:mr:ofi_monitor_unsubscribe():101<debug> unsubscribing addr=0x13ef6f0 len=32768 subscription=0x13edac0 nq=0x13d2150
libfabric:verbs:mr:util_mr_cache_merge():208<debug> merging 0x13edc50 (len: 65536) with 0x13e5a80 (lne: 40032)to 0x13e5a80 (lne: 65536)
libfabric:verbs:mr:util_mr_free_entry():58<debug> free 0x13e5a80 (len: 40032)
libfabric:core:mr:ofi_monitor_unsubscribe():101<debug> unsubscribing addr=0x13e5a80 len=40032 subscription=0x13e59d0 nq=0x13d2150
libfabric:verbs:mr:util_mr_cache_create():143<debug> create 0x13e5a80 (len: 65536)
libfabric:core:mr:ofi_monitor_subscribe():75<debug> subscribing addr=0x13e5a80 len=65536 subscription=0x13e57d0 nq=0x13d2150
...
mlx5: nnlmpibdw03: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00008813 10002fa4 34b8bfd3

This is an error that produces on IMB over MPI/OFI/verbs due to merging multiple MR cache entries into single entry. When we merging, the MR descriptors and keys of merged entries are no longer valid.

Should we merge only those regions that are not being used at the time of merging?

@dmitrygx
Copy link
Member

dmitrygx commented Dec 14, 2017

Should we merge only those regions that are not being used at the time of merging?

Ops, it is already done :) nevermind

--
I've prepared fix for this problem (this is second commit in my PR prepared against your master branch)

The erroneous case here is we have two regions merged (0x13edc50 (len: 65536) and 0x13e5a80 (len: 40032)) to 0x13e5a80 (len: 65536) region. Those two regions are intersected, but the resulting buffer isn't large enough to include the first one (the resulting buffer starts at the address of the second one).

Copy link
Member

@dmitrygx dmitrygx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shefty with proposed fixes, the OFI/verbs + MR caching works and provides acceptable performance results

I'm going to fix some problem on OFI/verbs level and we would obtain what we need

@shefty
Copy link
Member Author

shefty commented Dec 14, 2017

I thought I added your DBG changes to my commit (updated manually). I agree that they're useful. Did you see some that were left out?

@shefty shefty force-pushed the master branch 2 times, most recently from ec72c27 to 50298ed Compare December 14, 2017 18:39
@shefty
Copy link
Member Author

shefty commented Dec 14, 2017

Squashed all commits into one. Added a couple more debug prints plus merge fix from Dmitry's code. Added a couple minor simplifications and asserts and fix. The MR entry flags (cached and subscribed) should now always be in sync with rb tree track and monitoring.

@shefty shefty force-pushed the master branch 2 times, most recently from 585e4d1 to c3b3a00 Compare December 14, 2017 18:48
@jswaro
Copy link
Contributor

jswaro commented Dec 14, 2017

Should we merge only those regions that are not being used at the time of merging?

I think the upper layer should dictate whether merging is acceptable. I imagine it would be hard for the cache to determine whether a registration was actively being used for a transfer or not. We had a case for this in the GNI provider where we were initially observing issues with the merge system and eventually came to the conclusion that we might not want to merge registrations all the time. It would take some burden onto the registration cache system, but I think it would be a more flexible system overall.

It was something we had wanted to do when it was eventually generalized, so if we could get that in here, I'd be thrilled.

include/ofi_mr.h Outdated

struct ofi_mr_entry {
struct iovec iov;
uint64_t access; /* TODO */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the GNI provider, I found it was difficult to handle registrations of different types within the same cache. For the GNI provider, we split those registrations into two different caches -- one for read only regions, and one for read-write. For providers such as the verbs provider, this could get trickier in the event of read-only local, read-write, and read-remote registrations. Merging a read-only with a read-write could result in unexpected behavior from the fabric if an application tries to land data into a merged entry where it might have been read-only for a specific region of that registration prior to the merge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to test the expectations of these functions with a set of tests before they get committed.

I wouldn't expect a READ-only registration from 0xa000 to 0xb000 to be merged with a 0xa800 to 0xc000 READ-WRITE registration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think separate caches will end up being preferable as well. I'm inclined to remove the field until it is actually used.

@dmitrygx
Copy link
Member

I've verified this patch, this works well with verbs/RDM and RxM/verbs/MSG.

But I'm still hitting an error:

#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0            1         2.50         0.00
            1            1         2.98         0.32
            2            1         2.50         0.76
            4            1         2.50         1.52
            8            1         2.98         2.56
           16            1         2.50         6.10
           32            1         2.98        10.24
           64            1         2.98        20.48
          128            1         3.10        39.38
          256            1         2.98        81.92
          512            1         3.58       136.53
         1024            1         3.58       273.07
         2048            1         4.41       442.81
         4096            1         5.01       780.19
         8192            1         6.56      1191.56
        16384            1        11.44      1365.33
        32768            1        13.47      2319.86
        65536            1        16.09      3883.61
       131072            1        22.41      5577.53
       262144            1        34.57      7231.56
Fatal error in MPI_Recv: Other MPI error, error stack:
MPI_Recv(204)............: MPI_Recv(buf=0x2ad2433c0010, count=524288, MPI_BYTE, src=MPI_ANY_SOURCE, tag=MPI_ANY_TAG, comm=0x84000000, status=0x7fff04de5660) failed
PMPIDI_CH3I_Progress(780):
(unknown)(): Other MPI error

The verbs provider hangs when trying to register MR via ibv_reg_mr. I don't know why and investigating this currently

while ((cache->cached_cnt >= cache->size) && ofi_mr_cache_flush(cache))
;

iter = rbtFind(cache->mr_tree, (void *) attr->mr_iov);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this cache going to deal with access permissions as well when attempting to find a registration?

Copy link
Contributor

@jswaro jswaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments. I'll wait for your responses before posting more. Thanks!

@shefty
Copy link
Member Author

shefty commented Dec 14, 2017

@gladkovdmitry17 - did you run with debug enabled? Maybe one of the asserts in the code will point to the issue.

@dmitrygx
Copy link
Member

did you run with debug enabled? Maybe one of the asserts in the code will point to the issue.

Sure, this is debug build of libfabric. No asserts were triggered

@jswaro
Copy link
Contributor

jswaro commented Dec 14, 2017

But I'm still hitting an error:

Dmitry, Sean. One of the rationales behind the 'retired' flag, was that registrations in use from the perspective of the application (an existing fi_mr object) require that the registration persist until the application actually calls fi_close on the object. This is to allow for safe usage of that handle until such time as it closed. This also protects the provider so that it can continue to use a registration handle until it is no longer used by the application as well. The l_key and r_key for verbs should stay the same until the region is no longer actively used by the application.

This argument changes slightly for the FI_MR_LOCAL/FI_LOCAL_MR case but should remain roughly the same in terms of operation.

Is the application providing the registration, or is it relying on the cache to fetch it each time under the premise of !FI_LOCAL_MR?

@shefty
Copy link
Member Author

shefty commented Dec 14, 2017

The 'retired' concept is still there. It's just called 'cached' now. An entry remains valid as long as its use_cnt is > 0. It may not be part of the cache for a couple of reasons. An application must match all search() calls with delete(). There is a direct relationship between the value of cached and whether the entry can be found in the rbtree (i.e. the cache) -- this is why I renamed the field.

@jswaro
Copy link
Contributor

jswaro commented Dec 14, 2017

The 'retired' concept is still there. It's just called 'cached' now. An entry remains valid as long as its use_cnt is > 0. It may not be part of the cache for a couple of reasons. An application must match all search() calls with delete(). There is a direct relationship between the value of cached and whether the entry can be found in the rbtree (i.e. the cache) -- this is why I renamed the field.

Ok, that sounds right. Is there enough debug in the verbs provider to detect if the l_key and/or r_key are changed between when the transaction is initiated from the perspective of the application, and when it is posted to the libibverbs library from the verbs provider? Given the nature of the problem, that would be my first thought.

@shefty
Copy link
Member Author

shefty commented Dec 14, 2017

Updated to remove unused 'access' field. I agree that we're likely to need separate caches based on access permission. When we start getting to access permissions, we may need to add a new call ofi_mr_cache_find() - that searches for an entry without adding it. But I prefer to do that only when it's actually needed and we know it makes sense.

@dmitrygx
Copy link
Member

@jswaro @shefty The ibv_reg_mr sets EFAULT errno.

This is log that I can obtain before ibv_reg_mr fails

libfabric:verbs:mr:ofi_mr_cache_search():231<debug> search 0x2b8f3228c010 (len: 4194304)
libfabric:verbs:mr:util_mr_cache_merge():202<debug> merge 0x2b8f3228c010 (len: 4194304) with 0x2b8f3258e010 (len: 2097152)
libfabric:verbs:mr:util_mr_free_entry():58<debug> free 0x2b8f3258e010 (len: 2097152)
libfabric:verbs:mr:util_mr_free_entry():65<debug> 111
libfabric:verbs:mr:util_mr_free_entry():67<debug> 222
libfabric:verbs:mr:util_mr_cache_merge():202<debug> merge 0x2b8f3228c010 (len: 5251072) with 0x2b8f3238d010 (len: 2097152)
libfabric:verbs:mr:util_mr_free_entry():58<debug> free 0x2b8f3238d010 (len: 2097152)
libfabric:verbs:mr:util_mr_free_entry():65<debug> 111
libfabric:verbs:mr:util_mr_free_entry():67<debug> 222
libfabric:verbs:mr:util_mr_cache_create():144<debug> create 0x2b8f3228c010 (len: 5251072)
libfabric:verbs:mr:util_mr_cache_create():154<debug> 111
libfabric:verbs:mr:fi_ibv_mr_cache_entry_reg():405<debug> RET = -14
IMB-MPI1: prov/verbs/src/verbs_mr.c:406: fi_ibv_mr_cache_entry_reg: Assertion `!ret' failed.

Seems the verbs driver detects that address range that verbs provider is trying to register isn't mapped to physical pages. It occurs when two address ranges are merging into single entry.

@shefty
Copy link
Member Author

shefty commented Dec 14, 2017

@gladkovdmitry17 - for your latest run, did you use my updated merged patch? There was at least one one fix that I added to that beyond your changes. Hanging in ibv_reg_mr() seems unlikely unless there's been some sort of memory corruption.

@dmitrygx
Copy link
Member

for your latest run, did you use my updated merged patch? There was at least one one fix that I added to that beyond your changes. Hanging in ibv_reg_mr() seems unlikely unless there's been some sort of memory corruption.

I've updated to the latest one, I hit this issue again.

@jswaro
Copy link
Contributor

jswaro commented Dec 14, 2017

I've updated to the latest one, I hit this issue again.

The reason why you are hitting this issue is likely due to the fact the MR monitor may not be hooked up to the cache. Without any notion that the page is freed, and the fact that lazy deregistration is probably enabled, the cache is grabbing the stale entry, merging it with the new request and attempting to register some pages that are mapped and some that are not. I recommend disabling the lazy deregistration feature until you have a memory monitor that you are comfortable with.

@dmitrygx
Copy link
Member

some pages that are mapped and some that are not.

Yes, that's why I'm hitting this issue. this 0x2b8f3238d010 address (from my example) is freed by MPI rank. Let's not spent more time to investigate this issue and continue to get this PR merged.
After that I'm going to upstream Memory hooks. Seems it will be next round that we will discuss :)
Thanks for helping me out.

Btw, @jswaro will we integrate this to the GNI provider? Seems, it will be changed to be able work with the GNI

@jswaro
Copy link
Contributor

jswaro commented Dec 14, 2017

Btw, @jswaro will we integrate this to the GNI provider? Seems, it will be changed to be able work with the GNI

My intent from day 1 was to get this ported to common core to be used by GNI and by any other interested provider (verbs being the current target). It may need some changes and @shefty has already said that this will be open to additional changes after the initial merge, so I/we will be integrating this with the GNI provider.

@jswaro
Copy link
Contributor

jswaro commented Dec 14, 2017

Let's not spent more time to investigate this issue and continue to get this PR merged.

The lazy deregistration feature should be able to be turned off at init time by the requesting provider. If so, when the registration is removed from deleted, it should be deregistered as well -- which will prevent the error that you are seeing.

Is the verbs provider missing a matching call to 'delete'?

@shefty
Copy link
Member Author

shefty commented Dec 14, 2017

I can merge this once CI completes. It's a safe change, since nothing actually uses it currently. :)

I agree, adding in the first monitor implementation would be a good next step.

cache->delete_cnt++;

util_mr_cache_process_events(cache);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To handle the case where the cache is not required to support lazy deregistration, an additional flag should be added to the cache structure which should indicate whether the feature is in use. After that, the following should be added to the condition below.

     if (--entry->use_cnt == 0) {
		if (entry->cached && cache->lazy_deregistration) {
			dlist_insert_tail(&entry->lru_entry, &cache->lru_list);
		} else {
			util_mr_free_entry(cache, entry);
		}
	}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Primarily, this is to support application cases where the monitor systems available via libfabric will not be sufficient, in which case the cache uses best-effort caching and aggressively purges entries on calls to 'delete'.

The case that Dmitry is observing is a similar to the above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting the size = 0 will disable the cache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the cache::size = 1 provides MR cache w/o lazy deregistration

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache::size = 1 will still cache one entry, and has the potential to create the same problem -- it would just be incredibly unlikely.

@dmitrygx
Copy link
Member

Is the verbs provider missing a matching call to 'delete'?

No, there is matching calls to delete for each search.
There is no possbility to turn off the lazy deregistration in the current implementation. The only way to do this is to set cache::size to 1. It works great for the verbs/RDM and the RxM/verbs providers!

Btw, I don't follow the idea why the MR caching functionality w/o lazy deregistration is needed. Seems it works like a storage for the MR entries and nothing else

@jswaro @shefty Thank you for great collaboration and your efforts to implement this!

@shefty
Copy link
Member Author

shefty commented Dec 14, 2017

I agree with Dmitry -- disabling lazy deregistration disables caching, and a cache that doesn't actually cache doesn't seem all that useful. :) But such a model is effectively supported by setting the cache size to 0.

@jswaro
Copy link
Contributor

jswaro commented Dec 14, 2017

Btw, I don't follow the idea why the MR caching functionality w/o lazy deregistration is needed. Seems it works like a storage for the MR entries and nothing else

Again, the reason why you wouldn't enable lazy deregistration would be for cases where the memory monitoring system isn't sufficient, and you require that registrations be coherent with application usage (search/delete). Specifically, I've observed applications that bypass simple memory monitoring systems that hook into the glibc allocation systems, and so those monitors may not be sufficient. KDREG for the GNI provider is sufficient to detect this page map changes at the kernel level, so we don't need to worry about it. However, if KDREG isn't available on the customer system, then we disable lazy deregistration at the cache to maintain registration coherence with respect to the application usage. When the registration is created, the entry exists until the use count drops to zero. In this cache, if multiple requests to the same region were created, several libfabric fi_mr objects could be mapped to the same underlying cache entry... which still serves some benefit to the application -- it wouldn't just be a registration storage tree.

@jswaro
Copy link
Contributor

jswaro commented Dec 14, 2017

I agree with Dmitry -- disabling lazy deregistration disables caching, and a cache that doesn't actually cache doesn't seem all that useful.

I would generally agree, but look at the case I provided above. It is a sub-optimal case where the cache simply serves as a best-effort to minimize time in the hardware memory registration calls given that it doesn't have sufficient information to reconcile registration entries with the underlying memory management system.

Edit: It would still 'cache' the entries in the tree so that subsequent calls could use the existing registration while it hadn't yet been closed by the upper layer. In this way, multiple fi_mr requests can utilize the same underlying memory registration without requiring multiple calls into the hardware registration functions.

@jswaro
Copy link
Contributor

jswaro commented Dec 14, 2017

But such a model is effectively supported by setting the cache size to 0.

Sure, that makes sense. I missed that.

@dmitrygx
Copy link
Member

Sorry, it's time to go to bed. I've update the PR #3602 w/ new implementation of MR caching for the verbs provider.
If this PR will be merged, I'll update #3602 and will wokr on the next step - to implement Memory notifier over memory hooks.

@shefty shefty merged commit bd22d58 into ofiwg:master Dec 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants