Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support to use heap for pool for large systems #7871

Closed
wants to merge 1 commit into from
Closed

Support to use heap for pool for large systems #7871

wants to merge 1 commit into from

Conversation

yufengwangca
Copy link
Contributor

Problem

What is being fixed? Examples:
Large systems (e.g. Linux) do not need actual 'pools' of data and would rather use heap. We could always preallocate large amounts (e.g. saying a linux controller can connect to 100 devices at a time) and virtual memory will ensure we do not really use that much physical RAM.

Pools always need to be over-provisioned for the worst case, reserve space from data section for each pool waste RAM and this adds up the more memory when we have more pools.

This is the initial change for pool management optimization in chip, I would like to collect feedback and make sure if this is the right direction before scaling it further.

Change overview

Add configuration to support to use heap for pool for large systems.

Testing

How was this tested? (at least one bullet point required)

  1. Run $ ./TestSystemObject and make sure all test pass
  2. Run Echo ping test and confirm the test succeed with this change.

#if CHIP_SYSTEM_CONFIG_POOL_USE_HEAP
unsigned int oldCount = __sync_fetch_and_add(&this->mRefCount, 1);

if (oldCount == 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see how the atomic operations ensure correctness for mRefCount. However, how do we ensure that the pool is actually initialized after the second call to Init returns? I don't think we actually can.

It seems like this is unsafe unless all of the pools are carefully initialized at the start of the program before anything actually uses them. That seems to somewhat defeat the purpose of using atomics for mRefCount, or having the reference counting scheme at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we're moving to purely external synchronization, I would just leave it out unless there's an actual race (and I don't see how there can be; I think we'll have bigger problems if somebody tries to initialize the sdk twice), but if necessary, prefer std::atomic to gcc instrinsics.

@msandstedt
Copy link
Contributor

I agree that unconstrained platforms do not share the same concerns that motivated creation of the object pools. There aren't the same risks around fragmentation that can occur with dynamic allocation of a large number of small objects, and particularly objects that may have a long lifetime.

However, moving the pools from the bss segment to the heap doesn't seem functionally different. My preference would instead be that when System::TryCreate is called, this is just a thin wrapper for malloc. This would allow unconstrained platforms to leverage the fact they have abundant memory and not exhaust the sPools at runtime simply because we developers did not anticipate a particular use case.

Unfortunately, I don't know how this would be easily achieved. One current use of the sPools is for layers to walk all objects in the pool to perform particular operations. Apparently objects do need to be known to the layer that allocated them. I think to refactor in the way I'm suggesting, we would need to organize the objects into some type of accessible, but expandable structure like a linked list.

Alternatively, we would need to remove the use cases where layers require immediate access to all objects in the pool. This seems to be the biggest user of that pattern:

void InetLayer::HandleSelectResult(int selectRes, fd_set * readfds, fd_set * writefds, fd_set * exceptfds)

Does anybody know why we need to walk every single pool object in this function? That seems fundamentally unscalable. I admit I haven't looked at the inet code very closely. But if we can remove this type of behavior, then a refactor where System::TryCreate becomes a thin wrapper for malloc is very straight forward.

@kpschoedel
Copy link
Contributor

Alternatively, we would need to remove the use cases where layers require immediate access to all objects in the pool. This seems to be the biggest user of that pattern:

void InetLayer::HandleSelectResult(int selectRes, fd_set * readfds, fd_set * writefds, fd_set * exceptfds)

That case is going away (#5556 / #6561).

Copy link
Contributor

@bzbarsky-apple bzbarsky-apple left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to understand the actual goal here.

So far what I am seeing is that we still have a pool size template argument; it's just that instead of allocating the pool memory inline in the object we are heap-allocating that memory. What is the benefit of doing this?

I thought the idea we had been discussing was to allow heap-allocation of the individual objects. This PR does not really seem to be doing that...

@yufengwangca
Copy link
Contributor Author

I'm trying to understand the actual goal here.

So far what I am seeing is that we still have a pool size template argument; it's just that instead of allocating the pool memory inline in the object we are heap-allocating that memory. What is the benefit of doing this?

I thought the idea we had been discussing was to allow heap-allocation of the individual objects. This PR does not really seem to be doing that...

The goal of this PR is to use a flag to allow reserve pool space from heap instead of global static memory for big systems. The pool still use big chunk of continuous memory consist of fixed-size blocks. Pools always need to be over-provisioned for the worst case, virtual memory will ensure we do not really use that much physical RAM until we actually access the data. It also allow the system return the memory reserved by pool back to system after usage.

Could you explain more about heap-allocation of the individual objects? Then we will have extra meta data to track those individual and discrete objects in memory, right? how we handle the memory fragmentation if the pool memory is not continuous?

@msandstedt
Copy link
Contributor

I'm trying to understand the actual goal here.
So far what I am seeing is that we still have a pool size template argument; it's just that instead of allocating the pool memory inline in the object we are heap-allocating that memory. What is the benefit of doing this?
I thought the idea we had been discussing was to allow heap-allocation of the individual objects. This PR does not really seem to be doing that...

The goal of this PR is to use a flag to allow reserve pool space from heap instead of global static memory for big systems. The pool still use big chunk of continuous memory consist of fixed-size blocks. Pools always need to be over-provisioned for the worst case, virtual memory will ensure we do not really use that much physical RAM until we actually access the data. It also allow the system return the memory reserved by pool back to system after usage.

OK, this wasn't obvious to me. But can't the kernel map the bss to the zero virtual page? I think this copy-on-write virtual memory optimization is possible regardless of where we allocate the pools.

Could you explain more about heap-allocation of the individual objects? Then we will have extra meta data to track those individual and discrete objects in memory, right? how we handle the memory fragmentation if the pool memory is not continuous?

I believe @bzbarsky-apple is suggesting the same thing I was: make System::TryCreate a shallow wrapper for malloc such that each individual object is malloc'ed.

The only remaining question is whether the layers need to track objects themselves. @kpschoedel's comment implies they may not. If not, can't we just map System::TryCreate to malloc and map System::Release to free? It would be really nice if we could.

@mspang
Copy link
Contributor

mspang commented Jun 25, 2021

I'm trying to understand the actual goal here.
So far what I am seeing is that we still have a pool size template argument; it's just that instead of allocating the pool memory inline in the object we are heap-allocating that memory. What is the benefit of doing this?
I thought the idea we had been discussing was to allow heap-allocation of the individual objects. This PR does not really seem to be doing that...

That's certainly what I've been pushing for.

The goal of this PR is to use a flag to allow reserve pool space from heap instead of global static memory for big systems. The pool still use big chunk of continuous memory consist of fixed-size blocks. Pools always need to be over-provisioned for the worst case, virtual memory will ensure we do not really use that much physical RAM until we actually access the data. It also allow the system return the memory reserved by pool back to system after usage.

Virtual memory works the same whether it's a large static allocation or a large dynamic one (but you should use calloc rather than malloc + memset if you want lazy allocation).

Could you explain more about heap-allocation of the individual objects? Then we will have extra meta data to track those individual and discrete objects in memory, right? how we handle the memory fragmentation if the pool memory is not continuous?

If objects are relocatable (which requires not storing pointers to pool-allocated objects) and we really did want contiguous storage so we can iterate over the objects with good locality, it's an option to allow use of std::vector... But we'd have to store indexes instead of pointers.

@bzbarsky-apple
Copy link
Contributor

reserve pool space from heap instead of global static memory for big systems.

Why does it matter? As @msandstedt says, there's not much of a difference here in that .bss likewise involves just a virtual memory reservation on these systems until you actually write nonzero data in.

It also allow the system return the memory reserved by pool back to system after usage.

That is true, but in practice do we deallocate these pools short of shutting down the entire Matter stack?

Could you explain more about heap-allocation of the individual objects?

Sure. The basic problem we are running into, as I understood it, is that picking the right size for some of these pools for some of these "large" systems is very hard because we have no idea up front what the workload will look like. Maybe we transiently need 200 exchanges when we boot up to send subscribe requests to everything on the fabric, but then we'll never need that many again. That sort of thing.

One plausible solution to that problem is to just not use a fixed-size pool at all and instead heap-allocate the objects, so the only constraint on how many we can have is the actual memory of the device. On small devices where we would run up against these limits a lot this is not a great strategy, but on the "large" systems it seems like it should be fine.

Then we will have extra meta data to track those individual and discrete objects in memory, right?

Yes. Well, malloc will. We only need this sort of metadata if we need to find all the objects. That is the only part of the whole thing that seems complicated and annoying...

how we handle the memory fragmentation if the pool memory is not continuous?

By just accepting that some amount of loss to fragmentation is fine. In practice, on "large" systems:

  1. Virtual memory means that physical fragmentation is really only a problem for large allocations that span many pages: if every physical page has something on it already, then you might have an issue for such an allocation. But I don't expect us to make such large allocations in Matter.
  2. Virtual fragmentation is almost entirely a non-issue on 64-bit systems due to the size of the address space. On 32-bit systems it's possible to end up with significant fragmentation, but again in practice that is usually a problem if you have to make a large virtual allocation. Which, again, we don't plan to do.

So in practice, fragmentation on "large" systems comes down to a some-percentage tax on your memory capacity. And while a 10% or so on a constrained device's memory is a significant problem, the same thing on a "large" device is a much smaller issue, no?

@stale
Copy link

stale bot commented Jul 2, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale Stale issue or PR label Jul 2, 2021
@mspang
Copy link
Contributor

mspang commented Jul 5, 2021

Actually I think even with std::vector<PoolObj> we would still be overcomplicating things.

How about just switching to std::vector<std::unique_ptr<PoolObj>> on sufficiently capable systems?

  • Sizing is dynamic
  • No need to invalidate pointers
  • Doesn't require immediate redesign of code that assumes we can iterate over the pool

@stale stale bot removed the stale Stale issue or PR label Jul 5, 2021
@yufengwangca
Copy link
Contributor Author

Actually I think even with std::vector<PoolObj> we would still be overcomplicating things.

How about just switching to std::vector<std::unique_ptr<PoolObj>> on sufficiently capable systems?

  • Sizing is dynamic
  • No need to invalidate pointers
  • Doesn't require immediate redesign of code that assumes we can iterate over the pool

I think this is a good idea, will pilot this solution, thanks.

@yufengwangca
Copy link
Contributor Author

I believe @bzbarsky-apple is suggesting the same thing I was: make System::TryCreate a shallow wrapper for malloc such that each individual object is malloc'ed.

The only remaining question is whether the layers need to track objects themselves. @kpschoedel's comment implies they may not. If not, can't we just map System::TryCreate to malloc and map System::Release to free? It would be really nice if we could.

We actually have two pool systems in our system, Memory pool class BitMapObjectPool defined in support/Pool.h. and Memory pool class SystemObject defined in system/SystemObject.h, we plan to finally merge these two systems.

Even though we might not need immediate access to all objects in the pool for SystemObject with latest update, but BitMapObjectPool provide interface to iterate all items within the pool.

@yufengwangca
Copy link
Contributor Author

This draft PR is only for feedback collection, based on the initial feedback, I need to reset the direction based on what we really want to achieve for pool optimization.

Close this draft PR of discussion and will post a new PR for pool optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow more control over Pool sizes
5 participants