Velox memory consumption #9008

aditi-pandit · 2024-03-08T17:30:54Z

aditi-pandit
Mar 8, 2024
Collaborator

Velox Memory issue summarizes particular Prestissimo behavior we observe at IBM.

The issue happens when running consecutive TPC-DS 10 K queries on a cluster with 4 workers of the following configuration:

Process memory: 122GiB (31981568 pages)
system-gb-memory: 120 GiB (31457280 pages)
query-memory-gb: 115 GiB
query.max-memory-per-node: 107 GiB
async-data-cache (enabled by default)

The worker was OOM killed after 3 consecutive queries (though each run one at a time).

As described in the presentation:

Q1 heavily uses memory for aggregation
Q2 loads lots of data into the async data cache. At end 40%+ of memory used in cache.
Q3 or Q4 may get OOM killed
Q3 heavily uses async data cache, low query memory usage

The profile of memory after Q2 looks like:

After Q2 there was a sudden spike of memory consumption of almost 850 MB. This spike is not explainable. The symptoms point to a memory leak. Has anyone else observed such scenarios ?

But beyond this specific situation, general observations about Velox memory consumptions raise some questions:

3rd party libraries like proxygen allocate memory that can't be controlled. Meta throttled exchanges for this. At IBM, we observed the thrift library in Parquet reader to allocate memory as well. In general, operators (new connectors, file/table format readers) are likely to allocate memory. Can Velox leave some head-room for these ?
The AsynDataCache has greedy behavior and consumes all system-gb-memory. This memory is always in use even if the system is idle for a while. This comes by as strange. It seems more natural for cache entries to expire and leave space for new queries to consume. Meta observes external memory use and shrinks cache periodically. But could this be done by the AsyncDataCache more organically ?
The AsyncDataCache using the entire system-gb-memory makes it seem like Velox requires much more memory than Java presto. Many times folks comment that Prestissimo needed 2x of JVM requirements for the same query. IBM uses beefy machines so this makes memory use look very bloated. Can the AsyncDataCache be bound to a limit ?

Would be great to hear more thoughts on this.

aditi-pandit · 2024-03-08T17:32:13Z

aditi-pandit
Mar 8, 2024
Collaborator Author

@spershin @xiaoxmeng @oerling @mbasmanova @pedroerp @czentgr @majetideepak @yingsu00

0 replies

spershin · 2024-03-08T19:00:07Z

spershin
Mar 8, 2024
Collaborator

Folks, I had a chat with Orri yesterday about this.
He confirmed my assumption that 2GB is too little to have outside of memory map. Better to go with 10GB.

Also, when initializing the memory, we have two parameters: memory for memory map and memory for arbitrator.
It is better to give arbitrator less memory than memory map has: less by 5-10%.
This is because we want cache to hold on to some memory, so it can free it up when reading data from source.

We also have indication of memory being leaked, in terms of mapped memory not being released.
There is no understanding if these are small pages we allocated with jemalloc or large ones in the mapped memory.

@bikramSingh91 would work on these leaks after he is back from PTO.

0 replies

xiaoxmeng · 2024-03-09T05:31:28Z

xiaoxmeng
Mar 9, 2024
Collaborator

@aditi-pandit, per our offline discussion, we could leave headroom for non-velox controlled memory usage through system-gb-memory config which is our practice in Meta as @spershin mentioned above. We might also need to investigate the non-trivial parquet reader memory allocated through std::malloc and change to allocate from velox memory pool which also provide STL allocator compatible interface. Velox doesn't have capacity enforcement for async data cache and the actual velox memory capacity is enforced by the memory allocator. The memory allocator dynamic adjusts the cache memory usage per query memory demand (see velox memory doc for details). We have built memory pushback mechanism in Prestissimo to shrink the cache when server is under memory pressure. Since the actual server memory usage detection is platform specific, the pushback component is not in OSS. But we could consider to move the control logic like the connection with async data cache to OSS and different setups can customize the server memory pressure detection logic cc @bikramSingh91 @tanjialiang . I am not sure if it is good idea to detect the server memory condition inside the async data cache. It is better to keep them separate.

0 replies

aditi-pandit · 2024-03-20T18:21:30Z

aditi-pandit
Mar 20, 2024
Collaborator Author

Thanks @spershin and @xiaoxmeng.

Some follow up:

We will leave more headroom for non-Velox controlled memory usage. We were trying to be aggressive to give more memory to Velox in our runs. We have clusters of different sizing (both number of nodes and the kind of nodes). Right now, we are playing with our clusters to pick a number for each configuration based on the TPC-DS scale factor requirements. But it would be great to have a formula for this.
@yingsu00 : Is it possible to use Velox memory pool for the Thrift allocations in Parquet reader ? We should prefer to use Velox memory pool to control memory accounting as much as we can.
@bikramSingh91 @tanjialiang : It would be great to get more insight into the working of the AsyncDataCache flush. Both in terms of its internals and how we can control it. Do you have a design doc ?

We would have different server logic to trigger flushing the cache. We definitely want to do so in memory pressure conditions, but we also want to try it when the system is idle.

6 replies

czentgr Mar 22, 2024
Collaborator

I'm collecting data by looking the process memory consumption (from ps output). Also I have the periodic task manager dump async data cache and allocator string output.
The metricpresto_cpp.mapped_memory_bytes should show that it doesn't really go down.

jaystarshot Mar 27, 2024

Thanks! Do you have a thread dump too

czentgr Mar 27, 2024
Collaborator

You are referring to a thread stack dump? Or thread memory dump to see what is used?
We are looking if there is residual memory once a query finishes. So far I've run the TPCDS q1 -> q2 -> q3 queries and the RSS from pmap has increasingly more pages used that are not all accounted for by the MemoryAllocator. We would expect from query to query - after cleanup - that the number is roughly the same ballpark (except for async data cache which may use more).

But seemingly not. So we are trying now running the same queries and also without async data cache to see what the numbers are.

jaystarshot Mar 27, 2024

Ohh yeah I mean memory dump^

czentgr Mar 28, 2024
Collaborator

No, we have not looked at a memory dump. It won't be trivial to try and identify what is what.

aditi-pandit · 2024-03-20T18:31:02Z

aditi-pandit
Mar 20, 2024
Collaborator Author

@xiaoxmeng : We had also spoken about the assumptions around "fair" memory use between queries during arbitration. You said that Velox expects users to over-provision per-query memory and that all queries are around similar shape, so picking up the biggest memory consumer is most efficient.

Please can you elaborate on this if I missed something. We might need to refine this assumption for use at IBM.

4 replies

xiaoxmeng Apr 18, 2024
Collaborator

@aditi-pandit for "fair" sharing, velox memory arbitrator will reclaim memory from queries with more memory usage. We assume the memory usage of a query on all the workers are similar except the query has data skew. And most of time, we only reclaim memory from the query itself and cross query memory reclamation happens rarely at least in Meta internal use case.

aditi-pandit Apr 19, 2024
Collaborator Author

Thanks @xiaoxmeng.

So the thing is in watsonx.data platform, the Presto/Prestissimo data engine might not have control of the scheduler. The watsonx.data platform has many engines like Presto, Prestissimo, Spark and others. And the scheduler could be a common one... everything is in design phase and we don't know.

But when I look at this purely from a design perspective I feel that if we had introduced weights (for queries and tasks) coming from Prestissimo into Velox (in QueryContextManager or PrestoTask), then we could have done some weight based algorithm to choose a query to reclaim.

Meta's case would be the identity case in such a design where all weights are always 1. So its a completely fair design.

Are you open to such a change possibly in the future ? I'll work to investigate how scheduling happens in the different pipelines in the product to make this idea more concrete. Someone from our side could pick up the coding work for this.

xiaoxmeng Apr 25, 2024
Collaborator

@aditi-pandit yeah, we could extend the memory arbitration to support multi-tenant use case with sth like priorities. We could either design a new policy or extend the existing one with priority or weight. This also needs to support MemoryPool to carry the weight from query ctx. I think we could improve the existing SharedArbitrator to support prioritized arbitration.

aditi-pandit Apr 25, 2024
Collaborator Author

Thanks @xiaoxmeng. That's perfect.

@czentgr, @majetideepak : Lets follow up with watsonx.data platform team.

czentgr · 2024-04-17T23:22:39Z

czentgr
Apr 17, 2024
Collaborator

WHen running multiple tests I've noticed that the MmapAllocator will eventually unadvise the mapped pages if no allocation has taken place for a while (hours). I looked into the code to find out what the parameters for this are. But neither in the Prestissimo code (which creates the memory manager and has periodic tasks it performs) nor the velox code I could find what is unadvising the pages.

E.g. we might start out with

Memory Allocator[MMAP total capacity 25.00GB free capacity 25.00GB allocated pages 0 mapped pages 12658 external mapped pages 0
[size 1: 0(0MB) allocated 54 mapped]
[size 2: 0(0MB) allocated 116 mapped]
[size 4: 0(0MB) allocated 41 mapped]
[size 8: 0(0MB) allocated 36 mapped]
[size 16: 0(0MB) allocated 53 mapped]
[size 32: 0(0MB) allocated 36 mapped]
[size 64: 0(0MB) allocated 29 mapped]
[size 128: 0(0MB) allocated 13 mapped]
[size 256: 0(0MB) allocated 25 mapped]
]

after finishing a query. But after a long time of running, the mapped pages are eventually unadvised to 0 again (I didn't collect the evidence but saw it on the console).

We talked about forcing the Allocator to unadvise all pages after finishing a query. I looked into the code but there is no wrapper for MmapAllocator::unmap that passes in numMapped_ which you would expect if this was called somehow.

On the overall investigation.

turned off async data cache.
running on smaller Linux and MacOS (single worker)
on Linux saw resident size increasing with full Q1 (mmap allocator)

Tried Q1 subquery that only does tablescan as we suspected the ParquetReader to be involved.
But this did not show any leak.

Next steps:

run with join and aggregation
use malloc allocator - for hash tables (contiguous allocations this will also use mmap) - and collect jemalloc heap.

3 replies

xiaoxmeng Apr 18, 2024
Collaborator

@czentgr the cache shrink will unadvise the mapped pages. The memory pushback used by Meta internally will invoke cache shrink free up memory back to os. I don't think there is any other background daemon to unadvise mmaped pages.

czentgr Apr 18, 2024
Collaborator

Right. I don't use the async data cache (anymore) and don't use the shrink function. The cache contributes to the memory squeeze (which is how we started this investigation) where the shrink would help. I turned it off to remove one consumer of the memory as we don;t think the cache is the actual memory consumer causing the issue.

I'm trying to repro the unadvise of the mapped (but not allocated) pages but it took days last time to change the number. We saw something similar weeks ago after leaving the process alone without any activity. But yes, I don't know how it could happen. So hoping I can catch it.

xiaoxmeng Apr 18, 2024
Collaborator

@czentgr @aditi-pandit Hey, per discussion in the meeting, consider to change the memory allocation inside MallocAllocator::allocateContiguousImpl to jemalloc to make sure the stats report from MallocAllocator and jemalloc stats are consistent

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Velox memory consumption #9008

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 13 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Velox memory consumption #9008

aditi-pandit Mar 8, 2024 Collaborator

Replies: 6 comments · 13 replies

aditi-pandit Mar 8, 2024 Collaborator Author

spershin Mar 8, 2024 Collaborator

xiaoxmeng Mar 9, 2024 Collaborator

aditi-pandit Mar 20, 2024 Collaborator Author

czentgr Mar 22, 2024 Collaborator

jaystarshot Mar 27, 2024

czentgr Mar 27, 2024 Collaborator

jaystarshot Mar 27, 2024

czentgr Mar 28, 2024 Collaborator

aditi-pandit Mar 20, 2024 Collaborator Author

xiaoxmeng Apr 18, 2024 Collaborator

aditi-pandit Apr 19, 2024 Collaborator Author

xiaoxmeng Apr 25, 2024 Collaborator

aditi-pandit Apr 25, 2024 Collaborator Author

czentgr Apr 17, 2024 Collaborator

xiaoxmeng Apr 18, 2024 Collaborator

czentgr Apr 18, 2024 Collaborator

xiaoxmeng Apr 18, 2024 Collaborator

aditi-pandit
Mar 8, 2024
Collaborator

Replies: 6 comments 13 replies

aditi-pandit
Mar 8, 2024
Collaborator Author

spershin
Mar 8, 2024
Collaborator

xiaoxmeng
Mar 9, 2024
Collaborator

aditi-pandit
Mar 20, 2024
Collaborator Author

czentgr Mar 22, 2024
Collaborator

czentgr Mar 27, 2024
Collaborator

czentgr Mar 28, 2024
Collaborator

aditi-pandit
Mar 20, 2024
Collaborator Author

xiaoxmeng Apr 18, 2024
Collaborator

aditi-pandit Apr 19, 2024
Collaborator Author

xiaoxmeng Apr 25, 2024
Collaborator

aditi-pandit Apr 25, 2024
Collaborator Author

czentgr
Apr 17, 2024
Collaborator

xiaoxmeng Apr 18, 2024
Collaborator

czentgr Apr 18, 2024
Collaborator

xiaoxmeng Apr 18, 2024
Collaborator