Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory Crash - 1.6.0-9.22789.0 #3840

Open
1 task done
haron4igg opened this issue Nov 3, 2024 · 29 comments
Open
1 task done

Out of memory Crash - 1.6.0-9.22789.0 #3840

haron4igg opened this issue Nov 3, 2024 · 29 comments
Labels
bug Something isn't working

Comments

@haron4igg
Copy link

Describe the bug

My players experiencing a huge spike in crashes due to memory usage (low memory, memory access and so on).
My game mode is pretty memory intensive because of a lot of custom models, we spent days looking for the problem and so far we only found that
older mta version having almost no problems, comparing to latest one's:

Crashes in last 30 days per versions:

1381 | 1.6.0-9.22789.0
675 | 1.6.0-9.22780.0
611 | 1.6.0-9.22763.0
247 | 1.6.0-9.22746.0
23 | 1.6.0-9.22650.0
17 | 1.6.0-9.22684.0
10 | 1.6.0-9.22771.0
9 | 1.6.0-9.22751.0

We are now suggesting player to rollback to older 22746 / 22650 version, and they report that there is no problems with that versions.

we had two crash scenarious, one is "good one", when player reconnects server 2-3 times and on 3-4th time he may get low memory and crash which is like ~99% of cases for us during last 10 years )

and new one - player start mta and joins server first time, plays 15-20 minutes, or after 2-3 minimizing he gets low memory and crash.
this is what happens primarily with 22789.0

moment before player gets crash, after first low-memory warnings (textures/fonts not created) - his memstat looks totally okeish
Image

crashes are different but mostly:
Image
Image
Image

Going to update ticket as soon, as we identify exact version where problem appeared first time.

Steps to reproduce

  1. Join our server: mtasa://83.222.116.88:22003
  2. just play occasionaly for 10-15 minutes.
  3. Minimize-maximize MTA few times
  4. Eventtually you get low memory warning
  5. Few minuites after game crash. (people also report that by enabling showmemstat after memory warning, will cause crash immideately)

Version

Client: 1.6.0-9.22763.0 - 1.6.0-9.22789.0

Additional context

No response

Relevant log output

No response

Security Policy

  • I have read and understood the Security Policy and this issue is not security related.
@haron4igg haron4igg added the bug Something isn't working label Nov 3, 2024
@Xenius97
Copy link
Contributor

Xenius97 commented Nov 3, 2024

0x003C91CC is the most famous crash caused by unoptimized mods / scripts

https://wiki.multitheftauto.com/wiki/Famous_crash_offsets_and_their_meaning

@haron4igg
Copy link
Author

Unfortunatelly we cant proceed with checking MTA versions to find which one introduced the issue, since latest 22789 is now enforced.

the last tested-stable for us were: 22746

Also, collected a bit more data over crashes in last 90 days, seems like 22780.0 were the latest 'good' for us.

CRASH COUNT | VERSION

1543 | 1.6.0-9.22684.0
1437 | 1.6.0-9.22789.0
 675 | 1.6.0-9.22780.0
 612 | 1.6.0-9.22763.0
 593 | 1.6.0-9.22746.0
 559 | 1.6.0-9.22650.0
 529 | 1.6.0-9.22741.0

@PlatinMTA
Copy link
Contributor

PlatinMTA commented Nov 3, 2024

Unfortunatelly we cant proceed with checking MTA versions to find which one introduced the issue, since latest 22789 is now enforced.

the last tested-stable for us were: 22746

Also, collected a bit more data over crashes in last 90 days, seems like 22780.0 were the latest 'good' for us.

CRASH COUNT | VERSION

1543 | 1.6.0-9.22684.0
1437 | 1.6.0-9.22789.0
 675 | 1.6.0-9.22780.0
 612 | 1.6.0-9.22763.0
 593 | 1.6.0-9.22746.0
 559 | 1.6.0-9.22650.0
 529 | 1.6.0-9.22741.0

I think your own data shows that this is not related to MTA whatsoever (most likely the higher amount of crashes on certain versions is related to the minversion being updated)... using almost 2.8GB when you have 3.2GB available for MTA is too much. You shouldnt be using that much memory.

But you did mention players reverting to older versions not having those issues... but then again you have 1100+ crashes with both mentioned versions so that just sounds like a placebo effect. You surely can optimize your models and scripts to lower that 2.8GB use to at least 2GB (160MB in vertices is too much, you surely can lower it without it being noticeable, same with the textures, instead of using 1024x1024 maybe try using 512x512 textures instead, they will still look nice).


As a comparison my server right now is hosting 260 players, with a lot of custom vehicles, ped skins and models, and these are the memory values on the busiest part of the map.

Screenshot Also the values crop and my resolution isnt that low... just 1600x900. We should take a look.

I know this is kind of an unfair comparison because we try to maintain the GTA: San Andreas aesthetic, so our models tend to be low poly. I know that DayZ servers don't really strive for that, but that doesn't mean you cant optimize the models you already got. I'm positive that if you half most of the textures from the skins your memory usage will drop drastically (500MB allocated just for textures is a lot). Renderware is old... and only capable of using 32 bit addresses. Compromises have to be made.

@haron4igg
Copy link
Author

haron4igg commented Nov 3, 2024

Unfortunatelly we cant proceed with checking MTA versions to find which one introduced the issue, since latest 22789 is now enforced.
the last tested-stable for us were: 22746
Also, collected a bit more data over crashes in last 90 days, seems like 22780.0 were the latest 'good' for us.

CRASH COUNT | VERSION

1543 | 1.6.0-9.22684.0
1437 | 1.6.0-9.22789.0
 675 | 1.6.0-9.22780.0
 612 | 1.6.0-9.22763.0
 593 | 1.6.0-9.22746.0
 559 | 1.6.0-9.22650.0
 529 | 1.6.0-9.22741.0

I think your own data shows that this is not related to MTA whatsoever (most likely the higher amount of crashes on certain versions is related to the minversion being updated)... using almost 2.8GB when you have 3.2GB available for MTA is too much. You shouldnt be using that much memory.

But you did mention players reverting to older versions not having those issues... but then again you have 1100+ crashes with both mentioned versions so that just sounds like a placebo effect. You surely can optimize your models and scripts to lower that 2.8GB use to at least 2GB (160MB in vertices is too much, you surely can lower it without it being noticeable, same with the textures, instead of using 1024x1024 maybe try using 512x512 textures instead, they will still look nice).

As a comparison my server right now is hosting 260 players, with a lot of custom vehicles, ped skins and models, and these are the memory values on the busiest part of the map.

Screenshot
I know this is kind of an unfair comparison because we try to maintain the GTA: San Andreas aesthetic, so our models tend to be low poly. I know that DayZ servers don't really strive for that, but that doesn't mean you cant optimize the models you already got. I'm positive that if you half most of the textures from the skins your memory usage will drop drastically (500MB allocated just for textures is a lot). Renderware is old... and only capable of using 32 bit addresses. Compromises have to be made.

1543 | 1.6.0-9.22684.0 - was spike here (1), which got fast fixed in next patch:
1437 | 1.6.0-9.22789.0 - is current problematic version (2) which causing crashes during first clean run of MTA
Image

rest:
675 | 1.6.0-9.22780.0
612 | 1.6.0-9.22763.0
593 | 1.6.0-9.22746.0
559 | 1.6.0-9.22650.0
529 | 1.6.0-9.22741.0

  • are normal crash-rate caused due to often reconnect within same MTA session.

but i totaly agree with optimisation points, doing this a lot actually... with each update during last ~10 years working with DayZ :D

@PlatinMTA
Copy link
Contributor

1543 | 1.6.0-9.22684.0 - was spike here (1), which got fast fixed in next patch

I remember this crash (game crashed on disconnect). Really annoying crash.

1437 | 1.6.0-9.22789.0 - is current problematic version (2) which causing crashes during first clean run of MTA

did r22787 work for instance? if i'm not confused you guys still havent found a version where the amount of crashes did not spike. Does your server use CEF?

@haron4igg
Copy link
Author

1543 | 1.6.0-9.22684.0 - was spike here (1), which got fast fixed in next patch

I remember this crash (game crashed on disconnect). Really annoying crash.

1437 | 1.6.0-9.22789.0 - is current problematic version (2) which causing crashes during first clean run of MTA

did r22787 work for instance? if i'm not confused you guys still havent found a version where the amount of crashes did not spike. Does your server use CEF?

Yea, we haven't found exact. but statistic advise that .22780.0 were last normal.
Yes, we using CEF

@PlatinMTA
Copy link
Contributor

Yea, we haven't found exact. but statistic advise that .22780.0 were last normal. Yes, we using CEF

Afaik last forced minclientversion before r22789 was r27763, according to my logs. Do you have for instance data related to amount of users using r22780 vs amount of crashes? That would be really useful. Recently some changes have been made in CEF (#2933), so maybe thats the issue. Maybe disabling GPU rendering could stop the crashes?

@haron4igg
Copy link
Author

Yea, we haven't found exact. but statistic advise that .22780.0 were last normal. Yes, we using CEF

Afaik last forced minclientversion before r22789 was r27763, according to my logs. Do you have for instance data related to amount of users using r22780 vs amount of crashes? That would be really useful. Recently some changes have been made in CEF (#2933), so maybe thats the issue. Maybe disabling GPU rendering could stop the crashes?

Cant find any API to disable it, isn't it compilation flag?

@PlatinMTA
Copy link
Contributor

Cant find any API to disable it, isn't it compilation flag?

You can disable it from the settings, and you can check if the client has it enabled with isBrowserGPUEnabled

@Lpsd
Copy link
Member

Lpsd commented Nov 3, 2024

It's a client setting, it's entirely up to the user. There is no API to control this, as per most other client settings (the server has no authority over them).

As I mentioned in Discord the CEF GPU rendering was introduced in 22771 but was broken due to compositing being re-enabled, then fixed in 22789 (by disabling compositing, but still having GPU enabled by default).

I doubt that disabling GPU in CEF will resolve your issue but you can ask players to try it out (MTA settings > Web Browser > Enable GPU rendering).

@Lpsd
Copy link
Member

Lpsd commented Nov 3, 2024

If 22780 was good for your players then it's 99% not CEF, since that was (mostly) broken from 22771 to 22789 as mentioned above.

@haron4igg
Copy link
Author

Have 3 users so far, who got a lot off crashes, and since disabled GPU Rendering - having no more issues.

so i assume, because we are on the edge with models/textures, adding CEF to video memory takes all the space...
Is it possible to show how much CEF uses video memory on 'showmemstat' ?

@Fernando-A-Rocha
Copy link
Contributor

Wouldn't it be nice to have a client function to disable cef gpu rendering, so certain servers can control whether their clients should need it or not?

@Lpsd
Copy link
Member

Lpsd commented Nov 5, 2024

Wouldn't it be nice to have a client function to disable cef gpu rendering, so certain servers can control whether their clients should need it or not?

Not viable for two reasons:

  1. Requires client restart
  2. The principle of not allowing server to modify client settings

@Lpsd
Copy link
Member

Lpsd commented Nov 5, 2024

In my opinion it's not up to a server to decide that a client can't use GPU in CEF, just because that server wants to push memory limits to breaking point.

@haron4igg
Copy link
Author

haron4igg commented Nov 8, 2024

After week of researches...:

  1. We found few users who were able to reproduce the issue within ~30 minutes of gaming session, with next hardware:
    -. NVIDIA GeForce RTX 3060 RAM: 32661
    -. NVIDIA GeForce GTX 1650 RAM 16334.94921875
    -. Intel(R) Arc(TM) A750 Graphics RAM 16208.4765625

  2. We been turning off resource-packs one by one to see if some of them are responsible for the problem.

  3. Were disabling newly added resources, and shaders to exclude in-shader memory leakage.

  4. Also runned a test without all the textures/shaders, so ~500 mb less memory and no shader-rendering.

As result, testers now getting this crash not after 30 mins, but after ~1.5-2 hours. (even with GPU CEF setting turned off)
which is now kinda looks like a memory leak for me. Considering that we don't have this problem with older MTA version, i now really doubt that cause of this leak is my resources.

Also i released some patches to resource pack, reducing the size of the textures by ~200 mb in total, just for test purposes to public:
as result - crash-rate reduced in prod, but only because normal gaming session is bellow 1 hour, players who stay longer sill having crashes.

Counting items on the client-side element tree, shows no advancement over the session time, so we are not leaking elements, shaders or textures.

And this is not the first time to be honest, when CEF gets some new update, and we are getting crashes #2446

@PlatinMTA
Copy link
Contributor

Could it be a table (or a series of tables) that are not being cleaned properly? Global tables... not local ones because those get caught up by the garbage collector.

Memory leaks can also happen because you are not clearing properly some global variables. For example you are not cleaning them when elements get destroyed, players disconnect, or when they are not longer needed. You can check the memory usage of a resource in the performance browser.

This is a little script that should rise your RAM usage (an exaggeration, but you could easily make this mistake with an onClientRender event):

function tableCopy(orig)
    local orig_type = type(orig)
    local copy
    if orig_type == 'table' then
        copy = {}
        for orig_key, orig_value in pairs(orig) do
            copy[orig_key] = orig_value
        end
    else -- number, string, boolean, etc
        copy = orig
    end
    return copy
end

---------------------

allElements = {}
global = {}

function onStart()
	getChildren(root)
	
	utilizeRAM()
end
addEventHandler("onClientResourceStart", resourceRoot, onStart)

function getChildren(element)
	local children = getElementChildren(element)
	if #children == 0 then
		return
	end
	
	for key,element in ipairs(children) do
		local elementType = getElementType(element)
		if not allElements[elementType] then
			allElements[elementType] = {}
		end
		
		local i = #allElements[elementType]+1
		allElements[elementType][i] = element
		
		getChildren(element)
	end
end

function utilizeRAM()
	local iMax = 500000
	for i=1,iMax do
		global[i] = tableCopy(allElements)
	end
end

Image

BEFORE:
Image

AFTER:
Image

So, if you could check your memory usage for your resources that would be great. Realistically speaking I doubt you have a massive memory leak in one of your resources but that could be the case, so it would be nice to roll that out.

@haron4igg
Copy link
Author

haron4igg commented Nov 8, 2024

Could it be a table (or a series of tables) that are not being cleaned properly? Global tables... not local ones because those get caught up by the garbage collector.

Memory leaks can also happen because you are not clearing properly some global variables. For example you are not cleaning them when elements get destroyed, players disconnect, or when they are not longer needed. You can check the memory usage of a resource in the performance browser.

This is a little script that should rise your RAM usage (an exaggeration, but you could easily make this mistake with an onClientRender event):

function tableCopy(orig)
local orig_type = type(orig)
local copy
if orig_type == 'table' then
copy = {}
for orig_key, orig_value in pairs(orig) do
copy[orig_key] = orig_value
end
else -- number, string, boolean, etc
copy = orig
end
return copy
end


allElements = {}
global = {}

function onStart()
getChildren(root)

utilizeRAM()
end
addEventHandler("onClientResourceStart", resourceRoot, onStart)

function getChildren(element)
local children = getElementChildren(element)
if #children == 0 then
return
end

for key,element in ipairs(children) do
local elementType = getElementType(element)
if not allElements[elementType] then
allElements[elementType] = {}
end

  local i = #allElements[elementType]+1
  allElements[elementType][i] = element
  
  getChildren(element)

end
end

function utilizeRAM()
local iMax = 500000
for i=1,iMax do
global[i] = tableCopy(allElements)
end
end
Image

BEFORE: Image

AFTER: Image

So, if you could check your memory usage for your resources that would be great. Realistically speaking I doubt you have a massive memory leak in one of your resources but that could be the case, so it would be nice to roll that out.

  1. I do check performance browser memory consumption by scripts, and its normal:
    For the player who stayed connected for ~1 hour:
    Image
    Just joined player:
    Image

  2. why then rollback to older mta version resolves the problem? Lua leak would be version independent…

@haron4igg
Copy link
Author

Could you guys please allow mta downgrade
to r22771, so we retest precisely latest versions with my most problematic users, to locate the issue?

@TracerDS
Copy link
Contributor

TracerDS commented Nov 8, 2024

Could you guys please allow mta downgrade
to r22771, so we retest precisely latest versions with my most problematic users, to locate the issue?

Did you try nightly?

@haron4igg
Copy link
Author

Could you guys please allow mta downgrade
to r22771, so we retest precisely latest versions with my most problematic users, to locate the issue?

Did you try nightly?

Nightly 22789, a.k current force-update version is the reason of this issue )

@haron4igg
Copy link
Author

haron4igg commented Nov 10, 2024

So, i think we found it:

My mode is depent on CEF, some UI menus maden there.
Also, i have google-analytics integration, using fetchRemote on client side to send events, and small CEF window to start user session, with proper GEO data collected.

When user joins, i shortly create CEF browser with GA init code to obtain session, and once its done - deleting the browser.
Also, i do requestBrowserDomains call in advance, to get prepared for fetchRemote for sending GA events. So each user, who joins server, even if he never opens the CEF ui, will still get CEF loaded for him.

In both cases, if i requestBrowserDomains or create CEF for page-hit event - the CEF being loaded and attached to MTA. Process
Image

And memory leaking starts... Whenever the MTA has CEF - after 30-50 minutes of game my client will crash.
If i remove all requestBrowserDomains or CEL windows - then there is no problems anymore.

So we made a test with one of my problematic clients. I removed everything related to CEF.
we played for ~1 hour, no issues, no memory leaks/out of memory problems. We touched everything what is existing in my Gamemode.

Afterwards, i just enabled anaylitcs plugin with only requestBrowserDomains and fetchRequest features on, CEF got attached to process and he got instant crash, without any crash info window.

summing up with the previous research above, it should be clear now, that latest introduced features to CEF are having memory leak.
which is not noticible with low-size projects, but noticable with higher one.
#3840 (comment)

@Lpsd
Copy link
Member

Lpsd commented Nov 10, 2024

I don't think we changed anything in CEF recently outside of adding a setting for enabling GPU (which is just a command line option on CEF instantiation) and vendor updates.

Did you test this with GPU option enabled or disabled?

If you tested it with GPU disabled and still have the issue, then it's nothing to do with those recent GPU changes (as I said, the MTA implementation side is just passing a command line option to CEF itself on launch; it's a 3 line change where we don't allocate/change anything on our side, so not a memory leak in the MTA implementation).

It could be related to a recent update in CEF whereby we are missing out on something from MTA's existing implementation; but it's nothing to do with the GPU changes if the issue still exists with GPU disabled in CEF.

@Lpsd
Copy link
Member

Lpsd commented Nov 10, 2024

Also if you have a way to replicate a crash in CEF then please provide a resource/script which we can use

@haron4igg
Copy link
Author

haron4igg commented Nov 10, 2024

It is worth to mention, that not everyone experience this crash. For example on my machine i never had this problem so far.
But analytics suggests that it is also hardware independent.

This is the minimal code, which causing my tester to crash, once was started after ~45 minutes of game on my 'memory-heavy' gamemode.

GPU rendering was disabled in latest tests with this user.

function onClientResourceStart()
	sendPageView()
end
addEventHandler( "onClientResourceStart", resourceRoot, onClientResourceStart)

function sendPageView()
	sendTracking("page_view", {
		screen_resolution = "1024x1024",
		language = getLocalization()["code"],
		page_title = "test",
		page_location = "https://nonstopz.com/gameplay/",
		client_version = getVersion().sortable,
	})
end

function sendTracking(event, ...)
	local params = {...}

	requestBrowserDomains(request, false, function (success)
		if success then
			local status, err = pcall(sendTrackingInternal, event, unpack(params))
			if not status then
				error(err)
			end
			return status
		else
			--sendTrackingRemote(event, unpack(params))
		end
	end)
end

function sendTrackingInternal(event, ...)

	local url = "https://exmaple.com/"
	local payload = {some="data"}

	local post = {
		queueName = "analytics",
		connectionAttempts = 1,
		connectTimeout = 5000,
		postData = toJSON(payload, true):sub(2, -2),
		method = "POST",
	}

	fetchRemote( url, post, function(responseData, errno)
		if tostring(responseData) == "ERROR" then
			ELog("[Analytics] error: %s", inspect(responseData, errno))
		end
	end)
end

@Lpsd
Copy link
Member

Lpsd commented Nov 14, 2024

So you don't even need to create a browser to cause the crash, only requestBrowserDomains? Or is there code missing from above?

@Lpsd
Copy link
Member

Lpsd commented Nov 14, 2024

I also tried yesterday for some time to replicate a memory leak scenario in CEF but I had no luck with that. Also I'm sure there would be far more reports about this from other popular servers, if there was a general issue with memory leak in CEF.

I'm not saying there isn't a memory leak issue in CEF, as it seems like you have identified CEF as a problem by eliminating its usage in tests, but we don't really have enough info yet. Ideally we need to be able to replicate the issue. It may be something quite particular in any of your scripts that utilize CEF, the more info we have the better chance we have to track it down.

Can you provide more analysis about the memory usage in this leak? Is the memory being increased in the CEF subprocess(es), or on gta_sa.exe itself?

Can you provide memory usage over time (on more than two time-points) of a player who eventually crashed due to OOM, with detailed reporting from performance browser (showing all available metrics relating to memory usage)?

How many CEF browser instances are you creating at any given time?

@haron4igg
Copy link
Author

haron4igg commented Nov 15, 2024

So you don't even need to create a browser to cause the crash, only requestBrowserDomains? Or is there code missing from above?

this code above is complete, it was enough to just call requestBrowserDomains() (it is attaching cef) to cause a crash for tester.

I also tried yesterday for some time to replicate a memory leak scenario in CEF but I had no luck with that. Also I'm sure there would be
far more reports about this from other popular servers, if there was a general issue with memory leak in CEF.

As i said, i also dont have this crash locally, and never been able to reproduce this kind of memory issue, i'm getting crash only due to multiple reconnects.

How many CEF browser instances are you creating at any given time?

normally, not more then one, or 0. but requestBrowserDomains executed for each connected player. so CEF is there always in process

@Lpsd
Copy link
Member

Lpsd commented Nov 15, 2024

My mode is depent on CEF, some UI menus maden there.

So all of your UI is done in a single browser instance?

Also this mention of the instant crash is confusing things, I thought we are talking about a memory leak here? What you mentioned earlier about playing for an hour with no memory issues, then enabling a CEF resource and getting an instant crash sounds completely unrelated to memory leaks? Unless you are saying that within the time the resource started, memory usage spiraled out of control to cause that crash? Otherwise, lets try to be clearer about what issue is being referred to here.

Until I get my hands on some proper analysis of the memory usage throughout a player's session who experiences this, I'm afraid I have nothing to work with on the claims that this is a CEF related issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants