Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xa30a cmrr_mbep2d_bold crashes after 55 volumes #338

Open
lfxnat opened this issue Nov 4, 2023 · 12 comments
Open

xa30a cmrr_mbep2d_bold crashes after 55 volumes #338

lfxnat opened this issue Nov 4, 2023 · 12 comments

Comments

@lfxnat
Copy link

lfxnat commented Nov 4, 2023

dear CMRR

I have experienced a very strange problem when the epi acquisition aborts on the 55th volume when running on Prisma (not sure if this is important), XA30A software.
rfMRI_REST_PA.pdf

Attached to this post is the pdf file of the protocol.

this is very reproducible but not 100%. Sometimes the acquisition finishes normally.
increasing TE or TR by a few ms does not resolve the problem. Rotating slices does not resolve.

I could not find anything obvious in the scanner log files, but there was a message about an unhandled exception.

What is so magical in 511 ? or is it 96 which is magical?

Thanks

Lazar

@eauerbach
Copy link
Member

There is nothing unusual about that protocol, so I would assume it is most likely some issue specific to your scanner. If you can get a savelog right after one of these crashes I can look at it and see jf there is a useful error message.

@lfxnat
Copy link
Author

lfxnat commented Nov 19, 2023 via email

@lfxnat
Copy link
Author

lfxnat commented Dec 21, 2023 via email

@lfxnat
Copy link
Author

lfxnat commented Jan 4, 2024 via email

@eauerbach
Copy link
Member

That is interesting. In the logs I see a memory allocation error. If I am reading the logs correctly it looks like not quite 64 GB of memory is in use when it throws the error, which is a strange limit since your MaRS has 128 GB of memory installed. I am not entirely familiar with how XA30 is set up so maybe the other 64 GB is reserved for the system and this is normal.

But it is strange that so much memory is in use. Normally over the course of the scan as each repetition is acquired the system allocates enough memory for that repetition, reconstructs the repetition, then releases the memory once the images are sent to the database. So even though your per-repetition data size is large (~344 MB per repetition), at any given time there should not be more than a few repetitions in memory. Even accounting for multiple copies of each repetition being held in memory temporarily for intermediate calculations, that should only be a few GB total, not anything close to 64 GB.

Your observation that the images are not shown in real time in the inline display would seem to indicate that the images are not being sent to the database at the end of the chain for some reason, which could mean that they are piling up in memory until it runs out and crashes. 54 volumes * 344 MB is under 20 GB, so even that doesn't seem right, but it isn't impossible there are 3 temporary copies in use for calculations which would get up in the 60 GB range. So maybe that is it.

It looks like you are using the 64-channel head coil, so a quick way to check if this is the case would be to try the same protocol with a coil with fewer channels that uses less memory (e.g. the 32-channel or 20-channel coil). Or enable the "Matrix optimization" option to compress the channels (we generally recommend always doing this with the 64-channel head coil for performance reasons). If memory is the limit then with fewer coil elements it should crash instead after 90-100 volumes or more.

But the real question is why the reconstruction is hanging up. I have only seen this happen before when physiology logging is enabled with the DICOM export option. But that issue was fixed in R017pre7. So if you have the DICOM physiology export option enabled and are using R017pre6 or older, that may be the problem. Otherwise, there must be a new problem, and to troubleshoot further I would probably need a meas.dat file from one of your failed acquisitions.

@lfxnat
Copy link
Author

lfxnat commented Jan 16, 2024 via email

@ryanwillo
Copy link

We experienced this same issue on our Prisma at UAB running XA30a. We were using the 64-channel head coil, running HCP-like resting state fMRI with MB factor = 8.
HEALTH-Cog.pdf

We wound up doubling the TR and halving the MB factor. I plan to try matrix optimization and 20-channel coil in the future.

I can share .dat files and savelogs if it is helpful.

Here is relevant text from UTrace file:
102|2024/02/22-14:55:02.766816|mars|1638|MrMCIRContainer|1816|SCTSeqRun|32|Always|reserved, do not use|1|MrImagingFW.cmrr_mbep2d_bold|MrImagingFW|cmrr_mbep2d_bold|/tfs/C_MIDEA_NXVA30A_EJA/src/MrImagingFW/seq/eja_common/eja.XA20A.a_ep_Feedback.h|707|SyncFeedback|62958|0|Received a valid feedback [53]. Synchronization with volume no.: 54|
102|2024/02/22-14:55:02.810999|mars|1635|MrIrisContainer|1648|PARC32.onlinetse_ps.PoolThread[0]|32|Error|DLL|1|ICE.IceBasic|MrVista|IceBasic|/build/3318/src/MrVista/Ice/IceBasic/IceObj.cpp|1131|allocateDataAreaInMemory|6251|0|Memory allocation failed.Tried to allocate memory bloc of 361340928 bytes (344.602 Mb)|
102|2024/02/22-14:55:02.811046|mars|1635|MrIrisContainer|1648|PARC32.onlinetse_ps.PoolThread[0]|32|Notice|DLL|1|ICE.IceBasic|MrVista|IceBasic|/build/3318/src/MrVista/Ice/IceBasic/IceObj.cpp|1132|allocateDataAreaInMemory|6252|0|MrParc memory footprint:¬totalSize = 66051899392 (62992 Mb)¬protectedSize
= 3355443200 (3200 Mb)¬protectedAvail_ = 3355443200 (3200 Mb)¬normalSize_ = 62696456192 (59792 Mb)¬normalAvail_ = 235435328 (224.529 Mb)¬poolPageSize_ = 4194304 (4 Mb)¬fallBackCnt_ = 24¬protectedPeak_ = 3355357664 (3199.92 Mb)¬normalPeak_ = 62463085248 (59569.4 Mb)¬bottomUpSize_ = 947358400 (903.471 Mb)¬topDownSize_ = 61513662464 (58664 Mb)¬bottomUpPeak_ = 952830656 (908.69 Mb)¬topDownPeak_ = 61513662464 (58664 Mb)¬protectedUsage_ = 3355357664 (3199.92 Mb)¬normalUsage_ = 62463085248 (59569.4 Mb)¬section_ = 0¬|

Thank you,
Ryan

@bkossows
Copy link

bkossows commented Apr 19, 2024

We have exactly the same problem on XA30A and 64 channel coil. By rotating the FOV I can easily reproduce the error after 2-3 iterations. However, with the Matrix Optimization set to Performance it seems to work correctly even without restarting the MARS. Please confirm if we can go on with the study like that.

@eauerbach
Copy link
Member

Yes, Matrix Optimization = Performance is recommended with the 64 channel coil.

@lfxnat
Copy link
Author

lfxnat commented May 10, 2024 via email

@eauerbach
Copy link
Member

Matrix optimization is channel compression, which combines non-overlapping coils early on in the reconstruction, reducing memory and computational requirements. For the 64-channel head coil we usually find that channel compression reduces the number of logical coils to 34 or so, which makes a significant difference in the reconstruction performance.

As I tried to describe above, the images are reconstructed one repetition at a time on the MaRS and then sent over the network to the host to be stored in the database. Normally there should only be a couple repetitions in memory at a time on the MaRS, since the memory is released after each repetition is sent on its way to the host.

But in XA versions (XA20/30/50/60) the host is unusually slow in receiving images. It is a serious fundamental architectural problem that we are working with Siemens to try to fix. It is absolutely a new issue with XA; it did not happen like this in VB/VD/VE.

I believe the problem here is the delay on the host computer in receiving images accumulates over the course of the scan, and the images pile up on the MaRS. Eventually the MaRS runs out of memory to hold them all and the reconstruction stops. Matrix optimization does not directly fix this problem, but it cuts the memory used by the reconstruction in half, leaving much more of a buffer on the MaRS.

My image reconstruction code is essentially identical across all versions: VB/VD/VE/XA. This problem only exists in XA, so the idea it is due to a memory leak in the reconstruction on the MaRS does not make sense. We do not see this in VB/VD/VE. In VB/VD/VE in most cases the image reconstruction can keep up in real time.

One of the new problematic things that happens on the host in XA is that apparently it creates an additional resampled version of every volume on the fly, I guess for use in View&Go. This clearly must slow things down, but currently there is no way to disable it. It is plausible that if you acquire volumes with an oblique rotation, this resampling process is even slower (e.g. if orthogonal images use a simple linear interpolation, vs. obliques requiring a more complicated resampling).

@lfxnat
Copy link
Author

lfxnat commented May 10, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants