-
Notifications
You must be signed in to change notification settings - Fork 518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When running with F1.16xlarge on all FPGAs, PCIE access to one of them is stuck #656
Comments
Hi @NoamDualBird, Thank you for reaching out. In order for us to better investigate this, can you please provide more info about your application?
In addition, as you're using the small shell, it's unclear to me why the XDMA driver is utilized as there is no DMA engine in the shell (this is not directly related to this timeout issue, but I just want to call it out) Thanks, Chen |
Hi Chen,
Regarding XDMA driver, the driver is needed for the main reason of enabling registers in the shell interrupt controller (otherwise MSI-X does not work). We do not use it for DMA. since we had to use it to enable interrupts, we also use its devices (_user/_events) for mapping of BAR0 and interrupts to userspace We were wondering if there may be an issue in the shell, where the CL AW/W channel works in burst sizes that are larger than the PCIe MTU. If PCIe interface receives backpressure it may be propagated to the AW/W channel multiple times during a single transaction and thus reaching timeout, although the protocol wasn't violated. Thanks, Noam |
Hi Noam, Thank you for the details about your application and use of the XDMA driver. A burst size exceeding the PCIe MTU isn't problematic, as packets will be automatically fragmented. However, multiple cards sending simultaneous bursts to the host may exceed the hardware interface's bandwidth capacity, causing back pressure to the CL and triggering AW/W channel timeouts. If reducing traffic request frequency isn't a preferred option, please consider implementing staggered transmissions from the cards to see if this helps minimize the peak BW spikes and potentially eliminate the timeout issues. Hope this helps. Chen |
Hi Chen, Thank you for your response and we will implement all mechanisms in order to avoid this timeout, however I still think there's an issue here.
Thanks, |
Hi Noam, Timeout protect the shell and facilitate debugging if, for example, a misbehaved AXI master continuously drives the AW/W channels without properly terminating the transactions. Your application is experiencing timeout for a totally different reason. However, I suspect the issue is related to the PCIe MTU. Your tests on 1/2 FPGA instances would have shown similar problems if that were the case. It's more probable that your application is encountering a system bottleneck that occurs when the peak bandwidth from all FPGAs exceeds the system's capacity. The crashing needs more investigation. Timeout shouldn't cause system crash because the shell should return OKAY even in a timeout event. Thanks, Chen |
Hi Chen, Please advise how you propose to further investigate the system crash. Thanks |
Hi Noam, Please first check to see if there is any critical error reported by kernel or driver. Thanks, |
Hi Chen There are no errors in kernel drivers at all. Only indication in shell status. Thanks |
Hi Chen, same setup as the previous case but with f2.48xlarge instance (small shell, XDMA driver). Error: In addition to the above data we also noticed some other errors: Our internal debug shows the same issue as the one in F1 - the timeout is caused by the shell logic in response to external backpressure even though our custom logic is AXI spec compliant. If this problem is similar to the previous one from F1, to speed this debug I'll state that I believe that there's a bug in the shell logic HW. When backpressure is applied from the PCIE interface the shell logic should buffer a whole AXI write transaction in order to close the transaction gracefully. The timeout counter should indicate an underrun from the master side which is not the case in this matter. Thanks |
Please note that the kernel changed the PCI MAX payload on all FPGA devices: It looks like since the PCIe switch that is connected to all the FPGA's exposes max capability of 128B mtu, the kernel changed In addition, we also tried working with 64B MTU and the system still failed with the same timeout: Thanks |
Just a note that for PCIe VFs, the MPS doesn’t matter as the PF MPS overrides and the one that actually set the design
From: NoamDualBird ***@***.***>
Reply-To: aws/aws-fpga ***@***.***>
Date: Monday, January 6, 2025 at 3:02 AM
To: aws/aws-fpga ***@***.***>
Cc: Subscribed ***@***.***>
Subject: Re: [aws/aws-fpga] When running with F1.16xlarge on all FPGAs, PCIE access to one of them is stuck (Issue #656)
Please note that the kernel changed the PCI MAX payload on all FPGA devices:
[ 2.099447] pci 0000:9f:00.0: Max Payload Size set to 128 (was 512, max 1024)
[ 2.100270] pci 0000:9f:00.1: Max Payload Size set to 128 (was 512, max 1024)
[ 2.103184] pci 0000:a1:00.0: Max Payload Size set to 128 (was 512, max 1024)
[ 2.103997] pci 0000:a1:00.1: Max Payload Size set to 128 (was 512, max 1024)
[ 2.106930] pci 0000:a3:00.0: Max Payload Size set to 128 (was 512, max 1024)
[ 2.107746] pci 0000:a3:00.1: Max Payload Size set to 128 (was 512, max 1024)
[ 2.110710] pci 0000:a5:00.0: Max Payload Size set to 128 (was 512, max 1024)
[ 2.111527] pci 0000:a5:00.1: Max Payload Size set to 128 (was 512, max 1024)
[ 2.123475] pci 0000:ae:00.0: Max Payload Size set to 128 (was 512, max 1024)
[ 2.124277] pci 0000:ae:00.1: Max Payload Size set to 128 (was 512, max 1024)
[ 2.127231] pci 0000:b0:00.0: Max Payload Size set to 128 (was 512, max 1024)
[ 2.128025] pci 0000:b0:00.1: Max Payload Size set to 128 (was 512, max 1024)
[ 2.131001] pci 0000:b2:00.0: Max Payload Size set to 128 (was 512, max 1024)
[ 2.131798] pci 0000:b2:00.1: Max Payload Size set to 128 (was 512, max 1024)
[ 2.134771] pci 0000:b4:00.0: Max Payload Size set to 128 (was 512, max 1024)
[ 2.135562] pci 0000:b4:00.1: Max Payload Size set to 128 (was 512, max 1024)
It looks like since the PCIe switch that is connected to all the FPGA's exposes max capability of 128B mtu, the kernel changed
The mtu of the devices to match all the hierarchy (but shell is not aware)?
In addition, we also tried working with 64B MTU and the system still failed with the same timeout:
pcim-axi-protocol-error=1
pcim-axi-protocol-wchannel-error=1
Thanks
Noam
—
Reply to this email directly, view it on GitHub<#656 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AFTRWCI5PXE6VBXFHNC5VH32JJPDTAVCNFSM6AAAAABSP3X7FSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZSHA3TKMRTHE>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
That's interesting. Do you have any solution to the broken system? |
Hi Noam, The shell should not buffer the transaction because from the shell perspective, there is no guarantee that the data source in the CL would terminate the traffic properly. If the shell interfaces with a malfunctioning logic, which keeps driving a data interface, the internal buffer would overflow and the shell must terminate the traffic anyway. What is the traffic BW that causes this timeout issue on the PCIM interface? Is it still the case that after you lower the aggregate the BW, the problem disappear, just like on F1? If you reload the AFI, does that bring the slot back to its operational state? Thanks, Chen |
Hi Chen, Thank you for your reply. Regarding the buffering I suggested - I meant buffering of a single legal transaction, which doesn't exceed AXI MTU. In case of an illegal AXI transaction of course the shell must abort and proceed to error flow. Sorry if I wasn't clear on the subject. After the previous issue with F1 we inserted rate limiters to our design in order to try to mitigate this issue. We are currently set to limit of ~6GB/s. We observe average BW of ~1GB/s with short peak BW of ~6GB/s. In F1 this, with conjunction with internal AXI MTU set to be equal to PCIe MTU solved the issue. We don't see the same effect here. We will try to further limit our system to see if it helps. Regarding AFI reload - yes, it's operational again, however, it will receive timeout again in the next run. I believe we have to address 2 problems here: one is technical - the timeout counter issue and if it's counting correctly. The second one is more concerning - how come the FPGA receives such a backpressure from PCIe interface? From my understanding a healthy system shouldn't get 100us backpressure from the PCIe interface on every run. This will imply that even if the FPGA will not get into timeout state we're facing an unexpected performance degradation. Thanks |
When we run our workload on 1 or 2 FPGA's we do not have any issues but when we try to run on 4 or 8 FPGA's
We usually get an indication of shell pci master timeout error in one of the FPGA slots during high bandwidth DMA.
our setup:
From our internal debug this is what we see:
Our PCI AXI master (CL) is trying to write to the shell AXI transactions with typical burst size of 4KB.
At some point we see that the shell is reporting on Timeout Error on the W channel (i.e. pcim-axi-protocol-wchannel-error).
After debugging it we see that there is indeed a timeout violation between some WDATA transfers,
but this violation is caused because the WREADY is de-asserted during this period (while WVALID is asserted).
As a result of the WREADY backpressure, the CL can’t complete the transaction during the timeout period.
Some time after the timeout occurs, all writes and reads from FPGA towards PCI are stuck, including interrupts.
The text was updated successfully, but these errors were encountered: