PCIe Root Complex deadlock by PCIe Endpoint device

40 views Asked by At

Is it possible for a PCIe Endpoint device, with a high volume of Ingress and Egress traffic between the device and Host DRAM, to cause a deadlock situation at the Root Complex, i.e. where the acceptance by the Host of ingress traffic from the device will be blocked until egress traffic from the Host to the same device makes forward progress, even though the ingress and egress data streams are relatively independent?

Consider a case where a DMA engine, that is an independent and separate entity of the Endpoint device, is transferring such a high volume of traffic from Host DRAM to the device (MMIO space), that PCIe Flow Credits along the PCIe path downstream to the device are depleted, thus stalling egress traffic until Credits become available. Per the PCIe protocol Credits should eventually become available, and egress (downstream) traffic should progress. I guess my question is whether this egress/downstream blocking could impact the (independent) ingress/upstream traffic from the same device and destined for Host DRAM? Or would that ingress traffic be blocked at the Root Complex due to backed up egress transactions to the same device, that have not been retired?

I think the question comes down to the transition between the Host Coherent domain, where there are strict ordering rules, to the PCIe IO/Root Complex domain, where the ordering is not quite as strict.

What guarantees that such a deadlock will not occur, i.e. that both the ingress/upstream and egress/downstream flows will be able to make progress? Is it even accurate to postulate that ingress traffic would even be impacted at all by backed up egress traffic to the same device?

I have exercised such traffic to a device, and on some Intel Hosts, have encountered CPU CATERR failures (i.e. catastrophic error), which results in the Host crashing. I believe that the CPU, which itself may be doing PIO WRITE operations to the same device while this high volume of ingress/egress traffic is also taking place, runs into a stall and CPU transactions timeout, resulting in the CATERR.

Problem seems to require a sufficient volume of traffic such that PCIe Flow Credits are getting fully depleted (which has been observed via TLP Analyzer). Has also been exercised on AMD CPU based platforms, which only exhibit a Host crash/reboot, but no CATERR.

0

There are 0 answers