What can cause a simulation hang?

Simulation hang is a condition where simulator functionalities cannot advance. We can only observe that nothing meaningful has happened for a long time.

Simulation hang can be caused by RTL bugs, testbench (TB) issues, or C-model errors.

RTL Bugs

A combinational loop in RTL can prevent time from advancing. When there exists a combinational loop in RTL, the “always_comb” block repeats continuously throughout the duration of a simulation, similar to a ring oscillator. Modern RTL linting tools can easily identify a combinational loop, so RTL designers should always run lint checks.

Simulation hangs may be due to RTL deadlock. FSM bugs, like missing conditions, can cause simulation to remain in a constant state. Credit leaking in credit-based flow control, will eventually result in deadlock.

Simulation hangs may be due to RTL livelock as well. In livelock, the simulation is not stuck in a single state like deadlock, and time continues to advance. But there is no meaningful progress at the end. Livelocks are commonly seen in Network-on-Chip (NoC) simulations, if the network routing algorithm is not implemented properly.

TB Issues

To model physical characteristics of logic gates, interconnects, and SRAMs, either RTL designs or VIPs often have delay associated with them. If TB does not use time precision properly, a module can round off delays to zero, leading to zero-delay loop in simulation. For example, if TB defines timescale to be “1ns/1ps”, then a delay of “#0.01” specifies a latency of 10ps. However, if TB uses “1ns/1ns” or “1ps/1ps”, then the delay gets rounded off to 0. In addition, certain simulators offer zero-delay mode switches, which simulate modules in zero-delay mode. Blindly enabling the zero-delay mode switch can lead to simulation hangs.

Simulation hangs can happen with complex random constraints. DV engineers should determine which constraint is the root cause, and check if the constraint solver is stuck at solving a particular partition of a randomize call. Modern simulators often offer run command switches to enable the constraint debug features.

Race conditions in simulation can cause simulation hangs. Race conditions happen when the execution order between two or more concurrent processes is not defined, and simulators are free to execute processes in any order. One example is the race between 2 independent blocks of force-release statements. Force and release in these 2 blocks can execute in arbitrary order. Another example is using blocking assignment to model sequential logic or using nonblocking assignment to model combinational logic. Either one can lead to race conditions where multiple processes read and write to the same variable synchronized in the same SystemVerilog event region.

Similar to RTL, deadlock in TB can happen, when one process enters a waiting state because another waiting process is holding the shared resource. It is important to use appropriate locking mechanisms, e.g., semaphores, to protect shared resources from concurrent access. In addition, to detect and debug deadlock conditions, DV engineers should consider:

Implementing some timeout mechanisms when acquiring locks. If a lock cannot be acquired within the timeout mechanism, participating thread should give up and retry later, or
Using UVM heartbeat scheme, to terminate simulation after specified timeout period. Heartbeat scheme can exit simulation gracefully, or
Write safety assertions with a large time window, e.g., some event is expected to happen within X cycles.

C-model Errors

Occasionally a simulation hang may be caused by C-model that interfaces with the simulator. One can try OSS (Oracle Sun Studio) software to profile and debug. OSS can pinpoint where the C-model might be spending too much time or getting stuck in a loop, by gathering data about function execution or memory usage.