How to Boost Emulation Performance & Efficiency?

Emulation is commonly used in hardware verification and validation. Unlike simulation where all RTL / hardware activities are modeled in software domain, emulation is running on both software and hardware. The table below shows a comparison between RTL / hardware simulation and emulation:

SimulationEmulation
SpeedSlowOften 1000x faster than simulation
CostLowerProfessional branded emulators are very expensive 
Build TimeShort (in minutes)Long (in hours)
FlexibilityFlexible Testbench and software codeCode must be synthesizable
DebuggabilityHigh, all RTL signals are visible to design engineerLower, often requires more effort for debugging

Obviously, both RTL simulation and emulation have their own strengths and weaknesses. In the early design phase, simulation is more powerful on identifying RTL bugs; at a later stage, where larger test scale or longer test time is required, emulation comes handy.

Typically, testbench and software code runs on the host machine, while RTL code is mapped to and running on the emulator. The host machine and emulator exchange data via SystemVerilog Direct Programming Interface (DPI):

Host Machine  ← → DPI ← → Emulator

The data exchange has time penalties, and it often becomes the bottleneck of the overall emulation efficiency. In addition, similar to simulation, the host machine running testbench code is way slower than the emulator, dragging down the hardware utilization on the emulator side.

To boost emulation performance and efficiency, one should either reduce unnecessary data exchange, or spend more time on the emulator side, or both.

For example, to emulate the system fabric, instead of doing Host ← → Emulator sync on every response handling, one may move part of the responses handling to hardware side, and reduce the sync frequency. Sometimes it is even possible to implement the entire memory model in hardware, and eliminate the Host ← → Emulator sync entirely.

Another example would be implementing the data checkers on the hardware side instead of the testbench side. One may load a large chunk of expected data to emulator memory at once (that is one Host ← → Emulator sync), implement the data checker in hardware, and let the test run. Once the emulator consumes all expected data and finds no data mismatch, the next data load happens. This approach improves the debuggability as well: as soon as a data mismatch is detected, the test can be killed and a wave can be auto-generated around the failing timestamp.

The caveat is that loading data to the emulator consumes memory and gates, and the emulator resources are not unlimited. It is important to get a nice balance between Host ← → Emulator sync frequency and data chunk size.

References:

  1. https://en.wikipedia.org/wiki/SystemVerilog_DPI
  2. https://www.doulos.com/knowhow/systemverilog/systemverilog-tutorials/systemverilog-dpi-tutorial/

Subscribe

Enter your email to get updates from us. You might need to check the spam folder for the confirmation email.

Leave a comment