Many times, different RTL coding styles in the same piece of code result in different PPA (Power, Performance, Area). Let us dive into a few examples.
Use Register Enable Conditions
When coding data pipelines, it is a good practice to always explicitly include an enable condition, as the synthesis tool could infer clock gaters for these enable conditions.
For example, instead of writing something like this:
always_ff @(posedge clk)
data_q <= data;
Code it like this:
always_ff @(posedge clk)
if (data_en) data_q <= data;
If the “data” bus is wide, the inferred clock gater will save quite a lot of power.
Minimize Signal Toggling When Not Needed
Reducing signal toggling when not needed is a common technique to save dynamic power. Other than the above mentioned “use register enable condition”, RTL designers sometimes use data gaters when the data path is supposed to be idle.
One example is, the inputs of a multiplier will change every cycle, but it only produces valid results every few cycles. When the multiplier is not doing any meaningful work, RTL designers can gate its inputs, such that the multiplication circuitry can be quiesced.
Shift Registers without Actual Shifting
Shift register is a common circuitry to maintain a finite window of data sequence cycle by cycle. Instead of shifting data stage by stage, consider data muxing. See diagram below:

It is easy to observe that Scheme b) does not introduce data shifting thus it saves dynamic power compared to Scheme a).
Use Proper Register Slice Type
In our book “Crack the Hardware Interview – Architecture & Micro-Architecture Design”, we discussed several types of register slices. We use a full slice to break the timing path between the valid and ready path.
However, when only the valid path has timing violation and the ready path has no issues, using a forward slice saves half of the data storage; on the other hand, when only the ready path has timing violation, using a backward slice can also save half of the data storage.
Use Cache to Suppress Redundant SRAM Reads
Unlike reading from flop, reading from SRAM will always activate the SRAM peripheral circuitries such as output buffers and sense amplifiers, introducing dynamic power consumptions for each SRAM read.
If RTL designers know beforehand that the SRAM read address pattern follows a certain pattern, for example, a few consecutive reads will access the same SRAM address, they can use cache to suppress the “redundant” SRAM reads.
When an SRAM read address is accessed the first time, then read data can be stored in a small cache. Subsequent reads to the same address can retrieve the data from cache, instead of triggering an actual SRAM read.
Remove Reset from Data Path Flops
Certain flops in the design do not require reset, such as data storage. Removing resets from these flops saves the overall design area.
However, unresetable flops may cause DFT coverage loss or increase test time, as scan based ASIC testing must explicitly initialize the flops during scan shift-in phase.
Remove Unused RTL
Unused RTL should not be synthesized, as it wastes area. For example, RTL designers often write behavior modeling code for assertions, and such code should not be part of real silicon.
There are several ways to detect unused RTL, for example:
- Use Spyglass Lint: warnings like W120, W240, FlopEConst flag unused variables
- Use Jasper Gold: Use Jasper Gold: its comprehensive structural lint check can flag dead-code and unreachable states
- DesignCompiler: DC uses “OPT-1206, 1207” to report constant or unloaded flops, and “ELAB-976, 982, 984, 985” to report unused always blocks; designers can also rely on DC’s final report to review unloaded flops and constant flops
- Use Conformal LEC: it can report unreachable endpoints, and designers should carefully review the report
- Use Formality: similar to Conformal LEC, it can report endpoints without fanout
References:

Leave a comment