

### **TIMING DESIGN**



TUD/EE ET4293 - digic - 1213 - © NvdM 06 Timing

13/03/28

# Outline

- Timing Design Background and Motivation
  - Delay variations, impact
  - Sequential circuits, synchronous design
  - Pipelining, metrics reminder
- The Clock Skew Problem
- Controlling Clock Skew
- Case Study

# Get basic appreciation of some system level design issues

# **Design of LARGE Integrated Circuits**

Correct signal

Logic value

- Right level (restoring logic, ...)
- At right place
  - Interconnect (R, C, L)
  - Busses
  - Off-chip drivers, and receivers
- At right time
  - How to cope with (uncertain) delay

# For Reference: IBM Power 7 Chip, 775 supercomputer



TUD/EE ET4293 - digic - 1213 - © NvdM 06 Timing

2010 1.2 B transistors 45 nm 567 mm<sup>2</sup> Max 4.25 GHz 4 threads / core 8 cores / chip 4 chips / module 8 modules / drawer 12 drawers / rack 170 racks / system

524,288 cores / sys 2,097,152 threads/ sys ☺

13/03/28

# **Uncertain Delay**

- Data-dependent Delay
- Short and long combinational paths
- Device parameters variations (§3.4)
  - Batch to batch V<sub>t</sub> threshold voltage
  - Wafer to wafer k' transconductance
  - Die to die

- W, L dimensions
- Supply Variations
  - IR drop, dl/dt drop, ringing,
- Interconnect Delay
  - Don't know length of line during logic design Delay at begin of line smaller than at end Interconnect parameter variability

# **Delay Along a Wire**



# **Delay of Clock Wire**



#### 5ns compares with 200 MHz

# **Canonical Clock Tree Network**



TUD/EE ET4293 - digic - 1213 - © NvdM 06 Timing

# **Impact of Uncertain Delay**

- Combinational circuits will eventually settle at correct output values when inputs are stable
- Sequential circuits
  - Have state
  - Must guarantee storing of correct signals at correct time
  - Require ordered computations

# **Sequential Circuits**

- Sequential circuits require ordered computation
- Several ways for imposing ordering
  - **V** Synchronous (clock)
- **Asynchronous** (unstructured)
- **Self-timed** (negotiation)

Clock works like an orchestra conductor



# **Synchronous Design**

- Global Clock Signal
- Synchronicity may be defeated by
  - Delay uncertainty in clock signal
  - Relative timing errors: clock skew
  - Slow logic paths
  - Fast logic paths

# **Timing Metrics Reminder**









- : delay from clock (edge) to Q
- : setup time
- t<sub>hold</sub>

t<sub>c-q</sub>

t<sub>su</sub>

t<sub>cd</sub>

- : hold time
- t<sub>plogic</sub>
- : worst case propagation delay of logic
  - : best case propagation delay

(contamination delay)

T : clock period

$$\begin{array}{l} \textbf{T} \geq \textbf{t}_{c-q} + \textbf{t}_{plogic} + \textbf{t}_{su} \\ \textbf{t}_{cdregister} + \textbf{t}_{cdlogic} \geq \textbf{t}_{hold} \end{array}$$

# **Sequential Circuit Timing.**



#### How to reduce T<sub>clk</sub>?

# **Pipelined Laundry System**



From http://cse.stanford.edu/class/sophomore-college/projects-00/risc/pipelining/index.html which credited http://www.ece.arizona.edu/~ece462/Lec03-pipe/ TUD/EE ET4293 - digic - 1213 - © NvdM 06 Timing 13/03/28

# **Pipelining**



| Clock Period | Adder       | Absolute Value | Logarithm           |
|--------------|-------------|----------------|---------------------|
| 1            | $a_1 + b_1$ |                |                     |
| 2            | $a_2 + b_2$ | $ a_1 + b_1 $  |                     |
| 3            | $a_3 + b_3$ | $ a_2 + b_2 $  | $\log( a_1 + b_1 )$ |
| 4            | $a_4+b_4$   | $ a_3 + b_3 $  | $\log( a_2 + b_2 )$ |
| 5            | $a_5 + b_5$ | $ a_4 + b_4 $  | $\log( a_3 + b_3 )$ |



```
T<sub>clk</sub> > t<sub>c-q</sub> + max(t<sub>p,add</sub>, t<sub>p,abs</sub>, t<sub>p,log</sub>) + t<sub>su</sub>
Improve resource utilization
Increase functional throughput
```

TUD/EE ET4293 - digic - 1213 - © NvdM 06 Timing

# **Pipelining Observations**

- Very popular/effective measure to increase functional throughput and resource utilization
- At the cost of increased *latency*
- All high performance microprocessors excessively use pipelining in instruction fetch-decode-execute sequence
- Pipelining efficiency may fall dramatically because of branches in program flow
  - Requires emptying of pipeline and restarting
  - Partially remedied by advanced branch prediction techniques
- There was an era when all was dictated by GHz marketing drive
  - All a customer asked was: "How many GHz?"
  - Or said: "Mine is ... GHz!"

# Bottom line: more flip-flops, greater timing design problems

# **The Clock Skew Problem**

# In Single Phase Edge Triggered Clocking

# In Two Phase Master-Slave Clocking



TUD/EE ET4293 - digic - 1213 - © NvdM 06 Timing

# **The Clock Skew Problem**

#### Clock Rates >> 1 Ghz in CMOS ø **t**<sub>φ</sub>, **t**<sub>b</sub> " t<sub>d</sub> ,,, In Out CL3 R3 CL2 CL1 **R1 R2** ti t<sub>l,min</sub> t<sub>r.min</sub> t<sub>l.max</sub> t<sub>r.max</sub>

- Clock Edge Timing Depends upon Position
  - Because clock network forms distributed RC line with lumped load capacitances at multiple sites (see earlier slide)
- **(Relative)** Clock Skew  $\delta = \mathbf{t}_{\phi''} \mathbf{t}_{\phi''}$
- Clock skew can take significant portion of T<sub>clk</sub>

### **Positive and Negative Skew**



### **Edge-Triggered Slow Path Skew Constraint**



### Minimum Clock Period Determined by Maximum Delay between Latches minus skew

### **Edge-Triggered Fast Path Skew Constraint**



### **Clock Constraints in Edge-Triggered Logic**

 $T \ge t_{max} - \delta$ 

δ **≤ t<sub>min</sub>** 

#### • Observe:

- Minimum Clock Period Determined by Maximum Delay between Registers minus clock skew
- Maximum Clock Skew Determined by Minimum Delay between Registers
- Conclude:
  - Positive skew must be bounded
  - Negative skew reduces maximum performance

### Clock Skew in Master-Slave 2-Phase Design



### **2-Phase M/S Slow Path Skew Constraint**



TUD/EE ET4293 - digic - 1213 - © NvdM 06 Timing



TUD/EE ET4293 - digic - 1213 - © NvdM 06 Timing

# **Clock Constraints in 2-Phase Design**

$$\mathbf{\Gamma} \geq \mathbf{t}_{\max} - \delta + \mathbf{T}_{\phi 12}$$

 $\delta \leq \mathbf{t}_{\min} + \mathbf{T}_{\phi 12}$ 

#### Observe:

- Minimum Clock Period Determined by Maximum Logic Delay  $t_{max}$  minus clock skew  $\delta$
- plus T<sub>012</sub> (0-0 Overlap Time)
- Maximum Clock Skew Determined by Minimum Logic Delay plus T<sub>612</sub>
- Conclude again:
  - Negative skew reduces maximum performance
- However:
  - Positive skew can be made harmless by increasing T<sub>012</sub>
  - T<sub>\u03c612</sub> reduces maximum performance

# Controlling Clock Skew Case Study



TUD/EE ET4293 - digic - 1213 - © NvdM 06 Timing

# **Countering Clock Skew Problems**

- Routing the clock in opposite direction of data (negative skew)
  - Hampers performance
  - Dataflow not always uni-directional
  - Maybe at sub circuit (e.g. datapath) level
  - Other approaches needed at global chip-level
  - Useful skew (or beneficial skew) is serious concept
- Enlarging non-overlap periods of clock [only with two-phase clocking]
  - Hampers performance
  - Can theoretically always be made to work
  - Delay in clock network may require impractical/excessively large scheduled T<sub>\u03c612</sub> to guarantee minimum T<sub>\u03c612</sub> everywhere across chip
  - Is becoming less popular for large high performance chips

# **Dataflow not unidirectional**



#### Data and Clock Routing

- Cannot unambiguously route clock in opposite direction of data
- Need bounded skew

# **Need bounded Skew**

- Bounded skew most practical measure to guarantee functional correctness without reducing performance
- Clock Network Design
  - Interconnect material
  - Shape of clock-distribution network
  - Clock driver, buffers
  - Clock-line load
  - Clock signal rise and fall times
  - ----

# **H-tree Clock Network**



- All blocks equidistant from clock source ⇒ zero (relative) skew
- Sub blocks should be small enough to ignore intra-block skew
  - In practice perfect H-tree shape not realizable

#### **Observe: Only Relative Skew Is Important**

### **Clock Network with Distributed Buffering**



## **Case Study: IBM Power 4 Chip**



- f<sub>c</sub> > 1.3 Ghz
- 0.18 μm (0.09 μm L<sub>eff</sub>)
- 174.000.000 transistors
- 6380 C4 pins (2200 signal)
- 115 W (@1.1 GHz, 1.5 V)

IBM J. Res. Dev. Vol 46 No. 1 Jan 2002





#### Figure 6

Schematic diagram of global clock generation and distribution.



#### Figure 7

3D visualization of the entire global clock network. The x and y coordinates are chip x, y, while the z axis is used to represent delay, so the lowest point corresponds to the beginning of the clock distribution and the final clock grid is at the top. Widths are proportional to tuned wire width, and the three levels of buffers appear as vertical lines.



#### Figure 8

Visualization of four of the 64 sector trees driving the clock grid, using the same representation as Figure 7. The complex sector trees and multiple-fingered transmission lines used for inductance control are visible at this scale.

## **Power6 Clock Distribution**





#### Latency ~ cycle time

#### friedrich, isscc 2007

TUD/EE ET4293 - digic - 1213 - © NvdM 06 Timing

### **Power6 Clock Distribution**



#### stolt, jssc 2008

**Philip Restle animations** 

TUD/EE ET4293 - digic - 1213 - © NvdM 06 Timing

100 E

Raro

# **IBM Power6 Physical Design Flow**



# **Timing Design**

- Clocking Scheme is important design decision
- Influences
  - Power
  - Robustness
  - Ease of design, design time
  - Performance
  - Area, shape of floor plan
- Needs to be planned early in design phase
- But is becoming design bottle neck nevertheless
  - Clock frequencies increase
  - Die sizes increase
  - Clock skew significant fraction of T<sub>clk</sub>
- Alternatives
  - Asynchronous or self-timed





# **More Clocking Issues**

- Clock power (~ 30 % of total chip power)
- Asynchronous design for reasons of power, and variability
- Resonant clocking Use principles of (buck) convertors to recycle energy

Multiple clock domains, power domains, sleep states

- for standby modes
- for integrating IP

# Summary

- Timing Design Background and Motivation
  - Delay variations, impact
  - Sequential circuits, synchronous design
  - Pipelining, metrics reminder
- The Clock Skew Problem
  - In single Phase Edge Triggered Clocking
  - In Two Phase Master-Slave Clocking
- Controlling Clock Skew
- Case Study

# Got basic appreciation of some system level design issues?