# Design and performance parameters of an ultra-low voltage, single supply 32bit processor implemented in 28nm FDSOI technology Sylvain Clerc, Fady Abouzeid, Darayus Adil Patel, Jean-Marc Daveau, Cyril Bottoni, Lorenzo Ciampolini, Fabien Giner, David Meyer, Robin M. Wilson, Philippe Roche, et al. #### ▶ To cite this version: Sylvain Clerc, Fady Abouzeid, Darayus Adil Patel, Jean-Marc Daveau, Cyril Bottoni, et al.. Design and performance parameters of an ultra-low voltage, single supply 32bit processor implemented in 28nm FDSOI technology. ISQED 2015 - 16th International Symposium on Quality Electronic Design, Apr 2015, Santa Clara, United States. pp.366-370, 10.1109/ISQED.2015.7085453. lirmm-01272913 ## HAL Id: lirmm-01272913 https://hal-lirmm.ccsd.cnrs.fr/lirmm-01272913 Submitted on 10 Feb 2022 HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L'archive ouverte pluridisciplinaire **HAL**, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. ### Design and Performance Parameters of an Ultra-Low Voltage, Single Supply 32bit Processor implemented in 28nm FDSOI Technology S. Clerc<sup>1</sup>, F. Abouzeid<sup>1</sup>, D.A. Patel<sup>1</sup>, J-M. Daveau<sup>1</sup>, C. Bottoni<sup>1</sup>, L. Ciampolini<sup>1</sup>, F. Giner<sup>1</sup>, D. Meyer<sup>1</sup>, R. Wilson<sup>1</sup>, P. Roche<sup>1</sup>, S. Naudet<sup>1</sup>, A. Virazel<sup>2</sup>, A. Bosio<sup>2</sup>, P. Girard<sup>2</sup> <sup>1</sup>STMicroelectronics, 850, rue Jean Monnet, Crolles, FRANCE <sup>2</sup>LIRMM, University of Montpellier 2 / CNRS, Montpellier, FRANCE <sup>1</sup>E-mail: sylvain.clerc@st.com #### **Abstract** This work presents a single-supply SPARC 32b V8 microprocessor designed with Ultra Low Voltage (ULV) adapted standard cells and memories, aiming at low energy operation and stand by power. The microprocessor, equipped with 10 Transistors ULV bitcell 8KB SRAM cache, has been fabricated in Fully Depleted Silicon On Insulator (FDSOI) 28nm technology. A comparative analysis with similar implementations has been provided highlighting the performance gain and power savings that are achieved by our design methodology and implementation technology. Wafer-level tests showed that our ULV adapted microprocessor has an operating range that is functional down to 0.33V and that the ULV able cache can save from 30% to 62% energy. #### **Keywords** FDSOI, ULV, RISC, Energy efficient Cache #### 1. Introduction Growing complexity and performance requirements of modern multimedia devices is pushing chipmakers to achieve better logic performance especially at low voltages. Herein the FDSOI technology has conclusively demonstrated its ability to achieve high speed at low operating voltages [4]. Moreover in the race for higher energy efficiency driven by the Internet Of Things and battery-powered systems, lowering the device supplies to Ultra Low Voltage (ULV) has been intensively explored [1] [2]. This is the strongest lever to achieve low energy operation at the expense of low frequency and increased variability. However, the latest technologies widen the application range by enabling ULV frequencies from 10 to 100 MHz with lower variability penalty. This paper presents a 32b SPARC microprocessor aiming low operating energy and stand-by power designed with specific standard cells, a 10 Transistors memory cell, and adapted place-and-route tooling margins. Fabricated in FDSOI 28nm, the design has been tested at wafer level and is functional from 1.2V down to 0.33V. The rest of the paper is organized as follows: Section 2 discusses the design details and fabrication platform. Section 3 provides an overview of the microprocessor architecture. Section 4 presents the test methodology and results. Section 5 concludes the paper. #### 2. Design and Fabrication Platform This section firstly provides relevant information regarding the implementation technology used. Thereafter design & performance parameters of standard cells and memories employed within the processor are presented. Lastly details regarding the design flow adaptation are presented. #### 2.1. Technology The chip has been manufactured in FDSOI 28nm technology [4] [5] with 10 metal stack. The thin oxide logic transistor channel is un-doped and its volume is isolated from bulk by an ultra-thin buried oxide layer, this isolation enables to control the transistors' threshold voltage (VT) by the well type lying beneath the channel. Several design options are possible: - Dual-Well Regular VT (RVT), where the well positions are similar to those of bulk, allowing full supply range reverse body bias. - Dual-Well Low VT (LVT), where the well positions are inverted with respect to those of bulk (i.e. P-well under PMOS, N-well under NMOS), allowing full supply range forward body bias. - Single-Well (SW), where RVT PMOS together with LVT NMOS, or vice versa, can be used, but with no body bias possible. #### 2.2. Standard Cells The standard cells used in the processor implementation presented in this work have been laid out so that the source and drain contacts spacing can be kept equal for any drawn gate length (L) from 30nm to 46nm. The final derivation of longer L cells is done by simply adding a CAD marker layer which denotes the L increase from nominal. This technique is called poly biasing. In the rest of the article, Poly Bias 10 (PB10) denotes standard cells with a systematic increased L of 10 nanometers for all transistors. This technique can be applied asymmetrically on some selected transistors in a cell to achieve specific performances. In the rest of the article, the cells denoted as Asymmetric Poly Bias 10 (APB10) have transistors mainly in nominal size, with some transistors enlarged by +10nm. APB enables offering additional operating points in the Energy - Frequency space, where the extremes are achieved with symmetric poly bias cells. The various symmetric or asymmetric Poly-Biasing cells' Power Delay Product (PDP), leakage, current and delay values have been compared using the critical path methodology developed in [6]. As it can be seen in Fig. 1, **Figure 1:** The various poly bias cells leakage vs qFO4 critical path frequency is plotted. Asymmetric cells provide intermediate points, which enable to lower the leakage for a given frequency. when frequency is lower than 3000MHz, the optimal leakage is reached with the largest symmetrical PB but above this frequency, configurations are spread in frequency and APB configurations enable intermediate leakage-frequency alternatives to symmetric PB configurations. For example, at 3800MHz frequency, APB16PB4 and APB10PB4 offer an alternative to PB4 with lower leakage. The Poly-Biasing configurations giving lowest leakage currents or energy for a given delay were selected to be derived in standard cell library resulting in a set of 5 Poly-Biasing configurations and a total of 300 standard cells. In the reported design, 21% of the combinational cells are asymmetric. The flip-flops functionality has been validated using $6-\sigma$ Monte-Carlo simulations across the ultra-wide voltage range. The same method has been applied to the input and output level shifters designed to address nominal voltage (around 1V) interface. #### 2.3. Memories This work uses a single-pwell (SPW) evolution with limited sizing modifications of the dual-well ULV bitcell previously presented in [7]. Single-pwell gives higher threshold voltage NMOS and lower leakage with a cost of 50mV higher Vmin. The schematic of the bitcell is displayed in Fig. 2. **Figure 2:** 10T ULV bitcell; red circles show the transistors activated during read to drive to 0 the bit lines. **Figure 3:** Comparison of leakage between ULV 10T Dual Well (DW label in blue) and Single Well bitcells (SPW label in pink), expressed in ratio of reference 6T bitcell leakage. While the dual-well bitcell showed a leakage penalty up to 8X compared to 6T reference bitcell available in this technology, the new SPW bitcell lowers the leakage penalty from 66% at typical process 25C down to 25% at fast process 125C (Fig. 3), while maintaining the read (SNM) and write (WM) stability. #### 2.4. Design Flow Adaptation The ultra-low voltage supply scaling is injected during the implementation process using a specific set of Process-Voltage-Temperature-Slope-Capacitance (PVTSC) corners centered around 0.33V, in addition to the nominal ones. These ULV operating conditions increase the cells' delay and slopes nominal value and dispersion. Our experience is that brute force scaling of implementation tools margin (i.e. clock and data de-rating factors) matching the cell delay dispersion evolution from 1V to ULV leads to unacceptable area congestion and frequency decrease due to the high number of hold fix buffers inserted. Instead, we have used extra skewed corners in timing analysis at ULV with the nominal voltage tools margin settings, this led to lower area congestion and better Si-CAD matching. #### 3. Microprocessor Architecture The microprocessor presented in this work is a SPARC V8 32 bit LEON3 synthesizable core, available at [3]. **Figure 4:** SPARC V8 32b LEON3 microprocessor system & FDSOI 28nm reticle picture with Leon ULV processor marked in green. **Table 1:** Microprocessor Processor Comparison | | This Work | [8] | [9] | [10] | [11] | |--------------------|--------------------|-------------------------|-----------------|--------------------|-----------------| | Architecture | 32b<br>SPARCV8 | 32b C64x | 32b ReISC | 16b MSP430 | 16b MSP430 | | SRAM | 4KB I\$<br>4KB D\$ | 32KB L1\$<br>128KB L2\$ | 256B \$<br>16KB | 256B \$<br>18KB | 8B I\$<br>176KB | | Peripherals | DSU, UART,<br>GPIO | I2C, SPI, UART,<br>MMU | I2C, SPI, JTAG | SPI, UART,<br>GPIO | GPIO | | Technology | 28nm FDSOI | 28nm CMOS | 65nm CMOS | 65nm CMOS | 65nm CMOS | | Gate Count | 100k | 600k | 12k | 8k | N.A. | | Cycle Energy (pJ) | 7.6 | 200 | 10.2 | 7 | 27.3 | | F @ MEP (MHz) | 11@0.44V | 3.6@0.34V | 0.54@0.54V | 25@0.4V | 0.44@0.5V | | Fmax (MHz) | 606@0.92V | 500@0.9V | 83@1.2V | N.A. | 1@0.6V | | 25MHz power (mW) | 0.241 | 5.9 | 0.5 | 0.175 | N.A. | | Standby Power (µW) | 14.7 | 4000 | 1.65 | 1.5 | 1 | It is Harvard split data/instruction cache architecture with 4 KBytes data and 4 KBytes instruction caches, using random replacement policy. The processor uses register window context swap with a 136 entries 2-read 1 write register file (8 windows composed of 16 registers, plus 8 global registers). The register file is implemented with inferred flip-flops taken out of a scan chain to limit routing congestion; the register file is tested by a single register program at the beginning of the circuit test. The caches are implemented with single-port memories instances built from 10T bitcells. The implementation results in approximately 100k gates and 64Kb of memory, it occupies 0.45mm<sup>2</sup> on-chip area wherein the total chip area is 5.5 mm<sup>2</sup> as shown in Fig. 4. #### 4. Circuit Measurements #### 4.1. Test Setup The circuits have been measured on one wafer using the Verigy 93k automated test equipment at ambient temperature. The chip is activated with application program serially loaded into the AHB-RAM then run from the caches (if activated) or from the AHB-RAM. The microprocessor under test features two dedicated supplies, one for the CPU core and the other for the caches; while this split supply enables selective current characterization, this chip can be considered single-supply as no voltage modulation was used to assist operation: i.e. both voltages are constant and equal between them during the whole execution of the programs. The low-voltage results are extracted in two steps, first a minimum voltage (Vmin) search is done at fixed frequency, then at the extracted Vmin the maximum frequency is searched (Fmax), this filters out hold violation and possibly leads to higher frequency at voltage floor. The leakage and dynamic currents are extracted at Fmax and Vmin. **Figure 5:** Average microprocessor minimum voltage extracted at 1MHz. # 4.2. Electrical Performance Results and Comparative Analysis The microprocessor exhibits minimum energy point at 7.6pJ per instruction at 11MHz and 0.44V, the frequency ranges between 1MHz@0.33V to 25MHz at 0.51V for low voltage operation while it can achieve 165MHz@0.65V, 606MHz@0.92V and 914MHz@1.2V, but this latter voltage is an overdrive that cannot be sustained for continuous operation. Vmin extracted on 193 dies at respectively 10/1MHz is 0.43V/0.35V, with a best die at 0.33V. The Vmin population at 1MHz is displayed in Fig. 5. The standby power extracted at the 1MHz Vmin is 14.7μW, the lowest energy needed to execute a dhrystone is 6.5nJ@10MHz (detailed in the next section). The results measured are displayed and compared with similar work in Table 1. Compared to the work in [8], if we scale by a factor from 6X to 32X to normalize either to gate count or memory capacity, the standby power is improved in the range of 37X to 7X in the proposed design. The design in [9] suffers from very low speed at low voltage and both the designs in [10] and [11] are outperformed in energy per datapath bits. Both this work and [12] are manufactured in FDSOI 28nm, while [12] aims high speed and reaches 115MHz at 0.5V with 0V FBB, it drives 62pJ energy per instruction where the proposed design needs 7.6pJ. Comparison between different architectures is always difficult but we believe the proposed CPU is a performing alternative for low energy processing in tenth MHZ speed range. #### 4.3. Cache Impact on Energy and Leakpower The caches of the CPU have dedicated supplies and can be deactivated via software, a feature that enables a simple evaluation of the effects of the caches presence on the energy required to execute a program. We have evaluated the performance for three different programs: the Dhrystone test [13], a 256 points Fast Fourier Transform and Atkin's sieve prime number search [14] without any caches or with both instruction and data cache activated and extracted the needed number of cycles. **Table 2:** Execution energy with Cache activation comparison, extrapolated from 193 dies measurement on Atkin Sieve program at 10 MHz | Dhrystone (nJ) | | | | | | | |----------------|--------|-----------|--------|--|--|--| | Cycles \$ | Cycles | Energy \$ | Energy | | | | | active | No \$ | active | No \$ | | | | | 960 | 1978 | 12.8 | 18.3 | | | | | FFT (µJ) | | | | | | | | Cycles \$ | Cycles | Energy \$ | Energy | | | | | active | No \$ | active | No \$ | | | | | 192758 | 654053 | 2.43 | 6.04 | | | | | Atkin (μJ) | | | | | | | | Cycles \$ | Cycles | Energy \$ | Energy | | | | | active | No \$ | active | No \$ | | | | | 86889 | 287334 | 1.03 | 2.66 | | | | It was not possible to test all program and cache combinations, however, we have assumed that the relative proportion of RISC CPU instructions was the same from one program to the other, following [15], and reused the Atkin program dynamic cycle energy measurements. Each program energy is computed from cycle dynamic energy measurements scaled to the number of cycles and leakage measurements scaled to the program duration. The results are displayed in Table 2, it can be seen that the presence of caches on the same power and clock domain as the CPU can save from 30% to 62% of energy because of dynamic energy saving induced by the CPU cycles gain, which overcompensates the extra leakage caused by the ULV caches. While this cycles number and dynamic energy gains stand for any operating voltage, in our opinion, it justifies to equip the CPU with ULV caches. #### 5. Conclusion A FDSOI 28nm 32b SPARC microprocessor has been designed with adapted standard cells, memories and flow to withstand ULV operations. The validation circuit with split CORE/cache supplies enables to explore memory architecture options and demonstrates that designing an ULV able cache can save from 30% to 62% energy. The operating range extends from 0.33V minimum voltage at 1MHz with associated 14.7 $\mu$ W standby power, to Minimum Energy Point at 0.44V/11MHz where it drives 7.6pJ per instruction and 6.5nJ per dhrystone. At 25MHz, a frequency fast enough to enable audio signal processing for hearing aids, the operating power is 241 $\mu$ W. Duty cycled operation with peak processing power can also be addressed with the 606MHz@0.92V operating point. #### 6. References - [1] Vittoz, E.; Fellrath, J., "CMOS analog integrated circuits based on weak inversion operations," Solid-State Circuits, IEEE Journal of, vol.12, no.3, pp.224,231, Jun 1977 - [2] Wang, A.; Chandrakasan, A., "A 180-mV subthreshold FFT processor using a minimum energy design methodology," Solid-State Circuits, IEEE Journal of, vol.40, no.1, pp.310,319, Jan. 2005 - [3] Aeroflex Gaisler AB, 2012. Available Online: http://www.gaisler.com/index.php/products/processors - [4] Planes, N.; et al., "28nm FDSOI technology platform for high-speed low-voltage digital applications," VLSI Technology (VLSIT), 2012 Symposium on , vol., no., pp.133,134, 12-14 June 2012 - [5] C. Fenouillet-Beranger, et al., "Fully-depleted soi technology using high-k and single-metal gate for 32 nm node 1stp applications featuring 0.179 um2 6t-sram bitcell," in Electron Devices Meeting, 2007. IEDM 2007. IEEE International, vol., no., pp.267,270, 10-12 Dec. 2007 - [6] F. Abouzeid, et. al, "28nm CMOS, energy efficient and variability tolerant, 350mV-to-1.0V, 10MHz/700MHz, 252bits frame error-decoder," ESSCIRC (ESSCIRC), 2012 Proceedings of the , vol., no., pp.153,156, 17-21 Sept. 2012 - [7] F. Abouzeid, A. Bienfait, K. C. Akyel, S. Clerc, et al., "Scalable 0.35v to 1.2v SRAM Bitcell design from 65nm CMOS to 28nm FDSOI," ESSCIRC (ESSCIRC), 2013 Proceedings of the , vol., no., pp.205,208, 16-20 Sept. 2013 - [8] N. Ickes, G. Gammie, M. Sinangil, R. Rithe, et al., "A 28 nm 0.6 v low power DSP for mobile applications," Solid-State Circuits, IEEE Journal of , vol.47, no.1, pp.35,46, Jan. 2012 - [9] N. Ickes, Y. Sinangil, F. Pappalardo, E. Guidetti, and A. Chandrakasan, "A 10 pj/cycle ultra-low-voltage 32-bit microprocessor system-on-chip," ESSCIRC (ESSCIRC), 2011 Proceedings of the , vol., no., pp.159,162, 12-16 Sept. 2011 - [10] D. Bol, et al., "A 25MHz 7 uW/MHz ultra-low-voltage microcontroller soc in 65nm LP/GP CMOS for lowcarbon wireless sensor nodes," Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEE International , vol., no., pp.490,492, 19-23 Feb. 2012 - [11] J. Kwong, et al., "A 65nm sub-vt microcontroller with integrated sram and switched-capacitor dc-dc converter," Solid-State Circuits Conference, 2008. ISSCC 2008. Digest of Technical Papers. IEEE International, vol., no., pp.318,616, 3-7 Feb. 2008 - [12] R. Wilson, E. Beigne, P. Flatresse, A. V. F. Abouzeid, et al., "A 460mhz at 397mv, 2.6ghz at 1.3v, 32b vliw dsp, embedding f max tracking," Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International , vol., no., pp.452,453, 9-13 Feb. 2014 - [13] R. P. Weicker. (1988, May) dhrystone test suite. Siemens AG. Available Online: http://en.wikipedia.org/wiki/Dhrystone - [14] A. O. L. Atkin and D. J. Bernstein. Sieve of atkin. Available Online: http://en.wikipedia.org/wiki/Atkin sieve - [15] J. Hennessy and D. D. Patterson, Computer architecture, a quantitative approach. Morgan Kaufman, 2003.