## Synthesis of fixed-point programs based on instruction selection, the case of polynomial evaluation

Mohamed Amine Najahi

## To cite this version:

Mohamed Amine Najahi. Synthesis of fixed-point programs based on instruction selection, the case of polynomial evaluation. RAIM: Rencontres Arithmétiques de l'Informatique Mathématique, Jun 2012, Dijon, France. 5ièmes Rencontres Arithmétique de l'Informatique Mathématique, 2012. lirmm01277361

## HAL Id: lirmm-01277361

https://hal-lirmm.ccsd.cnrs.fr/lirmm-01277361
Submitted on 22 Feb 2016

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L'archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d'enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

## Synthesis of fixed-point programs based on instruction selection

... the case of polynomial evaluation

## Amine Najahi

Advisors: Matthieu Martel and Guillaume Revy
Joint work with Christophe Mouilleron

> Équipe-projet DALI, Univ. Perpignan Via Domitia
> LIRMM, CNRS: UMR 5506 - Univ. Montpellier 2

## Motivation

■ Embedded systems are ubiquitous

- microprocessors and/or DSPs dedicated to one or a few specific tasks
- satisfy constraints: area, energy consumption, conception cost


## Motivation

■ Embedded systems are ubiquitous

- microprocessors and/or DSPs dedicated to one or a few specific tasks
- satisfy constraints: area, energy consumption, conception cost

■ Some embedded systems do not have any FPU (floating-point unit)


- Highly used in audio and video applications
- demanding on floating-point computations


## Motivation

■ Embedded systems are ubiquitous

- microprocessors and/or DSPs dedicated to one or a few specific tasks
- satisfy constraints: area, energy consumption, conception cost

■ Some embedded systems do not have any FPU (floating-point unit)
No FPU
No FPU
ヨ



- Highly used in audio and video applications
- demanding on floating-point computations


## Motivation

■ Embedded systems are ubiquitous

- microprocessors and/or DSPs dedicated to one or a few specific tasks
- satisfy constraints: area, energy consumption, conception cost

■ Some embedded systems do not have any FPU (floating-point unit)


- Highly used in audio and video applications
- demanding on floating-point computations


## Motivation

- Embedded systems are ubiquitous
- microprocessors and/or DSPs dedicated to one or a few specific tasks
- satisfy constraints: area, energy consumption, conception cost

■ Some embedded systems do not have any FPU (floating-point unit)


- Highly used in audio and video applications
- demanding on floating-point computations


## How to use floating-point programs on embedded systems?

- Two approaches to continue using numerical algorithms on these cores:

1. convert the entire numerical application from floating to fixed-point arithmetic
2. write a floating-point emulation library and link the numerical application against it

## Fixed-point conversion

produces a fast code
consumes less energy
$X$ machine specific: no standard
$x$ smaller dynamic range than floating-point
$x$ tedious and time consuming

## Floating-point support design

tons of code are written using floating-point
an algorithm can be synthesized on a PC and then transferred to the device without modifications
$x$ slower
$x$ tedious and time consuming
$\hookrightarrow$ There is a need for the automation of both processes.

## Fixed-point conversion vs. floating-point emulation design

- Floating to fixed-point conversion tools:
- addressed by the ANR project DEFIS, with IRISA, LIP6, CEA, THALES, INPIXAL
- some tools are currently developed: ID.Fix, ...
- two main approaches:

1. statistical methods: perform well, but provide no guarantees and may be slow.
2. analytical methods: usually quite pessimistic, but they are safer to use.

- Floating-point emulation support:
- a number of high quality emulation libraries exist: FLIP, SoftFloat,...
- more or less compliant with the IEEE-754 standard
- FLIP: relies on polynomial evaluation to evaluate division and square root
- a huge number of schemes for evaluating a given polynomial $\rightsquigarrow$ development of CGPE
- $\approx 50 \%$ of FLIP's code was generated by CGPE.


## Outline of the talk

1. The CGPE tool
2. Instruction selection: an extension of CGPE
3. Conclusion and perspectives

## Outline of the talk

1. The CGPE tool
2. Instruction selection: an extension of CGPE

## 3. Conclusion and perspectives

## Overview of CGPE

- Goal of CGPE: automate the design of fast and certified C codes for evaluating univariate/bivariate polynomials
- in fixed-point arithmetic
- by using the target architecture features (as much as possible)
- Remarks:
- fast $\rightsquigarrow$ that reduce the evaluation latency on a given target
- certified $\rightsquigarrow$ for which we can bound the error entailed by the evaluation within the given target's arithmetic


## Global architecture of CGPE

- Input of CGPE

1. polynomial coefficients and variables: value intervals, fixed-point format, ...
2. set of criteria: maximum error bound and bound on latency (or the lowest)
3. some architectural constraints: operator cost, parallelism, ..
```
<polynomial>
    <coefficient x="0" y="0" inf="0\times000000020" sup="000000000020n
    <coefficient x="0" y=" I" inf=" 0\times80000000" sup=" 0 < 800000000"
    <coefficient x="1" y="1" inf="00\times400000000"
    <coefficient x="2" y="1" inf="0\times10000000"
    <coefficient }x="3" y="1" inf="0\times07fe93e4
    <coefficient x="4" y="1" inf="0x04eef694"
    <coefficient x="5" y="1" inf="0\times032d6643"
    <coefficient x="6" y="1" inf="0x01c6cebd"
    <coefficient x="7" y="1" inf="0x00aebe7d"
    ccoefficient x="8" v="1" inf="0000200000"
    <variable x="1" y="0" inf="0x00000000" sup="0xfffffe00" sign="0" integer part="0" fraction part=" 32"/> />
```



```
    <absolute_evalerror value="25081373483158693012463053528118040380976733198921b-191" strict="false"/>
</polynomial>
```

```
cgpe --degree="[8,1]" --xml-input=cgpe-test1.xml --coefs="[100000000111111111]"
    --latency=lowest --gappa-certificate --output
    --schedule="[4,2]" --max-kept=5 --operators="[1111111111111111111:133333333111333331]"
```


## Global architecture of CGPE (cont'd)

■ Internals of CGPE
CGPE proceeds in two steps:

1. Computation step:

- computes evaluation schemes while reducing their latency on unbounded parallelism
- considers only two possible arithmetic operations: addition and multiplication
- produces DAGs that represent the computed efficient schemes



## 2. Filtering step:

- prunes the evaluation schemes that do not satisfy different criteria: latency ( $\rightsquigarrow$ scheduling filter), accuracy ( $\rightsquigarrow$ numerical filter), ...


## Global architecture of CGPE (cont'd)

■ Output of CGPE

```
uint32_t func_d9_0(uint32_t T, uint32_t S)
|
    uint32_t r0 = T >> 2; 
    (+) Q[2.30]
    uint32_t r3 = 0x00000020 + r2; // (+) Q[2.30]
    uint32_t r4 = mul(T, T); // (+) Q [0.32]
    uint32_t r5 = mul(S, r4);
    uint32_t r6 = mul(T, 0x07fe93e4);
    uint32_t r7 = 0 <10000000 - r6;
    uint32_t r8 = mul(r5, r7);
    uint32_t r9 = r3 - r8;
    uint32_t r10 = mul(r4, r4);
    uint32_t r11 = mul(S, r10);
    uint32_t r12 = mul(T, 0x032d6643);
    uint32_t r13 = 0x04eef694 - r12;
    uint32_t r14 = mul(T, 0x00aebe7d);
    uint32_t r15 = 0x01c6cebd - r14;
    uint32_t r16 = r4 >> 11;
    uint32_t r17 = r15 + r16;
    uint32_t r18 = mul(r4, r17);
    uint32_t r19 = r13 + r18;
    uint32_t r20 = mul(r11, r19);
```



```
    return r21;
}
```

Listing 1: C code

```
```


## Coefficients and variables definition

```
```


## Coefficients and variables definition

a0 = fixed<-30, dn> (0x000000 20p-30);
a0 = fixed<-30, dn> (0x000000 20p-30);
a1 = fixed<-31,dn>(0\times800000000p-31);
a1 = fixed<-31,dn>(0\times800000000p-31);
a2 = fixed<-31,dn>(0\times40000000p-31);
a2 = fixed<-31,dn>(0\times40000000p-31);
a8 = fixed<-31,dn>(0\times00aebe7dp-31);
a8 = fixed<-31,dn>(0\times00aebe7dp-31);
a9 = fixed<-31,dn>(0\times00200000p-31);
a9 = fixed<-31,dn>(0\times00200000p-31);
T = fixed< <-32,dn>(fixed <-23,dn>(var0));
T = fixed< <-32,dn>(fixed <-23,dn>(var0));
S = fixed<-31,dn>(varl);
S = fixed<-31,dn>(varl);
CertifiedBound =
CertifiedBound =
25081373483158693012463053528118040380976733198921b-191;
25081373483158693012463053528118040380976733198921b-191;

## Evaluation scheme

## Evaluation scheme

r0 fixed<-31,dn>= T * a2; Mr0 = T * a2;
r0 fixed<-31,dn>= T * a2; Mr0 = T * a2;
r1 fixed<-31,dn\rangle= a1 +r0; Mr1 = a1 + Mr0;
r1 fixed<-31,dn\rangle= a1 +r0; Mr1 = a1 + Mr0;
r21 fixed<-30, dn>= r9 - r20; Mr21 = Mr9 - Mr20;
r21 fixed<-30, dn>= r9 - r20; Mr21 = Mr9 - Mr20;

## Results

## Results

I
I
l
l
var0 in [0x000000000p-32,0xffffffe00p-32]
var0 in [0x000000000p-32,0xffffffe00p-32]
/\var1 in [0\times800000000p-31,0\timesb504f334p-31]
/\var1 in [0\times800000000p-31,0\timesb504f334p-31]
->
->
/\ ro in [0,0xfffffffffp-31]
/\ ro in [0,0xfffffffffp-31]
/\ r0 - Mro in ?
/\ r0 - Mro in ?
/\ r21 in [0,0xfffffffffp-30]
/\ r21 in [0,0xfffffffffp-30]
\ |r21 - Mr21| - CertifiedBound <= 0
\ |r21 - Mr21| - CertifiedBound <= 0
/\ CertifiedBound in ?
/\ CertifiedBound in ?
)
)
}

```
```

}

```
```


## Listing 2: GAPPA certificate

## Global architecture of CGPE (cont'd)

- Output of CGPE

```
```

uint32_t func_d9_0(uint32_t T, uint32_t S)

```
```

uint32_t func_d9_0(uint32_t T, uint32_t S)
l
l
uint32_t r0 = T >> 2;
uint32_t r0 = T >> 2;
uint3.2_t r1 = 0\times80000000 + r0; // (t) Q [1.31]
uint3.2_t r1 = 0\times80000000 + r0; // (t) Q [1.31]
uint32-t r2 =mul(S, r1); % %// (+) Q[2.30]
uint32-t r2 =mul(S, r1); % %// (+) Q[2.30]
uint32_t r3 = 0x00000020 + r2;
uint32_t r3 = 0x00000020 + r2;
uint32-t r4 = mul(T, T); , // (+) Q [0.32]
uint32-t r4 = mul(T, T); , // (+) Q [0.32]
uint32_t r5 = mul(S, r4);
uint32_t r5 = mul(S, r4);
uint32_t r6 = mul(T, 0x07fe93e4);
uint32_t r6 = mul(T, 0x07fe93e4);
uint32_t r7 = 0\times10000000 - r6;
uint32_t r7 = 0\times10000000 - r6;
uint32_t r8 = mul(r5, r7);
uint32_t r8 = mul(r5, r7);
uint32_t r9 = r3 - r8;
uint32_t r9 = r3 - r8;
uint32_t r10=mul(r4,r4);
uint32_t r10=mul(r4,r4);
uint32_t r11 = mul(S, r10);
uint32_t r11 = mul(S, r10);
uint32_t r12 = mul(T, 0x032d6643);
uint32_t r12 = mul(T, 0x032d6643);
uint32_tr12 = mul(T, ox032d6643)
uint32_tr12 = mul(T, ox032d6643)
uint32_t r14 = mul(T, 0\times00aebe7d);
uint32_t r14 = mul(T, 0\times00aebe7d);
uint32_t r15 = 0x01c6cebd - r14;
uint32_t r15 = 0x01c6cebd - r14;
uint32_t r16 = r4 >> 11;
uint32_t r16 = r4 >> 11;
uint32_t r17 = r15 + r16;
uint32_t r17 = r15 + r16;
uint32_t r18 = mul(r4, r17);
uint32_t r18 = mul(r4, r17);
uint32_t r19 = r13 + r18;
uint32_t r19 = r13 + r18;
uint32_t r20 = mul(r11, r19);
uint32_t r20 = mul(r11, r19);
uint32_t r21 = r9 - r20;
uint32_t r21 = r9 - r20;
return r21;
return r21;
}

```
}
```

```
l/
```

l/
1;
1;
/
/
l/
l/
//
//
(-) Q [2.30]
(-) Q [2.30]
// (+) Q [2.30]
// (+) Q [2.30]
//(+) (+) [0.32]
//(+) (+) [0.32]
(+) Q[1.31]
(+) Q[1.31]
// (+) Q[1.31]
// (+) Q[1.31]
//
//
//
//
//
//
//
//
//
//
// (-)
// (-)
// (+) Q[2.30]

```
// (+) Q[2.30]
```

Listing 3: C code

13 12 11 10 9 8
11
10
12
10
9
8


## Achievements and lacking features of CGPE

## Features achieved by CGPE

$\checkmark$ validated on the ST200 core
$\checkmark$ so far, no ambushes were encountered for $\sqrt{ }, \sqrt[3]{ }, \frac{1}{\sqrt{ }}, \frac{1}{\sqrt[3]{3}} \cdots$ produced optimal schemes for some of the above functions such as $\sqrt{ }$

## Features lacking in CGPE

$X$ simplistic description of the underlying architecture (ex. no handling of advanced operators such as ST200 shift_and_add instruction)
$x$ the only shifts handled correspond to the multiplication by a power of 2
$x$ hypotheses are made on the format of the input coefficients

## Achievements and lacking features of CGPE

## Features achieved by CGPE

$\checkmark$ validated on the ST200 core
so far, no ambushes were encountered for $\sqrt{ }, \sqrt[3]{ }, \frac{1}{\sqrt{ }}, \frac{1}{\sqrt[3]{3}} \cdots$
produced optimal schemes for some of the above functions such as $\sqrt{ }$

## Features lacking in CGPE

$X$ simplistic description of the underlying architecture (ex. no handling of advanced operators such as ST200 shift_and_add instruction)
$x$ the only shifts handled correspond to the multiplication by a power of 2
$x$ hypotheses are made on the format of the input coefficients

Problem: without hypotheses on the formats of the input coefficients, CGPE fails Solution: add the handling of multiple shifts to CGPE

## Shift handling in CGPE

- There are 4 types of shifts to consider:

1. multiplication by a power of 2 shifts: allows to gain a few cycles

- shifting is usually less costly than multiplication


## Shift handling in CGPE

- There are 4 types of shifts to consider:

1. multiplication by a power of 2 shifts: allows to gain a few cycles

- shifting is usually less costly than multiplication

2. alignment shifts: used to align commas for an arithmetic operation

- addition of a $Q[1.31]$ and a $Q[2.30]$


## Shift handling in CGPE

- There are 4 types of shifts to consider:

1. multiplication by a power of 2 shifts: allows to gain a few cycles

- shifting is usually less costly than multiplication

2. alignment shifts: used to align commas for an arithmetic operation

- addition of a $Q[1.31]$ and a $Q[2.30]$

3. leading zeros' elimination shifts: used to gain some bits of precision

- $0 \times 40000000$ in the $Q[2.30]$ format $\rightsquigarrow 0 \times 80000000$ in the $Q[1.31]$ format


## Shift handling in CGPE

- There are 4 types of shifts to consider:

1. multiplication by a power of 2 shifts: allows to gain a few cycles

- shifting is usually less costly than multiplication

2. alignment shifts: used to align commas for an arithmetic operation

- addition of a $Q[1.31]$ and a $Q[2.30]$

3. leading zeros' elimination shifts: used to gain some bits of precision

- $0 \times 40000000$ in the $Q[2.30]$ format $\rightsquigarrow 0 \times 80000000$ in the $Q[1.31]$ format

4. overflow prevention shifts: used before an arithmetic operation to prevent it from overflowing

- to prevent the addition of a $Q[1.31]$ and a $Q[1.31]$ from overflowing the $Q[1.31]$ format, both operands are shifted to the $Q[2.30]$ format

■ Remark: to detect whether one of these shifts is needed, we rely on:

- fixed-point arithmetic rules (for case 2)
- MPFI computations (for cases 1, 3 and 4).


## Shift handling in CGPE

- There are 4 types of shifts to consider:

1. multiplication by a power of 2 shifts: allows to gain a few cycles

- shifting is usually less costly than multiplication

2. alignment shifts: used to align commas for an arithmetic operation

- addition of a $Q[1.31]$ and a $Q[2.30]$

3. leading zeros' elimination shifts: used to gain some bits of precision

- $0 \times 40000000$ in the $Q[2.30]$ format $\rightsquigarrow 0 \times 80000000$ in the $Q[1.31]$ format

4. overflow prevention shifts: used before an arithmetic operation to prevent it from overflowing

- to prevent the addition of a $Q[1.31]$ and a $Q[1.31]$ from overflowing the $Q[1.31]$ format, both operands are shifted to the $Q[2.30]$ format

Problem: shifts may affect the critical path, potentially increasing the latency of the DAG Solution: use more advanced instructions to help absorb this increase

- ex: shift-and-add instruction available on some fixed-point processors like the ST231


## Outline of the talk

## 1. The CGPE tool

2. Instruction selection: an extension of CGPE

## 3. Conclusion and perspectives

## The problem of instruction selection

- A well known problem in compilation that was proven to be NP-complete on DAGs.

■ Usually solved using a tiling algorithm:

- input:
- a DAG representing an arithmetic expression.
- a set of tiles, with a cost for each.
- a function that associates a cost to a subtree.
- output:
- a set of covering tiles that minimize the cost function.



FmaLeft $\left(x_{1}, x_{2},\left(x_{3} \cdot x_{4}\right)\right)$


FmaRight $\left(\left(x_{1} \cdot x_{2}\right), x_{3}, x_{4}\right)$

## The problem of instruction selection

- A well known problem in compilation that was proven to be NP-complete on DAGs.

■ Usually solved using a tiling algorithm:

- input:
- a DAG representing an arithmetic expression.
- a set of tiles, with a cost for each.
- a function that associates a cost to a subtree.
- output:
- a set of covering tiles that minimize the cost function.



FmaLeft $\left(x_{1}, x_{2},\left(x_{3} \cdot x_{4}\right)\right)$


FmaRight $\left(\left(x_{1} \cdot x_{2}\right), x_{3}, x_{4}\right)$

## Remark on instruction selection

## Some work in the area

Voronenko and Püschel from the Spiral group (2004):

- Automatic Generation of Implementations for DSP Transforms on Fused Multiply-Add Architectures.
$\checkmark$ They provide a short proof of optimality in the case of trees.
$x$ Their method handles FMAs in DAGs but is not generic.

■ We wish to integrate numerical verification in the process of instruction selection.

## The NOLTIS tiling algorithm

```
Near-Optimal Instruction Selection algorithm (Koes and Goldstein in CGO-2008)
    1: BottomUpDP()
    2: TopDownSelect()
    3: ImproveCSEDecision()
    4: BottomUpDP()
    5: TopDownSelect()
```


the progress step by step of the tiling algorithm on the expression $\left(a_{0}^{2}+\left(\left(a_{1} \times a_{2}\right)+\left(a_{3} \ll \alpha\right)\right)\right)$

## The NOLTIS tiling algorithm

```
Near-Optimal Instruction Selection algorithm (Koes and Goldstein in CGO-2008)
    1: BottomUpDP()
    2: TopDownSelect()
    3: ImproveCSEDecision()
    4: BottomUpDP()
    5: TopDownSelect()
```


the progress step by step of the tiling algorithm on the expression $\left(a_{0}^{2}+\left(\left(a_{1} \times a_{2}\right)+\left(a_{3} \ll \alpha\right)\right)\right)$

## The NOLTIS tiling algorithm

Near-Optimal Instruction Selection algorithm (Koes and Goldstein in CGO-2008)
1: BottomUpDP()
2: TopDownSelect()
3: ImproveCSEDecision()
4: BottomUpDP()
5: TopDownSelect()

the progress step by step of the tiling algorithm

$$
\text { on the expression }\left(a_{0}^{2}+\left(\left(a_{1} \times a_{2}\right)+\left(a_{3} \ll \alpha\right)\right)\right)
$$

## The NOLTIS tiling algorithm

Near-Optimal Instruction Selection algorithm (Koes and Goldstein in CGO-2008)
1: BottomUpDP()
2: TopDownSelect()
3: ImproveCSEDecision()
4: BottomUpDP()
5: TopDownSelect()

the progress step by step of the tiling algorithm

$$
\text { on the expression }\left(a_{0}^{2}+\left(\left(a_{1} \times a_{2}\right)+\left(a_{3} \ll \alpha\right)\right)\right)
$$

## The NOLTIS tiling algorithm

Near-Optimal Instruction Selection algorithm (Koes and Goldstein in CGO-2008)
1: BottomUpDP()
2: TopDownSelect()
3: ImproveCSEDecision()
4: BottomUpDP()
5: TopDownSelect()

the progress step by step of the tiling algorithm

$$
\text { on the expression }\left(a_{0}^{2}+\left(\left(a_{1} \times a_{2}\right)+\left(a_{3} \ll \alpha\right)\right)\right)
$$

## The NOLTIS tiling algorithm

Near-Optimal Instruction Selection algorithm (Koes and Goldstein in CGO-2008)
1: BottomUpDP()
2: TopDownSelect()
3: ImproveCSEDecision()
4: BottomUpDP()
5: TopDownSelect()

the progress step by step of the tiling algorithm

$$
\text { on the expression }\left(a_{0}^{2}+\left(\left(a_{1} \times a_{2}\right)+\left(a_{3} \ll \alpha\right)\right)\right)
$$

## Instruction tiles considered in CGPE

- Classical tiles

1. addition tile.
2. multiplication tile.
3. shift tile.


## Instruction tiles considered in CGPE

■ Classical tiles

1. addition tile.
2. multiplication tile.
3. shift tile.

- Advanced tiles


4. fma tiles (left and right).
5. add3 tiles (left and right).
6. shiftAdd tiles (available on the ST200 core).
7. square tile.


## Simple example

■ Original code


Listing 4：Original C code
－With the fma in 3 cycles and the shift in 1 cycle
}

```
```

```
```

uint32_t func_tiled(uint32_t T, uint 32_t S)

```
```

```
uint32_t func_tiled(uint32_t T, uint 32_t S)
```

```
```

uint32_t func_tiled(uint32_t T, uint 32_t S)
1
1
1
uint32_t r0 = power (T, -2);
uint32_t r0 = power (T, -2);
uint32_t r0 = power (T, -2);
uint32_t r1 = add (0\times800000000, r0);
uint32_t r1 = add (0\times800000000, r0);
uint32_t r1 = add (0\times800000000, r0);
uint32_t r2 = fma_right (0\times00000020, S, r1);
uint32_t r2 = fma_right (0\times00000020, S, r1);
uint32_t r2 = fma_right (0\times00000020, S, r1);
uint32_t r3 = square(T);
uint32_t r3 = square(T);
uint32_t r3 = square(T);
uint32_t r4 = mul(S, r3);
uint32_t r4 = mul(S, r3);
uint32_t r4 = mul(S, r3);
uint32_t r5 = mul(T, 0x07fe93e4);
uint32_t r5 = mul(T, 0x07fe93e4);
uint32_t r5 = mul(T, 0x07fe93e4);
uint32_t r6 = sub (0\times100000000, r5);
uint32_t r6 = sub (0\times100000000, r5);
uint32_t r6 = sub (0\times100000000, r5);
uint32_t r7 = mul(r4, r6);
uint32_t r7 = mul(r4, r6);
uint32_t r7 = mul(r4, r6);
uint32_t r8 = sub(r2, r7);
uint32_t r8 = sub(r2, r7);
uint32_t r8 = sub(r2, r7);
uint32_t r9 = square(r3);
uint32_t r9 = square(r3);
uint32_t r9 = square(r3);
uint32_t r10 = mul(S, r9);
uint32_t r10 = mul(S, r9);
uint32_t r10 = mul(S, r9);
uint32_t r11 = mul(T, 0x032d6643);
uint32_t r11 = mul(T, 0x032d6643);
uint32_t r11 = mul(T, 0x032d6643);
uint32_t r12 = sub(0\times04eef694, r11);
uint32_t r12 = sub(0\times04eef694, r11);
uint32_t r12 = sub(0\times04eef694, r11);
uint32_t r13 = mul(T, 0 x00aebe7d);
uint32_t r13 = mul(T, 0 x00aebe7d);
uint32_t r13 = mul(T, 0 x00aebe7d);
uint32_t r14 = sub(0\times01c6cebd, r13);
uint32_t r14 = sub(0\times01c6cebd, r13);
uint32_t r14 = sub(0\times01c6cebd, r13);
uint32-t r15 = power(r3, -11);
uint32-t r15 = power(r3, -11);
uint32-t r15 = power(r3, -11);
uint32_t r16 = add(r14, r15);
uint32_t r16 = add(r14, r15);
uint32_t r16 = add(r14, r15);
uint32_t r17 = fma_right(r12, r3, r16);
uint32_t r17 = fma_right(r12, r3, r16);
uint32_t r17 = fma_right(r12, r3, r16);
uint32_t r18 = mul(r10, r17);
uint32_t r18 = mul(r10, r17);
uint32_t r18 = mul(r10, r17);
uint32_t r18= mul(r10,r17);
uint32_t r18= mul(r10,r17);
uint32_t r18= mul(r10,r17);
return r19;

```
    return r19;
```

    return r19;
    ```
```

    エeturn rlainol
    ```
```

    エeturn rlainol
    ```
```

    エeturn rlainol
    ```

\section*{Simple example}

■ Original code


Listing 6: Original C code
    uint 32 _t r0 \(=T \gg 2\);
    (+) \(Q[1.31]\)
    uint \(32 \_\mathrm{t}=\mathrm{t}=0 \times 80000000+\mathrm{r} 0 ; \quad / /(+) \mathrm{Q}[1.31]\)
    uint 32 _t \(r 2=\operatorname{mul}(S, r 1) ; \quad / /(t) Q[2.30]\)
    uint 32 _t \(r 3=0 \times 00000020+r 2 ; \quad / /(+) Q[2.30]\)
    uint \(32 \_\mathrm{t} 4=\operatorname{mul}(\mathrm{T}, \mathrm{T}) ; \quad / /(+) \mathrm{Q}[0.32]\)
    uint \(32_{\text {_t }} \mathrm{r} 5=\operatorname{mul}(\mathrm{S}, \mathrm{r} 4) ; \quad / /(+) \mathrm{Q}[1.31]\)
    uint32_t r6 \(=\operatorname{mul}(\mathrm{T}, 0 \times 07 \mathrm{feg} 3 \mathrm{e} 4) ; \quad / /(+) \quad \mathrm{Q}[1.31]\)
    uint \(32-r 7=0 \times 10000000-r 6 ; \quad / /(-) Q[1.31]\)
    alnt32_tr8 \(\quad\) mul \((r 5, r 7) ; \quad / /(-) Q[2.30]\)
    uint32_t \(r 9=r 3-r 8 ; \quad / /(+) Q[2.30]\)
    uint \(32-\mathrm{t} r 10=\operatorname{mul}(r 4, r 4) ; \quad / /(+) \quad Q[0.32]\)
    aint32_t r11 \(=\operatorname{mul}(S, r 10) ;\)
    uint \(32 \_r 12=\operatorname{mul}(T, 0 \times 032 \mathrm{~d} 6643) ; \quad / /(+) \quad Q[1.31]\)
    uint 32 _t \(r 13=0 \times 04\) eef694-r12; \(\quad / /(-) \quad Q[1.31]\)
    uint 32 _t \(r 14=\operatorname{mul}(T, 0 \times 00\) aebe7d); \(/ /(+) ~ Q[1.31]\)
    uint 32 _t r15 \(=0 \times 01 \mathrm{c} 6 \mathrm{cebd}-\mathrm{r} 14 ; \quad / /(-) \mathrm{Q}[1.31]\)
    uint 32 _t \(r 16=r 4 \gg 11 ; \quad / /(-) Q[1.31]\)
    uint32_t r17 \(=r 15+r 16 ; \quad / /(-) Q[1.31]\)
    uint 32 _t \(r 18=\operatorname{mul}(r 4, r 17) ; \quad / /(-) \quad 0[1.31]\)
    uint32_t \(r 19=r 13+r 18 ; \quad / /(-) Q[1.31]\)
    uint 32 _t \(r 20=\operatorname{mul}(r 11, r 19) ; \quad / /(-) Q[2.30]\)
    uint 32 _t \(r 21=r 9-r 20 ; \quad / /(+) Q[2.30]\)
    return r21;
\}
- With the fma in 3 cycles and the shift in 3 cycle
```

uint32_t func_tiled(uint32_t T, wint32_t S)

```
uint32_t func_tiled(uint32_t T, wint32_t S)
{
{
    uint32_t r0 = fma_right (0x80000000, T, 0\times40000000);
    uint32_t r0 = fma_right (0x80000000, T, 0\times40000000);
    uint32_t r1 = fma_right (0x00000020, S, r0);
    uint32_t r1 = fma_right (0x00000020, S, r0);
    uint32_t r2 = square(T);
    uint32_t r2 = square(T);
    uint32_t r3 = mul(S, r2);
    uint32_t r3 = mul(S, r2);
    uint32_t r4 = mul(T, 0x07fe93e4);
    uint32_t r4 = mul(T, 0x07fe93e4);
    uint32_t r5 = sub (0\times10000000, r4);
    uint32_t r5 = sub (0\times10000000, r4);
    uint32_t r6 = mul(r3, r5);
    uint32_t r6 = mul(r3, r5);
    uint32_t r7 = sub(r1, r6);
    uint32_t r7 = sub(r1, r6);
    uint32_t r8 = square(r2);
    uint32_t r8 = square(r2);
    uint32_t r9 = mul(S, r8);
    uint32_t r9 = mul(S, r8);
    uint32_t r10=mul(T, 0x032d6643);
    uint32_t r10=mul(T, 0x032d6643);
    uint32_t r11 = sub (0x04eef694, r10);
    uint32_t r11 = sub (0x04eef694, r10);
    uint32_t r12 = mul(T, 0x00aebe7d);
    uint32_t r12 = mul(T, 0x00aebe7d);
    uint32_t r13= sub (0x01c6cebd, r12);
    uint32_t r13= sub (0x01c6cebd, r12);
    uint32_t r14 = power(r2, -11);
    uint32_t r14 = power(r2, -11);
    uint32_t r15 = add (r13, r14);
    uint32_t r15 = add (r13, r14);
    uint32_t r16 = fma_right(r11, r2, r15);
    uint32_t r16 = fma_right(r11, r2, r15);
    uint32_t r17 = mul(r9, r16);
    uint32_t r17 = mul(r9, r16);
    uint32_t r18=sub(r7, r17);
    uint32_t r18=sub(r7, r17);
    return r18;
    return r18;
}
```

}

```

\section*{Listing 7: Code after tiling}

\section*{Remarks on instruction selection in CGPE}
- A separation is achieved between the computation of DAGs (Intermediate Representation) and the code generation process
- the code can be generated according different criteria \(\rightsquigarrow\) cost function
- this general approach allows to tackle other problems (sum, dot-product, ...)


\section*{Remarks on instruction selection in CGPE}
- A separation is achieved between the computation of DAGs (Intermediate Representation) and the code generation process
- the code can be generated according different criteria \(\rightsquigarrow\) cost function
- this general approach allows to tackle other problems (sum, dot-product, ...)
- We are not bound to use these tiles, we can add many others
- CGPE can thus serve as a platform of simulation
- this general approach allows to give some feedback on the eventual need or usefulness of some tiles


\section*{Outline of the talk}

\section*{1. The CGPE tool}
2. Instruction selection: an extension of CGPE
3. Conclusion and perspectives

\section*{Conclusion}

■ Code synthesis for fast and certified polynomial evaluation
- fast and certified C codes, in fixed point arithmetic
- tool to automate polynomial evaluation implementation, using at best architectural features
- implemented in the tool CGPE (Code Generation for Polynomial Evaluation)
http://cgpe.gforge.inria.fr/

■ Extension of CGPE based on instruction selection:
- automatic handling of all input formats.
- better usage of the advanced architectural features (such as fma, add-3, shift-and-add, ...)
- using a tiling algorithm implies more modularity, as code generation is now an independant process.

\section*{Current work and perspectives}

■ Current work
- keep working on instruction selection in CGPE
- make CGPE more general to tackle other problems, like matrix inversion and multiplication, ...

\section*{Current work and perspectives}

■ Current work
- keep working on instruction selection in CGPE
- make CGPE more general to tackle other problems, like matrix inversion and multiplication, ...

■ Further extensions of CGPE
- handle other arithmetics like floating-point arithmetic, where the fma tile is more and more ubiquitous
- target other architectures (like FPGAs)

\section*{Synthesis of fixed-point programs based on instruction selection}
... the case of polynomial evaluation

\section*{Amine Najahi}

Advisors: Matthieu Martel and Guillaume Revy
Joint work with Christophe Mouilleron

> Équipe-projet DALI, Univ. Perpignan Via Domitia
> LIRMM, CNRS: UMR 5506 - Univ. Montpellier 2```

