Welcome to mirror list, hosted at ThFree Co, Russian Federation.

example_manual_matmul.html « www « polly - github.com/llvm/llvm-project.git - Unnamed repository; edit this file 'description' to name the repository.
summaryrefslogtreecommitdiff
blob: 50faa6d97912b9e7e9596327dc29df0bf9e6cb75 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
          "http://www.w3.org/TR/html4/strict.dtd">
<!-- Material used from: HTML 4.01 specs: http://www.w3.org/TR/html401/ -->
<html>
<head>
  <META http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
  <title>Polly - Examples</title>
  <link type="text/css" rel="stylesheet" href="menu.css">
  <link type="text/css" rel="stylesheet" href="content.css">
</head>
<body>
<div id="box">
<!--#include virtual="menu.html.incl"-->
<div id="content">
<!--=====================================================================-->
<h1>Execute the individual Polly passes manually</h1>
<!--=====================================================================-->

<p>
This example presents the individual passes that are involved when optimizing
code with Polly. We show how to execute them individually and explain for each
which analysis is performed or what transformation is applied. In this example
the polyhedral transformation is user-provided to show how much performance
improvement can be expected by an optimal automatic optimizer.</p>

The files used and created in this example are available in the Polly checkout
in the folder <em>www/experiments/matmul</em>. They can be created automatically
by running the <em>www/experiments/matmul/runall.sh</em> script.

<ol>
<li><h4>Create LLVM-IR from the C code</h4>

Polly works on LLVM-IR. Hence it is necessary to translate the source files into
LLVM-IR. If more than on file should be optimized the files can be combined into
a single file with llvm-link.

<pre class="code">clang -S -emit-llvm matmul.c -o matmul.s</pre>
</li>


<li><h4>Load Polly automatically when calling the 'opt' tool</h4>

Polly is not built into opt or bugpoint, but it is a shared library that needs
to be loaded into these tools explicitally. The Polly library is called
LVMPolly.so. For a cmake build it is available in the build/lib/ directory,
autoconf creates the same file in
build/tools/polly/{Release+Asserts|Asserts|Debug}/lib. For convenience we create
an alias that automatically loads Polly if 'opt' is called.
<pre class="code">
export PATH_TO_POLLY_LIB="~/polly/build/lib/"
alias opt="opt -load ${PATH_TO_POLLY_LIB}/LLVMPolly.so"</pre>
</li>

<li><h4>Prepare the LLVM-IR for Polly</h4>

Polly is only able to work with code that matches a canonical form. To translate
the LLVM-IR into this form we use a set of canonicalication passes. They are
scheduled by using '-polly-canonicalize'.
<pre class="code">opt -S -polly-canonicalize matmul.s &gt; matmul.preopt.ll</pre></li>

<li><h4>Show the SCoPs detected by Polly (optional)</h4>

To understand if Polly was able to detect SCoPs, we print the
structure of the detected SCoPs. In our example two SCoPs were detected. One in
'init_array' the other in 'main'.

<pre class="code">opt -basicaa -polly-ast -analyze -q matmul.preopt.ll</pre>

<pre>
init_array():
for (c2=0;c2&lt;=1023;c2++) {
  for (c4=0;c4&lt;=1023;c4++) {
    Stmt_5(c2,c4);
  }
}

main():
for (c2=0;c2&lt;=1023;c2++) {
  for (c4=0;c4&lt;=1023;c4++) {
    Stmt_4(c2,c4);
    for (c6=0;c6&lt;=1023;c6++) {
      Stmt_6(c2,c4,c6);
    }
  }
}
</pre>
</li>
<li><h4>Highlight the detected SCoPs in the CFGs of the program (requires graphviz/dotty)</h4>

Polly can use graphviz to graphically show a CFG in which the detected SCoPs are
highlighted. It can also create '.dot' files that can be translated by
the 'dot' utility into various graphic formats.

<pre class="code">opt -basicaa -view-scops -disable-output matmul.preopt.ll
opt -basicaa -view-scops-only -disable-output matmul.preopt.ll</pre>
The output for the different functions<br />
view-scops:
<a href="experiments/matmul/scops.main.dot.png">main</a>,
<a href="experiments/matmul/scops.init_array.dot.png">init_array</a>,
<a href="experiments/matmul/scops.print_array.dot.png">print_array</a><br />
view-scops-only:
<a href="experiments/matmul/scopsonly.main.dot.png">main</a>,
<a href="experiments/matmul/scopsonly.init_array.dot.png">init_array</a>,
<a href="experiments/matmul/scopsonly.print_array.dot.png">print_array</a>
</li>

<li><h4>View the polyhedral representation of the SCoPs</h4>
<pre class="code">opt -basicaa -polly-scops -analyze matmul.preopt.ll</pre>
<pre>
[...]
Printing analysis 'Polly - Create polyhedral description of Scops' for region:
'for.cond =&gt; for.end19' in function 'init_array':
   Context:
   { [] }
   Statements {
   	Stmt_5
           Domain&nbsp;:=
               { Stmt_5[i0, i1]&nbsp;: i0 &gt;= 0 and i0 &lt;= 1023 and i1 &gt;= 0 and i1 &lt;= 1023 };
           Schedule&nbsp;:=
               { Stmt_5[i0, i1] -&gt; schedule[0, i0, 0, i1, 0] };
           WriteAccess&nbsp;:=
               { Stmt_5[i0, i1] -&gt; MemRef_A[1037i0 + i1] };
           WriteAccess&nbsp;:=
               { Stmt_5[i0, i1] -&gt; MemRef_B[1047i0 + i1] };
   	FinalRead
           Domain&nbsp;:=
               { FinalRead[0] };
           Schedule&nbsp;:=
               { FinalRead[i0] -&gt; schedule[200000000, o1, o2, o3, o4] };
           ReadAccess&nbsp;:=
               { FinalRead[i0] -&gt; MemRef_A[o0] };
           ReadAccess&nbsp;:=
               { FinalRead[i0] -&gt; MemRef_B[o0] };
   }
[...]
Printing analysis 'Polly - Create polyhedral description of Scops' for region:
'for.cond =&gt; for.end30' in function 'main':
   Context:
   { [] }
   Statements {
   	Stmt_4
           Domain&nbsp;:=
               { Stmt_4[i0, i1]&nbsp;: i0 &gt;= 0 and i0 &lt;= 1023 and i1 &gt;= 0 and i1 &lt;= 1023 };
           Schedule&nbsp;:=
               { Stmt_4[i0, i1] -&gt; schedule[0, i0, 0, i1, 0, 0, 0] };
           WriteAccess&nbsp;:=
               { Stmt_4[i0, i1] -&gt; MemRef_C[1067i0 + i1] };
   	Stmt_6
           Domain&nbsp;:=
               { Stmt_6[i0, i1, i2]&nbsp;: i0 &gt;= 0 and i0 &lt;= 1023 and i1 &gt;= 0 and i1 &lt;= 1023 and i2 &gt;= 0 and i2 &lt;= 1023 };
           Schedule&nbsp;:=
               { Stmt_6[i0, i1, i2] -&gt; schedule[0, i0, 0, i1, 1, i2, 0] };
           ReadAccess&nbsp;:=
               { Stmt_6[i0, i1, i2] -&gt; MemRef_C[1067i0 + i1] };
           ReadAccess&nbsp;:=
               { Stmt_6[i0, i1, i2] -&gt; MemRef_A[1037i0 + i2] };
           ReadAccess&nbsp;:=
               { Stmt_6[i0, i1, i2] -&gt; MemRef_B[i1 + 1047i2] };
           WriteAccess&nbsp;:=
               { Stmt_6[i0, i1, i2] -&gt; MemRef_C[1067i0 + i1] };
   	FinalRead
           Domain&nbsp;:=
               { FinalRead[0] };
           Schedule&nbsp;:=
               { FinalRead[i0] -&gt; schedule[200000000, o1, o2, o3, o4, o5, o6] };
           ReadAccess&nbsp;:=
               { FinalRead[i0] -&gt; MemRef_C[o0] };
           ReadAccess&nbsp;:=
               { FinalRead[i0] -&gt; MemRef_A[o0] };
           ReadAccess&nbsp;:=
               { FinalRead[i0] -&gt; MemRef_B[o0] };
   }
[...]
</pre>
</li>

<li><h4>Show the dependences for the SCoPs</h4>
<pre class="code">opt -basicaa -polly-dependences -analyze matmul.preopt.ll</pre>
<pre>Printing analysis 'Polly - Calculate dependences for SCoP' for region:
'for.cond =&gt; for.end19' in function 'init_array':
   Must dependences:
       {  }
   May dependences:
       {  }
   Must no source:
       {  }
   May no source:
       {  }
Printing analysis 'Polly - Calculate dependences for SCoP' for region:
'for.cond =&gt; for.end30' in function 'main':
   Must dependences:
       {  Stmt_4[i0, i1] -&gt; Stmt_6[i0, i1, 0]&nbsp;:
              i0 &gt;= 0 and i0 &lt;= 1023 and i1 &gt;= 0 and i1 &lt;= 1023;
          Stmt_6[i0, i1, i2] -&gt; Stmt_6[i0, i1, 1 + i2]&nbsp;:
              i0 &gt;= 0 and i0 &lt;= 1023 and i1 &gt;= 0 and i1 &lt;= 1023 and i2 &gt;= 0 and i2 &lt;= 1022;
          Stmt_6[i0, i1, 1023] -&gt; FinalRead[0]&nbsp;:
              i1 &lt;= 1091540 - 1067i0 and i1 &gt;= -1067i0 and i1 &gt;= 0 and i1 &lt;= 1023;
          Stmt_6[1023, i1, 1023] -&gt; FinalRead[0]&nbsp;:
              i1 &gt;= 0 and i1 &lt;= 1023
       }
   May dependences:
       {  }
   Must no source:
       {  Stmt_6[i0, i1, i2] -&gt; MemRef_A[1037i0 + i2]&nbsp;:
              i0 &gt;= 0 and i0 &lt;= 1023 and i1 &gt;= 0 and i1 &lt;= 1023 and i2 &gt;= 0 and i2 &lt;= 1023;
          Stmt_6[i0, i1, i2] -&gt; MemRef_B[i1 + 1047i2]&nbsp;:
              i0 &gt;= 0 and i0 &lt;= 1023 and i1 &gt;= 0 and i1 &lt;= 1023 and i2 &gt;= 0 and i2 &lt;= 1023;
          FinalRead[0] -&gt; MemRef_A[o0];
          FinalRead[0] -&gt; MemRef_B[o0]
          FinalRead[0] -&gt; MemRef_C[o0]&nbsp;:
              o0 &gt;= 1092565 or (exists (e0 = [(o0)/1067]: o0 &lt;= 1091540 and o0 &gt;= 0
              and 1067e0 &lt;= -1024 + o0 and 1067e0 &gt;= -1066 + o0)) or o0 &lt;= -1;
       }
   May no source:
       {  }
</pre></li>

<li><h4>Export jscop files</h4>

Polly can export the polyhedral representation in so called jscop files. Jscop
files contain the polyhedral representation stored in a JSON file.
<pre class="code">opt -basicaa -polly-export-jscop matmul.preopt.ll</pre>
<pre>Writing SCoP 'for.cond =&gt; for.end19' in function 'init_array' to './init_array___%for.cond---%for.end19.jscop'.
Writing SCoP 'for.cond =&gt; for.end30' in function 'main' to './main___%for.cond---%for.end30.jscop'.
</pre></li>

<li><h4>Import the changed jscop files and print the updated SCoP structure
(optional)</h4>
<p>Polly can reimport jscop files, in which the schedules of the statements are
changed. These changed schedules are used to descripe transformations.
It is possible to import different jscop files by providing the postfix
of the jscop file that is imported.</p>
<p> We apply three different transformations on the SCoP in the main function.
The jscop files describing these transformations are hand written (and available
in <em>www/experiments/matmul</em>).

<h5>No Polly</h5>

<p>As a baseline we do not call any Polly code generation, but only apply the
normal -O3 optimizations.</p>

<pre class="code">
opt matmul.preopt.ll -basicaa \
    -polly-import-jscop \
    -polly-ast -analyze
</pre>
<pre>
[...]
main():
for (c2=0;c2&ltg;=1535;c2++) {
  for (c4=0;c4&ltg;=1535;c4++) {
    Stmt_4(c2,c4);
    for (c6=0;c6&ltg;=1535;c6++) {
      Stmt_6(c2,c4,c6);
    }
  }
}
[...]
</pre>
<h5>Interchange (and Fission to allow the interchange)</h5>
<p>We split the loops and can now apply an interchange of the loop dimensions that
enumerate Stmt_6.</p>
<pre class="code">
opt matmul.preopt.ll -basicaa \
    -polly-import-jscop -polly-import-jscop-postfix=interchanged \
    -polly-ast -analyze
</pre>
<pre>
[...]
Reading JScop 'for.cond =&gt; for.end30' in function 'main' from './main___%for.cond---%for.end30.jscop.interchanged+tiled'.
[...]
main():
for (c2=0;c2&lt;=1535;c2++) {
  for (c4=0;c4&lt;=1535;c4++) {
    Stmt_4(c2,c4);
  }
}
for (c2=0;c2&lt;=1535;c2++) {
  for (c4=0;c4&lt;=1535;c4++) {
    for (c6=0;c6&lt;=1535;c6++) {
      Stmt_6(c2,c6,c4);
    }
  }
}
[...]
</pre>
<h5>Interchange + Tiling</h5>
<p>In addition to the interchange we tile now the second loop nest.</p>

<pre class="code">
opt matmul.preopt.ll -basicaa \
    -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled \
    -polly-ast -analyze
</pre>
<pre>
[...]
Reading JScop 'for.cond =&gt; for.end30' in function 'main' from './main___%for.cond---%for.end30.jscop.interchanged+tiled'.
[...]
main():
for (c2=0;c2&lt;=1535;c2++) {
  for (c4=0;c4&lt;=1535;c4++) {
    Stmt_4(c2,c4);
  }
}
for (c2=0;c2&lt;=1535;c2+=64) {
  for (c3=0;c3&lt;=1535;c3+=64) {
    for (c4=0;c4&lt;=1535;c4+=64) {
      for (c5=c2;c5&lt;=c2+63;c5++) {
        for (c6=c4;c6&lt;=c4+63;c6++) {
          for (c7=c3;c7&lt;=c3+63;c7++) {
            Stmt_6(c5,c7,c6);
          }
        }
      }
    }
  }
}
[...]
</pre>
<h5>Interchange + Tiling + Strip-mining to prepare vectorization</h5>
To later allow vectorization we create a so called trivially parallelizable
loop. It is innermost, parallel and has only four iterations. It can be
replaced by 4-element SIMD instructions.
<pre class="code">
opt matmul.preopt.ll -basicaa \
    -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector \
    -polly-ast -analyze </pre>

<pre>
[...]
Reading JScop 'for.cond =&gt; for.end30' in function 'main' from './main___%for.cond---%for.end30.jscop.interchanged+tiled+vector'.
[...]
main():
for (c2=0;c2&lt;=1535;c2++) {
  for (c4=0;c4&lt;=1535;c4++) {
    Stmt_4(c2,c4);
  }
}
for (c2=0;c2&lt;=1535;c2+=64) {
  for (c3=0;c3&lt;=1535;c3+=64) {
    for (c4=0;c4&lt;=1535;c4+=64) {
      for (c5=c2;c5&lt;=c2+63;c5++) {
        for (c6=c4;c6&lt;=c4+63;c6++) {
          for (c7=c3;c7&lt;=c3+63;c7+=4) {
            for (c8=c7;c8&lt;=c7+3;c8++) {
              Stmt_6(c5,c8,c6);
            }
          }
        }
      }
    }
  }
}
[...]
</pre>

</li>

<li><h4>Codegenerate the SCoPs</h4>
<p>
This generates new code for the SCoPs detected by polly.
If -polly-import-jscop is present, transformations specified in the imported
jscop files will be applied.</p>
<pre class="code">opt matmul.preopt.ll | opt -O3 &gt; matmul.normalopt.ll</pre>
<pre class="code">
opt -basicaa \
    -polly-import-jscop -polly-import-jscop-postfix=interchanged \
    -polly-codegen matmul.preopt.ll \
   | opt -O3 &gt; matmul.polly.interchanged.ll</pre>
<pre>
Reading JScop 'for.cond =&gt; for.end19' in function 'init_array' from
    './init_array___%for.cond---%for.end19.jscop.interchanged'.
File could not be read: No such file or directory
Reading JScop 'for.cond =&gt; for.end30' in function 'main' from
    './main___%for.cond---%for.end30.jscop.interchanged'.
</pre>
<pre class="code">
opt -basicaa \
    -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled \
    -polly-codegen matmul.preopt.ll \
   | opt -O3 &gt; matmul.polly.interchanged+tiled.ll</pre>
<pre>
Reading JScop 'for.cond =&gt; for.end19' in function 'init_array' from
    './init_array___%for.cond---%for.end19.jscop.interchanged+tiled'.
File could not be read: No such file or directory
Reading JScop 'for.cond =&gt; for.end30' in function 'main' from
    './main___%for.cond---%for.end30.jscop.interchanged+tiled'.
</pre>
<pre class="code">
opt -basicaa \
    -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector \
    -polly-codegen -polly-vectorizer=polly matmul.preopt.ll \
   | opt -O3 &gt; matmul.polly.interchanged+tiled+vector.ll</pre>
<pre>
Reading JScop 'for.cond =&gt; for.end19' in function 'init_array' from
    './init_array___%for.cond---%for.end19.jscop.interchanged+tiled+vector'.
File could not be read: No such file or directory
Reading JScop 'for.cond =&gt; for.end30' in function 'main' from
    './main___%for.cond---%for.end30.jscop.interchanged+tiled+vector'.
</pre>
<pre class="code">
opt -basicaa \
    -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector \
    -polly-codegen -polly-vectorizer=polly -polly-parallel matmul.preopt.ll \
  | opt -O3 &gt; matmul.polly.interchanged+tiled+openmp.ll</pre>
<pre>
Reading JScop 'for.cond =&gt; for.end19' in function 'init_array' from
    './init_array___%for.cond---%for.end19.jscop.interchanged+tiled+vector'.
File could not be read: No such file or directory
Reading JScop 'for.cond =&gt; for.end30' in function 'main' from
    './main___%for.cond---%for.end30.jscop.interchanged+tiled+vector'.
</pre>

<li><h4>Create the executables</h4>

Create one executable optimized with plain -O3 as well as a set of executables
optimized in different ways with Polly. One changes only the loop structure, the
other adds tiling, the next adds vectorization and finally we use OpenMP
parallelism.
<pre class="code">
llc matmul.normalopt.ll -o matmul.normalopt.s &amp;&amp; \
    gcc matmul.normalopt.s -o matmul.normalopt.exe
llc matmul.polly.interchanged.ll -o matmul.polly.interchanged.s &amp;&amp; \
    gcc matmul.polly.interchanged.s -o matmul.polly.interchanged.exe
llc matmul.polly.interchanged+tiled.ll -o matmul.polly.interchanged+tiled.s &amp;&amp; \
    gcc matmul.polly.interchanged+tiled.s -o matmul.polly.interchanged+tiled.exe
llc matmul.polly.interchanged+tiled+vector.ll -o matmul.polly.interchanged+tiled+vector.s &amp;&amp; \
    gcc matmul.polly.interchanged+tiled+vector.s -o matmul.polly.interchanged+tiled+vector.exe
llc matmul.polly.interchanged+tiled+vector+openmp.ll -o matmul.polly.interchanged+tiled+vector+openmp.s &amp;&amp; \
    gcc -lgomp matmul.polly.interchanged+tiled+vector+openmp.s -o matmul.polly.interchanged+tiled+vector+openmp.exe </pre>

<li><h4>Compare the runtime of the executables</h4>

By comparing the runtimes of the different code snippets we see that a simple
loop interchange gives here the largest performance boost. However by adding
vectorization and by using OpenMP we can further improve the performance
significantly.
<pre class="code">time ./matmul.normalopt.exe</pre>
<pre>42.68 real, 42.55 user, 0.00 sys</pre>
<pre class="code">time ./matmul.polly.interchanged.exe</pre>
<pre>04.33 real, 4.30 user, 0.01 sys</pre>
<pre class="code">time ./matmul.polly.interchanged+tiled.exe</pre>
<pre>04.11 real, 4.10 user, 0.00 sys</pre>
<pre class="code">time ./matmul.polly.interchanged+tiled+vector.exe</pre>
<pre>01.39 real, 1.36 user, 0.01 sys</pre>
<pre class="code">time ./matmul.polly.interchanged+tiled+vector+openmp.exe</pre>
<pre>00.66 real, 2.58 user, 0.02 sys</pre>
</li>
</ol>

</div>
</div>
</body>
</html>