© Distribution of this video is restricted by its owner
Transcript ×
Auto highlight
Font-size
00:00 Okay, so today, as it on this slide, I will trying

00:08 tell you a little bit what compilers trying to do on bond. What

00:16 can do. A zoo, a or user. Our systems and tools

00:23 compiler do what one hopes it would . So first, I'll talk just

00:30 little bit very general and about coming to code efficiency. Then give some

00:37 that you may know where they have up when they looked at the compiler

00:41 and what they need at all. , mentioned that very briefly. And

00:47 we'll move on to talk and what are trying to do to improve the

00:53 , respecting how it's supposed to be on platforms a couple of quotes

01:01 That's why I start this course to focusing and understanding and learning about single

01:09 and how to get performance and sort going first from just single threaded performance

01:15 then trying to figure out multi threaded part on a single note using

01:21 empty on uh, so anyone that not be familiar with the name being

01:31 . He is quite well known scientists Waas instrumental off and one of the

01:39 in developing pretty much every generation of MP I standard, for instance,

01:46 programming clusters and, uh, now at the University of Illinois. But

01:53 worked in argon national Labs for many and is recipient of many awards and

02:03 . And, um, try Thio . Emphasize that I've done many times

02:10 the past to to get efficient One needs to understand the platform

02:17 thunder, take good algorithms, and money is, too. Try to

02:22 sure that, and also get good . So it's one of these

02:30 which is is kind of vertical One needs to understand the layers that

02:35 across kind of contract to the common off using by labor layers of

02:43 And you don't necessarily need to worry players Villa one above. Just work

02:49 your own layer. But well, , not to work too well when

02:53 comes to getting good efficiency out of system on just a little bit as

03:05 it is often perhaps not, said . It's not the focus of today's

03:10 , but we'll talk about it later in the course. But a lot

03:14 the progress in terms off getting useful out of computer systems also come from

03:20 in algorithms. And I thought that a nice just to include that.

03:24 what has happened. And this comes different. The scientific computing domain.

03:31 on, uh, in terms off , yes, cold issues. And

03:41 is says, You know, it's common. You can get a factor

03:46 10 and in fact, it's not , even get a lot more even

03:49 that. And that's part of the why, uh, in this

03:55 start to give you the tools to understanding about on the actual performances in

04:01 off resource utilization and then get you understanding what the resource is actually

04:08 um, so I think now that's of a comments on this slide.

04:18 . Here is one that another very known person in the computer science field

04:23 license, and that's also as a at MIT on Bond also have won

04:30 awards, and he is the creator his group. I should say instant

04:35 of this silk, um, programming . If you like some kind of

04:43 turn system that was later this. effort was then required by Intel,

04:50 it's not part on the Internet set tools. But anyway, it shows

04:58 difference in this case going from the code to highly optimized C code the

05:06 instructions. So it's orders of Um, that potentially is to begin

05:15 , they called. That might be good in terms of program. My

05:19 in that in terms of a resource non human resource, is that means

05:27 hardware. It may not be all good. So depending upon what your

05:34 is for the cold, if it's to be a production code or something

05:38 you're just doing as a sort of tools to understand something than your

05:45 except it's likely need to be very . So then, including here,

05:52 examples from what is the compiler flags in terms of what type of optimization

06:04 lack of their off that and is at the various levels of optimization.

06:15 I guess the sailor level is this do nothing and take it as it

06:20 , whereas level one and try to execution time by allowing for a number

06:28 optimization that enlisting here not going to them all but one of the main

06:37 or restrictions on the optimization is being is for the level one. There's

06:45 lot off attention paid to the extent which optimization is may increase the code

06:54 and we'll talk off some of these to some detail later on in this

07:00 year. That so sometimes there is trade off between speed and code sites

07:07 and than with this level to optimization still, you know, trying to

07:16 things. But now also saying include ization type features. And I will

07:22 about factory ization a little bit later this class today on also next

07:31 Um, so in this case, completely right, just trying to make

07:39 guesstimate as to what? The potential in code size, whether that is

07:47 eventually leading Thio reduced running time, . But that may or may not

07:54 be the case, and then the level, that is typically there is

08:01 for their optimization that is listening here names. We'll talk about some of

08:06 things just to give you a sense what, on a compiler most of

08:13 times, a recently successful to do but in a state of the art

08:19 . But whether there is a success not depends on the context of the

08:24 and whether the code analysis, as a result that these transformations of

08:33 code is safe, it may not able to conclude that it's safe and

08:38 compile it cannot do. These optimization on the other hand, you as

08:43 program, and they know that it's . And then by structure in your

08:48 , you can make the car the successful Eso here is a list off

09:00 . Optimization is my name, and will talk about some of these,

09:05 not all of them just again, I focused on things that are

09:13 uh, it's easy to do in of how your rights air so

09:20 And here is an example of that terms of the stream benchmark, I

09:24 that you can see the numbers s , this is just compiling the same

09:31 with different optimization flags and was using GCC compiler as it says. And

09:40 you look at the in the So peace off illustration here on the

09:50 . And if you look at the rates, the second column. The

09:54 column is the names of the extreme and the second columnist the data rates

10:00 for those functions. And if you at the one that used no optimization

10:06 , if just pick one of them look at that, the try it

10:12 says it wants seven, about seven per second. In terms of the

10:16 rates, then the level one optimization jumped to 13.4. So it's almost

10:24 in terms off the data it and the cold size even got down

10:31 about two thirds. So that level paid off very handsomely, both in

10:37 and code size. On the level optimization did Young looking at the trial

10:45 went from 13 4 to 13 so it give a little bit better

10:52 their rights for that particular stream But the cold sites not grew a

10:59 bit, not much. But if you went to the most aggressive

11:05 , the code size pretty much became off a level one optimization, and

11:12 performance in fact, actually dropped. this is again the trade off where

11:17 was intended to be good. But large part, sometimes the optimization is

11:22 not have been successful, but that also have been affected by the code

11:27 that may have produced, um, performance because code has to be loaded

11:35 the same data path except very close the functional units. But everything through

11:42 two caches are shared in terms of . Pass in order. Code music

11:47 with data for some of the Shelley . So anyone want tohave question or

11:58 on this kind of trade offs and it worked out in this particular

12:10 it's not. So. This is pretty much what I already said,

12:15 the compiler writers have to make sure the correctness of the code is

12:21 Recount, do anything that's somewhat And to be seemingly conservative, it's

12:29 always possible and analysis toe figure Marching source code transformations are possible and

12:40 things potentially going wrong in some extreme . So here is, I

12:48 No, I'm not. I'm sure is from a look, too most

12:53 science students, but Maybe some of know not so but compare list and

13:00 some of what is known as the and the back end on the front

13:05 is the thing that takes it program or their CEO fortune and then translates

13:14 into it's known as an intermediate language only intermediate format on that many

13:23 That's common to several programming languages. you see your fortune, it's intermediate

13:29 Mr Same, and that means and then use the same kind of

13:36 tools working on this intermediate formats. then one has a cold generator that

13:43 decoupled to some degree from the programming , but it is dependent on the

13:51 platform. So it has instruction sets the particular platforms and knows about their

13:57 on that platform, and in that , the optimization is that also happened

14:03 that level. But then they are more machine dependent, supposed toe dependent

14:09 there, largest for the function off code. So I'll talk mostly about

14:17 machine independent transformations today, Um, the other thing is that doing once

14:31 as whole program analysis is very as hopefully anyone can imagine if you

14:41 hundreds of thousands of lines of code the whole program analysis is an incredibly

14:50 data dependence rascal. So many uh, compilers, they all look

15:00 what's known asses basic plot that I , uh, last time.

15:06 yeah, just repeat the picture that showed last time in terms of what

15:10 basic box are. Oh, So within those things are recently safe

15:17 do. But then, once me beyond the basic blocks to do the

15:25 than it becomes a lot more analysis and time consuming to do the

15:36 So one of my colleagues, when was at Yale University Waas one of

15:46 first to try to expose this idea construction level parallelism and this. But

15:54 there's no one has to be like right format. So he designed a

16:02 as you know, started the company build the machine, for it was

16:07 as multiple for this belong W But as people said, you know

16:13 need a computer to actually completely code your computer, so compiling the code

16:18 take hours when you try to the analysis cold. So that's fine.

16:29 , the extent of its compilers do into procedural analysis, is kind of

16:40 . Um, so now I will about some of what these names means

16:49 if talk through an example off what on one particular processors and show what

17:00 mean. So one particular Intel architectures in lining of functions is essentially

17:11 Instead of having the function call, makes kind of in some ways a

17:16 and clean coat, you basically remove call and substitute the explicit text of

17:30 for the function into the line of . Or this is just an

17:36 So before in lining your best, call this bread function three times,

17:43 , whereas after in lining, you each one of the calls with statements

17:49 is used to define the prince So it's nothing magic, but what

17:54 does it the voids, the overhead during function calls. And that can

18:00 be the big game. So that's as in lining and compilers recently successful

18:09 least as long as the functions are too complex on if it's mean by

18:16 company, right it to be an that it's likely to win, but

18:22 clearly than willing increase the cold But this is part of our compilers

18:32 . But if you're concerned about overhead bending or not certain that the compiler

18:40 be successful at it on your own and that this is okay,

18:45 it would become a little less, know, being in pretty cold that

18:51 is potential significantly more efficient. Dr . Yeah. Is this primarily

19:00 um, eliminating the overhead of a frame? Because I can't imagine anything

19:07 that it would Very helpful, Is basically setting up the stack frame

19:13 keeping it, and then so that's it does. So it takes some

19:18 that the way on this is another that is again to me. He

19:30 stood programming practice E guess official is as cold movement meant a code hosting

19:39 essentially don't do things inside loops that , in variant perspective, the

19:46 So I'll do that once instead of it every time in the loop.

19:51 in this case is basically operation to on either the number two, you

19:58 , pilots keeps changing, so that's you could as well, computer outside

20:03 loop on, then just use it the loop. So, in actual

20:07 , yes, it costs you tempt . But, uh, in this

20:12 is very trivial, but introduces the of operations inside the loop on being

20:22 . And it's a little bit more this detail. Example that border from

20:28 that cmu eso it shows a little again and being careful all what?

20:39 you write things. I would say nothing wrong. All of these things

20:43 just source to source transformations that is at reducing the amount of work that

20:49 to be done sometimes at the expense . Some additional there are about from

20:57 locations. So in this case, example, is just that. Now

21:05 again inside the loop. In this , it's an array reference on this

21:09 . Instead of doing the end I not quotation every iteration in the

21:15 statement Thio just compared to the outside than avoiding transportation in every sign.

21:23 statement and this Waas yes and example in this case, what the compiler

21:33 to dio. I'm not going to to explain all of the assembly

21:38 even though I recommend anyone that want do good code optimization to look at

21:46 assembly, that the compiler generous and to understand it because that sees what

21:52 happens so to Jell O boxes is before and after a compilation with In

21:59 case, they level one optimization and shows up in this case, the

22:05 . They do the right thing and a new variable to the product and

22:10 I and then that's in fact used the in the loop. It also

22:21 a little bit more, as you see that it also used this in

22:27 pointer and then use a simple increment the pointer instead of having to do

22:34 array reference. Yeah, um, this example? Waas. Yes.

22:44 this has a deal with Just says the top of the slide. Try

22:52 . Keep things local two registers. , so if one deals with the

23:00 , sometimes the compiler is difficulty figuring to keep a rave variables local and

23:08 . And it may in fact, things in memory for every integration and

23:13 particular loop on this case, it's you, some all the elements of

23:21 off the area into back Trevi, Vesper is one element and be for

23:27 rule of s we just add up and this little example. So that

23:33 seem it becomes many Iraqi references that hard for that, uh,

23:43 sometimes the optimized out and create a variable. So I think in this

23:50 , it was what I have There's see what happens in these situations

23:59 basically says that that the these initializing they ace initialized. And then we

24:11 this function thio computer oh, sums store it and be, um,

24:21 the value of B shows here for generation. But it means there's kind

24:25 memory traffic again to the race for on the in looks as well.

24:35 the thing is to try to avoid is to use a local variable.

24:43 talked about this when he talked about MP and the sharing of variables.

24:48 in that case, there was a with respect to potential. Um,

24:57 line is, um, moving between caches, and this is similar,

25:05 now sort of focusing on the memory . So in this case. But

25:13 a local variable in this case developed , um, the computer is doesn't

25:20 to worry about any ray issues or some other processor. Does things,

25:30 , accord us something in to be ready that it's not the wear

25:34 So it just works on its local , and it keeps it and

25:38 So the number off memory references significantly . So it's, um, the

25:49 of programming concept of you like is same as we talked about in terms

25:55 the open MP. And avoiding this line session between cash is now in

26:06 example. I'm going to use, , on uses. You know,

26:14 per element is a common where you're to get a good idea off how

26:21 things are. Then that's being used the examples I I'm going team talk

26:28 in the next many sites. maybe I should, uh, pause

26:35 a minute and see if there's any so far, the fairly simple transformations

26:41 basically coding styles that is encouraged to sure you got good, efficient

26:49 um, on the example of updating array position, I think it would

26:54 a position I plus equals some Um, so you said that Teoh

27:00 that away. We should be introducing temporary variable to sum up and then

27:04 access the array once of the first . Okay, So because if it's

27:10 single variable, that doesn't leave quote in in the right context, even

27:16 it's the same memory location. If a single variable, the computers of

27:23 very successful link keeping that variable in register instead off allocating memory for

27:31 Okay, so So this is just simple model. Harmony, and the

27:44 told Time is just there's over some and then on funds in the cycles

27:50 instruction time or part elements, or times the number of elements so clearly

27:56 to get as few cycles for elements possible. This is what I want

28:01 try to talk about, and, , and I guess they see someone

28:07 two over the two versions on this that I just went through. This

28:12 the piece on to when the local on this piece someone on this slide

28:19 when you had explicit the very reference the innermost loop so lawyers. You

28:25 see that Thio cases worked out as functional number of elements, and there

28:31 not giant still the savings for very expense in this case. So this

28:41 the and you can see the slope the lines are than the cycles per

28:50 . So now there's this an example shows a little bit again how to

28:58 Thio, right? I would say told there's nothing wrong with any one

29:05 the code versions in terms of correctness its, uh, you know,

29:12 logic, respect from the use of . If you're focused on what you

29:16 to get done, Thio write it particular way. But one of the

29:25 in this case is that do you in the loop statement before statement You

29:38 basically a function evaluation carrying the vector and every integration of the loop,

29:46 you also then and have another function to get factor element that to use

29:53 in the body on the loot, then you do something simple where the

29:59 here, the tree from, the array now mhm. So in

30:10 case that is, you know, timing results, doing it for some

30:17 of the rate that I'm sorry. . Figure out what the race

30:23 Also, just take the number for they are. Eso This shows when

30:31 don't do any optimization at all on particular tiny piece of code. But

30:39 also shows that the level one type in this case have a good

30:46 and we'll talk more what that waas the next few slides, and we'll

30:52 get to see how even the old optimization can be improved significantly. In

31:02 , by the time we're done with example, the code would have approved

31:06 more than an order of magnitude and . So there is a sort of

31:14 thing one can do because the vector doesn't change in the loop. So

31:19 is just kind of being cleans. is no point in evaluating,

31:26 or calling the function rector lengthen jazz lines and every iteration of the

31:31 Since it's an in variant, just it once again. And so that

31:38 the first level of optimization in this that will be done by at the

31:42 level. Um and I said I'm sorry by the outside noise that

31:52 not happened at this time over What Somebody decided to burn the

32:00 Hopefully, it's not too annoying. , anyway, so it just says

32:06 hard again sometimes for the compiler Thio that this safe to move these things

32:11 so it may or may not succeed doing this. Ah, Level two

32:17 . But it's not too cumbersome for to actually do it and think of

32:22 in variant again and put what's seen things outside the look s O.

32:30 was kind of a fairly simple, , optimization. The other one is

32:37 a little bit perhaps less intuitive but maybe not. And it's also

32:47 they're simpler operations than calling this function get the starting position, um,

32:55 the actual location you want for every of the loop. So you can

33:01 just better yet than Thio? to starting a relocation or memory location

33:07 them the array and then just implemented in this case things were fairly

33:13 You were just going to walk through way from start to finish or the

33:18 point you wanted So is a very stride. One, uh, walk

33:24 the right. So there's no point we can calling this Get back

33:30 Get start. Ah, yeah, integration. So, um,

33:42 I think that was something That's So this one more and then there

33:50 be some more, um, timing . I think on the next slide

33:59 also then use, uh, and temporary er for the accumulation if we

34:08 it to they standing. So now again not using a regular reference,

34:15 , uh, no, this so has to be. But it's accumulation

34:23 a local, very variable. So I guess the sum of all

34:36 um, on first Waas uh, the default time and gain from optimization

34:46 one than the three code restructuring that talked about Loss first to move there

34:56 for the vector lengths outside the That is, in a variant second

35:00 to just get the pointed to the and then the walk through by implement

35:07 one and then the third wants to its local variable in the loop.

35:15 this cold restructuring now improved the performance a bit. We respect to even

35:25 one optimization. That was a factor better than the no optimization case by

35:30 compiler. So these four simple changes this loop conceptually simply than for more

35:40 of, uh gave what the factor three or better And, uh,

35:47 for the double precision mouth and the or eight, I think more or

35:51 for the integer app. We'll talk little bit more about these difference in

35:57 improvement in the coming slides, but are all you know, this

36:04 new tube, um, they're moving calls and, um, already

36:15 making it very simple instead of arrhythmic disease or others to do yes,

36:21 by one. So any questions so ? So these are to me

36:36 Not necessarily what one yet on tartans is keeping cold, clean, but

36:46 to understand it. That's so Little bit more difficult, but it

36:52 have huge performance gains. So let's what ah now I'm going to get

37:02 the topic of loop unrolling, and going to take quite a few slides

37:08 I'm done with this example on Here's the most trivial example First and

37:15 what do you live? More elaborate , second to the point. And

37:21 is, um, basically Thio reduce number off nuclear iteration and in

37:35 Instead, she ate a few integrations the loop in the loop body.

37:42 that means two things may happen when the looping overhead. Because you don't

37:50 to do the tests assed many times you are at the end off the

37:58 or not on. And it also helps the compiler again when it has

38:07 the statements in the loop body to things in registers. So there

38:14 yeah, two games and their other that I will show you in the

38:19 several slides. But so the first of business, it is loping over

38:26 on, help the compiler and keep in registers. Dr. Johnson.

38:33 . So this loop unrolling more at at paralyzing at the core level,

38:42 not necessarily parallels. And so you will help you even on a single

38:49 . Okay, I think I like paralyzing in the sense that we're

38:54 Max effectiveness of each cycle, Because in terms of, um,

39:00 dependencies, is this the kind of that we're looking at with loop and

39:05 . So, um, the data is are the same. So what

39:15 in this case is to take the thing here. Um um, So

39:27 is I guess a messed up I'm sorry. So where they should

39:30 put you some instead of the Litex . It's another variable y or something

39:36 why outside? Not as a loop . So I didn't spot that.

39:42 apologize what it means, um, compiler may generate load stores on the

39:52 pick a loop case for econ for it now while the delete while

39:59 so it loads it. And then does whatever this function delete does to

40:03 X or an assignment statement. And there's memory traffic generated potentially for each

40:13 in the loop that is then reduced this case by a factor of five

40:19 the under oath. So it saves . But the dependency is the

40:28 It's just that instead of a one access to register, it may be

40:35 hundreds of cycles to memory for the . Okay, so I'm not sure

40:42 would call that sour um, paralyzed that's why I say even in happens

40:50 a single thread that you avoid round to memory. Most likely when you

40:57 look controlling and for every alteration in loop, when you're at the end

41:06 the loop and get to the you have to do test Insee and

41:10 down or not that takes not but it takes a few cycles to

41:16 that test so it eliminates some, , test arithmetic conditional operation comparison its

41:29 . Compare, uh, in the structure, and it potentially eliminates around

41:37 to memory. Eso here is just example you may it takes multiply that

41:50 will see more times than you care in this course because it's a simple

41:55 structure, but it shows the and this is just one example of

42:00 some enrolling in that was used on particular machine, and it shows,

42:07 , the again I waas, depending their sort of look playing for

42:16 Up and down for the loops were the compute Great. It was,

42:21 know, it's a very old and the numbers are not very interesting

42:24 themselves. But the difference is In terms of that, you got

42:29 factor off three improvement in this performance for this particular loop by doing

42:36 loop unrolling and again again is round to memory avoided most of the king

42:43 here, plus some reduction in looking . But most of it is do

42:48 reduce memory traffic. So, I'm going to come to talk about

42:58 more detailed example. I was done intel Haswell CPU, which is the

43:05 nothing will be using for the next . Pittsburgh Supercomputing Center The bridge is

43:14 . The house fall sick to Um so here's a little bit all

43:21 number of functional units it has for integer and floating point operations. And

43:28 , um, how some of the units are operating. When I talked

43:36 processors, I don't think I Uh huh. Too much about functional

43:44 type designs in terms of vacancy and ports and cycles per issue.

43:52 talked about the Leighton see to various of cash is and main memory.

44:01 I talked about processes, but I think I mentioned these other pipeline depths

44:06 is sometimes called or Layton City. almost on functional units, whether

44:16 uh, multipliers ADDers or you're an decoding is done in the form of

44:25 units in modern processors. So what means It's like, um, I

44:35 a typical cafeteria or counting right that go and pick up a tray and

44:44 you go by the various pick up and eventually get to They have the

44:49 end. And as you pass through different stations, other people can,

44:55 know, be at your previous so as opposed to a version where

45:06 person would have to pick up all different pieces and pay before the next

45:12 can pick up the trade. So is what hardware designers do, And

45:24 instead off taking, you know, points add or multiply. And

45:33 you know, have some notion of point numbers, right that they have

45:36 man Tisa, and they haven't so they create sort of a dynamic

45:42 by modifying the exponents. So if want to add to floating point

45:48 one is tow. Line them so they basically have the same

45:54 Otherwise, you can't deal with what's as the Mantis A or the

45:59 So there are several different operations to performed in order to do, add

46:05 multiply on numbers, so that means fair amount of logic. And then

46:12 hardware designers do They are make sort stages, or you may do in

46:19 stage of our parents, and that's certain amount of logic. And then

46:25 kind of a register a buffer where keep the result of that. And

46:30 it moves that down toe, Next thing you want to do and

46:34 another parent numbers can be aligned, . So what, it means you

46:40 , um, we get the Crater is dependent on the longest kind

46:47 slowest path for each one of the , as supposed to the whole

46:53 So it's again the car kite and dependent on more or less, or

47:01 people move into the cafeteria line, upon the slowest station along the cafeteria

47:11 . So what this idea or pipe is that they allows designers to increase

47:18 cooperate compared to during a complete operation a single entity, Um so it

47:27 clock rates. And by that, you following and cafeteria analogy, it

47:34 the truth. So that's what you on this thing. Not to say

47:38 , you know, low store and . Multiply and add and even the

47:43 point multiply. Add. Um, just take. You can add or

47:50 new operations into the pipeline every so that's what the cycles per issue

47:54 . One, but it takes more one cycle to do. The store

48:00 in this case takes four. In of the floating point multiply, it

48:04 five cyclists to get the result of multiply, but you can start a

48:09 one again every cycle. So this the way it works now. The

48:13 one to pay attention to, which unfortunately to in most cases, is

48:20 , yeah, the division is fairly operation, and many times that is

48:31 a pipeline function. So that's what think that some lecture in the

48:36 I mentioned that division is very expensive no means comparable thio multiplication in modern

48:46 and the way division is done is through some attractive algorithm, in

48:50 and the one that is so typically variation and Newton's iteration being done to

48:58 the division. All right? There a lot about this. Any questions

49:07 this before I continue with the Okay, so there is just,

49:18 know, cartoonish examples of what it for flight. The pipeline respect to

49:22 particular still tiny code in the yellow . That's the first computer product of

49:29 and B and another product of A C and then your computer product of

49:33 two products of us created. So this case, when you have this

49:39 you can takes in this case three to do a lot of applications.

49:45 a times being moves through the pipeline , you know, uh um,

49:51 four sort of big, then a eight times b is ready to be

49:57 , but in this case, and to use a multiplier to start the

50:04 product 18 times, see, and just follows 38 times me by one

50:10 . And so, in cycle then the eight EMC product is

50:14 and no one can use it to the product. All the to partial

50:21 . And so that started in that . Five. So in this

50:27 It's what it takes to do these statements. So this is the notion

50:32 loop unrolling a social on the next slides. Place enrolled and hard to

50:37 . Reduced. They start off Okay, this is when I said

50:46 , Um, now let's see what here. This waas, I think

50:55 was on the previous slightest combined for waas. The three optimization is that

51:00 explicitly made by the programmer to this simple loop on the outcome Was is

51:06 four line in the table in the left hand corner and then on the

51:13 of the table. It shows what late in sea bound times will

51:20 Andi, Those comes from that. table on the lower right hand side

51:27 is again respects off the pipeline depth the latest for each one of the

51:32 units. So this means it's the from this combined four optimization is pretty

51:43 to what one can expect if one bound by Leighton. See, But

51:51 it would be like in this case be bound by the throughput of the

51:57 per issue. Ideally, you would to get close to one result for

52:02 cycle at least, or even better your multiple functional units. So here's

52:11 reason why things are not throughput bound late in Saigon. And so this

52:18 to illustrate with this kind of uh, multiplication. If it

52:25 you know, same thing would be ad, but that there is sort

52:33 one of these multiplication is used in next integration of the loop.

52:40 So if you look at his loop we have in the Jell O boxes

52:45 t, uh, in the new is depending on the old T plus

52:51 result off that wasn't an excellent using as the operation. So that means

53:00 has to wait until the the result that is ready before you can add

53:08 to teach. So that's why you this dependency graph that No, it

53:17 based. Ah, limited by the or the functional units. So

53:25 obviously trying to work around that somehow one way would be to try to

53:31 something like this, Um, that do look enrolling as we can see

53:38 . Um, where, you do two iterations of the loop.

53:45 this No preparation statement. So now see what happens. Anyone and one

53:58 actually volunteer. Did this help on ? They should help in terms of

54:11 overhead, because in terms of the , they can't. That didn't actually

54:20 away the vacancy. Eagle candidates. let's see what happened. Um,

54:29 here is now what they did. , the unrolling that was just done

54:34 the previous slide, and nothing really much except a little bit for the

54:42 for mhm. Andi, I won't into that too much, because the

54:48 is, we wanted to get beyond late in sit down. Mhm.

54:56 the recent things didn't improve because we have a statement. Well, that

55:03 the same later, this penance as have before. So in this

55:10 the loop unrolling didn't help. But doesn't mean that loop unrolling is not

55:15 useful thing to do to figure out to use pipelines better. So

55:20 We're doing different version. Um, , um, now we do the

55:28 different this in all they try to on the two. Uh, the

55:35 was first on. Then we added the accumulation variable. So now if

55:43 do that, um, we're going see that this is the one A

55:50 two times one a version? Things actually includes quite a bit.

56:02 one the kind of wonder, How what is so different between the association

56:09 the two different ways, and that depends on actually the underlying architecture.

56:18 , um that has to load store on, in this case also to

56:27 units for floating point operations. So is kind of what actually now happens

56:36 the code that the pair off variables loaded at in a single cycle because

56:47 the two little story units and they pipelines. So they have a

56:52 the ones so basically every cycle, can load new things. And then

57:01 fact that there is a Leighton see the result is available that Leighton see

57:08 to each off the multiplication is in case being done. So they're basically

57:15 each other just by one cycle. a late DNC off three for the

57:21 one, but then things comes every . So then accumulation is done every

57:28 , so this structure than allows using both those stories units concurrently, and

57:37 accumulation does not become late in seat . So now on this to get

57:45 factor to improvement as we show on Children, the previous slide, any

57:55 on that? I'll get a couple more tricks. So this is again

58:03 it's important to understand the architecture, and most of the architectures today they

58:12 at least two and many times they more replication off those store units,

58:21 , to keep things somewhat balanced with replication or functional units in this VL

58:28 type architectures. So here's another Try to use two separate accumulators.

58:41 now, still loop unrolling. But of using a single variable,

58:46 now, we add things up into next year on the next one.

58:52 now we can certainly still use Then two of those story units,

59:00 one for the volume 1 50 y one in the same cycle. So

59:08 , um, think we'll get the result as with the other,

59:17 single variable best association using to those units? Well, we did because

59:40 are using the do little stories, we didn't get beyond that. So

59:52 least if in this slide also shows that we have moved beyond the latency

59:58 things the we still have some distance go to something that is limited by

60:07 troops. So, um, so is what it says, right?

60:17 we have still the, um, for each one of these two

60:23 right. So we had enough functioning to do things in parallel. But

60:34 still have the latest dependency in each of these variables. So here's small

60:40 , right that we didn't illustrating these kind of pipelines because we have multiple

60:48 units that can do the same thing the same cycle. But then,

60:54 , we have a lately see in of the units. So now I'm

61:03 to try to combine these two to to the things that are bound by

61:09 don't. And that is to combine loop unrolling and using multiple,

61:22 accumulators and do the association in a way. So now I'm not like

61:30 showing a cold example, but simply happens with the EU's off unrolling and

61:39 accumulators in this case for the multiplication single accumulator in this case,

61:51 than regardless of the loop unrolling You're still late in C bomb.

61:58 you can see in this case if end up gets the best case is

62:02 getting to the throughput about limitation and is 0.5, because again, you

62:11 multiple functional units. So even though throughput or there the cycle spreader,

62:22 , issue was one you can you know, one pair into the

62:29 recycle, but you have two of . So that's why the throughput pound

62:33 20.5 on. You get very close . It also shows a little bit

62:41 that if you start over doing it unrolling to very high degree with multiple

62:47 , you may actually becomes like a efficient. So there's usually trade off

62:52 in how if you want to unroll on and it's all depending on,

63:01 , the residents, the number of you have accessible because for you have

63:07 accumulators, you have several variables to track coming at some point to get

63:13 returns. And then this is, guess, for the integer part to

63:19 some similar trend that get you can things just like the different set up

63:25 terms of latest since and others. the principle is the same.

63:34 yes. So this was, working through hard to use loop unrolling

63:44 also got benefit in the case where units of pipeline in this case was

63:50 on their athletic units being pipeline. questions on that? So,

64:05 final rolling is something compilers definitely there to do, and they're usually really

64:14 at it. But, um, also something once and do by

64:21 the depth of the enrolling is something that is the best is platform

64:29 Because, as I mentioned, it on the size or the number of

64:33 you have, or, as is called, you know, size of

64:37 file. Um, so it depends you know how much can be kept

64:43 to the functional units without having to to cash is or, uh,

64:48 memory. If you're too ambitious, go too far. Things may not

64:54 registers anymore, and then things that said gets built to memory. So

65:05 ? No, a few more things the compartments do this They tried to

65:10 killed that defines its not usedto. the code size, obviously, if

65:14 never called, Um, like in case, this variable Isay is has

65:24 no you. So just get rid this piece of go don't execute

65:30 Don't use just this other former common expression elimination is just evaluate an expression

65:37 and then use it in particular. there's complex functions that logs also mentioned

65:45 , it's an expensive operation is not one cycle operations log and the intrinsic

65:50 sign co sign and another trick functions exponentially ations all those functions that are

65:58 thing and most programming languages. They quite expensive compared to add a multiplication

66:07 using ah for evaluating them once instead repeatedly for the same argument. Iss

66:16 practice on. I think this may now this is kind of another kind

66:21 strength conduction that maybe should have been but is used as common side of

66:27 in this case for the array arguments trying to see if there explicit,

66:33 is maybe have nice is to read understand the cold. Andi, then

66:40 the right hand side, is economizing the number of arithmetic operations that it

66:45 by using a couple of 10 variables this is other things compilers tried to

66:54 . If you happen to know a compile time was certain. Very

67:01 Then the compiler may actually just totally that operations like adding zero doesn't do

67:07 . Multiplying by zero doesn't do anything by one doesn't do anything, so

67:13 may eliminate those operations from your And this is I think, what

67:20 already said what compilers tried to do what a show this example of this

67:26 source the source transformations Thio Want to this a programmer or do it in

67:35 first place So you don't have to rewrite, uh, but being aware

67:40 what compilers try to do to do the source transformation in order to make

67:46 that, uh can execute it Now, I had a couple of

67:55 things, um, in terms off level, perilous. So this was

68:03 too much of it a little bit terms of using potentially among multiple functional

68:08 at the same time. But this more of a still a very simplistic

68:15 off in modern processor as we know stampede and, you know, has

68:26 24 cores remember rights and that makes up the skylight version. No.

68:32 28 I guess. One of over sky like a different version. I

68:37 remember on top of my head right , which one is stampede? But

68:41 beside the point. And then each of the corn have also been replicated

68:49 units of the same kind. So this case, this little graph shows

68:56 the athletic units have been replicated, that's a separate load in separate store

69:03 . So the previous example was actually old story in this, uh that

69:11 units that can do both loads and into anyway. So these so called

69:17 instructions that is, or realized that you that may contain the singer option

69:25 , uh allows you to potentially schedule in the same cycle for several of

69:32 units. Eso this is Guess what already said. And sometimes it's called

69:39 W. Sometimes super scaler has been for quite some time on Dhere is

69:48 repeat for the hustle. In terms magnetic units, there are there,

69:56 , two floating about multipliers, One one divide and for integer units.

70:02 then there are slow story units. , so now for the sky

70:11 it has the what's known as the Factor instructions. Version two. So

70:19 the A V X two for short it comes to Intel's vocabulary and also

70:26 bit wide instructions on it is just little bit when it comes to data

70:33 or data. What maybe contains in 512 bits? Um, So,

70:44 , they can have, you for about precision. This doesn't come

70:56 in my mind. So this is of the graph doesn't quite fit.

71:01 us it should be an eight time is 5 12. Uh, it's

71:07 right. So this you see, every time you look at something,

71:12 for that. But the idea is . So this basically, um,

71:18 it's based on, um yeah, myself. That's what happens when you

71:30 at the board kind of thing. but the 12 2 is 512

71:36 and that's allows you to have eight position numbers or 16 single position

71:49 but the principal I'll talk about um, the same. Now Cindy

71:59 not exactly the same as well A and something can talked about before.

72:05 Cindy is the same operation was so the same op code or instruction for

72:14 array. So that's why, for , the stream benchmarks take the add

72:24 . We add corresponding elements off to race or for each pair of elements

72:29 used the add instruction. So that's available to Cindy. Operations on the

72:37 allows you to combine different operations in same wide instruction. So they seem

72:44 operation is, you know, something is usable when there are multiple others

72:50 multipliers, for instance. Then you have this Cindy instruction. And this

72:56 shows depending upon the width of the , you get fewer of them.

73:02 one what if you So now I this, I believe, is final

73:13 on this example that I borrowed from colleagues at CMU that now if you

73:23 use the fact that you have this capability in each, um off the

73:33 than in this case, you get factor, all for in terms off

73:41 even, um, in terms off sorry and yes, more or

73:48 It depends on the particular operation. it is postal factor for for the

73:55 point. And it's a little bit for their integer add because there are

74:03 other units than there were 14 20 in this particular processor architectures. So

74:12 this case, um, one what see using the tricks of unrolling etcetera

74:19 then using the simile feature on top that to use, exploit all the

74:25 functional units than the outcome is very to the lower bound, at least

74:35 the floating point. The integral mouth , I guess, quite a successful

74:41 this case. So I think, , any questions on that? So

74:56 I think the punch line all along example is that I would say in

75:05 , it normally days to be aware this. And now count on the

75:15 being successful in restructuring a fairly nice clean cold as supposed to something that

75:24 prepared for what the compartment will do rewriting your cold. Because again,

75:31 know what safe to do. And compiler has to infer it from something

75:38 different cold potential. Okay, And I had a couple more slides in

75:46 of branching. And branching is a bit tricky about in case one,

75:56 not familiar. Whether it's, I will just quickly show what modern

76:04 do. Most of them have what's as branch prediction. So when there

76:14 conditions in the cold, they tried guess which way the execution is going

76:22 go on based on the gas, the execution is going to go than

76:32 load instructions and data, assuming that going to happen and it's began strong

76:46 they, in a way need to and basically on their needs. Toe

76:58 on the air. Either backtrack or on the execution, um, past

77:04 branch and point, um, and out which way to go. So

77:12 it has executed help, then there's need to roll back. Because then

77:16 what has been computed is wrong. . Um, if yes,

77:24 then the preparation for what brands to ready wasted, and it basically then

77:32 to start to load the thing according the, uh, rounds that the

77:38 actually wants to take and supposed to way was just so This is kind

77:44 a little bit what these settle slide that what on the branch and kind

77:54 works. And nowadays, processors have sophisticated record keeping on the desk of

78:06 statistics for branches and see for particular on the data set for is being

78:17 at run time and see, you how many times branching was going this

78:22 and how many times it's went that . And it continuously update the history

78:29 branches, taking in order to try end, uh, minimize execution

78:36 So this since my time is I will just show you what slides

78:43 in there and you can take a at it. But it's just explicit

78:47 showing what I just said, how behaves in terms off French connection,

78:55 one is lost on that. So that I think power stop and today

79:06 because again you're summarizing we were off machine architecture. Er, I'm trying

79:17 challenge the component more than necessary to your coat and then use good tools

79:26 understand what actually happens. And there's on this slide or maybe on the

79:32 slide. And so the nationals health asked him they'll cut the assembly and

79:39 what happens. Yes. Okay, we'll stop there and take if there's

79:52 questions and stop sharing screen for the . What? Any questions?

80:17 not No. I had a question the last part. That is theme

80:25 . So the effect off loops plan performance. Can you show the last

80:31 just now? That you? Yeah. Uhz Mhm. 74 of

80:52 . Okay. Okay. All Hmm. So this'll this branching on

81:04 slide. Okay. And now Mm. Was so the prediction I

81:12 not get the last part. How the prediction done? Like, how

81:17 the performance and the prediction linked with to the loops. Okay. So

81:27 does not necessarily need to be in loops or they're independent concepts.

81:35 so you may have conditioners without Luke is just, uh, straight

81:43 cold. No loops, and depending certainly could data values. You want

81:48 do one thing or another? Eso branching has to deal with what code

81:57 going to be executed depending upon whether is to a false for instance.

82:08 that's what the branching is about. if one has no prior knowledge,

82:19 you know the best guess is basically probability for each branch. If you

82:28 this case, the conditional is inside loop at some level than there has

82:42 a history saying that in the first 30 times this conditional waas uh,

82:53 executed if we call it just the and the right pants, you

83:00 And 25 off these 30 cases, left branch was taken and then only

83:06 the right branch was taken. So I then set myself up for assuming

83:13 the new time that this condition is , it will probably most likely go

83:22 same way the 25 other ones. So then I can pre load all

83:28 instructions in the data for that branch the condition so on the other.

83:39 when it actually evaluates the condition, turns out it was going to the

83:44 direction where only five instances have been then. Obviously, I kind of

83:53 to restart things from where the condition in the state that is proper at

84:00 stadium. The court, not Not as some, um, latest

84:07 that assumed that it was going the way. So you have to,

84:10 that case, in the formative backtracked competitions. Or if they're things you

84:16 do without using the variables coming in the from the condition. Okay.

84:28 you, song. So that's what branching doesn't. And it shows in

84:38 next some detail. Hide works in particular exam. Yeah. Okay.

84:53 else? So I guess I'll stop recording, then. Thank

-
+