ICS Video Player

COSC 6365 Introduction To High Performance Computing - Lecture14_Compiler_Optimization

Transcript ×

Auto highlight

Off

Font-size

00:00	Okay, so today, as it on this slide, I will trying

00:08	tell you a little bit what compilers trying to do on bond. What

00:16	can do. A zoo, a or user. Our systems and tools

00:23	compiler do what one hopes it would . So first, I'll talk just

00:30	little bit very general and about coming to code efficiency. Then give some

00:37	that you may know where they have up when they looked at the compiler

00:41	and what they need at all. , mentioned that very briefly. And

00:47	we'll move on to talk and what are trying to do to improve the

00:53	, respecting how it's supposed to be on platforms a couple of quotes

01:01	That's why I start this course to focusing and understanding and learning about single

01:09	and how to get performance and sort going first from just single threaded performance

01:15	then trying to figure out multi threaded part on a single note using

01:21	empty on uh, so anyone that not be familiar with the name being

01:31	. He is quite well known scientists Waas instrumental off and one of the

01:39	in developing pretty much every generation of MP I standard, for instance,

01:46	programming clusters and, uh, now at the University of Illinois. But

01:53	worked in argon national Labs for many and is recipient of many awards and

02:03	. And, um, try Thio . Emphasize that I've done many times

02:10	the past to to get efficient One needs to understand the platform

02:17	thunder, take good algorithms, and money is, too. Try to

02:22	sure that, and also get good . So it's one of these

02:30	which is is kind of vertical One needs to understand the layers that

02:35	across kind of contract to the common off using by labor layers of

02:43	And you don't necessarily need to worry players Villa one above. Just work

02:49	your own layer. But well, , not to work too well when

02:53	comes to getting good efficiency out of system on just a little bit as

03:05	it is often perhaps not, said . It's not the focus of today's

03:10	, but we'll talk about it later in the course. But a lot

03:14	the progress in terms off getting useful out of computer systems also come from

03:20	in algorithms. And I thought that a nice just to include that.

03:24	what has happened. And this comes different. The scientific computing domain.

03:31	on, uh, in terms off , yes, cold issues. And

03:41	is says, You know, it's common. You can get a factor

03:46	10 and in fact, it's not , even get a lot more even

03:49	that. And that's part of the why, uh, in this

03:55	start to give you the tools to understanding about on the actual performances in

04:01	off resource utilization and then get you understanding what the resource is actually

04:08	um, so I think now that's of a comments on this slide.

04:18	. Here is one that another very known person in the computer science field

04:23	license, and that's also as a at MIT on Bond also have won

04:30	awards, and he is the creator his group. I should say instant

04:35	of this silk, um, programming . If you like some kind of

04:43	turn system that was later this. effort was then required by Intel,

04:50	it's not part on the Internet set tools. But anyway, it shows

04:58	difference in this case going from the code to highly optimized C code the

05:06	instructions. So it's orders of Um, that potentially is to begin

05:15	, they called. That might be good in terms of program. My

05:19	in that in terms of a resource non human resource, is that means

05:27	hardware. It may not be all good. So depending upon what your

05:34	is for the cold, if it's to be a production code or something

05:38	you're just doing as a sort of tools to understand something than your

05:45	except it's likely need to be very . So then, including here,

05:52	examples from what is the compiler flags in terms of what type of optimization

06:04	lack of their off that and is at the various levels of optimization.

06:15	I guess the sailor level is this do nothing and take it as it

06:20	, whereas level one and try to execution time by allowing for a number

06:28	optimization that enlisting here not going to them all but one of the main

06:37	or restrictions on the optimization is being is for the level one. There's

06:45	lot off attention paid to the extent which optimization is may increase the code

06:54	and we'll talk off some of these to some detail later on in this

07:00	year. That so sometimes there is trade off between speed and code sites

07:07	and than with this level to optimization still, you know, trying to

07:16	things. But now also saying include ization type features. And I will

07:22	about factory ization a little bit later this class today on also next

07:31	Um, so in this case, completely right, just trying to make

07:39	guesstimate as to what? The potential in code size, whether that is

07:47	eventually leading Thio reduced running time, . But that may or may not

07:54	be the case, and then the level, that is typically there is

08:01	for their optimization that is listening here names. We'll talk about some of

08:06	things just to give you a sense what, on a compiler most of

08:13	times, a recently successful to do but in a state of the art

08:19	. But whether there is a success not depends on the context of the

08:24	and whether the code analysis, as a result that these transformations of

08:33	code is safe, it may not able to conclude that it's safe and

08:38	compile it cannot do. These optimization on the other hand, you as

08:43	program, and they know that it's . And then by structure in your

08:48	, you can make the car the successful Eso here is a list off

09:00	. Optimization is my name, and will talk about some of these,

09:05	not all of them just again, I focused on things that are

09:13	uh, it's easy to do in of how your rights air so

09:20	And here is an example of that terms of the stream benchmark, I

09:24	that you can see the numbers s , this is just compiling the same

09:31	with different optimization flags and was using GCC compiler as it says. And

09:40	you look at the in the So peace off illustration here on the

09:50	. And if you look at the rates, the second column. The

09:54	column is the names of the extreme and the second columnist the data rates

10:00	for those functions. And if you at the one that used no optimization

10:06	, if just pick one of them look at that, the try it

10:12	says it wants seven, about seven per second. In terms of the

10:16	rates, then the level one optimization jumped to 13.4. So it's almost

10:24	in terms off the data it and the cold size even got down

10:31	about two thirds. So that level paid off very handsomely, both in

10:37	and code size. On the level optimization did Young looking at the trial

10:45	went from 13 4 to 13 so it give a little bit better

10:52	their rights for that particular stream But the cold sites not grew a

10:59	bit, not much. But if you went to the most aggressive

11:05	, the code size pretty much became off a level one optimization, and

11:12	performance in fact, actually dropped. this is again the trade off where

11:17	was intended to be good. But large part, sometimes the optimization is

11:22	not have been successful, but that also have been affected by the code

11:27	that may have produced, um, performance because code has to be loaded

11:35	the same data path except very close the functional units. But everything through

11:42	two caches are shared in terms of . Pass in order. Code music

11:47	with data for some of the Shelley . So anyone want tohave question or

11:58	on this kind of trade offs and it worked out in this particular

12:10	it's not. So. This is pretty much what I already said,

12:15	the compiler writers have to make sure the correctness of the code is

12:21	Recount, do anything that's somewhat And to be seemingly conservative, it's

12:29	always possible and analysis toe figure Marching source code transformations are possible and

12:40	things potentially going wrong in some extreme . So here is, I

12:48	No, I'm not. I'm sure is from a look, too most

12:53	science students, but Maybe some of know not so but compare list and

13:00	some of what is known as the and the back end on the front

13:05	is the thing that takes it program or their CEO fortune and then translates

13:14	into it's known as an intermediate language only intermediate format on that many

13:23	That's common to several programming languages. you see your fortune, it's intermediate

13:29	Mr Same, and that means and then use the same kind of

13:36	tools working on this intermediate formats. then one has a cold generator that

13:43	decoupled to some degree from the programming , but it is dependent on the

13:51	platform. So it has instruction sets the particular platforms and knows about their

13:57	on that platform, and in that , the optimization is that also happened

14:03	that level. But then they are more machine dependent, supposed toe dependent

14:09	there, largest for the function off code. So I'll talk mostly about

14:17	machine independent transformations today, Um, the other thing is that doing once

14:31	as whole program analysis is very as hopefully anyone can imagine if you

14:41	hundreds of thousands of lines of code the whole program analysis is an incredibly

14:50	data dependence rascal. So many uh, compilers, they all look

15:00	what's known asses basic plot that I , uh, last time.

15:06	yeah, just repeat the picture that showed last time in terms of what

15:10	basic box are. Oh, So within those things are recently safe

15:17	do. But then, once me beyond the basic blocks to do the

15:25	than it becomes a lot more analysis and time consuming to do the

15:36	So one of my colleagues, when was at Yale University Waas one of

15:46	first to try to expose this idea construction level parallelism and this. But

15:54	there's no one has to be like right format. So he designed a

16:02	as you know, started the company build the machine, for it was

16:07	as multiple for this belong W But as people said, you know

16:13	need a computer to actually completely code your computer, so compiling the code

16:18	take hours when you try to the analysis cold. So that's fine.

16:29	, the extent of its compilers do into procedural analysis, is kind of

16:40	. Um, so now I will about some of what these names means

16:49	if talk through an example off what on one particular processors and show what

17:00	mean. So one particular Intel architectures in lining of functions is essentially

17:11	Instead of having the function call, makes kind of in some ways a

17:16	and clean coat, you basically remove call and substitute the explicit text of

17:30	for the function into the line of . Or this is just an

17:36	So before in lining your best, call this bread function three times,

17:43	, whereas after in lining, you each one of the calls with statements

17:49	is used to define the prince So it's nothing magic, but what

17:54	does it the voids, the overhead during function calls. And that can

18:00	be the big game. So that's as in lining and compilers recently successful

18:09	least as long as the functions are too complex on if it's mean by

18:16	company, right it to be an that it's likely to win, but

18:22	clearly than willing increase the cold But this is part of our compilers

18:32	. But if you're concerned about overhead bending or not certain that the compiler

18:40	be successful at it on your own and that this is okay,

18:45	it would become a little less, know, being in pretty cold that

18:51	is potential significantly more efficient. Dr . Yeah. Is this primarily

19:00	um, eliminating the overhead of a frame? Because I can't imagine anything

19:07	that it would Very helpful, Is basically setting up the stack frame

19:13	keeping it, and then so that's it does. So it takes some

19:18	that the way on this is another that is again to me. He

19:30	stood programming practice E guess official is as cold movement meant a code hosting

19:39	essentially don't do things inside loops that , in variant perspective, the

19:46	So I'll do that once instead of it every time in the loop.

19:51	in this case is basically operation to on either the number two, you

19:58	, pilots keeps changing, so that's you could as well, computer outside

20:03	loop on, then just use it the loop. So, in actual

20:07	, yes, it costs you tempt . But, uh, in this

20:12	is very trivial, but introduces the of operations inside the loop on being

20:22	. And it's a little bit more this detail. Example that border from

20:28	that cmu eso it shows a little again and being careful all what?

20:39	you write things. I would say nothing wrong. All of these things

20:43	just source to source transformations that is at reducing the amount of work that

20:49	to be done sometimes at the expense . Some additional there are about from

20:57	locations. So in this case, example, is just that. Now

21:05	again inside the loop. In this , it's an array reference on this

21:09	. Instead of doing the end I not quotation every iteration in the

21:15	statement Thio just compared to the outside than avoiding transportation in every sign.

21:23	statement and this Waas yes and example in this case, what the compiler

21:33	to dio. I'm not going to to explain all of the assembly

21:38	even though I recommend anyone that want do good code optimization to look at

21:46	assembly, that the compiler generous and to understand it because that sees what

21:52	happens so to Jell O boxes is before and after a compilation with In

21:59	case, they level one optimization and shows up in this case, the

22:05	. They do the right thing and a new variable to the product and

22:10	I and then that's in fact used the in the loop. It also

22:21	a little bit more, as you see that it also used this in

22:27	pointer and then use a simple increment the pointer instead of having to do

22:34	array reference. Yeah, um, this example? Waas. Yes.

22:44	this has a deal with Just says the top of the slide. Try

22:52	. Keep things local two registers. , so if one deals with the

23:00	, sometimes the compiler is difficulty figuring to keep a rave variables local and

23:08	. And it may in fact, things in memory for every integration and

23:13	particular loop on this case, it's you, some all the elements of

23:21	off the area into back Trevi, Vesper is one element and be for

23:27	rule of s we just add up and this little example. So that

23:33	seem it becomes many Iraqi references that hard for that, uh,

23:43	sometimes the optimized out and create a variable. So I think in this

23:50	, it was what I have There's see what happens in these situations

23:59	basically says that that the these initializing they ace initialized. And then we

24:11	this function thio computer oh, sums store it and be, um,

24:21	the value of B shows here for generation. But it means there's kind

24:25	memory traffic again to the race for on the in looks as well.

24:35	the thing is to try to avoid is to use a local variable.

24:43	talked about this when he talked about MP and the sharing of variables.

24:48	in that case, there was a with respect to potential. Um,

24:57	line is, um, moving between caches, and this is similar,

25:05	now sort of focusing on the memory . So in this case. But

25:13	a local variable in this case developed , um, the computer is doesn't

25:20	to worry about any ray issues or some other processor. Does things,

25:30	, accord us something in to be ready that it's not the wear

25:34	So it just works on its local , and it keeps it and

25:38	So the number off memory references significantly . So it's, um, the

25:49	of programming concept of you like is same as we talked about in terms

25:55	the open MP. And avoiding this line session between cash is now in

26:06	example. I'm going to use, , on uses. You know,

26:14	per element is a common where you're to get a good idea off how

26:21	things are. Then that's being used the examples I I'm going team talk

26:28	in the next many sites. maybe I should, uh, pause

26:35	a minute and see if there's any so far, the fairly simple transformations

26:41	basically coding styles that is encouraged to sure you got good, efficient

26:49	um, on the example of updating array position, I think it would

26:54	a position I plus equals some Um, so you said that Teoh

27:00	that away. We should be introducing temporary variable to sum up and then

27:04	access the array once of the first . Okay, So because if it's

27:10	single variable, that doesn't leave quote in in the right context, even

27:16	it's the same memory location. If a single variable, the computers of

27:23	very successful link keeping that variable in register instead off allocating memory for

27:31	Okay, so So this is just simple model. Harmony, and the

27:44	told Time is just there's over some and then on funds in the cycles

27:50	instruction time or part elements, or times the number of elements so clearly

27:56	to get as few cycles for elements possible. This is what I want

28:01	try to talk about, and, , and I guess they see someone

28:07	two over the two versions on this that I just went through. This

28:12	the piece on to when the local on this piece someone on this slide

28:19	when you had explicit the very reference the innermost loop so lawyers. You

28:25	see that Thio cases worked out as functional number of elements, and there

28:31	not giant still the savings for very expense in this case. So this

28:41	the and you can see the slope the lines are than the cycles per

28:50	. So now there's this an example shows a little bit again how to

28:58	Thio, right? I would say told there's nothing wrong with any one

29:05	the code versions in terms of correctness its, uh, you know,

29:12	logic, respect from the use of . If you're focused on what you

29:16	to get done, Thio write it particular way. But one of the

29:25	in this case is that do you in the loop statement before statement You

29:38	basically a function evaluation carrying the vector and every integration of the loop,

29:46	you also then and have another function to get factor element that to use

29:53	in the body on the loot, then you do something simple where the

29:59	here, the tree from, the array now mhm. So in

30:10	case that is, you know, timing results, doing it for some

30:17	of the rate that I'm sorry. . Figure out what the race

30:23	Also, just take the number for they are. Eso This shows when

30:31	don't do any optimization at all on particular tiny piece of code. But

30:39	also shows that the level one type in this case have a good

30:46	and we'll talk more what that waas the next few slides, and we'll

30:52	get to see how even the old optimization can be improved significantly. In

31:02	, by the time we're done with example, the code would have approved

31:06	more than an order of magnitude and . So there is a sort of

31:14	thing one can do because the vector doesn't change in the loop. So

31:19	is just kind of being cleans. is no point in evaluating,

31:26	or calling the function rector lengthen jazz lines and every iteration of the

31:31	Since it's an in variant, just it once again. And so that

31:38	the first level of optimization in this that will be done by at the

31:42	level. Um and I said I'm sorry by the outside noise that

31:52	not happened at this time over What Somebody decided to burn the

32:00	Hopefully, it's not too annoying. , anyway, so it just says

32:06	hard again sometimes for the compiler Thio that this safe to move these things

32:11	so it may or may not succeed doing this. Ah, Level two

32:17	. But it's not too cumbersome for to actually do it and think of

32:22	in variant again and put what's seen things outside the look s O.

32:30	was kind of a fairly simple, , optimization. The other one is

32:37	a little bit perhaps less intuitive but maybe not. And it's also

32:47	they're simpler operations than calling this function get the starting position, um,

32:55	the actual location you want for every of the loop. So you can

33:01	just better yet than Thio? to starting a relocation or memory location

33:07	them the array and then just implemented in this case things were fairly

33:13	You were just going to walk through way from start to finish or the

33:18	point you wanted So is a very stride. One, uh, walk

33:24	the right. So there's no point we can calling this Get back

33:30	Get start. Ah, yeah, integration. So, um,

33:42	I think that was something That's So this one more and then there

33:50	be some more, um, timing . I think on the next slide

33:59	also then use, uh, and temporary er for the accumulation if we

34:08	it to they standing. So now again not using a regular reference,

34:15	, uh, no, this so has to be. But it's accumulation

34:23	a local, very variable. So I guess the sum of all

34:36	um, on first Waas uh, the default time and gain from optimization

34:46	one than the three code restructuring that talked about Loss first to move there

34:56	for the vector lengths outside the That is, in a variant second

35:00	to just get the pointed to the and then the walk through by implement

35:07	one and then the third wants to its local variable in the loop.

35:15	this cold restructuring now improved the performance a bit. We respect to even

35:25	one optimization. That was a factor better than the no optimization case by

35:30	compiler. So these four simple changes this loop conceptually simply than for more

35:40	of, uh gave what the factor three or better And, uh,

35:47	for the double precision mouth and the or eight, I think more or

35:51	for the integer app. We'll talk little bit more about these difference in

35:57	improvement in the coming slides, but are all you know, this

36:04	new tube, um, they're moving calls and, um, already

36:15	making it very simple instead of arrhythmic disease or others to do yes,

36:21	by one. So any questions so ? So these are to me

36:36	Not necessarily what one yet on tartans is keeping cold, clean, but

36:46	to understand it. That's so Little bit more difficult, but it

36:52	have huge performance gains. So let's what ah now I'm going to get

37:02	the topic of loop unrolling, and going to take quite a few slides

37:08	I'm done with this example on Here's the most trivial example First and

37:15	what do you live? More elaborate , second to the point. And

37:21	is, um, basically Thio reduce number off nuclear iteration and in

37:35	Instead, she ate a few integrations the loop in the loop body.

37:42	that means two things may happen when the looping overhead. Because you don't

37:50	to do the tests assed many times you are at the end off the

37:58	or not on. And it also helps the compiler again when it has

38:07	the statements in the loop body to things in registers. So there

38:14	yeah, two games and their other that I will show you in the

38:19	several slides. But so the first of business, it is loping over

38:26	on, help the compiler and keep in registers. Dr. Johnson.

38:33	. So this loop unrolling more at at paralyzing at the core level,

38:42	not necessarily parallels. And so you will help you even on a single

38:49	. Okay, I think I like paralyzing in the sense that we're

38:54	Max effectiveness of each cycle, Because in terms of, um,

39:00	dependencies, is this the kind of that we're looking at with loop and

39:05	. So, um, the data is are the same. So what

39:15	in this case is to take the thing here. Um um, So

39:27	is I guess a messed up I'm sorry. So where they should

39:30	put you some instead of the Litex . It's another variable y or something

39:36	why outside? Not as a loop . So I didn't spot that.

39:42	apologize what it means, um, compiler may generate load stores on the

39:52	pick a loop case for econ for it now while the delete while

39:59	so it loads it. And then does whatever this function delete does to

40:03	X or an assignment statement. And there's memory traffic generated potentially for each

40:13	in the loop that is then reduced this case by a factor of five

40:19	the under oath. So it saves . But the dependency is the

40:28	It's just that instead of a one access to register, it may be

40:35	hundreds of cycles to memory for the . Okay, so I'm not sure

40:42	would call that sour um, paralyzed that's why I say even in happens

40:50	a single thread that you avoid round to memory. Most likely when you

40:57	look controlling and for every alteration in loop, when you're at the end

41:06	the loop and get to the you have to do test Insee and

41:10	down or not that takes not but it takes a few cycles to

41:16	that test so it eliminates some, , test arithmetic conditional operation comparison its

41:29	. Compare, uh, in the structure, and it potentially eliminates around

41:37	to memory. Eso here is just example you may it takes multiply that

41:50	will see more times than you care in this course because it's a simple

41:55	structure, but it shows the and this is just one example of

42:00	some enrolling in that was used on particular machine, and it shows,

42:07	, the again I waas, depending their sort of look playing for

42:16	Up and down for the loops were the compute Great. It was,

42:21	know, it's a very old and the numbers are not very interesting

42:24	themselves. But the difference is In terms of that, you got

42:29	factor off three improvement in this performance for this particular loop by doing

42:36	loop unrolling and again again is round to memory avoided most of the king

42:43	here, plus some reduction in looking . But most of it is do

42:48	reduce memory traffic. So, I'm going to come to talk about

42:58	more detailed example. I was done intel Haswell CPU, which is the

43:05	nothing will be using for the next . Pittsburgh Supercomputing Center The bridge is

43:14	. The house fall sick to Um so here's a little bit all

43:21	number of functional units it has for integer and floating point operations. And

43:28	, um, how some of the units are operating. When I talked

43:36	processors, I don't think I Uh huh. Too much about functional

43:44	type designs in terms of vacancy and ports and cycles per issue.

43:52	talked about the Leighton see to various of cash is and main memory.

44:01	I talked about processes, but I think I mentioned these other pipeline depths

44:06	is sometimes called or Layton City. almost on functional units, whether

44:16	uh, multipliers ADDers or you're an decoding is done in the form of

44:25	units in modern processors. So what means It's like, um, I

44:35	a typical cafeteria or counting right that go and pick up a tray and

44:44	you go by the various pick up and eventually get to They have the

44:49	end. And as you pass through different stations, other people can,

44:55	know, be at your previous so as opposed to a version where

45:06	person would have to pick up all different pieces and pay before the next

45:12	can pick up the trade. So is what hardware designers do, And

45:24	instead off taking, you know, points add or multiply. And

45:33	you know, have some notion of point numbers, right that they have

45:36	man Tisa, and they haven't so they create sort of a dynamic

45:42	by modifying the exponents. So if want to add to floating point

45:48	one is tow. Line them so they basically have the same

45:54	Otherwise, you can't deal with what's as the Mantis A or the

45:59	So there are several different operations to performed in order to do, add

46:05	multiply on numbers, so that means fair amount of logic. And then

46:12	hardware designers do They are make sort stages, or you may do in

46:19	stage of our parents, and that's certain amount of logic. And then

46:25	kind of a register a buffer where keep the result of that. And

46:30	it moves that down toe, Next thing you want to do and

46:34	another parent numbers can be aligned, . So what, it means you

46:40	, um, we get the Crater is dependent on the longest kind

46:47	slowest path for each one of the , as supposed to the whole

46:53	So it's again the car kite and dependent on more or less, or

47:01	people move into the cafeteria line, upon the slowest station along the cafeteria

47:11	. So what this idea or pipe is that they allows designers to increase

47:18	cooperate compared to during a complete operation a single entity, Um so it

47:27	clock rates. And by that, you following and cafeteria analogy, it

47:34	the truth. So that's what you on this thing. Not to say

47:38	, you know, low store and . Multiply and add and even the

47:43	point multiply. Add. Um, just take. You can add or

47:50	new operations into the pipeline every so that's what the cycles per issue

47:54	. One, but it takes more one cycle to do. The store

48:00	in this case takes four. In of the floating point multiply, it

48:04	five cyclists to get the result of multiply, but you can start a

48:09	one again every cycle. So this the way it works now. The

48:13	one to pay attention to, which unfortunately to in most cases, is

48:20	, yeah, the division is fairly operation, and many times that is

48:31	a pipeline function. So that's what think that some lecture in the

48:36	I mentioned that division is very expensive no means comparable thio multiplication in modern

48:46	and the way division is done is through some attractive algorithm, in

48:50	and the one that is so typically variation and Newton's iteration being done to

48:58	the division. All right? There a lot about this. Any questions

49:07	this before I continue with the Okay, so there is just,

49:18	know, cartoonish examples of what it for flight. The pipeline respect to

49:22	particular still tiny code in the yellow . That's the first computer product of

49:29	and B and another product of A C and then your computer product of

49:33	two products of us created. So this case, when you have this

49:39	you can takes in this case three to do a lot of applications.

49:45	a times being moves through the pipeline , you know, uh um,

49:51	four sort of big, then a eight times b is ready to be

49:57	, but in this case, and to use a multiplier to start the

50:04	product 18 times, see, and just follows 38 times me by one

50:10	. And so, in cycle then the eight EMC product is

50:14	and no one can use it to the product. All the to partial

50:21	. And so that started in that . Five. So in this

50:27	It's what it takes to do these statements. So this is the notion

50:32	loop unrolling a social on the next slides. Place enrolled and hard to

50:37	. Reduced. They start off Okay, this is when I said

50:46	, Um, now let's see what here. This waas, I think

50:55	was on the previous slightest combined for waas. The three optimization is that

51:00	explicitly made by the programmer to this simple loop on the outcome Was is

51:06	four line in the table in the left hand corner and then on the

51:13	of the table. It shows what late in sea bound times will

51:20	Andi, Those comes from that. table on the lower right hand side

51:27	is again respects off the pipeline depth the latest for each one of the

51:32	units. So this means it's the from this combined four optimization is pretty

51:43	to what one can expect if one bound by Leighton. See, But

51:51	it would be like in this case be bound by the throughput of the

51:57	per issue. Ideally, you would to get close to one result for

52:02	cycle at least, or even better your multiple functional units. So here's

52:11	reason why things are not throughput bound late in Saigon. And so this

52:18	to illustrate with this kind of uh, multiplication. If it

52:25	you know, same thing would be ad, but that there is sort

52:33	one of these multiplication is used in next integration of the loop.

52:40	So if you look at his loop we have in the Jell O boxes

52:45	t, uh, in the new is depending on the old T plus

52:51	result off that wasn't an excellent using as the operation. So that means

53:00	has to wait until the the result that is ready before you can add

53:08	to teach. So that's why you this dependency graph that No, it

53:17	based. Ah, limited by the or the functional units. So

53:25	obviously trying to work around that somehow one way would be to try to

53:31	something like this, Um, that do look enrolling as we can see

53:38	. Um, where, you do two iterations of the loop.

53:45	this No preparation statement. So now see what happens. Anyone and one

53:58	actually volunteer. Did this help on ? They should help in terms of

54:11	overhead, because in terms of the , they can't. That didn't actually

54:20	away the vacancy. Eagle candidates. let's see what happened. Um,

54:29	here is now what they did. , the unrolling that was just done

54:34	the previous slide, and nothing really much except a little bit for the

54:42	for mhm. Andi, I won't into that too much, because the

54:48	is, we wanted to get beyond late in sit down. Mhm.

54:56	the recent things didn't improve because we have a statement. Well, that

55:03	the same later, this penance as have before. So in this

55:10	the loop unrolling didn't help. But doesn't mean that loop unrolling is not

55:15	useful thing to do to figure out to use pipelines better. So

55:20	We're doing different version. Um, , um, now we do the

55:28	different this in all they try to on the two. Uh, the

55:35	was first on. Then we added the accumulation variable. So now if

55:43	do that, um, we're going see that this is the one A

55:50	two times one a version? Things actually includes quite a bit.

56:02	one the kind of wonder, How what is so different between the association

56:09	the two different ways, and that depends on actually the underlying architecture.

56:18	, um that has to load store on, in this case also to

56:27	units for floating point operations. So is kind of what actually now happens

56:36	the code that the pair off variables loaded at in a single cycle because

56:47	the two little story units and they pipelines. So they have a

56:52	the ones so basically every cycle, can load new things. And then

57:01	fact that there is a Leighton see the result is available that Leighton see

57:08	to each off the multiplication is in case being done. So they're basically

57:15	each other just by one cycle. a late DNC off three for the

57:21	one, but then things comes every . So then accumulation is done every

57:28	, so this structure than allows using both those stories units concurrently, and

57:37	accumulation does not become late in seat . So now on this to get

57:45	factor to improvement as we show on Children, the previous slide, any

57:55	on that? I'll get a couple more tricks. So this is again

58:03	it's important to understand the architecture, and most of the architectures today they

58:12	at least two and many times they more replication off those store units,

58:21	, to keep things somewhat balanced with replication or functional units in this VL

58:28	type architectures. So here's another Try to use two separate accumulators.

58:41	now, still loop unrolling. But of using a single variable,

58:46	now, we add things up into next year on the next one.

58:52	now we can certainly still use Then two of those story units,

59:00	one for the volume 1 50 y one in the same cycle. So

59:08	, um, think we'll get the result as with the other,

59:17	single variable best association using to those units? Well, we did because

59:40	are using the do little stories, we didn't get beyond that. So

59:52	least if in this slide also shows that we have moved beyond the latency

59:58	things the we still have some distance go to something that is limited by

60:07	troops. So, um, so is what it says, right?

60:17	we have still the, um, for each one of these two

60:23	right. So we had enough functioning to do things in parallel. But

60:34	still have the latest dependency in each of these variables. So here's small

60:40	, right that we didn't illustrating these kind of pipelines because we have multiple

60:48	units that can do the same thing the same cycle. But then,

60:54	, we have a lately see in of the units. So now I'm

61:03	to try to combine these two to to the things that are bound by

61:09	don't. And that is to combine loop unrolling and using multiple,

61:22	accumulators and do the association in a way. So now I'm not like

61:30	showing a cold example, but simply happens with the EU's off unrolling and

61:39	accumulators in this case for the multiplication single accumulator in this case,

61:51	than regardless of the loop unrolling You're still late in C bomb.

61:58	you can see in this case if end up gets the best case is

62:02	getting to the throughput about limitation and is 0.5, because again, you

62:11	multiple functional units. So even though throughput or there the cycle spreader,

62:22	, issue was one you can you know, one pair into the

62:29	recycle, but you have two of . So that's why the throughput pound

62:33	20.5 on. You get very close . It also shows a little bit

62:41	that if you start over doing it unrolling to very high degree with multiple

62:47	, you may actually becomes like a efficient. So there's usually trade off

62:52	in how if you want to unroll on and it's all depending on,

63:01	, the residents, the number of you have accessible because for you have

63:07	accumulators, you have several variables to track coming at some point to get

63:13	returns. And then this is, guess, for the integer part to

63:19	some similar trend that get you can things just like the different set up

63:25	terms of latest since and others. the principle is the same.

63:34	yes. So this was, working through hard to use loop unrolling

63:44	also got benefit in the case where units of pipeline in this case was

63:50	on their athletic units being pipeline. questions on that? So,

64:05	final rolling is something compilers definitely there to do, and they're usually really

64:14	at it. But, um, also something once and do by

64:21	the depth of the enrolling is something that is the best is platform

64:29	Because, as I mentioned, it on the size or the number of

64:33	you have, or, as is called, you know, size of

64:37	file. Um, so it depends you know how much can be kept

64:43	to the functional units without having to to cash is or, uh,

64:48	memory. If you're too ambitious, go too far. Things may not

64:54	registers anymore, and then things that said gets built to memory. So

65:05	? No, a few more things the compartments do this They tried to

65:10	killed that defines its not usedto. the code size, obviously, if

65:14	never called, Um, like in case, this variable Isay is has

65:24	no you. So just get rid this piece of go don't execute

65:30	Don't use just this other former common expression elimination is just evaluate an expression

65:37	and then use it in particular. there's complex functions that logs also mentioned

65:45	, it's an expensive operation is not one cycle operations log and the intrinsic

65:50	sign co sign and another trick functions exponentially ations all those functions that are

65:58	thing and most programming languages. They quite expensive compared to add a multiplication

66:07	using ah for evaluating them once instead repeatedly for the same argument. Iss

66:16	practice on. I think this may now this is kind of another kind

66:21	strength conduction that maybe should have been but is used as common side of

66:27	in this case for the array arguments trying to see if there explicit,

66:33	is maybe have nice is to read understand the cold. Andi, then

66:40	the right hand side, is economizing the number of arithmetic operations that it

66:45	by using a couple of 10 variables this is other things compilers tried to

66:54	. If you happen to know a compile time was certain. Very

67:01	Then the compiler may actually just totally that operations like adding zero doesn't do

67:07	. Multiplying by zero doesn't do anything by one doesn't do anything, so

67:13	may eliminate those operations from your And this is I think, what

67:20	already said what compilers tried to do what a show this example of this

67:26	source the source transformations Thio Want to this a programmer or do it in

67:35	first place So you don't have to rewrite, uh, but being aware

67:40	what compilers try to do to do the source transformation in order to make

67:46	that, uh can execute it Now, I had a couple of

67:55	things, um, in terms off level, perilous. So this was

68:03	too much of it a little bit terms of using potentially among multiple functional

68:08	at the same time. But this more of a still a very simplistic

68:15	off in modern processor as we know stampede and, you know, has

68:26	24 cores remember rights and that makes up the skylight version. No.

68:32	28 I guess. One of over sky like a different version. I

68:37	remember on top of my head right , which one is stampede? But

68:41	beside the point. And then each of the corn have also been replicated

68:49	units of the same kind. So this case, this little graph shows

68:56	the athletic units have been replicated, that's a separate load in separate store

69:03	. So the previous example was actually old story in this, uh that

69:11	units that can do both loads and into anyway. So these so called

69:17	instructions that is, or realized that you that may contain the singer option

69:25	, uh allows you to potentially schedule in the same cycle for several of

69:32	units. Eso this is Guess what already said. And sometimes it's called

69:39	W. Sometimes super scaler has been for quite some time on Dhere is

69:48	repeat for the hustle. In terms magnetic units, there are there,

69:56	, two floating about multipliers, One one divide and for integer units.

70:02	then there are slow story units. , so now for the sky

70:11	it has the what's known as the Factor instructions. Version two. So

70:19	the A V X two for short it comes to Intel's vocabulary and also

70:26	bit wide instructions on it is just little bit when it comes to data

70:33	or data. What maybe contains in 512 bits? Um, So,

70:44	, they can have, you for about precision. This doesn't come

70:56	in my mind. So this is of the graph doesn't quite fit.

71:01	us it should be an eight time is 5 12. Uh, it's

71:07	right. So this you see, every time you look at something,

71:12	for that. But the idea is . So this basically, um,

71:18	it's based on, um yeah, myself. That's what happens when you

71:30	at the board kind of thing. but the 12 2 is 512

71:36	and that's allows you to have eight position numbers or 16 single position

71:49	but the principal I'll talk about um, the same. Now Cindy

71:59	not exactly the same as well A and something can talked about before.

72:05	Cindy is the same operation was so the same op code or instruction for

72:14	array. So that's why, for , the stream benchmarks take the add

72:24	. We add corresponding elements off to race or for each pair of elements

72:29	used the add instruction. So that's available to Cindy. Operations on the

72:37	allows you to combine different operations in same wide instruction. So they seem

72:44	operation is, you know, something is usable when there are multiple others

72:50	multipliers, for instance. Then you have this Cindy instruction. And this

72:56	shows depending upon the width of the , you get fewer of them.

73:02	one what if you So now I this, I believe, is final

73:13	on this example that I borrowed from colleagues at CMU that now if you

73:23	use the fact that you have this capability in each, um off the

73:33	than in this case, you get factor, all for in terms off

73:41	even, um, in terms off sorry and yes, more or

73:48	It depends on the particular operation. it is postal factor for for the

73:55	point. And it's a little bit for their integer add because there are

74:03	other units than there were 14 20 in this particular processor architectures. So

74:12	this case, um, one what see using the tricks of unrolling etcetera

74:19	then using the simile feature on top that to use, exploit all the

74:25	functional units than the outcome is very to the lower bound, at least

74:35	the floating point. The integral mouth , I guess, quite a successful

74:41	this case. So I think, , any questions on that? So

74:56	I think the punch line all along example is that I would say in

75:05	, it normally days to be aware this. And now count on the

75:15	being successful in restructuring a fairly nice clean cold as supposed to something that

75:24	prepared for what the compartment will do rewriting your cold. Because again,

75:31	know what safe to do. And compiler has to infer it from something

75:38	different cold potential. Okay, And I had a couple more slides in

75:46	of branching. And branching is a bit tricky about in case one,

75:56	not familiar. Whether it's, I will just quickly show what modern

76:04	do. Most of them have what's as branch prediction. So when there

76:14	conditions in the cold, they tried guess which way the execution is going

76:22	go on based on the gas, the execution is going to go than

76:32	load instructions and data, assuming that going to happen and it's began strong

76:46	they, in a way need to and basically on their needs. Toe

76:58	on the air. Either backtrack or on the execution, um, past

77:04	branch and point, um, and out which way to go. So

77:12	it has executed help, then there's need to roll back. Because then

77:16	what has been computed is wrong. . Um, if yes,

77:24	then the preparation for what brands to ready wasted, and it basically then

77:32	to start to load the thing according the, uh, rounds that the

77:38	actually wants to take and supposed to way was just so This is kind

77:44	a little bit what these settle slide that what on the branch and kind

77:54	works. And nowadays, processors have sophisticated record keeping on the desk of

78:06	statistics for branches and see for particular on the data set for is being

78:17	at run time and see, you how many times branching was going this

78:22	and how many times it's went that . And it continuously update the history

78:29	branches, taking in order to try end, uh, minimize execution

78:36	So this since my time is I will just show you what slides

78:43	in there and you can take a at it. But it's just explicit

78:47	showing what I just said, how behaves in terms off French connection,

78:55	one is lost on that. So that I think power stop and today

79:06	because again you're summarizing we were off machine architecture. Er, I'm trying

79:17	challenge the component more than necessary to your coat and then use good tools

79:26	understand what actually happens. And there's on this slide or maybe on the

79:32	slide. And so the nationals health asked him they'll cut the assembly and

79:39	what happens. Yes. Okay, we'll stop there and take if there's

79:52	questions and stop sharing screen for the . What? Any questions?

80:17	not No. I had a question the last part. That is theme

80:25	. So the effect off loops plan performance. Can you show the last

80:31	just now? That you? Yeah. Uhz Mhm. 74 of

80:52	. Okay. Okay. All Hmm. So this'll this branching on

81:04	slide. Okay. And now Mm. Was so the prediction I

81:12	not get the last part. How the prediction done? Like, how

81:17	the performance and the prediction linked with to the loops. Okay. So

81:27	does not necessarily need to be in loops or they're independent concepts.

81:35	so you may have conditioners without Luke is just, uh, straight

81:43	cold. No loops, and depending certainly could data values. You want

81:48	do one thing or another? Eso branching has to deal with what code

81:57	going to be executed depending upon whether is to a false for instance.

82:08	that's what the branching is about. if one has no prior knowledge,

82:19	you know the best guess is basically probability for each branch. If you

82:28	this case, the conditional is inside loop at some level than there has

82:42	a history saying that in the first 30 times this conditional waas uh,

82:53	executed if we call it just the and the right pants, you

83:00	And 25 off these 30 cases, left branch was taken and then only

83:06	the right branch was taken. So I then set myself up for assuming

83:13	the new time that this condition is , it will probably most likely go

83:22	same way the 25 other ones. So then I can pre load all

83:28	instructions in the data for that branch the condition so on the other.

83:39	when it actually evaluates the condition, turns out it was going to the

83:44	direction where only five instances have been then. Obviously, I kind of

83:53	to restart things from where the condition in the state that is proper at

84:00	stadium. The court, not Not as some, um, latest

84:07	that assumed that it was going the way. So you have to,

84:10	that case, in the formative backtracked competitions. Or if they're things you

84:16	do without using the variables coming in the from the condition. Okay.

84:28	you, song. So that's what branching doesn't. And it shows in

84:38	next some detail. Hide works in particular exam. Yeah. Okay.

84:53	else? So I guess I'll stop recording, then. Thank

Previous Next

00 : 01
05 : 51
09 : 21
12 : 47
17 : 11
20 : 21
26 : 05
30 : 09
36 : 57
42 : 55
50 : 45
56 : 33
63 : 35
70 : 07
75 : 49
79 : 03