ICS Video Player

COSC 6365 Introduction To High Performance Computing - Lecture18_GPU_MV_MM_Substitute_Video

Transcript ×

Auto highlight

Off

Font-size

00:01	So now coming back to my comments about, uh, a little bit

00:05	reminder in terms of trying to understand little bit about the performance And,

00:12	, some of this this societies for called Paschal, um which is about

00:18	generations old off and the g p S. But it this the GPU

00:24	is on bridges. So it's relevant that purpose. Some of the data

00:28	some of the other GPU soon maybe its like differ. Just remember this

00:33	kind of hierarchical clustering or things in off what is called the, uh

00:41	if PC clusters and far, whatever big chunks there were six of them

00:48	remind it pieces. Stanford won but you may look at the

00:54	It's on the slides anyway, So of them has 10 off this streaming

00:58	processors and and then each one of has a bunch of these crude a

01:03	in it. And so then And I'm using the food of

01:12	basically, which is warps and thread . Instead, off vectors on

01:20	I guess eso they kind of So the native thing that was around

01:25	a long time was just 32. then they got more silicon, so

01:29	speak. Or they got FINA smaller sizes. So then, at some

01:33	back, I think Pascal may have could, of course, per streaming

01:39	processor. But they still have this of the 32 white grouping or threads

01:45	simple threads into some vector. So still there. And then they group

01:51	into what they call Fred blocks, then they manage the blocks independently off

01:59	other. And here's a little bit . What then? It kind of

02:06	like a more or less what I on again, trying. And I

02:11	produces GP use seriously, and maybe have a chance to do it for

02:15	assignment, to play around with it see how this kind of gets reflected

02:20	the performance. You see the structure what's happening underneath um, and back

02:29	the thing that I have tried to all through the course Memory hierarchies is

02:35	, and in this case, it yes, capital. That was the

02:42	prior to Pascal in terms off GPU what the memory latency ASUs in

02:50	off where things hit, whether it's of the memory on but in the

02:56	the unit or the thing that is between the different units in a streaming

03:03	the processor. And then whether you to go outside and go to your

03:08	memory, you see a little bit what the difference is. So it's

03:14	for GP use. Going to the memory is quite expensive compared to staying

03:20	the silicon back post the streaming So for that they have. They

03:30	what known as a memory coalition, the way they have a friendly

03:39	their memory controller, as it seems me, is that they can have

03:44	outstanding or two requests to the memory on at the same time. And

03:50	, things get serialized. So if one off the simple threats that gets

03:56	up in tow work make your request memory for some data, then there's

04:03	requests, I guess issued, because , it's Cindy all at the same

04:06	, synchronously. So 32 requests the hits at the same time and,

04:12	, American on this or, you , to request so that means,

04:17	know, it takes 16 rounds from memory to serve these 32 guys.

04:22	no good. So that's why they what they call Coalition and trying to

04:27	these requests into no more than So, in this case, an

04:35	here where the requests are eight vice . Then the eight requests that is

04:42	here then turns out to the request 64. Bite on. I think

04:49	a good thing. And the way interpret the slightest Yeah, I can

04:52	that this is a good thing based what, while already wrote it down

04:57	. So if you go back and at the architecture that they use,

05:02	are 384 bits or wires to their . So that is 3 84.

05:13	then it's the D and the G are the first D, which is

05:17	so gone. Este per talk period the memory. You get two rounds

05:23	3. 84 bits. So 64 6 512 bits. So freaking.

05:29	, fine. They get this for rounds in one memory cycle, they

05:35	serve the 512 this. So that's I think they used this particular

05:40	But it's again coming back. That's . Why let you for I think

05:45	waas on, then repeated when I a little bit more about tpu

05:50	Understand the hard work architectures understand the path in the system. In orderto

05:56	out how many Iran's of things it to get things in and out When

06:00	tried to interpret How come things behave way? Go back and look How

06:04	are the data past? How much waas try to transfer and how does

06:09	get transferred? Then you can get idea what takes That's what I was

06:17	to say about open A C C , and again, it's very you

06:25	, Uh huh. Small portion of you can do so just an into

06:31	get your flavor of what there is the slides Have you are else for

06:38	to find things. More information in of tutorials and current specifications. Other

06:44	. So for interest beyond during the , you know, go on,

06:49	at you or else and there's no I make some okay, think so

07:06	a little bit e wouldn't say but it's kind of algorithm. But

07:12	coming back to remember the hierarchy again figuring up the impact off cold and

07:19	to figure out how to analyze a bit what to expect in some very

07:24	case. So today is best, , uh, simple things act in

07:34	off cash is so bar kind of things from Call me catch up in

07:44	, that is now UT Austin is of the compiler guys type. So

07:50	did this work quite some time and and so the data is

07:56	but qualitatively is still highly relevant. in this case, depending about don't

08:02	about all the different versions that they out to try to make things

08:06	Well, the only thing that take message for this class is basically things

08:11	the very bottom. Things are just around zero. You can get

08:16	or if you do something really you can get something. In this

08:20	, that is probably about 2000 times performed. So and it's not about

08:30	clever in terms of the algorithm. you're people probably heard about Straczynski algorithm

08:39	does make use multiplication in whatever and 2.7 or something instead of an

08:47	That's not the game here, just , playing around with your three nested

08:50	, that's it. And I can a dramatic impact on performance. What

08:55	do. So not due to all different versions, but basically trying to

09:03	Start with a matrix vector multiplication and out how you can get Understand.

09:09	happening with cash is depending upon how write is to look when your matrix

09:14	multiplication. Um, so here's a way. All right, in your

09:23	vector multiplication. Um, so do collection of in their products one

09:31	Israel of the Matrix. Um, And you look at the number of

09:37	of Frances now. So we're trying understand the memory performance again. We're

09:41	trying to do anything about the algorithm In this case, it says four

09:47	squared memory references. Okay, so means, um, you need to

09:53	every X. Then you need to a and then you need to read

09:58	. So that's three I need to . Why? So that's four And

10:04	nothing good is happening, you need do that. Ah, as many

10:11	, Sarah Instances being executed and their squared off them. So there's four

10:16	the statement and there's n squared times statement. So that's four n squared

10:21	reference that you need to do Hopefully if you have cash is you

10:26	to be able to do better. that's what we're going to figure out

10:28	much better can be. Yeah, . And it says here. So

10:37	naive thing here is just the point , you know, this is total

10:40	. Yes. You know, in decent compile, it will not read

10:44	. Meanwhile, back to memories. and then if it's not that great

10:49	lists, you can fix it by attempts yourself, and then it doesn't

10:52	confused. But we'll talk about that later. So So, um,

11:01	I guess maybe I should so down bitterness. So how many has had

11:09	course of talking about cash is and they work? No one to three

11:18	uh okay, Alright. So I you pictures in lecture four owners,

11:28	know, level one, level two three, maybe even have a four

11:32	hierarchy of things and at the cash to functional units than toe operate that

11:41	same clock great as the functional units so they can transfer. Typical since

11:49	kind off for that case were driven scientific and engineering stuff that does.

11:57	is a lot major spectrum matrix, and stuff. So that means there

12:02	like the director example, right to two or three inputs in this case

12:08	and X two loads, and then have Y convention needs to be

12:12	So there is basically typical Things are for two concurrent loads and one store

12:18	level ones can ship to our 64 arguments. Typically, too, the

12:24	fun, and it can take one back to the cash from the register

12:29	, so they tend to be, , for two loads and one store

12:34	and different 100 does a little bit about. That's one thing typical.

12:39	then once you move to a level level three, then typically they don't

12:44	the same capability as a level so things may run a little bit

12:50	than May run at half the clock a quarter of the clock. So

12:53	though the path maybe about this wide data transfer rate is not to say

12:59	something now there are more. Then think I also mentioned in those slides

13:06	, uh, most caches today are caches. They are. The other

13:13	versions is direct map and fully Direct map means that every memory location

13:21	the main memory yes, have, , its own designated place in the

13:26	. It can only go toe one if that place is already taken and

13:32	somebody needs decide what to do. associate fully associating cash is is that

13:37	other extreme, uh, in a location can go anywhere in the

13:42	anything that is free. Just go , Grab it. Andi, In

13:48	practical thing is in order not to to search for where are my free

13:53	? Then you just look up a places and say Okay, fine.

13:58	choose between 48 or 16 places That's considerably less complex than looking at

14:04	of them. So most of them four way 8,000,016 wave associate, then

14:10	So that's one thing. You know you can go and then it's a

14:15	. When things are already occupied, do you do so well, do

14:20	just sit around and wait until somebody that that's not needed anymore? Or

14:26	you then decided to say, fine. I'm going to kick

14:33	If it's never been updated, you necessarily need to kick it up.

14:36	that cash line have been updated, need to write it back somewhere.

14:42	then they're different policies for the cash as to which one's gets kicked

14:50	and the most common one is, at least recently used. So that

14:57	the things that has been sitting untouched the cash for the longest time gets

15:03	out, or and then over then there's a process. So when

15:08	get this cash, miss to write back to memory and then load whatever

15:13	want into memory. So L You policy is what's being used in

15:18	here, and that's something Thio again attention to on most of these

15:29	At least they are on the slides telling which processors use what kind of

15:33	four walking, Um um, so the cash vocabulary, then, depending

15:43	what happens when you try to load which try to read something, I

15:47	say, Um, one thing is known as compulsory Mrs. That means

15:54	it's never been in the cash. , it needs to migrate from main

15:59	to the functional units, so there no option you have to get it

16:03	. And that's what the notion of missus is, um, the other

16:08	of simple cases that the cash is because it's a lot smaller on,

16:13	the main memory. So at some that gets filled up. And then

16:17	it's full, then you get Mrs. Because it's full,

16:25	And then there's the thing that has deal with the associative iti in particular

16:30	things are occupied. So then it's conflict. So today we'll do the

16:36	K first two cases that are relatively and clean to analyze. It gets

16:41	complicated when you look at their so the obvious thing is that large

16:49	is good. Andi, for the . I'm going to use them,

16:56	says after problems. Size is small that the whole problems, so to

17:05	, fits in the cash and you a small matrix and that you

17:09	and it all fits in cash. the cash, this kind of large

17:13	to the problem you have. So the notion of large cash is going

17:17	use, Uh, the other one a small cash that is the

17:22	The cash is relatively small, Whatever a race of data that you

17:28	, um, now and you design , the larger the cash so obviously

17:34	capacity in conflict, Mrs. But hit time Because you need to look

17:38	the Roger address range and figure out politics and retreat. Hired social activity

17:45	reduces conflict, Mrs. But increase time as I mentioned before on a

17:50	block sizes. Cash line size in case is yes, the more you're

17:54	a law in some sense, it could be better, because if

17:58	need the rest of the thing in line, that's a good thing.

18:01	you didn't need it, you load lot more data than you need.

18:05	there is a trade off between number cash, Mrs and excess time for

18:11	stuff you don't want. So that's trade. Often. That's why you

18:15	pretty much all the architectures out as in No. 64 bytes cash lines

18:20	least close to the functional unit. they may have won 28 that some

18:26	the cases as when you get closer main memory so you don't. So

18:30	that case, you take 1 28 out of memory, and then you

18:34	of petition it can do 64 bits the level one cash somewhere.

18:41	now, So this is the one one needs to remember. And that's

18:45	should remember this one, because that's of important from trying to understand cash

18:52	. And that's this notional off, , stack distance. Or which is

18:58	how many cash lines do you try touch between to events you're interested

19:03	in our case is going to be to read the same data again.

19:09	it tells you best how many things stuck into the cash before you tried

19:14	get back to it. So if a lot of it, most likely

19:18	you want again has been kicked out you don't have it in there.

19:23	that's where we're going to use this . Um, so this is now

19:31	to figure out, you know, can you expect kind off the system

19:40	behave as the giant cash thing? don't have to worry about caches or

19:45	does it kind of get in the of getting good performance? And you

19:49	to worry about cash, Mrs. is what this kind of thing tried

19:54	say on the use of the Matrix about to buy. And I think

20:00	ready, said no. That's quite basically large cash again, you get

20:05	compulsory Mrs, because again, you to read the data once. But

20:10	is no capacity, Mrs, because know it holds the whole problem.

20:16	small cashes in need to worry The catch is operating. So here

20:24	basically five scenarios. I'm going to if I can get through for the

20:29	vector case today. Um, the two ones is basically doing the things

20:36	on is just one number. And you try to figure out what's the

20:40	of having a larger cash light and it's good. But it depends as

20:44	will see. So back now to make expected to nested loops. And

20:52	I mentioned, this is the typical a product formulation. So practice,

20:57	also you on the So we look you know how many mistresses it?

21:06	now it's the last large cash Right? So we need to read

21:10	obviously that stand square data elements we to read X way we need to

21:15	light and we need to write So, um so in that

21:22	uh, in terms off the once we have loaded why it's again

21:26	the large cash it's there. So don't miss them. We need to

21:30	it. So we basically and Mrs X and Y and n squared for

21:36	, eso that gives you the total of this is and relative to the

21:41	model that just did the four in thing. Then you get best

21:47	It's 0.25 plus, 05 divided by rights. Just like small end is

21:55	then eventually goes down to missing about quarter off Now, so the next

22:04	should be on the small cash So now what happens? So on

22:09	top, there is basically the string memory references that the loops do.

22:15	everything that is kind of shaded in way is trying to denote going through

22:20	first grow from start to finish. then, uh, the first sort

22:26	white type Imo non traded item after is going on to the next

22:33	So still no surprise that the sand , Mrs for a is never reused

22:41	anyway. So we need the Are a number of matrix elements regardless

22:46	what Yeah, access being reused right every role used, the same vector

22:55	. So there's potential reuse off there totally used. Why accept depending upon

23:02	you like that? Look that so why now if it's a small

23:07	the idea is, uh, you through the shady guys and that's so

23:14	know, it gets loaded and sits cash X loaded sits in cash.

23:18	have Y one and it sits there and I guess, gets iterated,

23:22	that's not really a problem. But keep loading ace and access, and

23:27	some point, the cash is Most. It's a small cap,

23:32	that means when you're at the end the role and you want toe,

23:36	with the next draw. The idea , it's a small cash, so

23:40	ex is no longer left in cash there's a small cash has got so

23:46	written when I, you know, or small fraction away through the rope

23:51	upon the size of their own. X is gone. So first I

23:56	to know it for the first row times the and elements in X,

24:01	then I need to reload it for the other roads as well as a

24:06	that's and and those for every other and their end minus one. Those

24:11	after the first. So that gives basically the kind of n squared mrs

24:18	for X. So now we have squared for a we have n squared

24:23	, and why is not the Because we just use it for every

24:29	. So it sits there until I the next one. So that's why

24:34	got to end square plus and now number of cash Mrs and then you

24:40	at the ratio compared to the first one. Then it's 0.5. Plus

24:45	0.25. And, uh then so you try to figure out then.

24:52	, So what problems says can I instead of stay in the small or

24:59	Catherine Jean, which is the one is performed the best compared to the

25:04	town? So then you basically, , plug things in and try to

25:09	out how large does the cash needs be for me? Thio stay in

25:16	large Castro Gene. Right? Um if it's, you know, the

25:29	policy or what to replace is very . It knows. Then I'm never

25:34	to touch my a again, so don't need to worry about it.

25:38	the only thing I need is I X in there one space for a

25:42	on space for why? So I I only need space where X and

25:46	other variables. So that's entrusted. the other hand, if you have

25:53	and are you replacement? A. an ex state and it's ex still

26:02	be there at the airport at the of the road. You need to

26:06	space for one roll of a and a complete X plus re y.

26:15	maybe it's for safety. One more for, you know, plugging Axinn

26:20	messing things up. So that's why you need to end. Plus,

26:25	not end plus two, because you both one role or a and one

26:29	X because of the policy used for so So that means now if you

26:36	at when things changes, it's basically when N is kind of half the

26:42	up size of the problem So, , so this is what you

26:50	If you have the large caches started in terms of administration, then it

26:57	it goes down to the extreme On the other hand, if you

27:03	end goes past. So all of sudden the cash is small relative to

27:07	problem size, then the cash Mrs up again on then. And that's

27:14	you get that cash, please. , we'll see where we get.

27:23	this one is the other way. tends to be favored. And this

27:28	to depending upon whether you see your . One of the other is good

27:33	one is favorable toe draw, major and the other one to call a

27:37	order. So, Saxby, because the I look that actually goes through

27:43	of a is the enormous loop. it goes from within a column and

27:51	down in the column from the innermost , you know, just goes

27:56	and then the outermost loop scans across . So for certain architectures, this

28:02	an ideal product. But now we're for this example of his own role

28:06	order. So that means when you from, um, um,

28:13	I'm ahead of myself. That comes . If you look at this thing

28:17	the cash Einstein, it was one . So, in fact, it

28:21	up being the same miss ratio as . So forget about that. So

28:26	is what I had in my head I started to talk, eh?

28:29	now we're going thio, get the line size. So we're going to

28:32	do the same thing, whether to a product and Saxby Orthodoxy or XP

28:38	, and figure out what happens when cash line size is larger than

28:42	um, value. So now it's major order. So that means I

28:52	the animals loop, right? It through role and no major orders.

28:57	get the elements of a for one makes. So the number of cash

29:01	then gets divided Goodbye. The cash sites, which is n squared n

29:06	miss some good news for X not get those air. Just vectors of

29:12	doesn't matter the same thing for I get B at the time,

29:15	there's no problem. So that means got everything gets divided by B

29:23	and I think this is pretty much sense. So and then you look

29:27	this model and Lord Cash Mobs and was in the previous one and the

29:34	cash model. Now you have a of Mrs Andi other. The calculation

29:40	the same. So everything is just case one in the product elements

29:45	cash line size one. So everything gets shifted by be in terms of

29:51	things transition have, I think you got So, um, the

30:07	off the l are you? It's different because again you have just

30:11	Will be, but it's still what need is a whole roll away and

30:15	whole X and modular things being divisible . Be on after you are kind

30:22	tribute. So nothing happened. Um, stop the Yeah, What

30:36	? At the same point. but the missus, if you look

30:40	the numbers uh, yes, the show goes down. Right, Because

30:44	get more and you get the of . So the basic is divided by

30:47	Andi, That's what. So now was to do the XP guy

30:55	Um, so now things are a different because things are again allocated and

31:04	major order. You always get be of a But you actually only wanted

31:12	because the next element off hey, you want is in a different

31:18	Same column, a different role. that's not in the same cash line

31:23	what's in the same cash line are columns in the same role of a

31:28	elements in the same column. So why the cash basis are still and

31:34	for, but the other ones are , so that doesn't matter. Still

31:40	. But since you go down that the same thing. Things happen.

31:46	. Why that you get be elements the number of cash basis is not

31:54	, but I never be. But you have to go and repeat

31:59	because when you're through with one the why may be gone. So

32:10	what happens now is, in what forget is because you need the

32:16	role. So that means the transition for a much smaller and because you

32:26	again the number of cash lines off is in the number of rows.

32:32	for each cash line, you have elements, not just one. So

32:38	why you get the picture like where now things are transitioning to the

32:43	cash model as a function and at much your value and hands on the

32:48	lines. So in this case, that's what you do, cash line

32:52	of one is much better than 64 or 1 28 or nine. And

33:03	, I So what? The other that people do then in real life

33:10	trying to figure out instead off just the naive, uh, looping.

33:20	do so black algorithms. And that's think you were used to do in

33:23	of one of their make use multiply . Eso instead of doing element y

33:29	, you try to fit a block the matrix and it as some marginal

33:35	and making sector, But it has significant impact in the matrix matrix multiplication

33:41	. In this case, the idea you do a small in this case

33:46	B by capital B matrix on you such blocks at the time. And

33:51	you look at you know you need would be elements of X and Y

33:55	the cash at the same time to off during one block off the metrics

34:01	off. And so in this you get some benefits and and two

34:14	slides, and I think I can do this. So And this is

34:18	thing that is, um, important is actually Houston a number of sin

34:32	that goes way beyond doing with lectures, stuff. Um, So

34:40	you have the blocks and you still to decide in which order you want

34:44	traverse the blocks, right, you do no major order. You go

34:48	, you know, a block role by block in the road. Or

34:52	can go down in the column block block, or you can try to

34:57	some other orders. So what is the upper right corner? And they're

35:06	twisted around a little bit. But order is was known as a Morton's

35:12	ordering. You can see disease and some fellow by the name of Wharton

35:17	his name attached to disorder. So anyone I heard about bought on

35:25	ordering? No, no one has about space feeling curve. So?

35:35	I'll come back to that when I about Mike later about petitioning data

35:40	Um, so one technique that is used as thio the space feeling

35:49	You try to walk across your entire structure on idea I just if I

35:57	traveling salesman on a visit one place place only once, and visit all

36:03	but then but being occurred, it a multidimensional problem into one dimensional

36:10	and then you have a traverse in that supposedly will be good in some

36:20	. So it's juice for doing load in clusters that, and it is

36:26	for trying to figure out how to memory Hurricane works well on the more

36:32	ordering, as you can see from line is a recursive ordering. So

36:37	can do to coercively from the course and toe find your level. And

36:43	of the competitors, in fact, use off. When they do blocking

36:49	you, they do blocking and then alarm blocking and maybe more frenzy and

36:56	other ordering not is recursive because in and I'm a talk about that a

37:04	bit recursive algorithms are intrinsically good for is in real life. It turns

37:16	not to be quite true. So watch the compiler guys that got

37:21	excited about this about 15 20 years and they built compilers doing a

37:27	and then we tried them up. doesn't this work? And then I'll

37:33	a bit more why it doesn't work all cases, or why this one

37:37	to be, as always, careful to use it. But this space

37:42	current concept is something new for your that has computational science may want to

37:48	and may be useful in some and, uh this is just chose

37:55	little bit. And if you do blocking in terms of the Matrix

37:58	multiply, what you can get is function Cashman, sites and other

38:05	So I think that's pretty much what had just just chose and strip

38:12	Is that for the compiler guys talk when they petitioner look into seconds that

38:20	fine fits some block size or some length and so on that they loved

38:25	runs over a bunch of elements on want to use, you know,

38:29	11 32 and you have, you , 100,000 thing. You you still

38:35	to use vectors, and then you anything to an inner loop that uses

38:39	elements at the time. And that's they call strip mine and tiling.

38:44	talked about in terms off the GP , so So that was, I

38:54	it for to die and women is . So next time I'll talk about

39:01	matrix multiplication, and then I'll give little bit of into there are clusters

39:09	put together and move on to kind aggregates on note, um, as

39:16	basis for talking about this is passing MP I for short. That will

39:22	the assignment after the GPU assigned we're about, but I thought, understanding

39:30	little bit of cash behaviors and how structure algorithms for memory, our case

39:35	note based so far not in terms cluster based partition, memory or distributed

39:40	and still just local memory making use cash hierarchy. I'll try to do

39:46	now and finish up next time and talk about clusters as how they're put

39:52	basically very quickly. So you have mental picture if you don't already over

39:56	and why Mrs Passing makes sense. , all right. So last time

40:07	talked about me made to expect the and, um, had to show

40:19	little bit. The impact of Luke in terms off what fun could expect

40:23	terms off performance. And I show thing that just gives you the wide

40:28	off performance on some older machines, upon what you do to something

40:35	And in this case, it wasn't his matrix matrix. It wasn't the

40:39	vector, but it is matrix matrix you have a little bit more degrees

40:44	freedom and opportunities. And the whole again is about the memory system and

40:53	to make sure you understand or remembers get sort of really ingrained into your

41:00	about of the memory system work. then what? Your cold, how

41:06	actually uses the memory system. um, to me, it's not

41:11	the qualitative aspect we know it's kind different, but it's actually quantitative aspect

41:16	you need to worry about, So understanding, um metrics is really

41:22	through what I'm talking about. This just reminder about the Jewish difference in

41:27	of the collective capabilities off the core's a CPU versus the memory channels associating

41:39	the CPU. So it's kind of than an order of magnitude typically.

41:43	that means, um, if you have locality of reference in the applications

41:50	start with, then there's kind of hope of getting anything close to the

41:55	performance in terms off what the arithmetic logic units can do because it will

42:01	for memory. Um, if you data reuse and depending upon but the

42:07	managed to you when your course with coat, then you may or may

42:14	get the ability to fully benefit from cash because it's again. If the

42:20	mostly sees the cash is it runs cash bead kind of thing or a

42:24	unit speed. If it sees the memory, it's another, you

42:28	in order of magnitude or worse off terms of the performance. Eso today's

42:34	is basically trying to figure out, know, and when you have a

42:37	multiply, you know, what can do? And in part, the

42:43	that did with open MP and maintenance is, um, problem mostly comes

42:50	of frustrating, maybe experienced, um I wasn't really intent or getting you

42:59	frustrated but the M K L that the reference cases well optimized library.

43:08	that gives you reference as to what be done. And there is no

43:14	clever and magics being used is kind very basic insights into how the memory

43:20	uses. And I'll tell you exactly kind of techniques is being used in

43:24	K l eso. Why they get performer and the other thing that

43:29	I guess, hopefully the you noticed has a good experience, probably.

43:38	yes, you can get Port Sequential to make them parallel code by using

43:43	open MP, and it's not too to do that on. They may

43:48	and run, but it may kind not run very well. So,

43:53	, and it's all pretty much about the memory system on data accesses that

43:59	you the difference in performance. So sorry, but it's kind of sometimes

44:05	lessons are hard to learn. So , so yester minor about the two

44:12	things we talked about last time in of the compulsory and capacity,

44:16	is it and we kind of ignored ones that are a bit more

44:21	but and not so little complex, just to remind you about the associative

44:26	and small cashes. But today we'll do the simple things. Just look

44:32	or capacity, Mrs. And how this affected based on what they call

44:38	. This is the one thing that I think everybody should try to

44:45	And since it is one of the issues in trying to understand the behavior

44:52	your cold is to try to figure the reference pattern and how it relates

44:56	your cash is, um, so the large cash model is pretty

45:05	Everything fits in cash, or you expect too many problems. That's kind

45:08	easy. Um, and the small is what I'm kind of tired will

45:14	on today and try to understand what when the cash is relatively small compared

45:20	your problem at hand. For those you who does would say final

45:29	finite element codes is what I can about and for from upwards to try

45:39	represent it when she don't. But a metrics, it would be a

45:44	matrix. But the metrics then has structure. So even if you don't

45:49	the things, then you still have of a large amounts off matrices that

45:56	quite small incense because it depends on degrees of freedom in each node and

46:01	like that. But they're modest in , so each one of those may

46:06	well fit in. Certainly level two , if not level so so understanding

46:12	little bit. The large cash model be beneficial if you try to figure

46:17	to deal with the block structure in of these parts may takes cases,

46:22	that's what we played around with. Matrix vector was different. Luke,

46:25	and tiling and stuff, and we'll to do the same now for the

46:30	matrix multiplication. Um, and this pretty much what it says.

46:37	um, Matrix Matrix multiplication is not different than a collection of matrix vector

46:44	. So if you have eight times now, if it be is the

46:49	this classic collection off vectors. So each column off the lead, you

46:54	a matrix vector multiplication. So he's kind off at the first. Order

47:02	very close. It's just wrapped one loop around to take care of a

47:05	of vectors. Um, so where economical? White people tend to write

47:11	, rectum up matrix matrix multiply So three nested Loops unit most Lopez doesn't

47:19	a product, takes one roll off times a column of Be nothing

47:25	then what this particular quote does. it goes through in the exchange,

47:30	that means after it was one in product and it sequence the elements of

47:36	or computers to see in the same . And when it's done with a

47:41	of C, then it goes on do the next 00 C. So

47:45	what it does now on this slide said yes, the objects of the

47:50	air basically and squared things. So , uh, three and square

47:57	On the other hand, I hope you look county operations is basically to

48:02	cubana to one for Adam on from . And sometimes people used count affirmation

48:08	, so they just can't add multiply one. But we do it at

48:13	individual operation levels. There's basically two cube. So in this case,

48:19	the matrix vector multiplication, now you and to the operation and square

48:24	So, in principle, um, every data item is used on the

48:31	time, so in this case, have the much better opportunity for getting

48:36	are reduced than they have a Fact was for me. Sure,

48:41	element of the matrix that is a guy is only used one extra sections

48:46	y equals a a eight times used to possible. It's for every

48:52	of it, so you can get of experts. You can't get reused

48:58	this case so but if you look it, it's basically in that case

49:04	square data plus to end for the and two and squared operations. So

49:11	basically, on average, two operations data. So in this case,

49:15	ends. So the trick is how do you benefit from the fact

49:18	you can, uh, get data and which infrastructure your code in states

49:24	cash As much as. And then can see the cash performance and

49:30	But eso this is pretty much all . Already said, this is what

49:38	want to suspend the time right? this is back to the same site

49:43	I think there was a serious So to look at what happens in this

49:48	cold again a little bit more in . So if you look at a

49:51	is in a product, so what that mean? They go through a

49:54	away and out of the and then move on. So one of the

49:59	to see and then element and the rolled see on you go back and

50:07	same column of sorry. Grow or another Carnival needs to get there of

50:15	eso In fact, your kind off a vector Matrix multiplication are supposed to

50:21	vector because you use the same roll a for the different columns of B

50:27	it becomes a row vector times a supposed to major x times a column

50:33	. So this is what this thing . And then there are loop just

50:36	through a bunch of Israel metrics applications . I think this is kind of

50:46	explain in this one. So now kind of them trying to figure out

50:51	these things works. And so we for each role off a we

51:02	the whole matrix of the So that's squared. And then we do it

51:06	all the different roles. So it , uh, and times and squared

51:12	of Visa, saying cube reads off Hey, you're only deal once you

51:18	through it all away and then you to the next row away so a

51:22	only see or load once. So n squared loads for a and for

51:28	you have to in this case Yet in initial see from memory,

51:32	then you write it back. But only sort of load and store see

51:37	. So that's the to end squared memory references for see, So you

51:43	it all up, and then you then two plus three and square memory

51:48	. So now, if you look the ratio between the number of operations

51:54	the number or memory references, you'll get. This is basically to

52:00	squared operations on Ben, the in Plus three and square data references.

52:05	it is no surprise because, the two enormous loop that is exactly

52:13	vector matrix to be exactly operation. and then there's just another loop around

52:18	. So no surprise that you don't any potential for data reuse because you

52:24	do vector matrix multiplication a bunch of . On the other hand, as

52:30	saw on the previous slide in one should be able to engineer something

52:36	you can get much better data reuse there's a potential for and,

52:43	reuse some data. So this is this like and again the this quote

52:50	between operations and memory references, it's cold arithmetic intensity, and I talked

52:57	that a little bit back on, guess. Lecture three. So there

53:02	this rule flying model that people used frequently to look at when things are

53:08	, then with limited and when they basically arithmetic logic you and it's

53:14	And for that one uses this culture operations and memory reference now. So

53:24	see. But you know, when did the Matrix vector thing, we

53:28	around with different Luke borders. There two loops in that case. Now

53:32	have three so on and for the vector, the performance was quite different

53:38	particular. When you have the cash off some number of elements that it

53:45	is the performance depending upon the loop . And so you should expect the

53:48	toe happen with Matrix matrix. So I'm going to trying to put you

53:55	to work a little bit. so here is the JK Group.

54:00	typical. When people talking about uh, vector multiplication instances very

54:06	you know, and cubed operations people to I I would say people call

54:12	algorithm. I don't know what the different through border but algorithmic Lee,

54:17	wouldn't say there's much difference. So they tend to refer to things by

54:22	index order you have some. In case is I J k going from

54:27	animals loops to the inner Muslim. , so this kind of role says

54:34	Okay, um, the Tulou borders keep interchanges the hyeon jae loop,

54:42	to keep okay as the innermost. it's one can then try to figure

54:49	what is the data locality off the reference pattern when K is the innermost

54:56	, regardless of what happens. So a sense of focus on the innermost

55:00	that happens all the time, and doesn't matter too much what the order

55:05	the other two loops are. So kind of locality do we have for

55:13	A, B and C in this ? How about a what's locality?

55:29	when you go from one K to next, what happens, right,

55:37	you don't reuse the same A for next case, So it's not temporary

55:43	. So then it's a question. other option is then special locality.

55:48	is the spatial locality for a because it happens to be Grow major

55:54	and you walk down a row of So yes, you get spatial locality

55:58	a How about the no cause Gay marched on in a collar. And

56:06	you have no major order, that no spatial locality. How about

56:15	right? Yes. Because every integration the loopy reused, the same see

56:19	in just accumulate the results of the product or the partially in the

56:27	As if so this is what you . That's what is it? So

56:31	how about if you put I in inner open step, then what

56:39	How about a but right? So locality, Special locality. Nothing

56:59	Because you used the new A. it's in a different Joel. So

57:04	means there's neither station or temporal How about be well? So what

57:26	to be when ice changes? Except you get temporary, lieutenant, and

57:40	no into the eye. I is the most. So you go down

57:45	the color. Yes. So this what happened. So last case J

57:55	is in a Muslim a correct. happens with Jay is increments would be

58:17	. Yes, the same role. now we get special locality for

58:20	We get temporal for a and about Yes, this patient Yes.

58:29	Um, so that was it. I think Let's see, s

58:37	that's kind of a this just simple . It's probably a little bit out

58:43	order, but this is if you this. George has fashion area that

58:48	figure everything fits and you can figure in this that So if you go

58:55	to the slide, then it is we have said. I just realized

58:58	the animals out of order. So exactly the case we went through and

59:02	figuring out exactly what the number off basis is because for B, there

59:10	when you had K in the innermost that can see right. It

59:14	um, no locality for me. you missed every time for B.

59:18	there was special locality for a so basically forever the only means for every

59:23	elements off A and C is temporary and just reusing nothing. Now,

59:32	you go back and look at the two porters that I just showed

59:35	one can do the same analysis for of them in that case, one

59:41	contained So just do The simplistic thing soon be is very large, but

59:48	not. Then it becomes 0.25. , for the next one is basically

59:55	. So that means it's a high with I innermost than it is for

60:00	innermost. And then we had the one, which was J in the

60:04	, uh, and in that if the it's a reasonable number,

60:09	it beats the other. So in sense, one can. Based on

60:14	simple analysis of how data is you can figure out that the last

60:20	border ought to be the best one terms of performance. Second will be

60:25	tough one, and the worst one be with Ryan. So and then

60:31	guys were nice enough to run this on a platform, and you can

60:38	three very distinct behaviors in terms off left things where there is the 0.5

60:45	races up 2.5 in terms of the Mrs. And then at the other

60:50	, you see the purple one on bottom. So you can see that

60:54	on what you do. Yes, behaves as you expect. Um,

61:02	that and here is basically the loop , the show for the different

61:05	And how did you can also see it's not a nice, smooth

61:09	It's a lot of sprite. Um , um, I'm on there to

61:14	. Why? Despite right. So has to deal with the associative it

61:28	of the cash conflicts, What gets up? Because, uh, not

61:33	that's what the spiking comes from. the essential behavior qualitatively is no

61:39	but the curves a very much in with what? The thing. So

61:42	is against I s a very simple of your code allows you to get

61:50	good idea. What reported and of course, is to compile this

61:54	. To do that for you on something simple enough. Yeah, traditional

62:01	in it and so on. So analysis may very well turn up at

62:04	con Party does right thing, the to Congress on about it yourself and

62:10	sure things are set up so you depend on the compiler to figure it

62:13	for you. any volunteers in terms these things? Not to this.

62:20	a foreigner here. It's a fluke the measurement. Yes, exactly.

62:32	they on Dennis's on the horizontal line . So when the majors small

62:39	basically the all 15 cash and you no problem with memories so better and

62:44	Because again, no. Their smallest to people and where they have so

62:50	becomes more and more documentation that You're yes, forget. But then

62:57	not a race here. Okay, is not protecting, and you go

63:03	and pick does the simple analysis and did before. And you can figure

63:10	exactly what when you start to lose large cash behavior as a function of

63:15	of your matrices. Um, so stuff. This is what I guess

63:27	had. So the one caveat about ISS the fact where the optimal order

63:42	a senior Russian, I have kind a arguing and figured out, uh

63:48	depends on the shape of the major . So if the majors is a

63:51	, this is the right thing to . But if you have rectangular majors

63:54	you may have different reporters that is so. So it's not given that

64:02	always the right thing to do They removed border. And so the

64:06	portrait I just showed you for, , sorry, I'm a little bit

64:13	of myself. I'm getting on to next thing anyway, so I'll come

64:17	for that since I'm ahead of So, yes, I just put

64:20	in as a reminder, way back some lecture showed you got to play

64:25	two codes and you with all the and this just chosen. Yeah.

64:32	this is again pretty much what I . There One thing also, we

64:42	with the Matrix vector multiplication. We at the tile, the blocked

64:47	And that's in fact, they're a more important for the Matrix made,

64:52	vacation because that's where you actually start get some benefits from tax.

64:59	um, so now basically so the of the it's kind off. What

65:06	see in this loop border is kind the same as the I J K

65:11	. But the difference is now, of have individual elements work with a

65:17	, a small sub metrics, but everything is the same. So you

65:22	think of your now if I K algorithm and just instead of working

65:27	individual elements, think of it as block matrix on, then it's all

65:32	same. So if you do then I can figure out what,

65:39	a function off the block size where assume Capital B in this case role

65:46	and columns for the blocks or each sub block sub matrix is a bee

65:52	bee matrix. But then otherwise it's off the same. So but

65:58	instead off grabbing individual elements and doing whatever in squared times. So if

66:09	look through it so you basically mhm of a black row off eight times

66:17	block column off be, uh, the kind of innermost thing then,

66:24	that means you have again. And would be, um, blocks for

66:36	A and B eso. For You have any of the B

66:42	and each one is B squared, then you go and do it for

66:49	the difference. Then you have n divided by B block columns of

66:54	So before you're done with one block of a. Then you have done

66:59	over B squared times, the squared terms of element. So that's what

67:07	of within their square bracket thing. then you have the number of roadblock

67:13	of a that you need to go , and there's an over B block

67:16	of a until you're done with, times being so that's so in the

67:22	, you basically get something that um, for the Cube, divided

67:29	the it's a lot, and then can do the same, because in

67:33	case, it becomes symmetric respect to and B and then see it's still

67:38	same. It doesn't matter. You it your computer block by block,

67:42	you only need to load, see and stored one so can ignore the

67:48	the block sizes in material. The order details in your matter a little

67:53	in terms of heart maps of cash signs and details. But in terms

67:57	the number of elements loaded, it not affected by the blocks, and

68:03	still too in cubed operation. So you can see if you do this

68:09	matrix multiply. You have the option the B squared blocks of a BNC

68:16	in the cash. Then you can basically reuse that is proportional to the

68:22	off rows or columns in the block they're using. So by using these

68:28	, approach on keeping them of a where they fit in cash.

68:33	you know, get data reuse that didn't get before. And, of

68:38	, then the larger the block size , the better it is for the

68:44	. So and then I guess the line shows so that, basically,

68:49	that modular a few times and other , but essentially the three blocks

68:54	B and C all need to fit the cash. So that means the

68:59	size is kind of depending upon a cash you have, but it's proportionately

69:04	is the square root of the cash terms of do, and then you

69:08	figure out if you want something to kind of balanced, so you want

69:18	memory not to get in the That means you look at the number

69:22	data loads and the time for each load, and then you look at

69:26	time it takes to do in your and point operation and Onda how many

69:31	them you do and you look at ratio. And then it turns

69:35	if you're have this arithmetic intensity, can figure out, you know what

69:41	size and cash size do you need this? Computation in terms of

69:46	multiply not to be limited by the of memory that you have, so

69:52	a fairly simple thing, and then figure out also. Then, if

69:58	going to be, even if you a cash, is it going to

70:01	memory limited or not? Um, you can figure it out from your

70:06	points because we have from the When he talked about it,

70:10	It has a clock rate, um it has an instruction with that may

70:18	either a scaler or, in this months, why the compiler should be

70:23	and finding. So if you take and can do 16 operations,

70:28	at multiplies in a single instruction and it runs or whatever clock grades and

70:34	whatever course and you can physical plug in here and figure out the time

70:39	floating point operation and you also have memory bandwidth that you get from the

70:45	of memory channel and the speed of memory that you have. So you

70:49	figure out what the tea sebum and you can figure out what he

70:54	is. And you can then figure with the optimum block size are you

70:59	to be limited by the album cash should? So, which is just

71:05	say I think it comes on some here. So yeah, I guess

71:11	slide makes a comment so it turns for those are kind of theoretically inclined

71:16	this particular scheme one can also prove is the optimum scheme to do things

71:22	terms off, trying to basically get maximum reduce off each level in the

71:28	hierarchy. So you apply it to level one. You apply to the

71:32	two and you apply it to the three and whatever else you have,

71:37	if you have this beyond your main than your flight, you know the

71:42	memory, and so the disk and not so it applies to every level

71:47	the memory hurricane, which is kind what the next strike says so That

71:53	you know everything you got topped in software. Engineering goes out the window

71:59	you're going to mess up your coat , making it much less intuitive and

72:04	to sign because you break things Essentially do block sizes for every level

72:11	the memory. Higher case you get of you ask your three nested

72:15	You know, you get three times number off memory levels you have in

72:20	sister. And if you're really um, then the innermost is not

72:27	level one cash, but it's actually register. So that's what you

72:34	People that can kill guys do, I do. And he gets to

72:41	and enjoy. So then, in most things are kind of taking a

72:44	because the number of right tens 32 something from most process. I got

72:51	years for about four by four. , uh, what you dio so

73:00	much this is. You know what said what m K l does and

73:04	with the loop. Borders is place the blocking for the various levels in

73:07	memory hurricane, the way they get performance. Now that's obviously painful.

73:18	some people want to automate the process these days. I guess you can

73:24	to be clever and try to be these days. They used machine learning

73:30	up in the old day you did the brute force way and just tried

73:34	the different options on the brute force . One way is what's done by

73:40	called Atlas Library that you can download Net flip and what it does.

73:46	basically right all kinds of LA Kings Luke, petition ings and Luke

73:53	and then the very code for all different things. And then just try

73:58	out on your platform and figure out is the winner. Of course,

74:02	means that you just do one matrix . You spend a lot more trying

74:07	figure out which one is the fastest in the snow with me. What

74:11	doesn't really pay unless you repeat the operation many times. But it is

74:18	insights into the range of performances you get as a functional partitioning and look

74:27	. Well, this is pretty much yeah summary. I want, um

74:38	as I mentioned, I think last or yes, when I talked about

74:43	Tide Matrix director thing talked about this of space feeling, curves and other

74:52	. So some people says again, is the pain. So how can

75:00	kind of do something that is, it automatically? Or in addition to

75:09	the brute force is there's another Then it turns out and principle that

75:16	you do it recursive algorithms instead of intuitive things that I've shown all the

75:22	than you. So if you want do so, it's kind of like

75:29	block matrix at the high level. suppose you have your and by in

75:35	cities, then you're partitioning into four . They're basically end of the

75:40	each one, and then you multiply one of these two. So now

75:43	branch, you have four instances and you recursive Lee petitioned them and do

75:49	and smaller matrix multiply. And eventually have something radius to Ellen. So

75:55	recursive the partition, they may be . Two pi into block matrix

76:01	So that's what presumably is. So means and you get down today.

76:10	element level they definitely 15 cash and stay in cash as you kind of

76:15	up and you larger and larger block . And eventually you run out of

76:20	. A lot of them that fits cash and then depending on which order

76:24	process that blocks, you can use space filling carbs to be clever and

76:28	to preserve more and more of a . Um, so yes. So

76:38	is what So people try this as said, Um, and unfortunately,

76:47	practice, it hasn't been able to this kind of automatic procedure. So

76:55	interactive ones, there's still what's being in the libraries out there. And

77:01	you want Thio do something that is makes maximum use. So in this

77:05	, for on platform, I think got about two thirds and another

77:09	They got to about half the peak using the reverse of approach. And

77:13	has to deal with a lot of their memories system actually works, and

77:17	cash lines works and a lot of underneath on this is just a reminder

77:24	some of these, But, you , put this in there. So

77:30	is anyone interested? So the first is so Charles License and that Mitt

77:39	student was one off the early ones , if not the earliest that

77:46	you know, came to think about richer semantics, multiplying, being kind

77:50	combination of systems. Card on theoretician proved a bunch of properties. And

77:55	has very nice lecture. And I you are into this, lecture them

78:01	on about the course of approach and It's great. Uh, but it's

78:08	certain practices kind of India, and kind of lose. So that's what

78:17	had on Matrix

Previous Next

00 : 01
03 : 29
07 : 37
10 : 55
18 : 43
22 : 05
27 : 23
30 : 51
34 : 11
40 : 47
46 : 23
53 : 53
65 : 37
71 : 49