© Distribution of this video is restricted by its owner
Transcript ×
Auto highlight
Font-size
00:01 So now coming back to my comments about, uh, a little bit

00:05 reminder in terms of trying to understand little bit about the performance And,

00:12 , some of this this societies for called Paschal, um which is about

00:18 generations old off and the g p S. But it this the GPU

00:24 is on bridges. So it's relevant that purpose. Some of the data

00:28 some of the other GPU soon maybe its like differ. Just remember this

00:33 kind of hierarchical clustering or things in off what is called the, uh

00:41 if PC clusters and far, whatever big chunks there were six of them

00:48 remind it pieces. Stanford won but you may look at the

00:54 It's on the slides anyway, So of them has 10 off this streaming

00:58 processors and and then each one of has a bunch of these crude a

01:03 in it. And so then And I'm using the food of

01:12 basically, which is warps and thread . Instead, off vectors on

01:20 I guess eso they kind of So the native thing that was around

01:25 a long time was just 32. then they got more silicon, so

01:29 speak. Or they got FINA smaller sizes. So then, at some

01:33 back, I think Pascal may have could, of course, per streaming

01:39 processor. But they still have this of the 32 white grouping or threads

01:45 simple threads into some vector. So still there. And then they group

01:51 into what they call Fred blocks, then they manage the blocks independently off

01:59 other. And here's a little bit . What then? It kind of

02:06 like a more or less what I on again, trying. And I

02:11 produces GP use seriously, and maybe have a chance to do it for

02:15 assignment, to play around with it see how this kind of gets reflected

02:20 the performance. You see the structure what's happening underneath um, and back

02:29 the thing that I have tried to all through the course Memory hierarchies is

02:35 , and in this case, it yes, capital. That was the

02:42 prior to Pascal in terms off GPU what the memory latency ASUs in

02:50 off where things hit, whether it's of the memory on but in the

02:56 the unit or the thing that is between the different units in a streaming

03:03 the processor. And then whether you to go outside and go to your

03:08 memory, you see a little bit what the difference is. So it's

03:14 for GP use. Going to the memory is quite expensive compared to staying

03:20 the silicon back post the streaming So for that they have. They

03:30 what known as a memory coalition, the way they have a friendly

03:39 their memory controller, as it seems me, is that they can have

03:44 outstanding or two requests to the memory on at the same time. And

03:50 , things get serialized. So if one off the simple threats that gets

03:56 up in tow work make your request memory for some data, then there's

04:03 requests, I guess issued, because , it's Cindy all at the same

04:06 , synchronously. So 32 requests the hits at the same time and,

04:12 , American on this or, you , to request so that means,

04:17 know, it takes 16 rounds from memory to serve these 32 guys.

04:22 no good. So that's why they what they call Coalition and trying to

04:27 these requests into no more than So, in this case, an

04:35 here where the requests are eight vice . Then the eight requests that is

04:42 here then turns out to the request 64. Bite on. I think

04:49 a good thing. And the way interpret the slightest Yeah, I can

04:52 that this is a good thing based what, while already wrote it down

04:57 . So if you go back and at the architecture that they use,

05:02 are 384 bits or wires to their . So that is 3 84.

05:13 then it's the D and the G are the first D, which is

05:17 so gone. Este per talk period the memory. You get two rounds

05:23 3. 84 bits. So 64 6 512 bits. So freaking.

05:29 , fine. They get this for rounds in one memory cycle, they

05:35 serve the 512 this. So that's I think they used this particular

05:40 But it's again coming back. That's . Why let you for I think

05:45 waas on, then repeated when I a little bit more about tpu

05:50 Understand the hard work architectures understand the path in the system. In orderto

05:56 out how many Iran's of things it to get things in and out When

06:00 tried to interpret How come things behave way? Go back and look How

06:04 are the data past? How much waas try to transfer and how does

06:09 get transferred? Then you can get idea what takes That's what I was

06:17 to say about open A C C , and again, it's very you

06:25 , Uh huh. Small portion of you can do so just an into

06:31 get your flavor of what there is the slides Have you are else for

06:38 to find things. More information in of tutorials and current specifications. Other

06:44 . So for interest beyond during the , you know, go on,

06:49 at you or else and there's no I make some okay, think so

07:06 a little bit e wouldn't say but it's kind of algorithm. But

07:12 coming back to remember the hierarchy again figuring up the impact off cold and

07:19 to figure out how to analyze a bit what to expect in some very

07:24 case. So today is best, , uh, simple things act in

07:34 off cash is so bar kind of things from Call me catch up in

07:44 , that is now UT Austin is of the compiler guys type. So

07:50 did this work quite some time and and so the data is

07:56 but qualitatively is still highly relevant. in this case, depending about don't

08:02 about all the different versions that they out to try to make things

08:06 Well, the only thing that take message for this class is basically things

08:11 the very bottom. Things are just around zero. You can get

08:16 or if you do something really you can get something. In this

08:20 , that is probably about 2000 times performed. So and it's not about

08:30 clever in terms of the algorithm. you're people probably heard about Straczynski algorithm

08:39 does make use multiplication in whatever and 2.7 or something instead of an

08:47 That's not the game here, just , playing around with your three nested

08:50 , that's it. And I can a dramatic impact on performance. What

08:55 do. So not due to all different versions, but basically trying to

09:03 Start with a matrix vector multiplication and out how you can get Understand.

09:09 happening with cash is depending upon how write is to look when your matrix

09:14 multiplication. Um, so here's a way. All right, in your

09:23 vector multiplication. Um, so do collection of in their products one

09:31 Israel of the Matrix. Um, And you look at the number of

09:37 of Frances now. So we're trying understand the memory performance again. We're

09:41 trying to do anything about the algorithm In this case, it says four

09:47 squared memory references. Okay, so means, um, you need to

09:53 every X. Then you need to a and then you need to read

09:58 . So that's three I need to . Why? So that's four And

10:04 nothing good is happening, you need do that. Ah, as many

10:11 , Sarah Instances being executed and their squared off them. So there's four

10:16 the statement and there's n squared times statement. So that's four n squared

10:21 reference that you need to do Hopefully if you have cash is you

10:26 to be able to do better. that's what we're going to figure out

10:28 much better can be. Yeah, . And it says here. So

10:37 naive thing here is just the point , you know, this is total

10:40 . Yes. You know, in decent compile, it will not read

10:44 . Meanwhile, back to memories. and then if it's not that great

10:49 lists, you can fix it by attempts yourself, and then it doesn't

10:52 confused. But we'll talk about that later. So So, um,

11:01 I guess maybe I should so down bitterness. So how many has had

11:09 course of talking about cash is and they work? No one to three

11:18 uh okay, Alright. So I you pictures in lecture four owners,

11:28 know, level one, level two three, maybe even have a four

11:32 hierarchy of things and at the cash to functional units than toe operate that

11:41 same clock great as the functional units so they can transfer. Typical since

11:49 kind off for that case were driven scientific and engineering stuff that does.

11:57 is a lot major spectrum matrix, and stuff. So that means there

12:02 like the director example, right to two or three inputs in this case

12:08 and X two loads, and then have Y convention needs to be

12:12 So there is basically typical Things are for two concurrent loads and one store

12:18 level ones can ship to our 64 arguments. Typically, too, the

12:24 fun, and it can take one back to the cash from the register

12:29 , so they tend to be, , for two loads and one store

12:34 and different 100 does a little bit about. That's one thing typical.

12:39 then once you move to a level level three, then typically they don't

12:44 the same capability as a level so things may run a little bit

12:50 than May run at half the clock a quarter of the clock. So

12:53 though the path maybe about this wide data transfer rate is not to say

12:59 something now there are more. Then think I also mentioned in those slides

13:06 , uh, most caches today are caches. They are. The other

13:13 versions is direct map and fully Direct map means that every memory location

13:21 the main memory yes, have, , its own designated place in the

13:26 . It can only go toe one if that place is already taken and

13:32 somebody needs decide what to do. associate fully associating cash is is that

13:37 other extreme, uh, in a location can go anywhere in the

13:42 anything that is free. Just go , Grab it. Andi, In

13:48 practical thing is in order not to to search for where are my free

13:53 ? Then you just look up a places and say Okay, fine.

13:58 choose between 48 or 16 places That's considerably less complex than looking at

14:04 of them. So most of them four way 8,000,016 wave associate, then

14:10 So that's one thing. You know you can go and then it's a

14:15 . When things are already occupied, do you do so well, do

14:20 just sit around and wait until somebody that that's not needed anymore? Or

14:26 you then decided to say, fine. I'm going to kick

14:33 If it's never been updated, you necessarily need to kick it up.

14:36 that cash line have been updated, need to write it back somewhere.

14:42 then they're different policies for the cash as to which one's gets kicked

14:50 and the most common one is, at least recently used. So that

14:57 the things that has been sitting untouched the cash for the longest time gets

15:03 out, or and then over then there's a process. So when

15:08 get this cash, miss to write back to memory and then load whatever

15:13 want into memory. So L You policy is what's being used in

15:18 here, and that's something Thio again attention to on most of these

15:29 At least they are on the slides telling which processors use what kind of

15:33 four walking, Um um, so the cash vocabulary, then, depending

15:43 what happens when you try to load which try to read something, I

15:47 say, Um, one thing is known as compulsory Mrs. That means

15:54 it's never been in the cash. , it needs to migrate from main

15:59 to the functional units, so there no option you have to get it

16:03 . And that's what the notion of missus is, um, the other

16:08 of simple cases that the cash is because it's a lot smaller on,

16:13 the main memory. So at some that gets filled up. And then

16:17 it's full, then you get Mrs. Because it's full,

16:25 And then there's the thing that has deal with the associative iti in particular

16:30 things are occupied. So then it's conflict. So today we'll do the

16:36 K first two cases that are relatively and clean to analyze. It gets

16:41 complicated when you look at their so the obvious thing is that large

16:49 is good. Andi, for the . I'm going to use them,

16:56 says after problems. Size is small that the whole problems, so to

17:05 , fits in the cash and you a small matrix and that you

17:09 and it all fits in cash. the cash, this kind of large

17:13 to the problem you have. So the notion of large cash is going

17:17 use, Uh, the other one a small cash that is the

17:22 The cash is relatively small, Whatever a race of data that you

17:28 , um, now and you design , the larger the cash so obviously

17:34 capacity in conflict, Mrs. But hit time Because you need to look

17:38 the Roger address range and figure out politics and retreat. Hired social activity

17:45 reduces conflict, Mrs. But increase time as I mentioned before on a

17:50 block sizes. Cash line size in case is yes, the more you're

17:54 a law in some sense, it could be better, because if

17:58 need the rest of the thing in line, that's a good thing.

18:01 you didn't need it, you load lot more data than you need.

18:05 there is a trade off between number cash, Mrs and excess time for

18:11 stuff you don't want. So that's trade. Often. That's why you

18:15 pretty much all the architectures out as in No. 64 bytes cash lines

18:20 least close to the functional unit. they may have won 28 that some

18:26 the cases as when you get closer main memory so you don't. So

18:30 that case, you take 1 28 out of memory, and then you

18:34 of petition it can do 64 bits the level one cash somewhere.

18:41 now, So this is the one one needs to remember. And that's

18:45 should remember this one, because that's of important from trying to understand cash

18:52 . And that's this notional off, , stack distance. Or which is

18:58 how many cash lines do you try touch between to events you're interested

19:03 in our case is going to be to read the same data again.

19:09 it tells you best how many things stuck into the cash before you tried

19:14 get back to it. So if a lot of it, most likely

19:18 you want again has been kicked out you don't have it in there.

19:23 that's where we're going to use this . Um, so this is now

19:31 to figure out, you know, can you expect kind off the system

19:40 behave as the giant cash thing? don't have to worry about caches or

19:45 does it kind of get in the of getting good performance? And you

19:49 to worry about cash, Mrs. is what this kind of thing tried

19:54 say on the use of the Matrix about to buy. And I think

20:00 ready, said no. That's quite basically large cash again, you get

20:05 compulsory Mrs, because again, you to read the data once. But

20:10 is no capacity, Mrs, because know it holds the whole problem.

20:16 small cashes in need to worry The catch is operating. So here

20:24 basically five scenarios. I'm going to if I can get through for the

20:29 vector case today. Um, the two ones is basically doing the things

20:36 on is just one number. And you try to figure out what's the

20:40 of having a larger cash light and it's good. But it depends as

20:44 will see. So back now to make expected to nested loops. And

20:52 I mentioned, this is the typical a product formulation. So practice,

20:57 also you on the So we look you know how many mistresses it?

21:06 now it's the last large cash Right? So we need to read

21:10 obviously that stand square data elements we to read X way we need to

21:15 light and we need to write So, um so in that

21:22 uh, in terms off the once we have loaded why it's again

21:26 the large cash it's there. So don't miss them. We need to

21:30 it. So we basically and Mrs X and Y and n squared for

21:36 , eso that gives you the total of this is and relative to the

21:41 model that just did the four in thing. Then you get best

21:47 It's 0.25 plus, 05 divided by rights. Just like small end is

21:55 then eventually goes down to missing about quarter off Now, so the next

22:04 should be on the small cash So now what happens? So on

22:09 top, there is basically the string memory references that the loops do.

22:15 everything that is kind of shaded in way is trying to denote going through

22:20 first grow from start to finish. then, uh, the first sort

22:26 white type Imo non traded item after is going on to the next

22:33 So still no surprise that the sand , Mrs for a is never reused

22:41 anyway. So we need the Are a number of matrix elements regardless

22:46 what Yeah, access being reused right every role used, the same vector

22:55 . So there's potential reuse off there totally used. Why accept depending upon

23:02 you like that? Look that so why now if it's a small

23:07 the idea is, uh, you through the shady guys and that's so

23:14 know, it gets loaded and sits cash X loaded sits in cash.

23:18 have Y one and it sits there and I guess, gets iterated,

23:22 that's not really a problem. But keep loading ace and access, and

23:27 some point, the cash is Most. It's a small cap,

23:32 that means when you're at the end the role and you want toe,

23:36 with the next draw. The idea , it's a small cash, so

23:40 ex is no longer left in cash there's a small cash has got so

23:46 written when I, you know, or small fraction away through the rope

23:51 upon the size of their own. X is gone. So first I

23:56 to know it for the first row times the and elements in X,

24:01 then I need to reload it for the other roads as well as a

24:06 that's and and those for every other and their end minus one. Those

24:11 after the first. So that gives basically the kind of n squared mrs

24:18 for X. So now we have squared for a we have n squared

24:23 , and why is not the Because we just use it for every

24:29 . So it sits there until I the next one. So that's why

24:34 got to end square plus and now number of cash Mrs and then you

24:40 at the ratio compared to the first one. Then it's 0.5. Plus

24:45 0.25. And, uh then so you try to figure out then.

24:52 , So what problems says can I instead of stay in the small or

24:59 Catherine Jean, which is the one is performed the best compared to the

25:04 town? So then you basically, , plug things in and try to

25:09 out how large does the cash needs be for me? Thio stay in

25:16 large Castro Gene. Right? Um if it's, you know, the

25:29 policy or what to replace is very . It knows. Then I'm never

25:34 to touch my a again, so don't need to worry about it.

25:38 the only thing I need is I X in there one space for a

25:42 on space for why? So I I only need space where X and

25:46 other variables. So that's entrusted. the other hand, if you have

25:53 and are you replacement? A. an ex state and it's ex still

26:02 be there at the airport at the of the road. You need to

26:06 space for one roll of a and a complete X plus re y.

26:15 maybe it's for safety. One more for, you know, plugging Axinn

26:20 messing things up. So that's why you need to end. Plus,

26:25 not end plus two, because you both one role or a and one

26:29 X because of the policy used for so So that means now if you

26:36 at when things changes, it's basically when N is kind of half the

26:42 up size of the problem So, , so this is what you

26:50 If you have the large caches started in terms of administration, then it

26:57 it goes down to the extreme On the other hand, if you

27:03 end goes past. So all of sudden the cash is small relative to

27:07 problem size, then the cash Mrs up again on then. And that's

27:14 you get that cash, please. , we'll see where we get.

27:23 this one is the other way. tends to be favored. And this

27:28 to depending upon whether you see your . One of the other is good

27:33 one is favorable toe draw, major and the other one to call a

27:37 order. So, Saxby, because the I look that actually goes through

27:43 of a is the enormous loop. it goes from within a column and

27:51 down in the column from the innermost , you know, just goes

27:56 and then the outermost loop scans across . So for certain architectures, this

28:02 an ideal product. But now we're for this example of his own role

28:06 order. So that means when you from, um, um,

28:13 I'm ahead of myself. That comes . If you look at this thing

28:17 the cash Einstein, it was one . So, in fact, it

28:21 up being the same miss ratio as . So forget about that. So

28:26 is what I had in my head I started to talk, eh?

28:29 now we're going thio, get the line size. So we're going to

28:32 do the same thing, whether to a product and Saxby Orthodoxy or XP

28:38 , and figure out what happens when cash line size is larger than

28:42 um, value. So now it's major order. So that means I

28:52 the animals loop, right? It through role and no major orders.

28:57 get the elements of a for one makes. So the number of cash

29:01 then gets divided Goodbye. The cash sites, which is n squared n

29:06 miss some good news for X not get those air. Just vectors of

29:12 doesn't matter the same thing for I get B at the time,

29:15 there's no problem. So that means got everything gets divided by B

29:23 and I think this is pretty much sense. So and then you look

29:27 this model and Lord Cash Mobs and was in the previous one and the

29:34 cash model. Now you have a of Mrs Andi other. The calculation

29:40 the same. So everything is just case one in the product elements

29:45 cash line size one. So everything gets shifted by be in terms of

29:51 things transition have, I think you got So, um, the

30:07 off the l are you? It's different because again you have just

30:11 Will be, but it's still what need is a whole roll away and

30:15 whole X and modular things being divisible . Be on after you are kind

30:22 tribute. So nothing happened. Um, stop the Yeah, What

30:36 ? At the same point. but the missus, if you look

30:40 the numbers uh, yes, the show goes down. Right, Because

30:44 get more and you get the of . So the basic is divided by

30:47 Andi, That's what. So now was to do the XP guy

30:55 Um, so now things are a different because things are again allocated and

31:04 major order. You always get be of a But you actually only wanted

31:12 because the next element off hey, you want is in a different

31:18 Same column, a different role. that's not in the same cash line

31:23 what's in the same cash line are columns in the same role of a

31:28 elements in the same column. So why the cash basis are still and

31:34 for, but the other ones are , so that doesn't matter. Still

31:40 . But since you go down that the same thing. Things happen.

31:46 . Why that you get be elements the number of cash basis is not

31:54 , but I never be. But you have to go and repeat

31:59 because when you're through with one the why may be gone. So

32:10 what happens now is, in what forget is because you need the

32:16 role. So that means the transition for a much smaller and because you

32:26 again the number of cash lines off is in the number of rows.

32:32 for each cash line, you have elements, not just one. So

32:38 why you get the picture like where now things are transitioning to the

32:43 cash model as a function and at much your value and hands on the

32:48 lines. So in this case, that's what you do, cash line

32:52 of one is much better than 64 or 1 28 or nine. And

33:03 , I So what? The other that people do then in real life

33:10 trying to figure out instead off just the naive, uh, looping.

33:20 do so black algorithms. And that's think you were used to do in

33:23 of one of their make use multiply . Eso instead of doing element y

33:29 , you try to fit a block the matrix and it as some marginal

33:35 and making sector, But it has significant impact in the matrix matrix multiplication

33:41 . In this case, the idea you do a small in this case

33:46 B by capital B matrix on you such blocks at the time. And

33:51 you look at you know you need would be elements of X and Y

33:55 the cash at the same time to off during one block off the metrics

34:01 off. And so in this you get some benefits and and two

34:14 slides, and I think I can do this. So And this is

34:18 thing that is, um, important is actually Houston a number of sin

34:32 that goes way beyond doing with lectures, stuff. Um, So

34:40 you have the blocks and you still to decide in which order you want

34:44 traverse the blocks, right, you do no major order. You go

34:48 , you know, a block role by block in the road. Or

34:52 can go down in the column block block, or you can try to

34:57 some other orders. So what is the upper right corner? And they're

35:06 twisted around a little bit. But order is was known as a Morton's

35:12 ordering. You can see disease and some fellow by the name of Wharton

35:17 his name attached to disorder. So anyone I heard about bought on

35:25 ordering? No, no one has about space feeling curve. So?

35:35 I'll come back to that when I about Mike later about petitioning data

35:40 Um, so one technique that is used as thio the space feeling

35:49 You try to walk across your entire structure on idea I just if I

35:57 traveling salesman on a visit one place place only once, and visit all

36:03 but then but being occurred, it a multidimensional problem into one dimensional

36:10 and then you have a traverse in that supposedly will be good in some

36:20 . So it's juice for doing load in clusters that, and it is

36:26 for trying to figure out how to memory Hurricane works well on the more

36:32 ordering, as you can see from line is a recursive ordering. So

36:37 can do to coercively from the course and toe find your level. And

36:43 of the competitors, in fact, use off. When they do blocking

36:49 you, they do blocking and then alarm blocking and maybe more frenzy and

36:56 other ordering not is recursive because in and I'm a talk about that a

37:04 bit recursive algorithms are intrinsically good for is in real life. It turns

37:16 not to be quite true. So watch the compiler guys that got

37:21 excited about this about 15 20 years and they built compilers doing a

37:27 and then we tried them up. doesn't this work? And then I'll

37:33 a bit more why it doesn't work all cases, or why this one

37:37 to be, as always, careful to use it. But this space

37:42 current concept is something new for your that has computational science may want to

37:48 and may be useful in some and, uh this is just chose

37:55 little bit. And if you do blocking in terms of the Matrix

37:58 multiply, what you can get is function Cashman, sites and other

38:05 So I think that's pretty much what had just just chose and strip

38:12 Is that for the compiler guys talk when they petitioner look into seconds that

38:20 fine fits some block size or some length and so on that they loved

38:25 runs over a bunch of elements on want to use, you know,

38:29 11 32 and you have, you , 100,000 thing. You you still

38:35 to use vectors, and then you anything to an inner loop that uses

38:39 elements at the time. And that's they call strip mine and tiling.

38:44 talked about in terms off the GP , so So that was, I

38:54 it for to die and women is . So next time I'll talk about

39:01 matrix multiplication, and then I'll give little bit of into there are clusters

39:09 put together and move on to kind aggregates on note, um, as

39:16 basis for talking about this is passing MP I for short. That will

39:22 the assignment after the GPU assigned we're about, but I thought, understanding

39:30 little bit of cash behaviors and how structure algorithms for memory, our case

39:35 note based so far not in terms cluster based partition, memory or distributed

39:40 and still just local memory making use cash hierarchy. I'll try to do

39:46 now and finish up next time and talk about clusters as how they're put

39:52 basically very quickly. So you have mental picture if you don't already over

39:56 and why Mrs Passing makes sense. , all right. So last time

40:07 talked about me made to expect the and, um, had to show

40:19 little bit. The impact of Luke in terms off what fun could expect

40:23 terms off performance. And I show thing that just gives you the wide

40:28 off performance on some older machines, upon what you do to something

40:35 And in this case, it wasn't his matrix matrix. It wasn't the

40:39 vector, but it is matrix matrix you have a little bit more degrees

40:44 freedom and opportunities. And the whole again is about the memory system and

40:53 to make sure you understand or remembers get sort of really ingrained into your

41:00 about of the memory system work. then what? Your cold, how

41:06 actually uses the memory system. um, to me, it's not

41:11 the qualitative aspect we know it's kind different, but it's actually quantitative aspect

41:16 you need to worry about, So understanding, um metrics is really

41:22 through what I'm talking about. This just reminder about the Jewish difference in

41:27 of the collective capabilities off the core's a CPU versus the memory channels associating

41:39 the CPU. So it's kind of than an order of magnitude typically.

41:43 that means, um, if you have locality of reference in the applications

41:50 start with, then there's kind of hope of getting anything close to the

41:55 performance in terms off what the arithmetic logic units can do because it will

42:01 for memory. Um, if you data reuse and depending upon but the

42:07 managed to you when your course with coat, then you may or may

42:14 get the ability to fully benefit from cash because it's again. If the

42:20 mostly sees the cash is it runs cash bead kind of thing or a

42:24 unit speed. If it sees the memory, it's another, you

42:28 in order of magnitude or worse off terms of the performance. Eso today's

42:34 is basically trying to figure out, know, and when you have a

42:37 multiply, you know, what can do? And in part, the

42:43 that did with open MP and maintenance is, um, problem mostly comes

42:50 of frustrating, maybe experienced, um I wasn't really intent or getting you

42:59 frustrated but the M K L that the reference cases well optimized library.

43:08 that gives you reference as to what be done. And there is no

43:14 clever and magics being used is kind very basic insights into how the memory

43:20 uses. And I'll tell you exactly kind of techniques is being used in

43:24 K l eso. Why they get performer and the other thing that

43:29 I guess, hopefully the you noticed has a good experience, probably.

43:38 yes, you can get Port Sequential to make them parallel code by using

43:43 open MP, and it's not too to do that on. They may

43:48 and run, but it may kind not run very well. So,

43:53 , and it's all pretty much about the memory system on data accesses that

43:59 you the difference in performance. So sorry, but it's kind of sometimes

44:05 lessons are hard to learn. So , so yester minor about the two

44:12 things we talked about last time in of the compulsory and capacity,

44:16 is it and we kind of ignored ones that are a bit more

44:21 but and not so little complex, just to remind you about the associative

44:26 and small cashes. But today we'll do the simple things. Just look

44:32 or capacity, Mrs. And how this affected based on what they call

44:38 . This is the one thing that I think everybody should try to

44:45 And since it is one of the issues in trying to understand the behavior

44:52 your cold is to try to figure the reference pattern and how it relates

44:56 your cash is, um, so the large cash model is pretty

45:05 Everything fits in cash, or you expect too many problems. That's kind

45:08 easy. Um, and the small is what I'm kind of tired will

45:14 on today and try to understand what when the cash is relatively small compared

45:20 your problem at hand. For those you who does would say final

45:29 finite element codes is what I can about and for from upwards to try

45:39 represent it when she don't. But a metrics, it would be a

45:44 matrix. But the metrics then has structure. So even if you don't

45:49 the things, then you still have of a large amounts off matrices that

45:56 quite small incense because it depends on degrees of freedom in each node and

46:01 like that. But they're modest in , so each one of those may

46:06 well fit in. Certainly level two , if not level so so understanding

46:12 little bit. The large cash model be beneficial if you try to figure

46:17 to deal with the block structure in of these parts may takes cases,

46:22 that's what we played around with. Matrix vector was different. Luke,

46:25 and tiling and stuff, and we'll to do the same now for the

46:30 matrix multiplication. Um, and this pretty much what it says.

46:37 um, Matrix Matrix multiplication is not different than a collection of matrix vector

46:44 . So if you have eight times now, if it be is the

46:49 this classic collection off vectors. So each column off the lead, you

46:54 a matrix vector multiplication. So he's kind off at the first. Order

47:02 very close. It's just wrapped one loop around to take care of a

47:05 of vectors. Um, so where economical? White people tend to write

47:11 , rectum up matrix matrix multiply So three nested Loops unit most Lopez doesn't

47:19 a product, takes one roll off times a column of Be nothing

47:25 then what this particular quote does. it goes through in the exchange,

47:30 that means after it was one in product and it sequence the elements of

47:36 or computers to see in the same . And when it's done with a

47:41 of C, then it goes on do the next 00 C. So

47:45 what it does now on this slide said yes, the objects of the

47:50 air basically and squared things. So , uh, three and square

47:57 On the other hand, I hope you look county operations is basically to

48:02 cubana to one for Adam on from . And sometimes people used count affirmation

48:08 , so they just can't add multiply one. But we do it at

48:13 individual operation levels. There's basically two cube. So in this case,

48:19 the matrix vector multiplication, now you and to the operation and square

48:24 So, in principle, um, every data item is used on the

48:31 time, so in this case, have the much better opportunity for getting

48:36 are reduced than they have a Fact was for me. Sure,

48:41 element of the matrix that is a guy is only used one extra sections

48:46 y equals a a eight times used to possible. It's for every

48:52 of it, so you can get of experts. You can't get reused

48:58 this case so but if you look it, it's basically in that case

49:04 square data plus to end for the and two and squared operations. So

49:11 basically, on average, two operations data. So in this case,

49:15 ends. So the trick is how do you benefit from the fact

49:18 you can, uh, get data and which infrastructure your code in states

49:24 cash As much as. And then can see the cash performance and

49:30 But eso this is pretty much all . Already said, this is what

49:38 want to suspend the time right? this is back to the same site

49:43 I think there was a serious So to look at what happens in this

49:48 cold again a little bit more in . So if you look at a

49:51 is in a product, so what that mean? They go through a

49:54 away and out of the and then move on. So one of the

49:59 to see and then element and the rolled see on you go back and

50:07 same column of sorry. Grow or another Carnival needs to get there of

50:15 eso In fact, your kind off a vector Matrix multiplication are supposed to

50:21 vector because you use the same roll a for the different columns of B

50:27 it becomes a row vector times a supposed to major x times a column

50:33 . So this is what this thing . And then there are loop just

50:36 through a bunch of Israel metrics applications . I think this is kind of

50:46 explain in this one. So now kind of them trying to figure out

50:51 these things works. And so we for each role off a we

51:02 the whole matrix of the So that's squared. And then we do it

51:06 all the different roles. So it , uh, and times and squared

51:12 of Visa, saying cube reads off Hey, you're only deal once you

51:18 through it all away and then you to the next row away so a

51:22 only see or load once. So n squared loads for a and for

51:28 you have to in this case Yet in initial see from memory,

51:32 then you write it back. But only sort of load and store see

51:37 . So that's the to end squared memory references for see, So you

51:43 it all up, and then you then two plus three and square memory

51:48 . So now, if you look the ratio between the number of operations

51:54 the number or memory references, you'll get. This is basically to

52:00 squared operations on Ben, the in Plus three and square data references.

52:05 it is no surprise because, the two enormous loop that is exactly

52:13 vector matrix to be exactly operation. and then there's just another loop around

52:18 . So no surprise that you don't any potential for data reuse because you

52:24 do vector matrix multiplication a bunch of . On the other hand, as

52:30 saw on the previous slide in one should be able to engineer something

52:36 you can get much better data reuse there's a potential for and,

52:43 reuse some data. So this is this like and again the this quote

52:50 between operations and memory references, it's cold arithmetic intensity, and I talked

52:57 that a little bit back on, guess. Lecture three. So there

53:02 this rule flying model that people used frequently to look at when things are

53:08 , then with limited and when they basically arithmetic logic you and it's

53:14 And for that one uses this culture operations and memory reference now. So

53:24 see. But you know, when did the Matrix vector thing, we

53:28 around with different Luke borders. There two loops in that case. Now

53:32 have three so on and for the vector, the performance was quite different

53:38 particular. When you have the cash off some number of elements that it

53:45 is the performance depending upon the loop . And so you should expect the

53:48 toe happen with Matrix matrix. So I'm going to trying to put you

53:55 to work a little bit. so here is the JK Group.

54:00 typical. When people talking about uh, vector multiplication instances very

54:06 you know, and cubed operations people to I I would say people call

54:12 algorithm. I don't know what the different through border but algorithmic Lee,

54:17 wouldn't say there's much difference. So they tend to refer to things by

54:22 index order you have some. In case is I J k going from

54:27 animals loops to the inner Muslim. , so this kind of role says

54:34 Okay, um, the Tulou borders keep interchanges the hyeon jae loop,

54:42 to keep okay as the innermost. it's one can then try to figure

54:49 what is the data locality off the reference pattern when K is the innermost

54:56 , regardless of what happens. So a sense of focus on the innermost

55:00 that happens all the time, and doesn't matter too much what the order

55:05 the other two loops are. So kind of locality do we have for

55:13 A, B and C in this ? How about a what's locality?

55:29 when you go from one K to next, what happens, right,

55:37 you don't reuse the same A for next case, So it's not temporary

55:43 . So then it's a question. other option is then special locality.

55:48 is the spatial locality for a because it happens to be Grow major

55:54 and you walk down a row of So yes, you get spatial locality

55:58 a How about the no cause Gay marched on in a collar. And

56:06 you have no major order, that no spatial locality. How about

56:15 right? Yes. Because every integration the loopy reused, the same see

56:19 in just accumulate the results of the product or the partially in the

56:27 As if so this is what you . That's what is it? So

56:31 how about if you put I in inner open step, then what

56:39 How about a but right? So locality, Special locality. Nothing

56:59 Because you used the new A. it's in a different Joel. So

57:04 means there's neither station or temporal How about be well? So what

57:26 to be when ice changes? Except you get temporary, lieutenant, and

57:40 no into the eye. I is the most. So you go down

57:45 the color. Yes. So this what happened. So last case J

57:55 is in a Muslim a correct. happens with Jay is increments would be

58:17 . Yes, the same role. now we get special locality for

58:20 We get temporal for a and about Yes, this patient Yes.

58:29 Um, so that was it. I think Let's see, s

58:37 that's kind of a this just simple . It's probably a little bit out

58:43 order, but this is if you this. George has fashion area that

58:48 figure everything fits and you can figure in this that So if you go

58:55 to the slide, then it is we have said. I just realized

58:58 the animals out of order. So exactly the case we went through and

59:02 figuring out exactly what the number off basis is because for B, there

59:10 when you had K in the innermost that can see right. It

59:14 um, no locality for me. you missed every time for B.

59:18 there was special locality for a so basically forever the only means for every

59:23 elements off A and C is temporary and just reusing nothing. Now,

59:32 you go back and look at the two porters that I just showed

59:35 one can do the same analysis for of them in that case, one

59:41 contained So just do The simplistic thing soon be is very large, but

59:48 not. Then it becomes 0.25. , for the next one is basically

59:55 . So that means it's a high with I innermost than it is for

60:00 innermost. And then we had the one, which was J in the

60:04 , uh, and in that if the it's a reasonable number,

60:09 it beats the other. So in sense, one can. Based on

60:14 simple analysis of how data is you can figure out that the last

60:20 border ought to be the best one terms of performance. Second will be

60:25 tough one, and the worst one be with Ryan. So and then

60:31 guys were nice enough to run this on a platform, and you can

60:38 three very distinct behaviors in terms off left things where there is the 0.5

60:45 races up 2.5 in terms of the Mrs. And then at the other

60:50 , you see the purple one on bottom. So you can see that

60:54 on what you do. Yes, behaves as you expect. Um,

61:02 that and here is basically the loop , the show for the different

61:05 And how did you can also see it's not a nice, smooth

61:09 It's a lot of sprite. Um , um, I'm on there to

61:14 . Why? Despite right. So has to deal with the associative it

61:28 of the cash conflicts, What gets up? Because, uh, not

61:33 that's what the spiking comes from. the essential behavior qualitatively is no

61:39 but the curves a very much in with what? The thing. So

61:42 is against I s a very simple of your code allows you to get

61:50 good idea. What reported and of course, is to compile this

61:54 . To do that for you on something simple enough. Yeah, traditional

62:01 in it and so on. So analysis may very well turn up at

62:04 con Party does right thing, the to Congress on about it yourself and

62:10 sure things are set up so you depend on the compiler to figure it

62:13 for you. any volunteers in terms these things? Not to this.

62:20 a foreigner here. It's a fluke the measurement. Yes, exactly.

62:32 they on Dennis's on the horizontal line . So when the majors small

62:39 basically the all 15 cash and you no problem with memories so better and

62:44 Because again, no. Their smallest to people and where they have so

62:50 becomes more and more documentation that You're yes, forget. But then

62:57 not a race here. Okay, is not protecting, and you go

63:03 and pick does the simple analysis and did before. And you can figure

63:10 exactly what when you start to lose large cash behavior as a function of

63:15 of your matrices. Um, so stuff. This is what I guess

63:27 had. So the one caveat about ISS the fact where the optimal order

63:42 a senior Russian, I have kind a arguing and figured out, uh

63:48 depends on the shape of the major . So if the majors is a

63:51 , this is the right thing to . But if you have rectangular majors

63:54 you may have different reporters that is so. So it's not given that

64:02 always the right thing to do They removed border. And so the

64:06 portrait I just showed you for, , sorry, I'm a little bit

64:13 of myself. I'm getting on to next thing anyway, so I'll come

64:17 for that since I'm ahead of So, yes, I just put

64:20 in as a reminder, way back some lecture showed you got to play

64:25 two codes and you with all the and this just chosen. Yeah.

64:32 this is again pretty much what I . There One thing also, we

64:42 with the Matrix vector multiplication. We at the tile, the blocked

64:47 And that's in fact, they're a more important for the Matrix made,

64:52 vacation because that's where you actually start get some benefits from tax.

64:59 um, so now basically so the of the it's kind off. What

65:06 see in this loop border is kind the same as the I J K

65:11 . But the difference is now, of have individual elements work with a

65:17 , a small sub metrics, but everything is the same. So you

65:22 think of your now if I K algorithm and just instead of working

65:27 individual elements, think of it as block matrix on, then it's all

65:32 same. So if you do then I can figure out what,

65:39 a function off the block size where assume Capital B in this case role

65:46 and columns for the blocks or each sub block sub matrix is a bee

65:52 bee matrix. But then otherwise it's off the same. So but

65:58 instead off grabbing individual elements and doing whatever in squared times. So if

66:09 look through it so you basically mhm of a black row off eight times

66:17 block column off be, uh, the kind of innermost thing then,

66:24 that means you have again. And would be, um, blocks for

66:36 A and B eso. For You have any of the B

66:42 and each one is B squared, then you go and do it for

66:49 the difference. Then you have n divided by B block columns of

66:54 So before you're done with one block of a. Then you have done

66:59 over B squared times, the squared terms of element. So that's what

67:07 of within their square bracket thing. then you have the number of roadblock

67:13 of a that you need to go , and there's an over B block

67:16 of a until you're done with, times being so that's so in the

67:22 , you basically get something that um, for the Cube, divided

67:29 the it's a lot, and then can do the same, because in

67:33 case, it becomes symmetric respect to and B and then see it's still

67:38 same. It doesn't matter. You it your computer block by block,

67:42 you only need to load, see and stored one so can ignore the

67:48 the block sizes in material. The order details in your matter a little

67:53 in terms of heart maps of cash signs and details. But in terms

67:57 the number of elements loaded, it not affected by the blocks, and

68:03 still too in cubed operation. So you can see if you do this

68:09 matrix multiply. You have the option the B squared blocks of a BNC

68:16 in the cash. Then you can basically reuse that is proportional to the

68:22 off rows or columns in the block they're using. So by using these

68:28 , approach on keeping them of a where they fit in cash.

68:33 you know, get data reuse that didn't get before. And, of

68:38 , then the larger the block size , the better it is for the

68:44 . So and then I guess the line shows so that, basically,

68:49 that modular a few times and other , but essentially the three blocks

68:54 B and C all need to fit the cash. So that means the

68:59 size is kind of depending upon a cash you have, but it's proportionately

69:04 is the square root of the cash terms of do, and then you

69:08 figure out if you want something to kind of balanced, so you want

69:18 memory not to get in the That means you look at the number

69:22 data loads and the time for each load, and then you look at

69:26 time it takes to do in your and point operation and Onda how many

69:31 them you do and you look at ratio. And then it turns

69:35 if you're have this arithmetic intensity, can figure out, you know what

69:41 size and cash size do you need this? Computation in terms of

69:46 multiply not to be limited by the of memory that you have, so

69:52 a fairly simple thing, and then figure out also. Then, if

69:58 going to be, even if you a cash, is it going to

70:01 memory limited or not? Um, you can figure it out from your

70:06 points because we have from the When he talked about it,

70:10 It has a clock rate, um it has an instruction with that may

70:18 either a scaler or, in this months, why the compiler should be

70:23 and finding. So if you take and can do 16 operations,

70:28 at multiplies in a single instruction and it runs or whatever clock grades and

70:34 whatever course and you can physical plug in here and figure out the time

70:39 floating point operation and you also have memory bandwidth that you get from the

70:45 of memory channel and the speed of memory that you have. So you

70:49 figure out what the tea sebum and you can figure out what he

70:54 is. And you can then figure with the optimum block size are you

70:59 to be limited by the album cash should? So, which is just

71:05 say I think it comes on some here. So yeah, I guess

71:11 slide makes a comment so it turns for those are kind of theoretically inclined

71:16 this particular scheme one can also prove is the optimum scheme to do things

71:22 terms off, trying to basically get maximum reduce off each level in the

71:28 hierarchy. So you apply it to level one. You apply to the

71:32 two and you apply it to the three and whatever else you have,

71:37 if you have this beyond your main than your flight, you know the

71:42 memory, and so the disk and not so it applies to every level

71:47 the memory hurricane, which is kind what the next strike says so That

71:53 you know everything you got topped in software. Engineering goes out the window

71:59 you're going to mess up your coat , making it much less intuitive and

72:04 to sign because you break things Essentially do block sizes for every level

72:11 the memory. Higher case you get of you ask your three nested

72:15 You know, you get three times number off memory levels you have in

72:20 sister. And if you're really um, then the innermost is not

72:27 level one cash, but it's actually register. So that's what you

72:34 People that can kill guys do, I do. And he gets to

72:41 and enjoy. So then, in most things are kind of taking a

72:44 because the number of right tens 32 something from most process. I got

72:51 years for about four by four. , uh, what you dio so

73:00 much this is. You know what said what m K l does and

73:04 with the loop. Borders is place the blocking for the various levels in

73:07 memory hurricane, the way they get performance. Now that's obviously painful.

73:18 some people want to automate the process these days. I guess you can

73:24 to be clever and try to be these days. They used machine learning

73:30 up in the old day you did the brute force way and just tried

73:34 the different options on the brute force . One way is what's done by

73:40 called Atlas Library that you can download Net flip and what it does.

73:46 basically right all kinds of LA Kings Luke, petition ings and Luke

73:53 and then the very code for all different things. And then just try

73:58 out on your platform and figure out is the winner. Of course,

74:02 means that you just do one matrix . You spend a lot more trying

74:07 figure out which one is the fastest in the snow with me. What

74:11 doesn't really pay unless you repeat the operation many times. But it is

74:18 insights into the range of performances you get as a functional partitioning and look

74:27 . Well, this is pretty much yeah summary. I want, um

74:38 as I mentioned, I think last or yes, when I talked about

74:43 Tide Matrix director thing talked about this of space feeling, curves and other

74:52 . So some people says again, is the pain. So how can

75:00 kind of do something that is, it automatically? Or in addition to

75:09 the brute force is there's another Then it turns out and principle that

75:16 you do it recursive algorithms instead of intuitive things that I've shown all the

75:22 than you. So if you want do so, it's kind of like

75:29 block matrix at the high level. suppose you have your and by in

75:35 cities, then you're partitioning into four . They're basically end of the

75:40 each one, and then you multiply one of these two. So now

75:43 branch, you have four instances and you recursive Lee petitioned them and do

75:49 and smaller matrix multiply. And eventually have something radius to Ellen. So

75:55 recursive the partition, they may be . Two pi into block matrix

76:01 So that's what presumably is. So means and you get down today.

76:10 element level they definitely 15 cash and stay in cash as you kind of

76:15 up and you larger and larger block . And eventually you run out of

76:20 . A lot of them that fits cash and then depending on which order

76:24 process that blocks, you can use space filling carbs to be clever and

76:28 to preserve more and more of a . Um, so yes. So

76:38 is what So people try this as said, Um, and unfortunately,

76:47 practice, it hasn't been able to this kind of automatic procedure. So

76:55 interactive ones, there's still what's being in the libraries out there. And

77:01 you want Thio do something that is makes maximum use. So in this

77:05 , for on platform, I think got about two thirds and another

77:09 They got to about half the peak using the reverse of approach. And

77:13 has to deal with a lot of their memories system actually works, and

77:17 cash lines works and a lot of underneath on this is just a reminder

77:24 some of these, But, you , put this in there. So

77:30 is anyone interested? So the first is so Charles License and that Mitt

77:39 student was one off the early ones , if not the earliest that

77:46 you know, came to think about richer semantics, multiplying, being kind

77:50 combination of systems. Card on theoretician proved a bunch of properties. And

77:55 has very nice lecture. And I you are into this, lecture them

78:01 on about the course of approach and It's great. Uh, but it's

78:08 certain practices kind of India, and kind of lose. So that's what

78:17 had on Matrix

-
+