© Distribution of this video is restricted by its owner
Transcript ×
Auto highlight
Font-size
00:00 and start lecturing this point. It like it is recording. Yes.

00:11 um them just comment to this lecture that reminded me started, I started

00:24 talking a little bit about oh hardware in order to understand what resources I

00:36 and what it takes to potentially good . And then we talked about tools

00:43 actually find out what the codes are . And then after that, talked

00:51 three different programming models that are being and HPC and pretty much anything in

00:59 of apparently computing that is open, official memory programming and programming. Heterogeneous

01:07 . And then the programming of clusters an P. So now as going

01:18 return to talking about techniques to actually good performance for applications since and that

01:28 think is important for we're projects and . Um, and so I pointed

01:35 and some of the feedbacks too. proposals. It's essential that in your

01:42 you have a good understanding and good of how efficient your cold is.

01:50 again kind of the basic four So I'll do that. Return to

01:58 how to get efficiency by doing Very simple Or two. Simple examples

02:04 to expect a matrix matrix multiplication and to understand how to make such simple

02:11 work well on current type of Yeah. So uh, yeah.

02:23 , so first a little bit about background in terms of to what extent

02:31 why bother essentially. And um, I'll give some background and I'll talk

02:39 the concept known as that distance. is really important. Try to get

02:44 idea what to expect in terms of performance and how to restructure codes in

02:51 for them to work with and cash systems, which is pretty much all

02:58 today and I'm not doing it. go through in these two examples made

03:05 spectrum Matrix matrix multiplication two go So you get an insight in how

03:12 think about your approach in order to good performance. Okay, so first

03:20 brother and this is kind of an . They don't need to worry too

03:24 about all the different labels on the person. The plot on the right

03:31 side of the side. But the thing to pay attention to is kind

03:37 the difference between the worst performance gold is at the bottom the blue dots

03:44 is apparently above zero and the good that are purplish squares on top.

03:52 as you can see it's a huge . Um I was at least two

03:58 of magnitude difference and this is for matrix multiplication. Very simple competition.

04:07 , depending upon how the gold destructed this particular example for this particular

04:14 There was more than two orders of difference in performance. So that's the

04:19 why things matters and how things are done and understanding how cash is actually

04:26 work and how to make good use them. So this is just reminds

04:33 a little bit about metrics in terms latency and bandit and spices that are

04:40 . And I'll come back to that . Mm So and then in terms

04:47 crashes and there were the three different of cache misses that do we affect

04:56 ? The compulsory misses or cold start ? The best of the first stress

05:00 to in a cache line. There's other way then there will be a

05:05 because data starts out in main memory eventually you want data back in main

05:14 . So then I typically do not cash. So that's the compulsory or

05:20 start. This is the capacity We talked a little bit about talked

05:26 memory systems simply that typically the cash any cash. It's too small to

05:37 the entire problem. So that means , capacity issues. Mrs regardless.

05:44 thanks for being done. And then were the additional things we talked

05:49 the cost typical caches are associated and that means that even if there potentially

06:01 space because of the greater map cache in main memory is mapped to the

06:09 . There can be conflict misses. these are the kind of three types

06:15 dishes And the example I'm going to about want to first and mostly focused

06:23 compulsory and capacity mrs. So that's of the first um at least complicated

06:30 understand cache behavior for codes. Um so this is special as I

06:40 going to use a very simple model can't understand cache behavior and uh so

06:48 will use symmetric vector matrix matrix multiplication try to understand What one can do

06:54 the consequences for performance for a very cash model. Okay, so the

07:09 sorry sure. The very simplest model to ignore the conflict message as I

07:16 . So just focus on compulsory misses capacity Mrs and to do so in

07:23 following several slices and I've world from fresh up in galilee at uT austin

07:31 that is a compiler type person is look at most enormous large cash model

07:42 means that there's no capacity Mrs and more real when it's a small

07:50 when there is both compulsory and capacity But for the first several examples not

08:00 about conflict messes. Mhm So this pretty much against what I already signed

08:08 large cast model. That means the , it's no capacity is only the

08:15 missed. Now the cash is what is also ice. So a number

08:25 these slides so follow them and try figure out when is the cash can

08:32 one can be considered large Respect to problem that 10 or well, is

08:39 more like a small cash relative to problem that happens what we're going to

08:46 to try to understand the car's Is this notion of that distance that

08:54 will talk more about common slider on show at the point of the stack

09:02 that it tries to account for the that typically there's a number of the

09:13 and I was being read at some on me want to use the same

09:21 values again. Which is certainly the in for instance matrix vector multiplication because

09:28 vector and made to effect an application used for every role in the multiplication

09:36 that case, elements of X depending the access order. Eventually a given

09:44 of export the read several times and stack distance and tried to account for

09:51 many other cache lines are being accessed one trying to use the same trash

09:56 again. Right. So it is of the notion of the stack

10:04 So if one looks at sort of timeline of references to memory then there's

10:15 particular Cache line is you know, at the same time are one and

10:20 there are a number of others cash or a memory address is being accessed

10:25 eventually that time are too than the cache line that was accessed that time

10:32 one is access to get. So , this stack distance then has an

10:42 . How much gets kind of loaded cache from memory and clearly the further

10:50 the Lord of the distance is the Or two is from R one the

10:56 data has kind of been loaded into and at some point depending on the

11:02 size, the car's gets full and the goat capacity nemesis. Yeah.

11:14 I guess I should stop before now into the examples in any questions on

11:20 basic concept of large and small attraction distance. Okay so here is our

11:40 generic might expect an application to nested using what typical in a product

11:50 The role of items selected for That gives an element of the result

11:55 work. So in this case based reform memory references and uh statement there

12:08 and for X. one for hey run for why on reading A and

12:13 there's another one for storing sorry for why and then storing why. So

12:19 becomes for references uh to executing the which is an executed then squared times

12:27 in total. Uh and this can way of doing it. It's four

12:31 spread memory references which will kind of in this example. Now what's pointed

12:39 on this slide here is also Well good compiler will probably exactly what's

12:47 and I'm here try to make sure why it doesn't get read and written

12:56 the same value for every iteration and most loop since it's why why is

13:06 for old J. So that means composition actually keep it and registers instead

13:12 writing it and reading it from my for every intervention but for simplicity in

13:19 student I. D. Version of scored as the reference case and then

13:24 to figure out how various uh new and other uh manipulations of this competition

13:36 affects. Yeah. And what this I used row major storage for a

13:48 is important as possible, pointed out think early on in his course and

13:55 guess can assignment I believe in playing with luke porters and major expansions

14:04 So It's the four scenarios that five that uh I've used to try to

14:13 what happens in the first two years doing very simplistic thing on just assuming

14:21 elements are accessed element wise basically there's line size effects on that one.

14:28 then we can kind of show them is the consequence If one as a

14:35 architecture today Cash loan sizes are not but typically there are 64 or 1

14:43 bytes For them. As you can from Lecture four or 64 by

14:49 Lots the significant of the will the in the next few slides. So

15:01 say it a little bit difficult to interaction this one but over the

15:06 But we'll see. So we have gun's a generic matrix rectum out the

15:13 tool which is also known as either stopped or deduct. Um For inner

15:21 product. And the first the is double position and if it's single it's

15:26 essence that so generically it's about product executed. So now there you have

15:38 cached on signs of the investors that read everything. One element at the

15:42 whether it's a the X or Um So then we tried to figure

15:48 in the large cash models. Large models. Again remember that was only

15:55 misses. So now we'll see if can help me work out the number

16:04 MRS for start with a how many misses will it be in this coat

16:16 mm have any takers? So and is willing to speak up. Uh

16:40 . Again A starts out in Right? So that means every load

16:47 a the tinkerer Castmates. There is other way it has to be loaded

16:55 every element of A is not in cache. So basically one that and

17:03 mrs for a. Now it's down about X. And the one is

17:15 to come up with the number for . Many cache misses for X.

17:26 one answer from students and N. Okay, good. And why?

17:46 just another minute the statement is executed square times. Yeah. Um the

18:01 was that it is and I say in cache misses for X.

18:29 The wholeness and all. Yeah. there is only in elements of

18:41 That's true. But the statement is and squared times. All right.

19:04 unless it any comment, I don't I was going to the coast.

19:10 yes, it's and misses. Now reason is only enemies is it's because

19:18 assumed it was a large cash So that means for when excess pain

19:29 you know in the innermost loop going jay it was one to end and

19:35 the excess has been loaded. And it's a large charge it all fit

19:39 the cache. So when the outer for I is executed then X.

19:49 already in the cache. So that's between A and X. Because it

19:57 to have N squared element. Each is only used once song that.

20:04 is there is no reason reuse but a potential reuse or there is reuse

20:10 X. And if the cash is enough X. Entire vector X fit

20:16 the cache. So it only red as well as elemental. But it's

20:22 again N squared times for each element excess used sometimes. But there's only

20:30 for the first time it's being used I got something similar for why?

20:40 and finance it all up. This the total number of cache misses in

20:45 case of a large cash. So again only first time because once it

20:52 been read the cars can hold it there is no issue. So in

20:58 case, if one looks at the ratio Denmark, it's something that depending

21:05 the size of event then it becomes quarter of the readings of data values

21:19 a miss relative to the naive for squared model any questions on this or

21:32 makes sense. Okay. And presume makes sense. Now suppose the cash

21:48 small relative to the problem size and what happens? Oh let's see.

22:00 is no difference what A has to read. Um So there's no way

22:06 it for that. But what happens X and why? No, the

22:21 vector X. Because the cash is , will not fit in the

22:30 So if you look at this loop was right. That kind of illustrated

22:35 the boxes on top that. And the first situation of being the

22:43 loop and were excellent. J Index and then Next one was the J

22:50 two, et cetera. So that And the woman come to J.

23:01 two. We have four elements already there and after J equals two is

23:08 there is an additional two elements Uh one is still there? But now

23:14 have anyone to an X. two also gets loaded into cache and others

23:20 for every day we had two more uh into the cache. And at

23:29 point that means one run out of size if it's a small cash.

23:37 then my mom guns try to get computing wild too. That means they

23:50 it F and use the typical, know model of least recently used That

23:59 X one is no longer in the , it has been over it because

24:05 cash was small. Same applies to when they get to the next

24:10 But hey is not being reduced So that doesn't really matter in terms

24:16 cache misses. So here's what happens yes. Um, the number of

24:27 are exes and go through the first , yeah, index for jay the

24:34 slope. There's the real terms but then when you come to doing

24:42 for the next I value the the X values you need are no

24:50 in the cash. So that means each traditional role we're going to compute

24:58 times six how to reload the entire X. So now I'm basically get

25:06 and square mrs for X for It's the case is better because if

25:16 do it in this particular order, only occupies one place and it stays

25:25 as you keep going or women day . So now these are one the

25:34 misses. And these were the capacity And that is only a concern for

25:41 . The capacity mrs. So now have a whole number of mrs is

25:48 N squared instead of N squared. basically budget show this thing.

25:55 Um uh, as I said towards bottom of the slide and I'll talk

26:02 that on the next slide. The looks out there. Hell are you

26:09 strategy but it's important to realize that of the cash is small. So

26:17 the entire vector expo not fit. but only needs one element of

26:23 Of Y N A. So that's such a big issue. So here

26:32 a little bit more careful now doing are you type of idea? And

26:40 , if I want to figure out , as I said, you

26:43 for a given problem size, can cash be viewed as a small or

26:48 cash in this case? Um, cash become small relative to the problem

26:57 ? When the cash is smaller than Monroe on a tire back for

27:10 And um, the elements of life one in the service. So that

27:17 plus two. So dependent. So the size is roughly and when

27:27 the answer of the cash starting to as small as opposed to a large

27:37 . Now I can go to their is actually the two M Okay.

27:42 A because both one rule of a one row of access to previously plus

27:53 and one more element needs to be the cash because you know, want

27:57 do um, the first element of is move to the next situation or

28:08 , so that's when it becomes actually , when the matrix dimension and is

28:18 greener than half of the car Then the cash start to act in

28:22 small case, small cash case rather the large trash case and they comments

28:34 questions on this analysis. So that's of the behavior later. I was

28:52 . So what do you do about to try to oh, make uh

28:59 were called mm hmm. Make effective of kind of a small cash

29:10 So here is kind of then you see in the different scenarios that Or

29:15 smaller than cache size over two. it kind of behaves that's a large

29:21 in once and becomes greater than half the cache sites than it starts to

29:27 as a small cash. Um, , scenario two was still using the

29:37 , flies cache line size being But the difference is to interchange the

29:45 border and it's then known as an type computation and set up there.

29:54 competition is being used and understands for X pass by. And in this

30:01 the inner the loop border is such what happens is instead of computer in

30:07 computing in their products one There's the columns of a and that means it

30:15 in a partial contribution to the entire why. And then one keeps adding

30:22 this portion contributions to the result Why as scale directors scale columns away

30:33 should say. But um, martin through the same analysis and it turns

30:41 that the behavior in the is ends being the same. So no further

30:49 that that will continue it on your . Now the next thing is okay

30:53 . So make it a little bit real than to see what happens if

30:59 now has a cache line size the be great in the months and arbitrary

31:07 A company 64 Bytes or a equivalent a double precision numbers which is common

31:15 maybe 1 28 is used in some . So still large trash models.

31:25 now the difference is For every Miss gets be elements because cash long sizes

31:34 so the number of these, this simply just getting scaled by the cash

31:39 . Say so that's true for all them. Otherwise technologic is the

31:44 So now I want to get something or now of the cash. This

31:50 simply gets scaled down by the size the cache like so that's part of

31:55 reason why not cache lines. So 64 bit and no one can have

32:03 cache lines. But then what happens depending upon how you use your

32:11 the penalty to bring in the cash and increases with these sorts not to

32:19 . That is universally good. And know if you use all the elements

32:23 the cash on it is good but you only use a few or one

32:29 in a cache line, a big can uh them can in fact be

32:35 to performance. So there is a off and discuss that a little bit

32:41 when I talked about members systems. trade often paris, ISIS and cash

32:46 ISIS but for now this has Yes, it's effective in reducing the

32:52 system and um the same thing I do with the cash, small cash

32:59 and there's essentially all their friends accept get scaled by the cache line size

33:08 so this a long dive into that further. And the so this just

33:18 the scaling with a corresponding science. various case the graph used four elements

33:24 mark the eight elements that is typical . Um So this was you know

33:33 expo blue border and the conflict is analysis simpler. Similar. Sorry.

33:45 I don't think there is much more trying to do and that that the

33:54 and And of course one as crash upside be size speed then the B

34:02 up in a number of places and transition now happens. And if Ellery

34:11 um basically cache size divided by the line size versus if one really had

34:21 the degree of freedom of figuring out they had to replace in the

34:27 then it will be still similar that transition happens. Um Then is about

34:34 cash plan size. See someone gets little bit like this type of picture

34:41 shows depending upon whether the right uh Trash can size is one or greater

34:54 what when the transition happens. So questions on this. Mhm. All

35:12 . So and when the potential one the projects and try to understand

35:20 Um and it's a useful thing to what the cash on science is trying

35:27 interpret your results. No, I . So what can one do to

35:39 to improve potential the behavior of the it is to use a lot?

35:50 one was blocking. So the best try to well on petitions The two

36:01 and two um tiles or blocks. our investment do a collection of small

36:12 spectrum multiplies where each one of those matrix have two multiplies fits in the

36:21 and thus behaves as there was a cash even though it's not and then

36:29 other looks scan over the block size to get the whole drop down.

36:38 right. Somewhere from the concept then one will notice that one does get

36:47 performance and I'll show more of I want to talk about 98s latest

36:54 , where it's much more significant then is for me to come out

37:00 but this is just the very simple . End up. That's good to

37:04 from the matrix vector multiplication case. let's see next. Yes. Um

37:20 is a little bit I guess at comment on this blocking algorithm. So

37:30 order in which controversies the box is important for the performance. So depending

37:39 how the layout is in memory one another way. All traversing what I

37:46 do things by column or by role terms of the inside the blocks.

37:54 then I want those things from one to another. It's also one Monday

38:00 blocked rules and block columns. And the traversal that affects the memory performance

38:09 well. It does affect the performance data in the cache but it affects

38:16 the team around performance depending on how are laid out. Remember. So

38:24 particular ordering um that you see on right hand side they're kind of known

38:33 more than the ordering song. The that I guess first investigating it really

38:42 computers they tried to absolutely change traversal index place for I and J.

38:50 such one or the other or some . Also traversal of memory results in

38:58 to optimize the Iran performance. And is kind of what actually happens then

39:09 himself. Um administrations for a very block cash model. So law is

39:19 because administrations are not good. So is ideally for the proper block and

39:27 things to be recording to the red . Uh Well this is just the

39:39 done summarized. There's nothing that I already said. Um So now um

39:51 will I guess I should mention that mining is to set up petitioning the

39:58 a vocabulary that compound that people tend use quite often um invest there petition

40:09 loop into. I said I look still two loops where the inner was

40:16 is length is or integration space is determined by the character size is that

40:29 intended for. So the discussion I made was for a very simple model

40:35 there's one cash and one main memory in practice it has to be applying

40:42 each level of the cash. That I'm trying to optimize it for um

40:50 levels of cash that is common in CPUS one does Styling for each one

40:56 the three cash is for each loop . So that is it kind of

41:08 one may have learned and how to . Nice clean cold because the coat

41:17 and stop starting to look they all dependent on the particular underlying architecture in

41:25 world to get performance. Um The of loop nested loop levers. As

41:33 said it's a function of the number crashes and the duration range. But

41:41 inner loops depends on the church sizes the various levels but they can parameter

41:46 that doesn't need to be hard There is a dependence on the particular

41:54 of the processor, Vegas. next is maintenance matrix multiplication. So

42:05 questions don't see anything. Okay well do the thing that for which this

42:24 of tailor things to crash structure is important compared to make this spectrum

42:33 So so here's the income. The matrix matrix multiplication cole bring us the

42:44 um The inner loop doing a matrix matrix or the two enormous luke doing

42:50 vector multiplication. But The Inner Loop and in the product take a roll

42:55 eight times a column of B. to generate one element of you

43:02 the product metric. See and the difference compared to the magnetic spectrum application

43:10 kind of the outermost loop on I basically carry things out as a sequence

43:20 made to expect um applications. So we have as we know, there's

43:31 in cute operations to do a matrix of and by n matrices um You

43:41 on the 2, 1 of them in two years. The multiply and

43:47 is the ad size dalton and there's and uh each in our old times

43:56 column they're all wait times a column be is to in operations and then

44:02 repeated for every column of be in column or all of a sudden we

44:10 to and you just hope, you , no, the mattress is a

44:19 and C. Or each have n elements. So on total there are

44:26 squared three n squared words that needs be loaded from memory and principle.

44:36 that means and there is ideally um our mathematical intensity of basically order and

44:48 basically two in divided by three to operation and the three n squared data

44:57 . So so if one can come with algorithms such as data reuse well

45:04 shouldn't be able to yeah kind of the required memory bandwidth. Um correspondent

45:14 proportional to potentially. And we'll go more about it. But the idea

45:21 that since now there is data reuse is significant. It's order in I

45:28 like to take advantage of it for metrics vector multiplication. There isn't much

45:38 a S N square and X and is order N. And I can

45:46 X. But again There are two quite operations and there is N squared

45:57 . So the maximum reuse is approximately . So that's why matrix matrix multiplication

46:08 and much more interesting example and how figure out how to get they they

46:15 and to use caches. Yeah, now we're going to use the ideas

46:22 the matrix factor and application case. . So our mountain actually we have

46:31 um that a performance compared to the this particular thought history. Yeah.

46:40 yes, there's this small and large models and we'll talk about that.

46:46 first here is kind of what's We'll annotate this cold a little bit

46:53 terms of data references. Um and depending upon what is small or large

47:01 mall. So um and I will a little bit more this more in

47:13 of that culture to during that So it's essentially just listen this makes

47:24 or it's nothing um more of this than putting in the other loop on

47:32 to do this sequence of matrix vector , everything behaves estimated spectrum multiplying in

47:39 particular case. So if it looks little bit more and depending upon what

47:46 assumptions are, how large the cash . So on the arithmetic form calculation

47:57 terms of memory references on this slide . Yeah from the lewdness that we

48:06 that this is in the product that the since one go through a row

48:16 A. And then wonder and the of the heat and then when the

48:24 outer loop the loop then um moves the same role of A. But

48:37 use a different column of the it's matrix vector for each integration

48:47 And then you do a bunch of spectrum applications to do. Okay so

48:58 that case what happens is that if cash has not really learned so everything

49:08 in cache. What's happening is that you go through the J lube as

49:18 as the K loops at the two loop, then you have effectively read

49:24 entire matrix B. And then you to do they are look so in

49:34 scenario basically the matrix B gets red time. So it's best. And

49:41 references memory references for B. In case with the soon H.

49:46 A fill it in the cash. that means A. Is only read

49:53 you don't need to be loaded for column will be that you use.

49:58 that's why is Red Monsters to be . And similar for see it kind

50:03 stays in josh so on. They the load and store of sees mr

50:08 and square. So that's what you . So now if one looks at

50:14 american intensively for this particular lewdness with assumption um Cash being large enough to

50:22 essentially roll away plus a few more one guess the arithmetic intensity two or

50:32 for the number of arithmetic operations per of reference to the same thing as

50:38 matrix vector multiplication. So I was great because in principle and should be

50:45 to get um charismatic intensity that approaches 2/3 event. It's just um when

50:57 a good job in terms of figuring out. Oh, how to orchestrate

51:02 loops or petition the loops. So the typical way of looking at this

51:13 No one talks about the various orders the loops. You know, In

51:19 making expected case we looked at the product and the XP version. Uh

51:27 the print. Change the loop border the two ropes we had in that

51:32 . Now we have three loop for and content to label them.

51:37 J K. For the loop in is being used and the order in

51:42 they appear with the first thing the most loop and the last thing,

51:47 innermost loop. So one of these permutations of I'm going to order the

51:58 . So comment on what happens in , I J K or J K

52:11 . So one thing to consider it is now one needs to worry about

52:18 in order and how it affects cache . So in that case we have

52:27 locality measures, we talked about way , I'm going to talk about memory

52:33 is temporal and special mortality. So if you look at okay being the

52:43 stoop, regardless of the order of two other laws, we can notice

52:54 for a for instance that we start a as in the case changes,

53:04 proceed from one element or in a to the next element in the same

53:12 . So that means actually a good locality for a see, it turns

53:23 it's not respect. Okay, in case because in this case it also

53:27 for the loop jay because we have and the next time I C I

53:35 plus one. So basically C is access in the order and the row

53:41 order. That means this could also locality results of temporal locality and this

53:47 because see I J is used in iteration in the Now for me things

54:01 not so good. I didn't list on this slide that listed the good

54:07 because with B in the innermost loop go from traverse B, column by

54:18 . So it goes from one element a column to the next element in

54:22 same column. But they are separating in the row major order by the

54:30 of the role of the, So that case the special locality for B

54:38 fair and there's no good locality for . Similarly, we can then look

54:50 , so I put I in the loop and it turns out then if

54:56 look at uh the toronto, look this case uh be because if I'm

55:08 in the innermost loop, um the stays the same every iteration, there's

55:14 dependence on I. So it has temporal locality. Um And that's pretty

55:22 all one can say because if you through from eye to eye passed one

55:28 it's in the innermost loop. That controversies both A and C. A

55:35 by columns or from an element in column to the next element in the

55:41 column. That is something with the strike equal to the length of the

55:46 of A and C. So that's special opportunity as well as temporary

55:53 Some things are bad and the final was to look at J in the

55:59 slope and I can figure out what do special and temporal locality is and

56:06 same reason. So this is things different and it's going to result and

56:16 cache behaviors and performance. There's a a question in the chat.

56:23 Rodman is asking if transposing the matrix can help an I J K loop

56:30 . Yes. So if you transport then you pay the price for the

56:37 that then has a problem and you when and how you strike through the

56:44 and depending upon how you do it then it costs you extra memory to

56:51 you don't do it kind of in . So yeah, so this notion

56:56 that which happens practically that the compiler one rule for laying multi dimensionally raised

57:08 memory. It doesn't have different rules the different operations. So this set

57:14 slices assuming that everything is kind of out from probably in the same what

57:21 thanks the actual and what's good spatial temporal locality changes with this real war

57:30 major order. But we got kind the same effect is just um difference

57:37 which ones are good and which ones bad. You see the same behavior

57:42 in taller major order but it's a point just if one is transposed to

57:47 another than you can, it affects what's good in terms of spatial and

57:56 locality. Yeah. So right now that so now, so this is

58:08 working through this thing with scenario that just talked about what for this I

58:16 K case. What? Yeah, cache misses are for cash flow and

58:24 of B. Um there's a large scenario and then now if one does

58:31 little bit from a small cash scenario I can work it through and one

58:36 uh numbers are similar to what was case they made to expect the multiplication

58:45 and a quarter of the memory accesses mrs and they go back and look

58:58 the slides from the matrix vector multiplication because again this jK lu porter is

59:04 , yes, a collection of accepting and you should get a similar

59:13 No. Then what's being done on line is kind of work it out

59:21 or six rather 2 pointers. But three cases for which one is the

59:28 , nope. Okay. I know and one got these expressions on the

59:34 hand side. So fantastic that these and I'll try to so if you

59:39 remember it before looking at I think next slide but if you look at

59:45 in terms of the mystery show uh been in a most loop is one

59:52 the worst case .5 que in the opus somewhat better is about half the

60:00 of misses and J and in the look is the best content not to

60:08 the benefit from the cache line size the So if you know like common

60:15 double position These eight or single is is significantly lower number of cash in

60:23 system Either on the other two so kind of is the best uh in

60:32 of being the innermost, okay is worse but not the worst and I

60:42 be the worse to have in the . Now this is what No,

60:51 something I borrowed a slide from they way back. So it's an old

60:56 pension dream. But in principle the is similar for anyone born Processor Like

61:05 one. So what we have was j in the innermost loop. What's

61:12 best? I think. So that's if you look at there is kind

61:23 a purplish line that is very different the other colors that is towards the

61:29 on this slide and this is gonna shows the difference and then we had

61:37 intermediate ones was okay in the in blue and that's sort of the bluish

61:46 yellowish depending upon the order of the two loops. But there it is

61:51 , okay, you know what blue also okay in Mosul said it was

61:55 different numbers right? The factor of . So that shows again considerably more

62:01 system the best case and I and R. Loop was the worst

62:08 So I haven't in the animals to the worst case. So and they're

62:15 hard to see about. There's actually colors. There is kind of brownish

62:19 then the purpose that is not travel much and then we can see a

62:28 bit strange behavior at the beginning here . So anyone can. So first

62:37 guess the simple analysis that was done a good indication of What one should

62:46 terms of ordering the loops and now was for all major order and the

62:53 doesn't for column is your order. idea of the border will be

62:59 I just need to do the similar that figure out which the loophole you

63:04 be respectful cache misses. No anyone a comment from what the strange behavior

63:12 . Um close to the origin and plot. Well it's simply that for

63:30 small values of n the cash that have been a little large cash.

63:36 is not many nieces. So this for this particular case. This is

63:40 of a transition happens between uh the behaving has occurred small versus large chest

63:50 one can figure out this transition point the cash face and um The fact

63:59 there are three options, A, . And each one should be fit

64:04 the cash for the cash to behave a large trash. So one can

64:13 very easily figure out when this transition about to happen. And then martin

64:19 one's performance data and see if it sense what you observe. Um in

64:26 case. So a few more slides try to figure out how to make

64:33 really good. So this is against I said in terms of figuring out

64:38 transition and then this is just a of what they did way back in

64:46 of sensing loop corridor was assignment to Anyways now one more thing about this

64:57 response applies for now going back to we learned from the matrix vector multiplication

65:04 Was one should do entire version. Burleson did make better use of

65:12 So the previous thing was just um to make good use on the cash

65:19 that's still relevant. Um So that change but now there's the additional feature

65:26 trying to minimize the number of casualties partitioning looks into types. So um

65:38 this case the best no one do matrix matrix multiplication, czar blocks and

65:44 I wanna tre versus those box and order that is similar to the one

65:54 has talked about when they killed the size was one instead of Okay,

66:01 there's kind of no difference is just . Instead of thinking of element multiplication

66:07 in The Matrix Matrix multiplication one now of each gentlemen being replaced by a

66:15 subnet to exercise bi bi bi and one does that no man can go

66:23 and with the same analysis in terms number of um Los and needs to

66:32 and then actually discovers that if again blocks all size bi bi bi fits

66:41 the cash Then one now has and sorry it's mating intensity that is proportional

66:53 the so I wanted to mention on blocks I saw The larger the block

66:58 one gets the better and using the this will reduce the need for the

67:07 bandwidth proportional to be. So this what's being done in most schools today

67:17 um want and then figure out essentially here. You know that A B

67:24 C. Sort of the three operations of size B squared. A simple

67:30 and one chooses them. He's such three B squared is just than the

67:35 slots. And then we can play with the numbers and figures out what

67:41 . But that's when the way one effective use of caches and then when

67:48 I mentioned, one needs to repeat for each level of the cash.

67:52 basically one just blocking for everyone. blocking for level two cache and blocking

67:57 level three cache and then don't get things to work on. This is

68:05 to send them there. I can't prove anyone other theoretical interest. There

68:10 something called er red blue Pebble Game was done by it's the tongue and

68:17 of his students way back in the show that this is indeed the optimal

68:24 of doing business. And this concept this blocking applies to pretty much any

68:33 of competition that you do. It's . And so even though it's easier

68:40 understand probably than many other scenarios, just understanding the matrix competition because they

68:48 fairly simple but I know for oh one group or you want to

68:55 the 50s from the project and we apply this recently also too Uh 50

69:04 and under's blocking or how those are done in order to get good performance

69:11 these and I'll try to talk about next time. Let's see what else

69:16 . And this is best for saying to do the same thing for all

69:20 different yeah, cash is I get performance. Uh this is a little

69:30 what happens in terms of optimization. customer diversity people sometimes like again,

69:39 that both, what happens when transition small class behavior, the effect of

69:48 and yes, they are software packages there. Uh one of the oldest

69:58 this case is known as atlas a for automatically to engineer. Underperform software

70:07 what it does is basically does an church search of luke orders and stylings

70:17 all cash levels in the processors. , um, it's a pretty extensive

70:27 that is being done, but on next slide here it shows how it's

70:35 uh competes with lender optimized libraries and is an outside. I haven't seen

70:41 new comparison but it tells you that blue is the Atlas version. This

70:52 search to try to find the best yeah, loop ordering and tiling and

71:02 it compares to um, one is , this vendor library for this particular

71:08 and then there's another old fortune But I guess I can compare the

71:15 optimization towards the then there's what can in different ways? Obviously this fairly

71:23 search procedure did indeed come up with optimal um tiling and reporter compared to

71:35 vendors and sometimes it actually even took the vendor library tiny did. Now

71:40 horizontal axis is different machines that were on this at this particular time.

71:48 I think that's pretty much what I one more comment. So this is

71:58 all right. Um taylor and co particular architecture even though it can be

72:04 wrist so it's not totally um I say I shouldn't dislike it totally.

72:15 that's a better way of saying But it is. And introducing a

72:21 amount of the tendency by the cannon bi parameter rise. So combining communities

72:29 they want to animate all of not like the atlas way by doing

72:33 that instead a writing code that is oblivious. That means that um they

72:43 trying to come up with these ways um writing codes for that.

72:49 the compiler would that do the right and then the code would end up

72:55 the optimal performance for close to it an extensive search through parameter space uh

73:04 out so that's so in doing so idea was to do made right maintenance

73:13 for instance as a recursive procedure. can certainly do. So you think

73:20 the matrix multiplication is as being it's multiplication of Medicis as Take two x

73:30 blocks and then each one of these x 2 blocks. Matrix multiplication is

73:35 reversible, Divided into smaller two x blocks and eventually at the bottom level

73:43 the rickerson. Then things would fit the Cache and things would be ran

73:48 one things would just happen without anyone about what's going on. It turned

73:54 in practice it didn't do so So that's why in the end most

74:02 libraries that are not using any cache approach to writing code, they do

74:09 parameter trist tiling and lou pre So this is so I'm here is

74:19 of uh that's what it said and I said that why it didn't

74:26 But there's also then so this is of what the bottom nine of this

74:33 the cash up get. In terms scientific oblivious. Uh It's a smart

74:39 codes. It didn't work so So the winner. And they always

74:42 up being the kind of interesting block versions. But there are some good

74:49 for both the cash a business and cash. Uh huh conscious coach.

74:58 at that point time is up I and I'll take some questions. Uh

75:05 Yeah. So I think all messages reporters are important and darling is important

75:19 to get good performance we need to about sometimes. When is lucky that

75:25 algorithms actually do the job automatically. yeah, most of the time unfortunately

75:33 though in principle it should in practice of the way the caches and instructions

75:39 works. Um it doesn't. So was stop sharing the screen at this

76:00 . We also mm stop the

-
+