© Distribution of this video is restricted by its owner
Transcript ×
Auto highlight
Font-size
00:03 So today I will start basically leading programming on GPS and first. Then

00:17 talk about what's known as victimization or is something one needs to do for

00:25 known a singing processors. Um, just first, just little bit of

00:37 , or what? Wherever you are so what they've done so far,

00:43 basically the first. It was a lectures and assignments about tools and talks

00:51 processor and memory architectures to get to basis for the platforms and try to

00:57 behaviors. Man Mother just discussed about power in that case measurement.

01:05 then talked a little bit about the energy management that is now happening in

01:11 processors and at all levels of On demonstration was talking about your memory

01:20 using open M. P s the for programming and talked a little bit

01:28 what the compilers attempt to do and one can help the compilers to be

01:33 in doing the code transformations, Thio efficient codes and today and move on

01:42 more or less continuing in terms off , the source transformations. But

01:49 with the specific objective to use so vector instructions. And then maybe if

01:57 some, I will start to talk little bit about, then have to

02:02 Ah accelerators. And for this particularly GP use Andi, just

02:13 In order to get the fish and , Juan Cano need to be vertical

02:17 understand all the different layers. One just do the typical, um,

02:25 just by obstructions and layers. All , so today since then talked about

02:34 explain what this Cindy mhm notion means case to another familiar with it,

02:43 then talkto give examples of source to transformations that, um, it's illustrative

02:51 compilers tried to do and also And I talked about the compilers,

02:56 they tried to do, how one potentially right the code from the start

03:02 reduce the need for program transformations or compiler to be able to find out

03:09 changes can make to the coat and one sometime. But that then almost

03:17 to talk about the programming of history snow in some particular how they are

03:23 put together and a society at the of the slide. So same the

03:30 is also often called data parallelism and may use the terms interchangeably, and

03:38 they will start to make sense. not now. At the end of

03:41 lecture, Andi, I just just the record, more or less,

03:49 is the notions that they're being Um, in the computer designs community

03:56 invariable classifications of different types of Er, um, the single instruction

04:03 data, Your Manila single threat, , program named Model um, Cindy

04:11 what we're talking about today, which single instructions from multiple data. Or

04:17 that's what the notion of active comes that there's one instruction that applies to

04:22 range of data on. Then there's instruction, single data just to cover

04:29 permutations. And it's not really something on the world that is actually being

04:36 , Wendy is what they used in of the open MP type or architectures

04:44 open and pay most programming models that means you have unique instructions for maybe

04:52 the difference. Statements that you have new code or combinations of data are

05:00 to seem there was a single instruction raise some data, and the last

05:06 is fairly new. So the first was something that notions were something that

05:12 Flynn the Stanford came out eight years on It has been sort of persistent

05:19 decades off computer architectures. Been there something that started to happen. In

05:26 . Man clusters started to emerge, that stands for simply single program in

05:32 data. So it is a typical over for clusters where you used the

05:39 program in all the different notes. . I'll talk more about that when

05:44 comes to ah, programming are So where is the basic notion of

05:53 ? Uh, is basically one instruction one instruction unit on this particular

06:05 and that the same instruction is used a range off functional units. On

06:12 times, they may have their own as well. So instructions in this

06:19 of architecture, er is in a broadcast Teoh a number of execution

06:27 On this, something first emerged in of image processing, but typically the

06:34 operation is may be applied, at in some stages, to all the

06:40 in an image. So it was very efficient way of doing or are

06:45 image processing systems. But it also advantages compared to the Membe because it

06:54 two simpler So the processing course, you like on that's what's being

07:01 indeed be used in particular of our basically use a Cindy course, and

07:08 processors. It also is a uh, design or as a very

07:18 impact on the memory system, since don't need to. Your old instructions

07:26 kind of each combination of data so to pressure on the memory system whenever

07:33 attend use. Uh, this mode operation on here is just the difference

07:42 Cindy and Spendy. Where again, coldest replicated doing different processing course or

07:50 , notes in the Cluster City. now any question this general things that

08:01 one will start to talk more about transformation Yes, yeah, I

08:08 I asked this a couple of lectures . I think it was like three

08:10 four. But just to be single instruction, multiple data on single

08:15 , multiple data. Um, the the refers to I guess a really

08:20 down example would be, say, add instruction to, uh, two

08:26 locations in memory and then doing that add instruction to different things.

08:32 uh, when you say same are we talking about literally the

08:37 Uh, the same assembly instruction applied different memory locations? Um,

08:44 Or OK. Ah. And for program. I guess it would be

08:50 sort of You. You you replicate instruction stream or not exactly. Because

08:55 could have conditional or what have Um, if you're doing it to

08:59 data sets, so s P m . Could that be viewed as,

09:05 , sort of a multi instruction s m d if that makes sense.

09:10 huh. All right. Um, a bit cautious saying yes.

09:20 because it's so different in terms of the whole code. Ah, so

09:32 to using the same up code for range of memory addresses. So maybe

09:40 the examples, at least the clear what seeing the hard works okay will

09:47 clear, which is not downloading the code into multiple processors or, you

09:57 , it could be the same. when we talk about M.

10:04 I is basically tying together pre applications hold programs, as supposed to

10:17 That is, as I took this processing example. Applying the same operator

10:25 pixels in an image. Okay, , that I think that's a good

10:33 . So So I'll, you talked to a number of examples and

10:40 and hopefully the notional Soon they will clear. So here is kind of

10:46 that he's being used to tried Well, in order to use

10:55 uh, factor in for Cindy instruction . So in a way, I

11:04 I think this has been some slice coming. But in a way,

11:10 victimization means that your app a loop each statement that tomainia So that's what

11:20 supposed to having a loop with a of statements inside and kind of trying

11:27 figure out how to wrapped the loop each individual statement on that will

11:37 hopefully become clear and what I'm talking . Eso illustrate some of these transformations

11:45 the next several slides, Uh, how one can initially write the code

11:54 of this compatible what we need to to figure out how to change the

11:58 in order to do that, invoke the instructions. But first I'll talk

12:07 little bit about things that make it to try to take care. Typical

12:15 written for one of the scaler type as supposed to vector units of vector

12:25 . And there, then different data on the wall, illustrating the next

12:31 slides through a bunch of examples. the concept that I think is I

12:37 to be at least familiar well, what they mean. So, to

12:43 dependence is specifically what it means. one reads a variable after having written

12:49 this, and the anti dependency is of the opposite. Instead, you

12:56 a variable after having read it, I'll give examples to are these things

13:02 happens in codes. And then there's known as output dependence on you make

13:09 several rights to the same memory So these are kind of the three

13:15 dependence is one talks about and and things are embedded in loops than

13:21 of these dependence is. May look because off looping through the sequence of

13:30 , and that's and now has looked dependence. But so some of these

13:35 three guys made then Booker because off new preparations, So here is now

13:47 little examples and try Thio Illustrate. , these three concepts and the reader

13:53 right on the right after read. we'll see if you guys have witnessed

13:58 far Never, Maybe some volunteer on figure out to read after write and

14:06 very simple code What? When we such dependencies Among these four statements in

14:15 low public, you've got one online , because X was written thio online

14:24 and then you read it so would a role from two on one.

14:30 , saying that again. So online we have, Right. You have

14:35 read after write because X was written . So you would say that line

14:41 , has a read after write dependency line one. Eso the statement works

14:46 the left side is they signed Yeah, right after read E remember

14:52 with us and okay, maybe I use errors instead of the math equal

14:59 . But this is an assignments on right tense. The outcome of the

15:03 of the right hand side is assigned the left hand side. So,

15:14 yeah, so in statement to obviously one has a new assignment.

15:19 excess written to after its red. , any other dependence is of any

15:29 or just, um okay. And it's just like on the three

15:39 volunteered here. So I will. guess you have a right off the

15:51 from 4 to 1. Right? they're supposedly executed this sequence,

16:00 So, um, but if you to reorder them, yes, that

16:06 have consequences. So I was a 1 to 3, right? We

16:16 the read after write to read things statement three that was written two in

16:22 . And then you also have the from 1 to 4. Because you

16:27 why, in a statement, four was ripped into in Iran so that

16:34 have a clear president's order that you to execute one before you can execute

16:40 or four in order to preserve the of the sequence of statements. Eso

16:50 we have. There's something written to it was red and that we have

16:56 1 to 2, you read X one on. Then you modified the

17:01 in to so you cannot the order put to head of one because then

17:08 get a very different outcome. So think we have. And then I

17:14 there was another one from 3 to , right that for updates, variable

17:21 . So you cannot put forehead afraid then you get the different results.

17:29 then we had 1 to 4 otherwise vice. So this applies to heart

17:35 or the statements. And so these , the dependence is locks in certain

17:43 on reordering off statements, and the may be advantageous again for memory.

17:53 is or registered use or something. I'll get lots of examples or not

17:59 four or five into this lecture where ordering some help. The use of

18:11 now. So because of this one easily wrap um, the basically compute

18:25 the integrations for one before and then all the integrations onto etcetera because that

18:34 you want to do all statements, , or every instance off statement one

18:43 all the 100 integrations in the then you will get very different

18:48 Compared Thio going through the four statements each iteration of the loop, so

18:56 would not kind of work to wrap loop around each one of these individual

19:03 and get the same outcome. So is then there was no look carry

19:10 because, uh, in the situation this, um, for Luke is

19:17 same index. I is used for statement. So nothing happens when you

19:22 to the next generation. It's all dependence on are not from our little

19:30 to another all dependent and so are the same integration. So this is

19:38 way than to try to help a bit in terms of being able to

19:47 the victimization by using temporary variables, called variable renaming or but in these

19:57 , 10 variables. And if you this set of 10 variables have eliminated

20:02 dependencies. Um, because now, you are too Uh huh certainly execute

20:17 second statement that should work without impacting correctness. So let's see it.

20:29 think I have in Afghan maybe on next. Like nope. So let's

20:34 shows a little bit how things potentially reduce some of the dependencies.

20:45 So here is now an example of , care, dependence on so

20:54 So in this case, we'll see the second statements rights to the Variable

21:06 and then that particular instance, are or access being read in statement one

21:13 the next situation. When are his ? So that is, this is

21:22 known as they look dependence that the victim value of X and one

21:35 is ready. And the other um also the U value is being

21:44 in each generation, so but you only used in state and one is

21:51 used in the second statement. So if we wanted to use with to

22:02 to wrap the loop, the four around each one of the statements but

22:10 not get correct result if you first all the instances off the first statement

22:18 then all the instances of the second . So is there anything and one

22:26 see that one can do to, fact enable the use off seemingly instructions

22:35 , wrapping loops around each one of statements? And I tried to for

22:42 kind of like pictures. Oh, tried Thio illustrate the access pattern on

22:48 graph to the right In this the first thing that comes to mind

22:55 be to make a copy of the X ray. I'm still thinking about

23:09 . Okay? So if you copy x ray than so that means you

23:22 the original X ray. Is that you have in mind and then have

23:27 updated X ray? Yes. That statement, um, doesn't interact with

23:36 one, but the first statement needs new X values. So having the

23:48 X values around doesn't help. So for I equals one right wellness with

24:00 x zero is a Valium argument. that stroll, right. So first

24:08 , you compute x one in the statement. For the next iteration,

24:14 equals two. You need X one was just computed in the previous situation

24:21 the second statement. So there in fact, from The Sopranos said

24:34 what you can do is and this correct is preserving because there's nothing to

24:40 you from computing all the X values , uh huh. What was original

24:48 one Onley uses a new values on New values of X was computed based

24:57 two independent to raise violence. It it was not updated. So in

25:05 case, now, there's nothing to you from. Compute all the instances

25:12 X and then computed all the instances or update the year values using those

25:22 X values. So in this this is kind of what can be

25:31 in this case by doing state country because of the way these dependence is

25:37 . So now this particular code, doing reordering of statements, can be

25:46 . Christ, So now you can this is again the notion of this

25:53 thing that you have in this case statement and you do all the

26:00 Um, so in this case, the code you need is the

26:06 and then in this particular case is simple. You start with the Y

26:12 the Z address and the next and in this case, you just

26:17 those three addresses by one for each of the loop. But you don't

26:21 in the up coat or even, , explicitly computing new memory addresses except

26:33 through implementing by one, which is very cheap operation compared to,

26:37 sending it totally new address. See memory. So in this particular

26:46 because it's implement by one one saves in terms of not telling Thio,

26:53 new up codes on in this not even more than basically adding one

27:02 everything each one of the ray addresses stretching or writing data. So on

27:15 array, notation is very common as way of illustrating the notion off I

27:24 it for. Then try Thio. it clear that these things can be

27:28 ized and I'm on you that might familiar with either maps, lab or

27:35 other kind of very processing or programming . They tend to use this kind

27:42 triplet notation wherever starting, address and address and potentially stride. And if

27:48 missed the stride additions to strike this , so but all my examples you

27:55 once on the road, just Penis of now Trip club or whatever,

28:00 Clift type location 21 Try to point what the vector on instructions kind of

28:11 . So any questions on this Mhm. So again, this fact

28:20 . Competitors thes days. If they parties sufficiently simple, and then they

28:28 to be had for some recently successful doing this form out reordering and then

28:36 doing back to instructions. But it's for me a good programing practice.

28:43 of be aware all these things and challenging me program analysis to figure it

28:49 for you. All right, so is another example. And now we

29:02 also they look carry dependence right from to 3. Except now it's kind

29:12 to integrations loop away in terms of loop carry dependence. And then we

29:18 also read as the right from 1 3. Um e guess Any other

29:32 anyone spit observes so in a saying and notice otherwise about this code In

29:52 of wrapping the for loop around each of the statements. You know,

30:03 is a street dependence years, right our findings One So the previous value

30:11 is there is a, uh from to 3. There is a reuse

30:17 Z off. See I. But why off by has Bean The history

30:24 via PFI is used Drink right? why is never modified is just

30:32 Right? Um What? I didn't you. That's ok. Eso When

30:42 comes to why it's never modified in cold it is just being accessed or

30:51 is never updated. You can reorder two in one. Correct eso that

31:02 have better Lee. They're afraid that's you're between one and three,

31:09 Yeah. So there is dependencies between and three, but to is the

31:17 simple ones. So that's correct. this is well, I think we

31:26 about. So in this case, guess what I did was put that

31:32 they used the second statement is are to what you suggested first, but

31:38 doesn't matter. But the next thing can do them because the second statement

31:45 independent off the other two statements. again, neither you is not used

31:54 one or three on why is not updated. So what one can do

32:01 basically pull the second statement out and part that particular statement can directories in

32:09 own right. Then my still has figure out what to deal with the

32:15 two statements one or now, or one and three statements that has this

32:20 dependencies in terms of three, depending the outcome of one for Z and

32:28 depending on the outcome off X two or to iterations earlier in terms all

32:37 this thing so this thing can be , arised, and that's can be

32:44 game is not the full game up hope for. And then one have

32:49 one. And it turns out the this particular loop or the first loop

32:58 with one and three statements, the on the right hand sites happens to

33:06 in these two statements. It in fact, we recognized, but

33:13 a very quite the complex rear reordering is required, and I'm not going

33:23 go into that and this kind of simple way. But there are ways

33:28 for this particular look to the common . Uh, so it's not just

33:36 simple way off dealing with the loop be able to do some level of

33:43 , but one can basically try to it, uh, a tree in

33:52 way on then. Yet the correct still so modular that arithmetic operations needs

34:01 be associative than it can indeed, . But Asai mentioned it's complex,

34:08 the same proportion. Um, so . All right, just so this

34:19 a different example, So that is different way, all trying to figure

34:27 how to be able to again wrapped loop morality each or at least some

34:33 the statements. So anyone sees what problem is in terms of trying to

34:42 this particular construct or set for Uh, well, actors read,

34:54 being written in both lines. Two three. Yes, that's fun.

35:04 , but we also used to you writing it on line three.

35:13 right. Um, yes. yeah. Read after write from 2

35:20 3. Um, now and then have the reader after, right,

35:31 , from 1 to 2 and Respect to X. Um, there

35:40 snow. Don't care independence, because this case, right. Um,

35:51 there is another problems, So in fact, can the evac

36:06 But that requires something. Um we don't use the value.

36:19 x and the first statement, we write to it. So if we

36:23 um, no, hold on, , if I wouldn't preserve the correctness

36:27 the program, I mean so in , if one word to compute all

36:50 instances on the first statement and then those available than the corrections of the

37:10 will be preserved. So that means . So I'll try Thio here's what

37:18 talked about. So the thing is thing that prevents these things for being

37:24 by victory visible is that X? just a single memory location and a

37:30 , as it's no. So if work to promote it to an array

37:41 everything could be wrecked arised. So is what's known as, um,

37:49 scaler to the vector. So this someone do It's like this thing

37:59 Um, we call it T Rex terms of the right now. So

38:03 keep all the instances off the outcome the first statement, then, um

38:16 then use the proper instance in statements and three things will be perfectly

38:26 So this is now what can be arised at the expense off promoting a

38:35 , doing the right. So this , um, one thing again that

38:46 compilers tends to be pretty good at they need them to figure out

38:51 That dependence is what there are, see that if you make an array

38:56 of X instead of keeping it as scaler, then things can it be

39:06 . But as a program and one pretension and then decide that this

39:11 or one can do it from the . All right, so on here

39:20 another type of them now and anti in, so it's a right after

39:32 . That happens due to the Um, so that's right. So

39:48 problem is this X I was one a statement to, um, that

40:02 the next iteration X I plus one written to in statement. Wrong.

40:14 , um so so in a statement one basically used one old X

40:31 the I plus one and one new value the X um I so And

40:46 suggestion for how you fix this So okay, so again kind of

41:05 I gave it away, right? statement to use is old and

41:11 And in order for to preserve the results, you need to have the

41:17 one around in order to make sure to get the correction So in this

41:23 , by introducing attempt Ferriol or a Year rate that keeps the old

41:32 then you can preserve the collectors, then you can back to rise the

41:39 thing. You can first through a of accent, too temporary. And

41:44 you set in the two statements Was second statement to at the correct

41:55 . So this is temporary. so here is a kind of different

42:02 of off dependence that happens so and this get dizzy. Variable is

42:19 . Read. But X is updated this second statement and sort of red

42:32 the first date. But it is quite a little bit more complex look

42:41 independence and than in the previous So General helps to comment on this

42:55 . Yeah. Let the array from toe. 100 because you're not gonna

43:01 You're not gonna cross x until you to the middle, right? Very

43:05 . Yes, exactly. So you the loop into half, basically.

43:11 in this case, you can split index, and then you get the

43:16 halves of the loop, and then come back to rise this part.

43:22 this is what happens. So that's kind of dependence that happens. And

43:29 on these two have these types of . All right, so this is

43:39 they tend to call dupes plating. let's see what happens in this

43:47 So this is, um just let it all the outcome. So this

43:53 kind of half a analogous. I to talk about compiler optimization,

43:59 Thio, take things out the That or not, I need to

44:07 in the loop. Eso in this , the part of the right hand

44:17 in these two statements can in themselves act arised, then, yeah,

44:25 figure out how to deal with the , which has no, the dependence

44:30 off the loop. Carrot. Think both x and Y in this

44:37 But at least in this case, known is now expecting and all this

44:43 the cold block of the statements, , in the for loop that once

44:48 then. All right, you take out some of the operations and

44:55 vector instructions for those and then trying figure out how to deal with the

44:58 part. Um, this is just of tribunal against looping has overhead associated

45:10 it because you need to test low , etcetera. So just reducing it

45:17 a single look can be beneficial for . So all this is yet another

45:29 known as the peeling, and that be worth spending a little bit of

45:36 figuring out what's this construct works and you can help, um, getting

45:45 cold little bit, potentially recognizable, so feeling with some off one can

46:03 at this. So what is to Thio in some ways take advantage?

46:14 what one can infer from the condition in this code to if statements So

46:29 it's two nested loops, right? with some luck, at least one

46:35 potentially do something about characterizing one level the two loops. So in this

46:49 , what the best process in the look that is on the index I

46:56 but in this case, they condition depends on the other Loop index.

47:04 basically, what is this the first ? The last integration on the other

47:09 Park kind of special? Because they one is special and the equals artist

47:16 . But all the other J values treated in the same way, so

47:24 what's kind of Luke peeling is's, , are recognize that fact and restructure

47:30 cold. For the three. The equals one on the day equals

47:37 and then the rest of the Jenny . So now one can see,

47:46 least for the J was one, and just has one loop level.

47:54 I look and it has some dependence ex Uh, but yeah, there's

48:11 look carry independent. So one should able Thio do something with that as

48:16 as the other two loops. So then on the next slide, trying

48:22 figure out how to now do the , given the dependencies and doing all

48:29 a equals one during all the integrations the first statement is kind of

48:37 Someone wants and backed arise. In , thanks for the J equals

48:44 And there's some regularization also possible. all of them, I guess,

48:58 remainder on the arse except R equals that has some trick. Independence is

49:03 part of the statements, but this just trying to show again that

49:10 sometimes the original here on the as perhaps an actual way of writing

49:19 from the functions you want him But some awareness off right ability to

49:30 may want you might want to consider something more on the right hand side

49:36 compilers to have an easier time trying figure out how to vaporize things.

49:43 I think this is the last elements the accusation and I wanted to bring

49:48 . So victimizing codes with conditions like used in the last example is,

49:57 , the turkey business on again. the kind of early picture in this

50:03 about the same d what I There was one instruction and multiple execution

50:09 that supposedly used the same instruction for data. But when you have conditions

50:20 , that means you get data dependencies terms of executing that construction. So

50:28 way this has been dealt with, , or is death but traditionally is

50:33 use what's known as mask mode or . Another version is compressed mode

50:40 Then I'll give examples of what these means. Asi says a mass modus

50:47 just tried to pretend there is no but then rectifying what actually happened and

50:56 wills kinda ridiculous. Cartoonish example. , off the details off doing this

51:06 statement on and top of the slide you divide wonder right with another array

51:14 by element. And of course, doesn't like to divide by zero.

51:20 , um I don't only want to this statement as long as we,

51:27 the rate element of B is not . And then, in order to

51:32 the two mask and compressed mellowed, assume some of Avery cartoonish type architecture

51:40 in terms off overhead or Leighton sees with the construction. And so the

51:48 again in the vector instructions that have kind of set up and do the

51:56 . And I think earlier last lecture about construction. Layton sees for add

52:03 multiplies and so on in terms of units. And the idea is,

52:09 up when you have the same the that you pay the overhead that the

52:15 see once. And then there's only cycle car additional add or multiply,

52:23 need to do so. This is notion, and then you have a

52:29 length in terms off characterizing cold, that means you load the vector into

52:40 register. Find so normally than arraignment much longer than the what you can

52:46 in the register file. So then array gets carved into up into chunks

52:53 you can load into their register file the time and for the I'll show

53:02 a little have come of evaluating these in a couple of sides. But

53:10 the notion here is that that you the overhead or the associate ID or

53:17 City with function unit for the type instruction you do and then the response

53:24 for each element in the vector that into the register file. And doesn't

53:31 L that I have in the cycles ? And then so here, when

53:40 use the mass load than your first thio access to two vectors. And

53:49 this case, I assumed that was load off. Each one of the

53:54 Victor's off the size that fits into register, finally, is an overhead

53:59 12 and then whatever that factor length such it as the number of cycles

54:05 takes to do these two notes. then you have to prepare to

54:12 whether with value off the zero or , and that is this selling of

54:20 variable. And then there is what's as an extra mask register, where

54:24 want test all the elements off the in for B in this case against

54:37 value you happen to have load in case zero. And so basically,

54:42 have to test it for each one the elements on that case, assume

54:46 parallels in the testing. So you through it element by element, so

54:50 taste to me really account for the This mask register and then you go

55:04 dso you said that to zero or , depending upon whether you should do

55:08 instruction or not. But then, fact, you kind of blast ahead

55:13 you still do it for every And then you try to make sure

55:18 if you divide by zero, that parts of the software does not through

55:23 exception and blow things up in your . But given that you know that

55:30 Bible zero should be ignored, things still work under this mask model off

55:36 with conditional and then you only store results for which you should keep the

55:46 . But you kind of go ahead store. So basically, you don't

55:51 the results for which you should not done the execution. Um, so

55:59 is kind of the model for how math mode works now and then we'll

56:06 about the compressed mode and then show the outcome. And then I'll happy

56:10 take questions of the two versions. the mass mode is trying thio gonna

56:21 a little bit more efficient, potentially essentially all in carrying out operations for

56:32 the operations are desired. So in , what to do in this

56:43 is that you, um, based the values off B In this

56:54 you kind of know which elements off you should load. And you could

57:05 do the same for a um I'm . So this plan now I see

57:15 a mistake on the side. So first one should state Lobi, not

57:18 day left hand is correct. Lobi V one, not load a

57:24 So that's what a load a B . And you find out which elements

57:30 B are non zero, and thus a fraction f off the video elements

57:38 gets loaded into the registers. Then do you do? You do kind

57:47 account of how many elements or be on zero and then, you

57:56 from the position within the director, which elements are non zero or

58:06 And then you construct the new vector addresses on Lee to the elements off

58:13 that a non zero. So that then. I used to what's known

58:20 load based on this, um, vector off non zero values on B

58:30 then you know, the corresponding values a. So now you eventually

58:43 and this very explicit what you're doing . Now you go and load on

58:47 the elements of a for which you operate. That's part of this,

58:54 , vector. Initially off length the in the corresponding elements of B And

59:00 now you're only divide those elements on and B that less than the subject

59:07 all and on the elements in this collection of data on Then you on

59:15 store back those elements off a that updated based on a divided by B

59:25 so. But you have to load entire vector a B to figure out

59:30 ones are zero. But then then do not load us many elements off

59:37 and you don't stores many elements, the updated vectors and you don't do

59:43 divide Brasileiro and count of other too. Make sure things don't go

59:49 . So if I do this little with the particular assumptions I did in

59:53 of overheads and cycle accounts, etcetera get the new day expression on,

60:01 on the next slide just plugging in particular values and try to figure out

60:08 one or the other approach gives you better performance. So this is,

60:20 , now using elements per cycle, that means in this case, it

60:30 a few cycles. Thio do kind each one off the iterations, the

60:40 , because off the condition is and dealing with it and mass mood and

60:47 confessed mode. So in this in the two collar, um,

60:56 a mass more than compressed Index The higher value means something is more

61:02 for how your performance, because you relative Mawr used uh for every cycle

61:10 you spent. So in this one can see, you know,

61:18 there is the fare, the condition true. Um, for I

61:26 in this case there's a lot or say, a different. That's all

61:35 you need to carry out the statements very relatively two elements. Uh,

61:46 vector length part, then the compressed winds. Where is the extra exercises

61:55 figure out which elements to load and ? It's sufficiently custody. So

62:04 um, the numbers integrations in the you shouldn't carry out is becomes relatively

62:12 than using the Nazmul is beneficial. come about these things and and

62:18 depending upon what the relative times are overheads. And so I used this

62:29 this line here, in terms of and store the overhead loads and

62:34 This 12 cycles is very low, that's kind of means in practice

62:40 That that would these things are kind in cash is if you have to

62:45 to memory. Those numbers are much , pretty much and in order of

62:50 , higher. So then it should the balance quite a bit and pushes

62:57 much further to that. Um, compressed mode is only effective when most

63:08 the interactions on the loop should not carried out. Uh huh. And

63:17 one show this slide in terms of known has gathered structure or scattered

63:24 Well, obviously, through this one then I stopped and take questions.

63:29 , so this tried to illustrate simply this indirect addressing or is doing.

63:37 the gather is one. Use another toe point to the elements that one

63:43 to retreat. So in this okay is a rare or addresses pointing

63:52 elements of extra to do want, then you're treated those elements and stored

63:58 they are a Y. And then example I just talked about in terms

64:04 compressed and masked ball. That means happens to be in this case,

64:08 in the register fired and the scatter the opposite. That's again. You

64:14 it. Um, the indirect addressing try is used to point to where

64:24 want values off X to distort in rest of wine. So in this

64:31 , you have the indirect store, , right, having the level of

64:42 in terms of addresses and why I that collection part. So this is

64:48 as gather scatter type constructions on um those are the things some computer

65:01 in fact, had some things didn't . They have hardware support for this

65:10 of indirect address and to try to it there sufficient as possible.

65:19 and so I'll stop and hold for about in the victimizations or dealing with

65:27 for this scatter scattered type constructions. gather scatter is kind of expensive

65:41 And anyone familiar with sports makings competitions things are not nicely organized as sequential

65:50 normally make extensive use organic scatter And it also means compared to one

66:11 the earlier assignments, right, the assignment. So and this kind of

66:17 when you need to sort of gather scatter, um, memory assistant performance

66:24 more guts likes than it becomes his life. No. Okay, so

66:39 few minutes left on top, sort high level things about this notion of

66:46 nodes. And then I'll talk more explicit dealing with programming. The central

66:57 types architectures. So I may be has to wrap up again.

67:08 the notion the same. The one called and multiple, um, address

67:17 being used for data those in but one out code and they sleep

67:25 unquote streaming, um, data If you like that construction. All

67:37 , so most of the things we'll it in the class is a particular

67:42 off. Heterogeneous note architectures in this that when have some form of accelerator

67:53 attached special to some degree purpose device is integrated into the note,

68:09 and typically this is the right things being integrated to now so deeply use

68:19 way you're going to use it in class. And problem. Why Anyone

68:23 has used, if you already has it is in the form of an

68:29 or processor that is connected to the on. And I owe bus known

68:38 the P C. R E Express , typically these days and which is

68:46 from the memory bus to which the is attached or accessible to the safety

68:56 Ah, so the thing there's many Thio. So the first thing to

69:09 aware in this type of scenario there things you mentioned briefly hopefully,

69:14 before the class is over that things . I'm deeply used in your the

69:23 . CPU has an integrated GPU. doesn't have it normally and attached to

69:28 that they might, but typically many processors for desktop, the every GPU

69:34 the same piece of silicon as the , but in Wales, focus on

69:41 this class and for your assignments, programming over attached devices. So the

69:49 thing thio become aware off that the devices, like accelerate and deep use

69:57 . P J s, uh, to have their own memory. And

70:06 there is to our multiple memory spaces sometimes you have several GP use attached

70:13 a given CPU on this PC I bus. You can host if

70:21 uh, deep use or other typically, but depending on what the

70:30 or target market is for the seat you it may also have more than

70:35 type of fish I express bus that can get also get political a touch

70:42 so But the point is, programming types of notes they have to memory

70:48 to worry about is also the case these attached processors to have their own

70:55 set. So now you have to spaces to worry about, and you

71:00 to instruction sets toe worry about in of programming these devices and I wanted

71:08 also mention that this piece I expressed of the Are you busted? It's

71:13 standardized bus that is commonly used for devices on here is just today.

71:22 the stampede for Stan produced the H expressed version for on, and I

71:29 that's also indicated for bridges to will for this next the GP programming exercise

71:36 doesn't have GPS, so we use for the GPU assignment. So these

71:43 the data rates. And just to , remind you a little bit,

71:49 , to see if you memory band typical you know, 100 plus maybe

71:55 to 200 gigabytes per second per socket was deep. You, on the

72:03 hand, is even higher to either local memory. So compared to these

72:11 data path between CPU and its memory GPS and its memory, the GC

72:18 is very weak on Dhere is a bit also off, I guess things

72:26 keep in one when it comes to use, an advantage or not of

72:30 and you to use. And for case, I just took two of

72:34 examples of the processes have talked about the process of lecture as well as

72:40 two deep use I mentioned in against processor architectural lecture as well.

72:49 so the things to keep in mind different. Degrees of parallel is not

72:56 . Uh, in terms off CPUs GPS. And, um also then

73:10 ops for cycle as well as the spur. Yes, performance per

73:20 So the point of this slide on will leave my last night and I

73:24 continue next time is a few So first, if one looks at

73:32 course or threads, it looks like are considerably weaker, um, or

73:40 powerful than the GPS. That is of a bit of a fallacy,

73:50 the CP used today they have these units associated with them. So in

74:03 instruction, you can for each that you like, they can do a

74:11 more operations than what the DP use do. Deep usar Cindy engines have

74:20 simple threads, so to speak, can Onley. Each thread can only

74:25 to ops per cycle. Where is C could use can do Hey,

74:36 or sometimes 30 to even ob supercycle threat. So they actually changes the

74:45 to not be quite as traumatic as core or thread count may seem,

74:51 indicate. So in the end, this deep use that are listed here

74:57 once that are designed for scientific and applications, which means they have quite

75:08 capabilities for double precision floating points, is not typical from energy to

75:16 um, for CP, used as of two between single and double position

75:24 . And that's true for this deep and are listed as well. But

75:28 most deeply used the double position maybe the factor off 10 or more

75:37 than on the single position performance. I guess the take home message in

75:45 for this side is that the bottom in terms of application is yes.

75:52 one ends up, um, using , use the peak performance maybe for

76:00 GPU in itself, maybe factor of ish or so higher, whether you're

76:04 or double, depending if you have type of DPU. But in order

76:09 get that, you need to have much higher degree off parallel is than

76:21 need for the CPUs, so it's no means a nisi thing. Thio

76:28 this potential difference because the C. . U. S are. I

76:35 the meme D capability. So there's things you can do that on to

76:41 good performance, where the name the the more appropriate and Cindy but from

76:49 use simply is pretty much the only the gap and they were close to

76:55 peak performance. And with that, will stop and some already over time

77:01 continue to talk about the programming next . And I guess Final, as

77:07 mentioned last week, uh, will out the midterm. That's the one

77:13 take home return that is basically scared you. Um, I was going

77:23 . The strategy was, we'll find of the answers to this problems on

77:29 midterm either and slides or the links on the site. So it's away

77:37 meat. I'm sure that, go through the sites. There's no

77:41 tasks in the midterm, so with can stop sharing the screen any for

77:55 . Comments for the block matrix matrix problem. The second one.

78:07 I've noticed that we haven't been using I mean, the assumption is that

78:12 don't provide the A V X 5 flag to the compiler right um,

78:18 as far as I know, we been using the vector units because it

78:22 that flag in a compiler. Um hey, I am so I

78:32 remember the exact texting for they uh, so did with rule out

78:41 use of the vector instructions in the . Do you remember sash way?

78:48 explicitly say that. Don't use Uh huh. The main motive that's

78:54 the problem is that we use one constructs to get however performance can you

79:02 It's not that we discourage using idiots , but again, the motive is

79:09 you to understand, open and be . No, it was certainly find

79:18 use it. I, um But I said, the notion was essentially

79:24 , you see, open, empty and maybe find out the impact are

79:30 . Okay. Yeah, I just that the m k l wasn't using

79:34 , so I figured we shouldn't use , but, you know, anyone

79:38 certainly welcome. Experiment with it and if the compiler managed to do the

79:44 . It may not. Okay, , that's the vector instructions or a

79:50 counter on puppy. Right? So we did decide to experiment with

79:56 Yes. Okay. And it's not you're looking for how much performance you

80:04 gain. It's like, what? ? All strategies you have tried and

80:10 what? What are your insights from those strategies for trying to battle

80:14 Those quotes? Okay, a za point. And we should perhaps be

80:20 clear. And if they use the again. So, from my

80:25 was most a question about again being on using shared and private variables in

80:34 of his consequences. Um, memory on performance. So what's mostly focused

80:43 managing memory as supposed to using the tech features? That was where my

80:53 was in writing this thing. I think the first part was very

80:59 about whether in or other was better , um or at least my results

81:06 to that. Right. Well, . That was a good point.

81:12 , well, job. Ah, afternoon. Okay. Any other question

81:16 anyone else? Just a comment that I site reported yesterday that he was

81:26 some issues when using zero optimization with block matrix multiplication when he was trying

81:35 the outermost loop level for some reason does some things that messes up the

81:44 allocations and gives them the segmentation fold what I can treat from his messages

81:51 one optimization seem to have done the . So in case anybody is getting

81:57 similar try using a different optimization the No. Zero. Yeah.

82:03 I guess I should also had, a couple of your arrest one more

82:09 Thio and in the assignment, and had said yes. So we should

82:14 yes to all of you. So have until tomorrow and we'll send an

82:18 as well to make sure enough This is that you have one more

82:24 . Three instruction said we were supposed be using, 01 optimization level for

82:30 baseline, right? If it's said , then that's fine. I don't

82:35 exactly the Texas Well, yeah, you in case you're using or

82:39 That's why I say Okay. I just wondering why he was using a

82:44 , but Okay, great. Another that, um I'm not sure we'll

82:52 a straight answer, but, I've had similar results. Um,

82:58 I run a script during an interactive versus just handing it off as a

83:03 job. Um, what do you ? Close toe identical results. Whether

83:09 doing it on an interactive session versus this job because the dove and the

83:16 s k X nodes air identical from hardware perspective, right? Yeah.

83:22 are two different ways of submitting As long as you're requesting the same

83:29 and keeping all the runtime parameters you may not get exactly the same

83:35 , but the values should not deviate from from the two things. So

83:40 overhead and having so not if we're the V N. C. Session

83:46 opposed to an ID obsession in the the state terminal Um, the overhead

83:52 providing ah, graphical interface with, , with Santo s That shouldn't have

83:59 appreciable effect on on our results. should be very minimal. E don't

84:06 b and C, it should be significant issue. Okay. Yeah,

84:13 the bad job let me just Sometimes I find myself waiting longer.

84:21 and that's a question. Any measure the cluster guys do not have people

84:26 too many jobs at the time. , for sure There's just a lot

84:33 permutations on this assignment. Yes, this way. The next one.

84:42 . I wish I had scripted better earlier assignments, but no, I

84:52 . Hopefully we're giving up. Just work. Forgive some insights in different

84:57 . Mhm. Unless the purpose of game sensory evidence exposed. Okay,

85:20 no more questions than I guess. , and to me session,

-
+