© Distribution of this video is restricted by its owner
Transcript ×
Auto highlight
Font-size
00:00 so Okay, so today is more openings to see, not as interrupted

00:11 way of programming attached processors and the for these classes deep use. But

00:21 in principle, not restricted just to , even though I don't know if

00:25 has been used for anything else at time. Mm, So here is

00:34 of an outline of what you're going talk about today in the first few

00:38 is basically re capital. But he that the last lecture. So that's

00:47 difference in the level of parallel Uh, difference in memory sizes are

00:53 important and bandwidth from Anne Marie and fact that there is, um,

01:00 thing pipe relative to other data ways the CPU and the GPU and the

01:07 I express bus and that at two bases and to instruction sets that needs

01:14 be dealt with in order to get working code. And here was the

01:19 that things starts and ends on the . And that means that both data

01:28 cold needs to remove from host to attached invites and then hopefully executes,

01:35 relatively independently, and then when the are available. Then they get copied

01:42 now, given that the memory on device tends to be significantly smaller than

01:47 host memory, Sometimes that means sector of interaction between the host on the

01:57 . Um, in order to accomplish entire competition, it's also the case

02:02 they can work sort of a so things can be handed over to

02:10 device for certain segments on the vile. The CPU works on other

02:15 of the code. Then they started . Talk about this. Now,

02:21 Example A Z on. I will this example to look illustrate some of

02:29 features of Opening City. And as met him Yeah. Sorry.

02:37 you say hosting device to be consistent principle, but in practice, we're

02:42 CPU and GPU, respectively. Right this class. Yes. And as

02:50 as I know, open a sissy not been used for anything else.

02:56 I mean the a little bit out date, but not much open

03:03 That now also supports that accelerators has used for other devices as well.

03:11 GPS and FPs. Okay, Thank you. And I'm trying to

03:20 emotional device that seems to now be used to some degree both in the

03:28 MP community and the Open A C community. So, uh, Andi

03:38 also mentioned these compiler flags that I believe, unique to the page

03:47 open agency compiler and personally, it's only open A C C.

03:53 Well, that's not true. Pray has an open a C C.

03:59 on, but that may differ somewhat the PGA compiler, but the start

04:07 them was kind of a joint development , he said on this lecture in

04:12 consortium between both Korean video and R g i. Eso on this electorally

04:21 a little bit about using this different for the target device or accelerator.

04:30 most of Italy on GP in terms exempt. Yeah, and I think

04:37 showed this one last time as well the sequential code for doing this Kobe

04:43 where, um and those of you or have had an intuitive America analysis

04:52 know about Kobe as an one of simplistic narrative algorithms. And,

04:59 it kind of works in a synchronized . If you like that, it

05:06 I mentioned last time that its use versions of the rate the A version

05:16 that they are the blue dots and generates the new Iranian the a new

05:22 is the red dots that is kind the average off its four neighbors.

05:28 you need to compute all the red . All the new one values off

05:33 points. First on, then swap . And so the first two loops

05:41 over there. Two D grid on states or computer, all the read

05:48 and then, uh, the next loops, then swaps and make the

05:56 values the blue for the next intuition in the while loop and then run

06:04 out. Put the maximum updates magnitude , which is the error in this

06:13 case, and one use basically the in magnitude update value. As a

06:25 measure, it's just a very common very simple technique. It's not very

06:35 , efficient soul, and people enough that very much for doing intuitive

06:42 But that's a different, um, pitch and s. So I think

06:50 is where, uh, then and I may have shown this thing when

06:58 used the option for the compiler. , take care of everything. So

07:04 speaks off that the target waas GPU as it says for those So you

07:11 know what Tesla has Nothing to do the famous test like about electro Magnetics

07:18 the name of the product line from . And then it also has the

07:27 that tells the compiler that the program wanted the compound it pretty much take

07:35 of everything and, uh, what used and then creating the code for

07:47 accelerator wants to use the HTC parallel . Drachma too. Identify the region

07:58 the code targeted for the accelerator through pragmatic sec parallel part on. Then

08:08 wanted, um, the program. wanted this darkness Thio be paralyzed and

08:15 talk a little bit more about that slides to come. So there's two

08:22 regions in two pragmatists, um, what's on the left hand side?

08:30 using the info flag for the telling what they did respect to the

08:39 . And I don't remember exactly how I talked about it, but let's

08:44 well, on a hand here on basically chose again here What they did

08:52 paralyzed. They are a look using notional gangs on. Don't talk a

09:00 bit more about that, and then slides to come. Eso There were

09:06 these three notion of parallel list for use that is kind of unique to

09:13 use not just in leaders the same for I think all of the vendors

09:20 and and these GPS that heuristics gang of parallelism er is the what's known

09:31 a worker level of parallel list. talk about that, and then the

09:36 and the single feature soul, in case used to gang level parallel list

09:42 the other loop in this construct. it used the victimization for the inner

09:49 . And that's kind of a common for compiler writers to use to fact

09:57 loops and try to use, gang or high level or parallelism for

10:05 loops. And it generated the reduction . Also, that was necessary for

10:12 first look. Next, in order guarantee correctness similar to what we talked

10:18 in terms of open and P, making sure that everything gets executed correctly

10:25 similar for the second lewdness they did same strategy gangs for the outer and

10:32 to from the inner. And then also took care off the data

10:40 Remember, things starts and ends on host. So this case used in

10:46 A. So it needs to come somewhere. So it comes from the

10:51 . So there is a copy in the A than it. Also,

10:59 new is something that is actually, , created on the Nice. So

11:13 was used this copy out. Then think I talked about last time eso

11:19 allocates memory for the array. And it's make sure that the whole section

11:25 I was the outcome of the Where is, um, evaluated or

11:33 ? Values? And the error variable something that the host says. So

11:40 also then make sure that it is copies from the host to the device

11:47 then copy back from the device to host. So they kind of stay

11:53 sync in terms off what the value the area variable is analogous,

11:59 The thing happens then at the other past that it is a new as

12:08 input right inside for the assignment So copied from the host to the

12:17 And then the result is there a A That is a returned thio the

12:29 . There are a number of other that one can notice. Maybe

12:42 Let me see if I wanted to that now. All right, so

12:47 I won Has any questions or comments what goes on in this,

12:55 version on the compiled code. I'll about other versions in coming slants.

13:04 , could you say a little bit how the second, uh,

13:09 Or I guess the second fragment does copy in, uh, a good

13:16 . Yes. Um, so the it cop, it doesn't copy in

13:26 . So the compiler waas collaborate enough figure out about a but probably not

13:35 a new that a is used both the first and the second part of

13:42 region. Um, one can try hypothesize as to why the compiler choose

13:52 be conservative about a new that assume it knows what a is for.

14:01 didn't need actually, because a gets in a second look. So asl

14:08 as its allocated memory for it on it doesn't need to get it back

14:18 that helps. But it didnt understand of that a new is used in

14:28 second probably region. So did the as an open and be when we

14:33 about things ends at the part of region. And when I came toe

14:38 MP and just to share a memory , Uh, if things were not

14:45 shared variable and allocated inside the Farrelly . It gets the allocated. And

14:50 you wanted to be preserved to another region, order just a sequential

14:59 In terms of NNP, you needed copy it out. So now it's

15:04 separate piece of memory. It's GPU memory in this or device memory,

15:10 it needed them to write it to host were it not get lost.

15:22 , right, Andi, copy out say yes should remember that the copy

15:29 all saying includes an allocation, even it doesn't initialize the right so it

15:35 allocate memory in new for a But it doesn't need to be initialized because

15:42 only written to, that's why it's copy out. It's sufficient to advocate

15:49 return values, too. The host it was the same in terms off

15:58 a new was hand of in the probably region, right? There was

16:05 initialized that was allocated on. Then was assigned values in the region.

16:10 didn't need to be initialized. so on and here at what

16:22 It did make good use off in case, the DPU. So they

16:28 a very good speed up. here is a little bit off than

16:38 this notion off that the compiler is is a notional, unified memory.

16:45 it means there are supporting mechanism. kind of pretend that it is just

16:55 address or one memory. So there the common address space. Even though

17:04 access properties are the different parts of address, space is quite different.

17:10 the compiling is to keep track about the properties are all those different

17:17 But it in itself looks or has to both the host memory and the

17:27 memory and try to figure out what best thing is to do in terms

17:33 allocating memory and transferring data between the memory and the device memory. And

17:44 did reasonably Well, no comment or more when I just say reasonably

17:52 obviously, they got a very good up, so maybe I shouldn't

17:57 but I will make it clear. as a reasonable as we go.

18:05 , um, so one can Sometimes it's beneficial. Try to

18:17 um, the address spaces oneself. that's what I'm going to talk about

18:24 , because the compiler is not always of doing a good job and managing

18:35 two other spaces with different properties. first now is to try Teoh.

18:45 what happens if we permit. There's attributes for this, um,

18:58 a flexor. Now the compiler is given the task off managing the two

19:07 of memory, so it, um optimize it. I should say it

19:17 general code that if the correct So it's perhaps not correct to say

19:27 it doesn't manage to to memory spaces Gaza, but treats them as truly

19:33 spaces. So here's what happened. in this case, uh, on

19:45 right hand side, um, you see what happened. Um, it's

19:53 not corrected. You should not say data causes part because that's not

19:59 So this is an error in the . I'm sorry. Coffee. The

20:03 without it's pointing it on film you know, used to slide a

20:07 times so they manage memory. They a very good job. But if

20:12 do not have explicitly managed memory, the compiler expressively told to know

20:20 memory, Then things can so kind badly. So next I'm going to

20:29 a little bit. So what actually in this case and what the differences

20:34 and then proceed to using the data together back to having good performance.

20:45 there is now kind of a look the code on without now compiled without

20:56 that man are they call a Attribute on on the left hand

21:03 Have kind of with what the compiler , what it actually did in this

21:10 because of what did it do? , two things. It certainly made

21:15 that the code was correct for generating reduction clause. You know, the

21:21 it did, uh, took these loops, did the same thing as

21:27 the managed case that it used the pearl of this for the other loop

21:33 directories, things in the interview on same thing with next. Ah,

21:43 region on the two loops. They exactly the same thing. So the

21:49 ization. Yeah, end up being exactly the same way whether it was

21:55 to know. So that's not the for the judge dropped in performance.

22:06 it's hard to perhaps memorized. But data management is the reason for things

22:15 so different. So here is what have in terms off what the compiler

22:23 doing now. So it did. copy out that music, allocated memory

22:32 a new and then carpeted back to holes, which was all sudden case

22:43 the manage case. Andi. This similar to what was done in the

22:59 in this case that which is not from the compiler output, that that's

23:13 how they're handled differently. But they , so it looks like it's more

23:17 less the same. But this is what's happening and the reason why there's

23:23 lot of traffic. So, as , um, was the case in

23:28 managed case mhm but is so that's kind of doubt that they actually output

23:36 arrest, Um, just politics and on. Sorry. So you have

23:44 rerun the code and making sure that output is actually consistent with,

23:54 flag set things. But so in case, what happens is that there

24:02 a lot. So first, as says, there is the problem that

24:07 from one of these parallel regions to other, they a new mean moves

24:17 and forth, rescued from device to and from holds back to the

24:25 which is not really necessary. It necessary because of the mhm syntax for

24:37 the private regions are handled. But not really we have to wish would

24:43 and, um, otherwise with a does the proper job. So I

24:56 put the little question mark here. anyone as an idea off any other

25:17 ? That's okay. I'll come to next problem. Yeah, So here

25:24 a little bit Taiwan can try Thio help the compiler by using

25:31 um, clauses to tell it what trouble doing in terms of copying things

25:39 and from and to the device and host on and has also shaped directive

25:49 is sometimes a good thing to use help the computer figure out how to

25:56 memory. And was you stopped in examples that comes so here is now

26:07 explicitly managed the data traffic, so speak. But now, putting in

26:15 , trying to help the compartment says I want and copy in for the

26:20 , Um, And in this I don't know exactly why the person

26:26 didn't examples use copy, but a . And it could have been perfectly

26:30 just to use to create, clause instead of a copy for a

26:36 , because it doesn't need to have input values from the host.

26:44 so, uh, now we can in this case, they had this

26:50 it turns out you can find as I did kind of hear that

26:55 water must be a little bit but pretty much everything is the same

27:00 one without no explicit copying and copy copy out clauses. So this didn't

27:09 help. So the other problem that trying to lurch there too,

27:22 um what it says on this So, in addition to kind of

27:32 traffic between the device and the host the two parallel regions. Yeah,

27:43 happens for each iteration in the So there is both insides. The

27:54 loop access traffic on this PC I bus that is the weakest part as

28:02 as between eight directions. And this so that was really notified by the

28:10 off open A C C as well open MP that we need Perry tools

28:16 manage. Uh huh. Data are or on the device and this situations

28:29 this. So there is a data that can be used to specify the

28:44 off variables or a raise that are on the device so much. And

28:57 I will now show in the next . So here is how it can

29:01 used with than the clauses off what wants happen to the various race.

29:13 this case, one doesn't want a pick up into the holes for every

29:22 from interior and the same thing with a new on. You also wanted

29:27 be preserved between the two currently So in this case, we need

29:37 initialize both allocates memory for a as as initialize it from the host.

29:44 that's why it's a copy and the , because also returns values to the

29:53 when things are set them down. in this case to create class was

29:59 for hey, because the host doesn't need to know about the a new

30:06 . It just needs to know what final outcome is when you're sitting

30:11 So a new is entirely snow cone the device in this case, and

30:18 only than is copied once from the at the start of the integration and

30:27 at the end of the integration. now I guess I'm shown on this

30:36 . Then Now pay attention to the is in this case same thing.

30:43 up the production could cause have figured how to paralyze things and the very

30:48 radius before. But now the copy versions is done on Lee ones outside

31:01 Cardinals, so to speak. It's with the cardinals but has done once

31:08 . That's now results and things behaving well. So in this case,

31:17 fact, the manually managed data copping it's a little bit better performance than

31:32 compiling managed and the small difference tells also that what I showed on

31:39 uh, data managed compiler version. doubt that. Yes, um,

31:48 they raised her copied as much as kind of looks like. And the

31:55 output. The compiler, it seems , must also have basically done at

32:00 . Otherwise, that would be a difference. So, yeah. I

32:07 , this is a good stopping point questions, So that's what you're thinking

32:18 potential questions. Now again, the that there is this to memory spaces

32:27 seriously non uniformed memory access and that accelerators are typically connected to the whole

32:38 in are you bus, that is a lot. There were performance

32:46 other on the two memory Busses. the problem. And managing and reducing

32:53 amount of copy is one of the things when needs to worry about and

33:01 recently useful and efficient deep, you . Otherwise, you may not only

33:10 disappointed about things when he actually slow . All right, so this is

33:24 more or less one than kind of that, and it was a tiny

33:28 in this case. So I think is kind of this benchmark slide forever

33:36 for if you, um actual application , you know, very small wide

33:46 . Well, I could dynamics and codes and some physics codes, but

33:51 know, and it just shows that the compiler managed the unified memory versions

34:01 most of the time is close Thio Angela managed. But sometimes it

34:10 . There's such a great job, guess, many times in death,

34:17 in terms off approach the programming. may be a good to first,

34:24 see kind of manage compiler option to the cold running, and then one

34:31 start to try to figure out if can make it do better by starting

34:36 explicitly managed copying between the device and host. Um, the open

34:52 um, I, environmental correctly didn't have a parallel constructs that actually

35:04 this construct. That is the colonel's in many ways, that similar to

35:12 power constructs. But it is a that please more freedom to the compiler

35:21 figure out what to do. So I think it's done, one slide

35:27 lecture is that and the arguments between Open MP community and the Open a

35:35 C community is, that's is that eyes? You see, guys says

35:41 compilers are very good this point in . So give it to the compiler

35:48 figure out how to paralyze the Where is the open MP again?

35:54 a little bit more conservative in the of being prescriptive, as supposed to

36:01 agency that claims to be more The pilot construct is more in the

36:08 MP flavor, and the Cardinals construct more and the open a C C

36:14 they completely do things flavor. So this case, what shown on this

36:21 is that the cardinals, uh, is used for the entire set Aloofness

36:30 of them. There is no explicit for each one of the two likeness

36:35 it wants us to be paralyzed. says, Here is a piece of

36:40 . Figure it out, and now is the outcome. And,

36:52 yeah, so this is coming from little bit different tutorial on someone.

37:01 this case, it's not speed up plotted by time, so I

37:06 little bars air good. They're supposed bad, and it shows kind of

37:11 difference. CPU and the number of core is being used, and the

37:17 did not paralyzed all that well yeah, it's sped up 24 ish

37:24 . But then six and more course didn't really do much Andi again,

37:30 they tried to maximize it performance for course, who knows? Because this

37:39 done by and really on. Artists on keep use, not on CPU

37:46 , but we can see in this the colonel's construct did a little bit

37:51 than a private construct. And neither them did well, which was the

37:58 that, um, we just talked when we didn't use the data construct

38:06 make sure erase were kept on the . So in this case, neither

38:11 Cardinals nor the private construct figured out keep things on the device between iterations

38:18 for a lot so copying. But take a look. Nevertheless, the

38:25 Kathy corners did a little bit better the parallel construct. So here

38:31 Can the compiler output on telling what did in the two codes?

38:38 I'll just try to highlight the differences the pieces, so one of the

38:47 is in the inner loop that the construct the compartment was more conservative and

38:57 not trying to do the higher level in terms of gangs. Yeah,

39:03 leave actualized the inner. Whereas the the colonel constructs they compared to figured

39:10 that can also use the high level journalist in combination of actualization for the

39:18 loop. And I was, too both off the Tulou pianists,

39:28 and see anyone? Eso yes, just showed that there's little blue bars

39:35 the bottom of this, um, , and it's hard to tell the

39:39 in performance on the computer part And the gray there first to the

39:46 is for the data copying that is . And the gray is whatever else

39:52 in the cold. It's an excellent about the difference in what happens with

39:58 data copping between the cardinals and the construct. So in this case,

40:07 the colonel's construct, there was a bit less data copying going on,

40:16 you can see in this, case that it, uh avoided the

40:28 in for a new and that was in the paddle constructs. So it's

40:36 of knew that it was already on device. It was conservative and still

40:44 the host know what the outcome But otherwise, um, it is

40:52 . So look at the virus. , the colonel did a little bit

40:56 copying then the others, but it's significant. And it's even without the

41:05 types of overheads, is still made GPU or code in this case,

41:12 performance to the CPU coat. So O. That was just thio.

41:26 to show I don't have an explicit for using the data construct to preserve

41:39 between iterations between and see the difference colonels and parallel. But both construct

41:48 a very good job and again hope will have the time to try it

41:54 . So question update sites and show potential difference. And this is very

42:08 of text that this yes, most the region says that what the difference

42:14 that again, Cardinals compiler has more of freedom, um, than the

42:22 construct. It also means that sometimes may actually fail, and it's not

42:30 to advantage if it's a fairly complex compartment, says have said they have

42:38 guarantee correctness of the code. So conservative in terms off optimization and in

42:46 of doubt. Now things done. in the my mind you surprise construct

42:54 more prescriptive. So in a the peril the programmer takes responsibility and

43:01 says, Go ahead and paralyzed this where the risk. If the program

43:07 wrong, the code may not be . And this is more or less

43:14 I just said in terms of these . So again, just starting on

43:19 textbook. Put this text on the on this a little bit more allowed

43:26 manage things. We don't use it the assignment that you will get.

43:32 just to the very simplest thing to you to use the directors programming approach

43:40 accelerators, I will stop in a more slides. Um, Thio suggest

43:51 do the demo off the opening C mode using the GP works on the

44:00 . Yeah, but there are um, sometimes when you do explicit

44:08 off data, there are ways to sure that the host again man u

44:19 execution threats both on the host and device at the same times, and

44:28 let the host in certain things that better at than the deep use in

44:33 , if there's really not much use Cindy constructs or instructions, then there

44:40 be as well to have the whole . Uh, well, we've then

44:46 other pieces of the code on to streaming type processors. But then you

44:54 also need to make sure for certain or erase that they are instinct.

45:00 then there are explicit ways requesting that are staying in sync by either copying

45:08 from the host to the device or , topping things from the device into

45:16 host. And this is just a example that I let you and look

45:25 because I want to suggest to have for me. Demo. Um,

45:32 is in order thio add more flexibility just given data construct for the

45:45 Set the parallel regions. You can it for kind off. Relatively arbitrary

45:52 of code was using enter and exit statements, and the only requirement is

46:00 the These are matching pairs, so always need to be matching exit data

46:07 an inter data statements. But the degree of freedom off where no place

46:17 to, um, directives on. is kind of a couple of examples

46:26 . You can place them, and is a little bit more that

46:30 even after can even being different functions long as the code execution path means

46:40 , then the execution path encounters and numbers off interests and exit data.

46:50 me see. I will talk briefly this and very quickly and then

46:56 Thio. So, yeah, so is a couple of other things that

47:03 used thio kind of valid marriage and to get better performance as one is

47:13 when it comes to loops collapsing loose the other one is what's now known

47:18 Thailand groups. And it has to with trying to understand the architectures off

47:26 memory system on that goes back to talk way back about memory systems and

47:33 how then the main memory in terms dear I'm are, um,

47:41 or that there is structure and the is by no means random access

47:46 It is highly structured, so there simply what the collapses. There's nothing

47:56 . It's just Chelsea compiler and basically will collapse. In this case,

48:05 24 Lopes and to single loop that has the low pounds. That is

48:11 product of the two. The Uh, with two other loops on

48:19 is how I can use it in the Kobe example that we have talked

48:25 I used so much already by collapsing two loops, um, in the

48:31 parallel regions and, uh, in case, collapsing on the little still

48:38 very much. But it bought a bit. And the other thing you

48:44 Thio. Sure, you knew about this tile construct, but basically and

48:50 considered integrations in the two loops together treat them as tiles basically do a

49:01 interactions in each of the two And then, um, you,

49:10 , basically partition each generation into So you get basically loopiness There is

49:18 that case, four loops on this . You get two steps in each

49:23 of the two loops and then you to step up through both. They

49:27 in the inner loop on fact, generates it looked just with four

49:33 And here is just another case, on here is more again higher can

49:38 it in terms of this Jacoby kind of that case using 32 by

49:44 Azaz against was recent beneficial and during cold baiting code and try to optimize

49:53 with respect to the memory system. here has worked some different stylings that

50:01 years four by fours up to 32 30 tours. And sometimes there was

50:07 improvement. And sometimes there was a . Uh, I think,

50:14 So I guess in this case for proper dining, the best case was

50:21 or 10% above collapsing loops by using style primitive. And I think that's

50:29 stopping point, I think for very quickly. You can also

50:37 um, decided to try Thio manage many, um, gangs are being

50:49 . And how many workers in each of the game gangs and the vector

50:56 in terms of the victimization. So the gang you remember, they get

51:03 to this streaming off the processors. one of these has, um,

51:15 , uh, core and to the in it, andi, than the

51:24 . What's, um allows each worker use several of the crew the course

51:36 again it reflects the kind of architecture off the hardware in terms off through

51:42 course, being grouped into sm access grouped into the graphics processing clusters.

51:52 think using the NVIDIA Norman closed. , and this is just explicit.

51:58 telling what type called fertilization on one from the loops, and this is

52:04 then how many of them are one each one, and this is how

52:09 could be used in their Kobe And here's in this case trying to

52:15 an explicit management. Did interaction help compile? It did a better

52:20 And with that I will love past over Thio suggest there's a few more

52:28 in the slide decks that talk a bit more explicitly about the energy for

52:35 . Andi, if there's time Left on it, otherwise they leave it

52:39 you to look at it. And that, leave it to so

52:47 get start sharing my screen. so it's nice screen visible.

53:03 Okay. Awesome. Uh, so Z may have, uh,

53:11 And may I mentioned earlier that on to knows that we have access to

53:16 notes do not have a g So in this case will be using

53:20 notes on the gist cluster and for time will be using the century.

53:26 the example that basically resigned the slides now? And interestingly enough, there

53:34 a few examples that may not concur the results that we signed the

53:38 so just keep an eye out. , well, they got started.

53:45 so since I've been using stampede until , So here's a quick reminder how

53:50 can connect toe the religious question. can use the business. Do you

53:56 that taxi? And on the on bridges cluster, if you want to

54:03 access to the GPU known, in the interact more. This is

54:08 command that you will use, Um , and is the number off notes

54:13 you want to get access to since want access to cheap, you

54:18 um, you have to provide the flag as well on that this flag

54:23 , you provide, uh, the of the GPU knows that you want

54:28 to so in. In the case bridges, we have two types off

54:32 notes one that contains the P 100 , which are relatively the New Orleans

54:38 three other notes contained gave four p Remember correctly, which are the slightly

54:45 GPS from in video. And then second parameter here with that is the

54:50 of GPS that you want to access in this case will be just accessing

54:55 , one GP. And then this the time that you want to be

55:00 for eso. As you can I've already run that command and I'm

55:05 on the GP. You know, you can confirm that by seeing the

55:09 , it should change from log into eso first thing before you start working

55:16 open A C C codes, you to make sure you have a couple

55:21 modules loaded. Yeah, and in list. In this case, it

55:31 be one off PG compilers, and we also need the food around

55:37 Since we're using thean video news, sure you have those two models.

55:44 is the model for P. C. C. Can pilot that

55:48 used to compare the open SEC Now this PG model. It also

55:54 some of uh, some useful some of which you can see.

55:59 is bgcc, which is the BG for C just the G d.

56:04 plus plus, which is the companion C plus plus court. Uh,

56:10 we're interested right now is in this . That's the PT Axel Info.

56:18 you run that, you can get information about the whole, uh,

56:23 GPU that's on the label on the this particular north. You can see

56:28 sorts off specification that this is the version of the compute capability that the

56:36 likes to call it a six Oh, you can also see the

56:41 . You can also see if this , of course, manage memory.

56:48 , that I will not. The that's most important is this one the

56:53 G i. D. For So we will need, uh,

56:57 flag Thio, tell the PG I find that way Want to use,

57:03 , this Tesla CC 60 uh, off on the cheap? You combining

57:09 courts in this case, the C 60 stands for the compute capability.

57:14 this point which was just here on ground? Original Hendrick Benign Strip.

57:23 , the first example that I will with can I za again statistics the

57:29 Jacoby. Cool on. We will with using the magma sec kernels.

57:36 so when we compiled its and from program for compiling and running both we

57:43 see one problem each with just like ex ISI journals and will step step

57:50 step. Try going through hell called we go through. Now, here

57:56 have just, uh for, my physical nose around our main computer

58:03 lock. When you want to compile code, you can simply use the

58:08 . I compile, like bgcc. also need to provide the flag,

58:12 , SEC that tells us that you're open E C C. It's another

58:19 , which is fast, which is to the optimization levels that we've seen

58:25 the intel compilers you was. You to three. In this case,

58:29 very likely, uh, most causes three organization level. Look,

58:38 we also need to tell that you're that particular, uh, computer teach

58:46 . So that zizi 60. then there's another flag and info.

58:54 with that provided excel as the parameter so that we can get details about

59:01 the compiler there in terms off, the food off our accelerator. So

59:07 stands for the accelerator here. Then can provide the name of the source

59:15 . So industries, that's Jacoby underscored the thing. Mm, and just

59:25 okay, thank you. And then are told, are prosecutable. Now

59:36 we compile that, as I we'll see one issue when we compile

59:42 court. And as you can see section here that the combined it for

59:49 reason thought that our compute loops contains dependency. And that's why did not

59:58 perform any kind of federalization for that move? However, we can be

60:04 , obviously know that there is no since we're treating a I would not

60:10 it, even though thes indexes I one g minus one. It makes

60:15 compiler think that there's some sort of across the on DSO in this case

60:23 not, uh, do any kind from any kind of finalization on that

60:28 book. So that's the first problem we notice. However, it did

60:34 an executable, so let's go ahead run it on. When we do

60:38 , we'll see that there is another here. That the code at runtime

60:49 had some trouble by trying to, , uh, access the Arab buffers

60:57 with this, uh, they find our code and given illegal address editor

61:05 the good online copy from device to , that was being profound and in

61:09 case compromising manner as well as I'm try to free the memory on the

61:16 world. Eso Let's first make sure remove our first issue. That

61:24 uh, making sure that this guy not have any trouble. But let's

61:31 with removing this memory issue so that can affects criminal code. So the

61:37 way you can do that is by providing theme. The managed barometer,

61:48 target taxes. Right. Uh, do that. You still have this

62:00 , uh, dependence from the But now way do not have Uh

62:08 . The memory access issue. So solved one problem. Now we need

62:13 solve the second problem that we need make sure that our look gets paralyzed

62:18 we already know that there is no in it. So what we can

62:22 is, rather than using this a sec kernels director, we can

62:29 the primary sec part of the And as we just saw in the

62:33 , that the difference between colonels and directors that colonel's objectively is pretty much

62:39 to the compiler to decide what the and whatnot. But in this

62:43 we know that there is no So we can explicitly tell the complainer

62:48 we need to paralyzed these particular Yeah, so we can do

62:57 You can combine this coat and still see what happens if we just removed

63:13 , managed, uh, again. mhm Motion on thio. Okay.

63:50 . Oh, right. So as just saw that the previous case

63:57 at runtime way got theater off memory . This time, even the compiler

64:05 us that this is not going to . You need toe, do something

64:10 the take a management structure, and we provide, he managed memory

64:19 parameter. Obviously that how was compiled And that particular loop waas paralyzed by

64:28 compartment on 20. Yeah, on we can already see that there is

64:43 speed up between the Colonel's director of , the Look production director. So

64:52 there any question of too long? we solved two problems May live.

64:57 we made sure that our looks at lives using the loop uh, construct

65:04 very mature. There's no issue with accesses by a T least for now

65:10 the manage memory, uh, any questions after that? Okay,

65:26 . Then we can already see that getting a good, uh, speed

65:31 . But still, we're using this manage memory paradigm for a construct eso

65:40 . Need to do something about that have a good data management strategy.

65:45 in the next one, for we can try to use this

65:51 see data construct and make sure what we need to do with help

65:57 So, as you can see for we're using the coffee construct. So

66:03 , which is, uh, which too? Allocating the memory and

66:09 copping that, uh, that offers from the host memory to the device

66:17 on, then copping, uh, output from device memory back to the

66:24 and that copy in. There's one step that it's competitive. Competitive to

66:31 first private close of open and see that locates the memory and device and

66:35 copies. And, uh, uh that data from that was a

66:43 meant but for that, you should been able to use create.

66:49 That's right. Right, right, . So we can Okay, maybe

66:52 coming. Uh, well, it's coming, but that's right. We

66:57 we can use create. I just on Yeah, that was in your

67:02 , but yes, we can also create for that since we do not

67:05 the initial values, uh, introduces amount of traffic. Correct?

67:13 Uh, but yes. Uh, . So now for at least for

67:21 example, keep Yes, we can that. Uh huh. Yes.

67:31 notice that now we will not provide managed memory clause. And the reason

67:39 the opening ceases standard. It says if you, uh, define your

67:45 data management strategy using these pregnancy data , then you should not provide the

67:52 memory process. If you do, what the compiler is going to do

67:56 going to disregard all the data management that you may have a light in

68:00 code and just use the manage memory for, uh, managing the data

68:09 . So make sure whenever you apply data management don't add the managed A

68:15 . Reflect Thank you way stepped up this example. The copy that we're

68:26 is redundant because way didn't need thio any memory, right? Way only

68:33 to allocate. Yes, we just Yeah, so we can We can

68:37 this copy and by creating Okay, . Um but then Mr were like

68:44 making it better one step at a , right? Well, at least

68:49 this case, it's not going to better. Spoiler alert. But let's

68:56 . Let me let me just complained yes, a Z you can see

69:09 generated a copy for a and copy for in you and it also battle

69:14 all the loops that we wanted it on and we run it. It's

69:25 not perform as well as the previously manage memory. However, if you

69:34 these two compilation, uh, you can. And if you count

69:40 number off data movements like copy and else, that's, um, exactly

69:47 same for these two. So this one copy in here and one copy

69:51 and one copy in here. So to copy in and one copy

69:56 And here is the same. That's copy in one copy in here and

70:00 coffee out. So that's again, , to copying and one copy

70:04 And these are redundant. Since these only executed, the data is not

70:09 present on the device where it is present here. Uh, the main

70:16 is, uh, in many it could happen. That theme manage

70:23 . Since it's managed by the food run time, the code at runtime

70:26 actually do a better job at moving between the device and the host.

70:31 the reason is when you use a memory, the cooler runtime decides when

70:38 block of data is needed on the , and it moves the data at

70:43 particular instance when it's actually needed or all. It can also apply some

70:49 optimization in doing so, and so not necessarily, uh, true that

70:56 will always get good results with your data country. That's the only reason

71:02 try to keep. I've got particular . Is there any question okay,

71:14 not Move to the next one. here is what we just saw in

71:21 of the last sites that we just that we can also use these private

71:28 enter and exit data process. What mainly does is if you have a

71:35 of court blocks or your code is arised. You don't want to put

71:41 data management laws across. All the are across all the models. What

71:46 can simply do is you can define section where you your data enters the

71:54 , generally speaking, and then for particular enter data clause, you also

72:01 to have an exit date across and between that enter and exit section

72:08 you can keep telling your theory, , that, hey, these buffers

72:15 already present on the device member using present cause. So that s so

72:21 the compiler of the front time, does not have to worry about data

72:25 president on the device memory at this point in the execution. So this

72:31 a very simple example, but way pretty much everything in just one,

72:38 , function. A good example would something like this. So if you

72:42 a function that performs initialization, another that does the actual computations, another

72:51 that does that second look which does slapping off values and and you and

72:57 the end, another function that de the memory. So in this

73:01 you can put a enter clause in initialization function, and the driver

73:08 like main function, would call these functions on in that in those

73:14 you can use these president across, the front time and the compiler that

73:19 worry about the data to be president not, it's going to be already

73:23 since we used this, uh, enter data constructs. But it is

73:29 to know that for every enter data , you need to have an exit

73:35 cause otherwise you get a peril from compiler orbital from time. Um,

73:47 a compilation would be exactly what Just go ahead and run this,

73:52 , on you can see still us a slightly better than what we had

74:00 . Quite a little bit. Still you have there any questions on

74:12 Uh, it's not then the last . I have a question for

74:20 Klaus. Sorry. Say that You said the President Klaus was there

74:27 tell. Um, Well, not if I would call it a hint

74:31 the compiler. Um, but it letting it know that we don't need

74:37 . Like Like it says it's already . There s You're pretty much telling

74:42 that don't worry about data being present not. We are guaranteeing that,

74:49 , thes buffers will be present in device memory at this particular point in

74:53 execution. Does that make sense? . Yeah. Um thanks. So

75:08 is a question for all of you . So this is a simple matrix

75:15 program. On these are the three loops that you may have become familiar

75:19 now, after your assignments. So question here is there's two questions.

75:28 , do you think that using these loop constructs will give us freedom?

75:35 the second question is, will you a correct result? They're not

75:57 Anyone okay? E just run this and see what happens. Eso this

76:17 will compile without an issue. but what's gonna happen is this was

76:22 execution in time. Whoever it failed one of the elements or there would

76:28 many elements that were expected to have certain value. But in the result

76:35 was computed, they had some other because there was something wrong with

76:39 The one thing wrong with our code that the C E. J was

76:43 accessed by multiple threads, or Andi you try to paralyze your innermost

76:52 which caused the race condition across, for this particular element across the across

76:59 threads and which may have resulted in incorrect assignment off the values to that

77:08 element. And as we have seen threats do not synchronized until explicitly being

77:16 eso. The simplistic, simplest solution this particular problem is by using a

77:23 about which we also saw in the of open MP and using the reduction

77:29 . So we asked plus, as operator to be applied upon them and

77:35 if you remember that thes the valuable you plaster reduction closet goes in as

77:41 private. And so if you compile on this code, we will see

77:48 not only, uh, your quote finished in less time, but it

77:53 passed. So the main motive force this example was that even though you're

77:59 may have applied all the finalization strategy of four and it is running

78:04 You should also make sure that you're a direct result out off out of

78:09 coat. So don't just get complacent you get your coat running at Lord

78:15 time, I should also make sure we're getting correct results that will

78:24 It was pretty much, uh Do mind if I ask you a

78:30 E So I understand why it produced results, right? Because there was

78:40 a race condition on the intimacies loop C at I J. Um,

78:45 I don't understand is why there was speed up on the last one.

78:51 , the main speed up is because believe that something, maybe people to

78:57 it much in a much better way what there is a term called cash

79:02 or just trashing, I would say multiple threads are trying to access a

79:10 single element. And when that the data element may be accessed by

79:20 thread. However, before that particular tried toe updated, some other threat

79:26 that particular element. And then this threat has toe update its own

79:31 And so, in this process, a bunch off member needs going

79:38 The cash is in the main so I don't know I'm doing it

79:42 I'm doing a good job explaining it as a doctor, Johnson may have

79:45 good explanation now that that totally makes . I it didn't occur to me

79:51 with with with the race condition, there's a possibility of threshing. But

79:58 only if only one connects that at time that it would make much more

80:01 that there was no thrashing right, is what we do by introducing

80:07 I'm very, very good. Okay, that makes some Thank

80:13 Theo. There are no questions Okay? Yeah. Yeah. So

80:32 guess that the high level many things not that different between opening Suzie and

80:40 empty in terms of have to go cold, and a lot of it

80:44 has to deal with managing data, or memory. I've been trying to

80:54 emphasize throughout this course so far, that's why it's important to Ah,

81:02 quite knowledgeable about them. That a architectures in order to get decent performance

81:09 try to figure out how to help and other tools manage or express more

81:18 the intent. Dive in there. Okay, Yeah. This,

81:36 stop falling at this point. In fact, some questions for

-
+