© Distribution of this video is restricted by its owner
Transcript ×
Auto highlight
Font-size
00:00 Yeah. Okay, so, last , I just got to start to

00:09 about that Virginia's computing or notes So I'll pick up where I left

00:21 last lecture and let's see if this So it was a little bit of

00:29 out time for today. So since first couple of points here is just

00:36 recap of the last few comments of lecture on, then talk a little

00:41 more about the programming aspects are heterogeneous in particular about something called Open a

00:49 C that some may be familiar It's kind of in the same spirit

00:54 open MP, and I'll try to out a little bit differences on commonalities

01:04 why we use open a sissy. anything was about etcetera, genius

01:13 And for this class, I would it is, uh, the

01:19 Aziz, you have used it so in terms of the stampede, for

01:25 , and some attached processor that maybe or an F B J or some

01:32 device. Uh, but for this , uh, will be GP use

01:40 So here's the point. Try Thio towards the end of last lecture that

01:45 is kind of the no architecture at high level, if you like,

01:53 will be the basis for the next assignment. And that's typical in terms

02:01 what to may use for lots of these days when you used the

02:08 or potentially even FDA s or some device. So, as I pointed

02:16 , main difference or differences are, that attend to be. Now there's

02:24 memory spaces and there's two instruction So then we'll come back to that

02:33 the lecture on then the next is examples very quickly, in case you

02:41 , um, come across that except using them from some weapon to

02:48 But here's what a kind of GPU model may look like, and they

02:58 attach or connect to this. Are bus station PC I express bus and

03:06 you can see in this kind of the other word golden type pins at

03:10 edge of the card. That's to about plugs into this bus, and

03:15 can also kind of C on the left hand corner here that deep use

03:20 to be power hungry, and it a lot of cooling. So not

03:26 of what you see it can. defense. And in addition, Thio

03:32 the GP use. Um, we're to use the bridges computers this

03:41 but we also have them at the Science Institute. Jovic and I think

03:48 punch has on GP use on it many off the cloud for riders Amazon

03:57 against the super Microsoft. Many of they do have keep the use in

04:02 of their notes, so there are commonly accessible in many shapes and

04:10 Then, um, we have a sort of beginning to show up

04:17 become a little bit more common. tools for programming. FPs has been

04:24 quite a bit over the years, they offer some benefits as kind of

04:30 compromise between fully custom piece of silicon a standard CPU. And you can

04:40 get them on PC expressed cards or can do what Microsoft did. They

04:47 Alibaba, the other with Chinese Cloud and big, uh, Internet

04:55 They also use FDA for their search and some of their functions,

05:01 and in terms of Microsoft, also through their cloud service, mhm We

05:08 not use FPs in this course, , but we should be aware

05:13 That's another element in terms of heterogeneous that are now rather the available,

05:19 through cloud services. And then the example I have in terms of the

05:25 of accelerators is so those of you are interested in machine learning and that

05:33 designed their own DPU, a tensor unit that is then used to support

05:42 like tensorflow. And these days they sell thes units. But they do

05:48 access to them in terms of their power, and it has the benefits

05:54 claimed over GP use. That's why did it so that in a custom

05:59 of silicon, um, and they're on their third generation to do

06:04 But it means basically programming heterogeneous And then I want to just a

06:13 briefly point out that in terms of computing, this has been the norm

06:18 a long time, in fact, in that case you don't use connectivity

06:24 an Ohio bus. Everything is on same piece of silicon, and that

06:28 a big difference. But in terms both performance aspect and programming aspects,

06:35 we will not help interview it in course. But is one example from

06:40 instruments that, um, for a they also did chips for mobile

06:47 And but they stay. They're more on teach other single processes. But

06:55 it's usually, you know, one of CPU core. There's low power

06:59 was produced by arm, that you have seen a lot of headlines in

07:04 years, also in terms of high computing and that NVIDIA is in the

07:11 of buying. But it shows that have done a number of different functional

07:17 that needs to be programming. There's one from Qualcomm in terms of their

07:23 phone ship designs that has it Number of processing engines on the same

07:28 of silicon. And here is another that's just from into us and older

07:33 . About four. They're stopping mobile and, as you can see towards

07:37 lower right 10 corner in the picture says GP use included together with a

07:44 set, of course, and by way, that may not the something

07:51 necessarily think of. But in in tone as the largest producer of

07:58 . But as of yet, they not have a kind of discrete component

08:06 units, all integrated with on a of silicon that at about this time

08:13 claim that they will release their first . We'll not stand alone by GPU

08:19 separately and to be connected over in bus. So they decided Thio seriously

08:30 up the competition with the media and in terms of having discreet GPS on

08:39 is another one boy and be in of their integrated um GP use on

08:46 same piece of city you know where were just examples. So say that

08:50 , keep used also exists as integrated the same piece of silicon that that's

08:59 something I recover in these course how deal with it. Those in how

09:03 program them. Any questions on this general kind of, or whether you

09:12 accelerators on some flavor, mostly GP in terms of communicating the the work

09:26 done by the accelerator? What are implications between having an attached versus

09:35 It's a huge difference. Uh, back to it, but very quickly

09:42 this point in the lecture. So ones that I integrated Andi used to

09:50 them a B use application processing units the graphics processing and CP use on

09:59 other accelerators on the same piece of . And so them biggest difference,

10:09 would say, and I will come to that is that when it's integrated

10:13 the same piece of silicon, the the different devices tend to has access

10:25 equal access, even to the same , which is not true when it

10:31 to the attach processing or accelerators. , and it also means that the

10:40 paths between C to use and accelerate or share. So it's even though

10:51 instructions that's are different for different But it's a lot more homogeneous in

10:59 of the kind of silicon infrastructure that being used by the different computational

11:06 as supposed to when you have attached so it effects. Other programming is

11:13 done and the tools being used as as the performance and I'll try Thio

11:22 that out as we go here in next few slides. So this

11:30 I think, the last night I last time on. Just try to

11:36 out that in terms off parallelism or of threads that can be supported,

11:50 guess between CPUs and deep use, a huge difference. So typically about

11:57 the magnitude, um, difference for . And for that mhm. The

12:06 part, I would say, is . I'll come back to that,

12:09 . Is that to get the full or high utilization off streaming processes that

12:19 use that will be the focus for rest of the lecture. You really

12:24 tohave your application capable of exploiting Cindy , instructions. And, as you

12:38 see, if you look at the column here, basically there is not

12:43 used a difference in terms of p between CPU see or CPUs and DP

12:53 . Yes, a factor. Five by no means being north. But

12:56 not orders of magnitude. And whether actually get this factor of five or

13:02 , it's highly dependent on whether you actually get victimization was simply toe work

13:09 your application and also, in this , the GP use that I put

13:16 the slides. Other ones there are specifically, I would say, to

13:24 with service sleepy use. So there other for DP use that may be

13:33 focused on supporting machine learning, and that case, they may still

13:40 The single precision capability is shown on slide, but typical. Their double

13:46 performance is way lower. So it's something to keep in mind in about

13:54 nature of the devices and what it to you get good utilization off them

14:00 terms off the nature of the application and the code being generate. So

14:09 are coming a little bit, I , to help answer the question that

14:13 didn't. But as I think this picture on the right of the

14:19 tried Thio illustrate some of it on text kind of make it kind of

14:27 . So the attached DP use It's true for the integrated GPS that they

14:40 not complete processors, so they all a host. You see, to

14:46 , it's a totally standard on It doesn't need anything else that has

14:52 . It has a lthough instruction decoding as memory. It does everything needed

14:59 execute coat CPUs have much more limited . Um, in terms of flexibility

15:13 dealing with cold. So that's why basically need to see for you.

15:18 that's the one thing that, it's important to keep in mind.

15:27 as when I talked in in the of lecture, talked to a little

15:32 about GPS, and I talked about , and so far most have been

15:38 on CPUs and the tours, and it said in the previous slides.

15:44 maybe if you up to a few , of course, on the pieces

15:50 . There is the CPU, whereas terms off cores, when it comes

15:56 deep use that tend to be in thousands. So again, the level

16:03 parallel this much they have on How can exploit is, you

16:10 up to two orders of magnitude higher on this. If you one of

16:17 advantages off GPS has bean that they been typically over ever since the,

16:29 , I just started to appear as , you can help him to annoy

16:33 Bus. They had 5 to 10 higher memory banquets than what specifically

16:44 So one of the big advantages for use has been how much how you

16:50 memory. Bad words, on the hand, coming back to the question

16:55 I haven't just asked if you have integrated GPU, it uses the same

17:01 . That means, yeah, doesn't the advantage off significantly higher memory

17:07 So it just has it. It's bandwidth eyes the Samos that for the

17:17 . The other difference that is important keep in mind is that GPU memory

17:25 tends to be a lot smaller than memory on SCP. So today,

17:36 of high end GP use may have to 32 gigabytes. Um,

17:41 on the other hand, see, you use may have terabytes,

17:43 memory on It's not the typical, nothing that prevents you toe configure an

17:51 the issue terabytes from memory. And of the knows that his per supercomputers

17:59 some of the richest knows they have from memory on So and then the

18:08 that connects is this are you And last time I talked a little

18:12 about the PCR Express bus and show that is kind of a very thin

18:19 compared thio the memory Busses, even the CPU, as well as even

18:26 so compared to the seat for So here is kind of the model

18:36 how get Virginia snows in terms of attached processors works. So it's basically

18:48 things start to an end on the , and in order to get anything

18:53 , one has to, to the the application that comes on some initial

19:01 , it starts out on the C in the CPU memory, and it

19:05 done to be moved over to the . That's his call on the GPU

19:12 this class. Then you have to the code over, and then it

19:19 start the execution, and then the typically can then proceed a synchronously on

19:26 device and whatever it is CP, may want to do on. At

19:31 point, the results are supposed to back to the CPU now in

19:40 since the GPU memories, it's significantly than the CPU memory. It is

19:48 not possible to move all of the data over to the DPU before execution

19:57 , but it actually has to be in faces where things gets moved over

20:03 maybe come back to the CPU. there's maybe a fair amount of interaction

20:10 the CPU and GPU during the execution the killed. In order to be

20:16 to process the entire data set on is just reemphasizing uh, how it

20:26 of works. Things starts on the hand side and gets moved over to

20:30 GPU, possibly in faces and the . I express bus, maybe having

20:38 severe performance. In fact, depending how much competition you can do in

20:44 GPU for transfer of data between the on the GPU on one has to

20:52 out when your leader literature if people actually telling you the full story in

21:00 off reduction in compute time or speed if it just looks at the GPU

21:09 by itself or if they actually do the bus transfers on the EU bus

21:15 get the total time speed up. when it needs to be careful,

21:22 there is not much competition capable even in itself, it may speed

21:28 uh, to a large factor if totally get killed, sort off by

21:35 slow transfers between the post on the . Ah, any questions on that

21:48 general picture or understanding how the structure on the trade offs. So the

21:56 era that points to the right offload . That's the Ah, three instructions

22:03 O. R, I guess when for phrasing it earlier, the code

22:07 transferred to the GPU, right? , of course, one of the

22:12 one that's just transfer data is, the data on which that instruction will

22:20 ? Right? Um, so are using the same PC I E.

22:25 Lane to do that? Yes. it's, um, it's it uses

22:36 same PCR Express lanes. There is difference in terms off lanes being used

22:45 cold and planes being used for But it's a good question. So

22:54 sees PC Express bus, maybe often 16 bits wide. When you use

23:02 16 bits, both for code and , that's sometimes there may be to

23:09 but the common is after, depending what the devices that's being attached for

23:17 or 16 bits wide. So here's little bit off them because again,

23:27 device, the G, P or P. J and for the TPU

23:33 its own instruction set. So that there is again said early on different

23:41 that so many of you are know . And maybe some of you have

23:46 used cooled off for programming and the A GP use to lawyers, proprietary

23:54 or crew that doesn't work on completing use. Um, that's why I

24:02 stayed away from using it. In course. Open CEO is an open

24:09 that, uh, is supported in , at least by several vendors.

24:18 waas initially driven by Apple and A B and had quite a few

24:26 uh, buying into the thing, Intel in and video. But in

24:34 of the media and they focus on and opens here is a bit of

24:39 step child. I will say Intel been a little bit more forthcoming in

24:47 of trying thio support. The construction good compilers for open C, L

24:54 AMG has also been, um, good supporter of open zeal, but

24:59 hasn't had the financial resources off NVIDIA Intel. Some of it hasn't.

25:05 open CEO has kind of been it improved over the years, but,

25:12 , it's still a little bit of issue of using that. And that's

25:17 for this class and decided not to it in part of availability off those

25:22 for it, open it to see I will focus on for the rest

25:27 this class is something we'll use on . Give more background toe. Why

25:33 the next few slides? The other that one needs to pay attention

25:39 as that was the focus of last is the ability for compilers to generate

25:49 , code or Cindy codes. And in particular critical for GP use because

25:54 z the basis for getting good um, GPS and open CL again

26:04 designed to support generating, called for using a good way. But,

26:10 I said has been compilers of not have the level of sophistication as Comme

26:20 and Fukuda or openness the seat. then we talked about, you

26:28 for CP use focused on open MP one of them programming paradigms for

26:35 So, um, a little bit open SNC and and open and P

26:46 ? This came about and then The difference is a little bit and

26:51 I'll talk about open a sissy. of the purpose of I would be

26:57 to use it for the next Look, so a little bit That

27:08 say on any one of these but open MP was started by the

27:20 Uncle community, big, broader And it from the start waas an

27:30 standard, uh, with both academics companies supporting the idea off open MP

27:39 a way off programming multi courtships. , as I said when I started

27:48 talk about open MP it Waas to to great simplified way or layer on

27:55 proposing spreads, for instance so make a little bit more collectable to deal

28:03 multi threaded systems. So I was much focused on again how to use

28:12 easy to use and inherit properties. huh, off those types off multicourse

28:21 . And that's meant Iwas kind of as, like, say, as

28:27 prescriptive system. Tell a lot what I want the system or the compare

28:36 do and the now over time. addition together, many course on on

28:46 periods also keep used made in in , an accelerant just quite becoming quite

28:59 and, um, open a sissy started as by a few vendors and

29:09 being one of them on pray being . And that was whether to,

29:18 guess, main companies that I do . But they kept at the proprietor

29:24 because at the time graze computer, of Highland High end Systems and the

29:31 they had also started to use DP . And that was at the

29:37 And they they didn't have much competition Terms Off GP used for engineering scientific

29:45 . As I said, until still the number one producer of JP

29:50 But they're all integrated on a piece silicon in terms off laptop or desktop

29:58 of use, and the that has a significant keep you manufacture as

30:07 They focused on the gaming market more NVIDIA. I would say both did

30:12 in that market, but they were or less. I would say equal

30:17 . And they did not use that in terms off trying to build something

30:23 the data center or scientific and engineering . So so basically, open agency

30:29 a separate, proprietary effort for a years, and I think after five

30:35 six years, they, like many realize that proprietor is not necessarily a

30:41 idea for acceptance by spread. So this point, open a sissy is

30:49 an open standard, but open nice see a society they started with and

30:55 being a strong driver. So that they tried to figure out how to

31:00 programming. Uh, the central genius with DP use being, ah,

31:10 little bit easier than using cola and programming of the use. So the

31:18 very highest level the idea What's the ? They're using directives and trying to

31:24 their layers on top of to it to make the programming of accelerators all

31:32 . Somewhat easier. But the end , the starting point was, has

31:38 difference that open, empty, standard . That's a different notion of threads

31:47 capabilities, of course. Being significantly , the GPU course I am,

31:55 , at that time Waas, I say, compared to Cebu course exceedingly

32:02 , so open a sissy started. massive parallel is on very simple threads

32:11 and capabilities off the course, whereas and peace started at the other

32:17 But a zit says on the starting about five years back open MP

32:26 , helping MP then started to also out how to extend the capabilities of

32:33 MP to deal with accelerators. So the open M p e. I

32:44 there's no 5 ft tall standard is many of the features, uh,

32:50 A. C C. Has and , open a sissy. As many

32:55 the features that open MP has the still a difference in the models,

33:06 to speak, that open empty, tends to be more prescriptive. And

33:13 idea, at least the argument from age, openness and see folks is

33:18 there's approaches, descriptive and leads more to from fighters to figure out how

33:26 generate Good cop. And I will to show some examples of what capabilities

33:33 open a C C compilers later in lecture. So here's a little bit

33:41 on the history, then a society points where independent, um, and

33:49 idea waas in the community that both has their merits. At some

33:58 uh, these two efforts should be of merge or integrated into one,

34:07 , approach if you like, and set our competitors being capable or having

34:12 best of both works didn't quite um, the open MP community.

34:27 I said it was focused on multi for a long time until the version

34:35 of open, empty Andi, as know, intelligence being the dominating player

34:40 terms of see to use. so they were kind of highly focused

34:48 making open and pay good tools for , um, off the court

34:55 And then the started to branch out because today, Intel is also interested

35:02 accelerated systems. So they, as of you may know, they even

35:09 chips that has integrated F b J on the same piece of silicon or

35:14 the same package. I should Sarah, simply use and they're about

35:20 release stand alone GPS. Meanwhile, open A, C c and on

35:32 two hardware manufacturers of are in the Open A C C. Consorts and

35:36 was also a compiler software company on PG or Portland group and which built

35:47 independent compiler companies. Or they built for C. P. U S

35:53 heterogeneous systems that a few years back they were required by in video So

36:01 I now owned by NVIDIA is, would say, really highly focused on

36:09 sure that compiled code using their compiler really well and e j g p

36:18 . It means that it also has work reasonable on CP use on Because

36:26 video needs the attached versions off and GP use needs a host, and

36:35 systems use Intel CPU. Some they generate code for CPUs as well.

36:43 the Indian and and in town that in media started, focused on selling

36:51 you, so that's where their focus . So this kind of merge did

36:57 have. So, unfortunately, I characterize things as a bit of a

37:09 in terms off programming attached processors So, um, the Andy,

37:25 for us as and users have had pretty strong comeback and from a couple

37:37 serious downturns in their business and are competitive both in terms of CPUs and

37:45 and the entire range all CPUs and , um, so and they have

37:55 off not abandon open cl. But have now pushing run open source effort

38:05 as the radio open. Compute, , initiative on they have some buy

38:15 in terms of that on making progress making things usable on both Intel and

38:23 CPUs as well as, um, own GP use and and the

38:34 The reason is clearly that they also they want to make sure that cold

38:41 and the DDP use is recently easy work on to their GPS. Just

38:46 marketing reasons, Intel has, you , been pushing the more recent versions

38:55 open MP thio than the good in off being able to generate code also

39:04 , um, keep you use, , and recently have started another initiative

39:12 make talk a little bit later But it's not out there yet more

39:18 in their aversion, something they call a p I that is supposed to

39:24 basically based on open MP, and is a peace to be able to

39:33 kind of the same source called to CPU CP. Use FB ace and

39:40 other accelerators. And as I uh, NVIDIA, they're kind of

39:47 open a sissy. That generous Tsukuda I B M. That doesn't really

39:56 too much of a stake in the wars they have kind of been predominantly

40:04 compartments for open MP. But because high end systems as used use,

40:13 have been focused on using and media , DPU soul. They kind of

40:21 could a cold but map code for the GPS. And it also means

40:31 sites like Pittsburgh that doesn't have D. M hardware. They don't

40:39 necessarily good compilers, open MP compilers could support in the video or other

40:50 and then pray and now have also their own compilers on can their customers

40:59 been using and and the area until . Now some other customers are actually

41:06 to use in GPU, so we'll how that kind of war plays

41:13 And then finally, GCC, they s o far basically been focused on

41:21 MP, the open standing more. has a little bit broader backing than

41:25 open a C. C also open . So it's unfortunate in the end

41:34 . It depends what platform you're What compilers are available for that particular

41:43 . In our case, we use that has in video, keep you

41:51 and in tow CPUs. And so , the best Buy combination off Intel

42:02 and Video GP use is the open C C compilers. So any questions

42:14 that? So that's the reason why end up. Not just in

42:20 can continue. Thio Talk about open , too. Switch on introduced openings

42:27 see, so it might be a off topic. But would you say

42:34 the job market for compiler engineers is and well? I hope that is

42:43 , and I I'm quite sure it . Um, now there's on your

42:52 . There's a bit of hesitation and software and in particular software in the

43:03 form of tools, and I will compilers as a tool. The program

43:10 has been a difficult one. I PG I was an independent compiling

43:18 They were acquired by NVIDIA 3 to years ago. I don't remember exactly

43:26 of the earlier compiling companies were no Cook and Associates, and they were

43:36 independent, completely company. That did well. Uh, and when Intel

43:43 thio multi core chips, they realized the programming or this is a lot

43:51 complicated. Andi, we need better on. Then they started to because

43:57 had their money to build up a software issue or software effort, and

44:05 acquired this company. Took an So today I don't know too many

44:12 I don't know practically any independent compiler . So Intel has an effort IBM

44:21 efforts and lead as efforts. And has efforts and the complexity of

44:29 Modern systems is increasing, and to extent they have the money. They

44:35 spending a lot off building very including better compilers. So So you

44:47 , from my perspective, I hope of you get into that business because

44:54 make uh, the productivity or generating code higher is would benefit the entire

45:09 . So on to talk more specifically open a C. C. It's

45:16 open MP, the same idea director S. O. And here is

45:23 basically the additional complication is said that to the original open MP, things

45:32 quite different not only in terms off target, um, core capabilities being

45:44 different. Uh, but there to That's and two memories, princess.

45:52 even though on in principle at least kinds of devices to CPU and the

45:59 so one need to generate codes for different things on figuring out how to

46:06 these two memory spaces. So the right? Uh, both win the

46:15 Europe extended open MP 4.5, uh, I know has then their

46:22 to also tell competitors like Open a C. What is supposed to be

46:30 code for and a touch device versus for the host device. So in

46:39 with compiler generated code, the runtime on that January, the code then

46:47 what gets executed, where and data that is, and coach transfer.

46:57 here is just still. A bit the structure is taken from in IBM

47:02 businesses that they have the piece of that is targeted for their power type

47:11 and then told a generate intermediate And then the optimized that for their

47:17 CP use and then, based on directives that tells what's supposed to be

47:22 this case, something foreign radia deep that part of the coal gas and

47:31 by basically NVIDIA tools compilers to optimize for and the NDP use and then

47:41 code a code for their GPU so kind of an integrated system, but

47:47 kind of makes use of two different type processes and co generations in order

47:54 eventually come up with piece of code gets linked together. And then the

47:59 system knows what's supposed to be And this is what, yes,

48:08 . So here's a little bit what claim is of open a sissy and

48:13 in case it's just as you may , and then you just mentioned that's

48:18 the name that IBM juices for their end processors in some way, wasif

48:26 remember that's one of the Chinese on . X 86 is both Intel and

48:35 and supports X 86 instruction set, it's not identical instructions. That's what

48:41 core is the same. Um, I think the this slide I as

48:50 as I can tell, it's only true, because once t g i

48:59 , that was again The compiler company Open A C C. Was acquired

49:03 in video I things to stop evolving cogeneration and optimization for nd GPU.

49:14 then the entirely focused on and vida um, the structure is very much

49:22 same, uh, that again, in the form of Prague Mus And

49:29 only difference at the highest level is this HST instead of about so just

49:37 in this case that and open a C compiler recognized This is something that

49:43 should be with us, a directive figuring out how to generate code proper

49:51 and attached device. And there's otherwise direction, no point, something in

50:02 vocabulary that some of you may be to if you have that GP

50:10 And it's a bit unfortunate again to that the terminology is different, but

50:17 is, and I guess you should it so and I will use

50:22 So in terms of the you use talks about workers vectors and gangs and

50:34 as he don't on top. my, uh, remember, exactly

50:42 how much of this is also carried to the open MP 5.0 version,

50:49 I think it might be the structure these things is really reflection on the

50:55 of GPS. So that same why I showed, uh, the

51:03 way back in against lecture form, I talked about GPS but also now

51:08 on this case, So GPS are together as a replication off in a

51:20 way. I would say at the level there is this, I think

51:28 processing clusters, I think NVIDIA, this comes out of calls them.

51:34 in this case, for the current of thinkers, six of this D

51:41 on a single piece off silicon inside one of these graphics processing clusters,

51:51 streaming off the processors. They have access. And the current version of

51:59 too. All this streaming of the in each one off. We're actually

52:09 of them, I should say, , also in each, I

52:19 uh, I only remember. Ah, there's a bunch of them

52:26 each of these G P. C . And then inside each of

52:30 uh, streaming multi processors are the cooler course. A stand out.

52:37 cool decor is kind of the processing on principle. It is the corresponding

52:48 too, you know, x 86 and into the rmd CPU, but

52:57 pointed out to started out Maybe they and these cars are much, much

53:08 than an exodus six core. But also means that the footprint in silicon

53:15 a lot smaller, and that's why can get so many of them and

53:19 of them on a single piece of . Yeah, so in the management

53:29 parallel list, one needs to be or this structure is actually reflected and

53:37 programing. So a worker is something is assigned through a single core crew

53:50 when it comes to in video. this is a similar thing. When

53:55 comes to in the d P there is basically a single threat

54:02 There's no multi threading include, of , single thread per court. Now

54:10 number of these courts, um, in each one of these strange and

54:20 processors. And typically there are 32 these cores in each one of these

54:26 multi processors, and this is where factory ization and simply features comes into

54:36 the scene. If they do anything the same time, they all do

54:43 same instruction on different data. So 70 feature happens among these could a

54:54 in one of these straining multi So that's what the notion of this

55:00 is that and in principle. So can think of the the worker as

55:12 threat in the open MP vocabulary the we have thought about this before and

55:18 vector is kind of a similar so it's not quite the same

55:26 um, the vector feature in open being exist on a single threat.

55:36 this case, it means you get of multiple threads, pulls together sort

55:41 one thread Perko decor as a And then there is the notion off

55:52 , that is then, well, member of the gang is mapped to

56:00 streaming multi processor. So these concepts critical to keep in mind for how

56:10 parallels is supposed to work on is from this slide to realize that unless

56:18 can use the vector feature, you a lot on the processing power zp

56:25 instead of using 32 32 true, of course, you may only being

56:32 to use a single one. All , Any question on that, we'll

56:42 examples on this coming up. So all right? So this concept is

56:56 much identical to they open and pay director. So does the same

57:04 but now we call them gang instead threat. But each member on the

57:12 execute identical code. So it's redundant . So here is a classic example

57:21 the four lope. That's what it . And I think we have the

57:25 again type of example money I talked . Open, empty. That's probably

57:30 what you nations I want to So, like, what? Open

57:35 . You want some work sharing so you can get more parallels than yours

57:43 gold and basically divide up the Yeah, they're working there. Follow

57:49 in that case instead of four than sec and use the name Luke.

57:58 that's the way that you get work in affordable. So at this

58:06 kind of pretty much just the same as an open and peak.

58:15 right. So it also again, what? Open MP is immediately following

58:22 . Uh, that's work church, to speak. So this is kind

58:28 what happens from now. You can , uh, more use our load

58:39 , so to speak, within Uh huh. It takes a

58:45 you know, type of the gold order to the code. Of

58:49 they're actually being able to support that a message. The other thing.

58:57 try to point out that, a bunch of be careful and basically

59:03 the parallel loop constructs separately for each . Comments to be paralyzed on here

59:12 now. We'll talk about the few these clauses that can be used together

59:17 parallel constructs. You can manage both in terms off wait racing classes.

59:28 can help manage the number of members gangs, the number of workers,

59:37 director length. You can specify Bice , and so, and also in

59:45 of the management, recognize many things the open MP in terms of copy

59:51 company out and also about far first, private and etcetera. So

59:58 a little bit about this, but all that much, um, in

60:05 because there are very similar to open on. Um, don't want to

60:14 too much beyond what budget for open openness, you see there is also

60:24 I think the parallel construct is a bit off. Openness is effort to

60:31 , um, similar to open but they started out with having one

60:37 , in other construct that they call colonel's construct on. I will not

60:42 about it too much today, but next time. But that schemes a

60:48 more freedom to the compiler Thio restructure and try to optimize things for the

60:56 . So at least more freedom to compiler. This is in the descriptive

61:03 of open A C C versus the of open empty. So here is

61:10 little bit policies that, um, is, um, in terms of

61:22 parallel construct is similar to began there are open and P things remains

61:29 Once things will set up for, instance, of parallel region have the

61:35 number of gangs and workers and vectors the region, it doesn't dynamically

61:43 Um, whereas in the carnal things are a lot more flexible and

61:51 can happen. So, um, now let's talk about one clause,

62:02 then I'll talk about, um, an example on. The only clause

62:09 mentioned was reduction clause that because I it in the example of the show

62:14 , so that's also supported in it . C. And that means that

62:18 computer, like an open MP generates code to make sure that things happens

62:25 in terms of reduction. And and pretty much that take home message from

62:31 slide. Um, and this production are the common one, supportive in

62:38 open, empty and most programming So now a little bit off getting

62:47 the example. And they will do example showing a little bit off,

62:54 , a couple of the compiler flags then what the consequence are using

62:59 Maybe for a simple example. So SEC has this fast flag that basically

63:10 encouraged the computer to do whatever I do to try to optimize the

63:16 Eso. That's I think, what will be used in the examples I

63:22 . There's another flag that allows you get the information about what the compiler

63:29 done to the code. And, , there is a different, you

63:36 , options for the flag. And I mean for you can I will

63:42 you all up changes that did, certain optimization or things just focused on

63:48 , um accelerator. Mm. as I mentioned that you need because

64:00 cold needs to be generated for both and accelerator. But you can use

64:07 open racism C compiler like the open take on violence to generate cold for

64:14 host only. And then you use targets. Love the car,

64:21 flag or the multi core attributes for targets flag. Or you can

64:30 uh, the after Testa on. use that for the GPU because Tesla

64:38 one of the product things foreign media use on the same day. I

64:43 that, um, as part of attributes for the black. But then

64:50 another after you managed and that tells compiler that it should kind of manage

64:58 for you. And I'll show you that works in the next example on

65:09 . Maybe I'll do a little bit this example and then take some questions

65:15 somebody want to ask a question So this again Matrix, matrix,

65:27 , multiplication, matrix, matrix, and Kobe. Its methods for each

65:32 as an interactive solar is canonical examples are used by compiler people and held

65:40 people very, very often. So was in this before, and they

65:45 see it again. I'm sure before course is over. So,

65:55 it just have solver applied to the equation as basically relax ation scheme where

66:05 this case, you use the blue basically the average of the values and

66:10 the blue points to get the value the red point In the center of

66:15 , this is just the square. now this is then sequential told,

66:29 , for this Jacoby reiteration type scheme what it is. It's too loops

66:40 the upper half of this slide. going go through all the points and

66:47 kind of two d trade off. , great points a values in the

66:57 it x and Y r i N directions. So the statement in the

67:02 is just doing the averaging for each of the points. And then you

67:08 to figure out whether this thing and the error is, and eventually you

67:14 things for toe converge, so they compute the average of the point to

67:23 point is a new, and then figure out what the error is.

67:30 then for the first two loops to out then what the maximum error is

67:35 this anywhere across the grid. And , um what, once you have

67:47 that to get because you're Kobe Iteration of works in a very structured or

67:56 way. You have to use all old points, Um, before you

68:03 and use any new point. So what you see. It doesn't quite

68:09 in the equation on the top, a Kobe tradition basically valued all red

68:16 before they go to the next decoration make them blue. So once you

68:25 computing all the read points, the points updated, then you basically make

68:33 blue. And then asl, long Theo Air is not sufficiently small or

68:43 have been tired of iterating and reached maximum. You keep looping. So

68:48 the way I look. So this the way the sequential code works.

68:57 now, trying to use it to for doing this business as remember,

69:04 kind of starts and ends on the to you. So you have Thio

69:10 code and data to the GPU on the idea in this case is you

69:15 all the computations on the GPU. it's all said and done, then

69:20 result will be moved back to the memory. So now, using open

69:31 sissy to try to get this job ? Mm hmm. In this

69:37 there's a parable ization all, um are a look on the two Lucas

69:46 day first look, best thio all the red points. So to

69:51 on. Then the next thing making the red points bill points. And

69:56 this case, Thio 20 correctness producers cross and a similar tourism was a

70:04 example for open MP, but it's the same thing that you have a

70:10 of independent actors. Ah, during workload sharing for the after four

70:18 And in order to make sure that reduction happens correctly, you conduced use

70:25 reduction clause and have the compiler generate proper instructions to make sure that things

70:32 global Max air Yeah, it's properly . So in this school,

70:44 So now we're supposed to compile right? So the first effort here

70:50 yes, the generate code for the . In this case, it shows

71:01 that fast and also the, flag to the compiler is free to

71:07 whatever it can do to optimize code the most for the city, to

71:11 . And then what did some information happened in terms off acceleration. But

71:19 so says that can be accelerated. things were, uh, to be

71:26 to be generated for the CPU. and basically says that, you

71:32 things that are unknown in terms of GPU site. Yeah, so this

71:37 not particularly interesting, but it just what you can get. I was

71:43 after a couple of more slides on questions now. So here's no one

71:50 . So yes, um, the did Generally coat was executed on in

71:58 off. There was an intense safe and this is the model and all

72:03 . So it got the three times up on the 10 core,

72:10 which is not too impressive as far I'm concerned. But it did.

72:16 least managed to get some parent was on some of the course.

72:23 so again, it's the caveat. , always have to be a bit

72:29 again. Um, and videos don't want thio. We have to see

72:36 they used to their do their best . And if you it is interesting

72:42 read the literature on the papers. When things are published by Intel people

72:48 a media GP use versus and media using Intel c p use how the

72:55 in terms are, how the different of devices perform. Um,

73:02 well, talk a little bit one that decided to instead generate code

73:10 and the GP use and let the pretty much figure everything out. So

73:19 , in this case, if you at left column on the slider tells

73:25 the compiler did on in this case these, uh so the gangs basically

73:33 code for independent stream of the Remember, a member of a gang

73:41 assigned to each streaming off the so they tried to use,

73:48 several all their streaming units. the other thing, uh, it

73:55 manage data traffic. And so it a again things starts and finished on

74:03 CPU. So it allocates memory on GPU for the variables, or erase

74:11 you need. And it also initialize so in a here the values air

74:18 from the CPU memory to the GPU since the copy and and I'll talk

74:23 little bit more about this, clauses that, um, follows

74:30 Then it also, uh, returns the values of a thio CPU.

74:41 then, uh, on the second , Yeah. And then I guess

74:47 doesn't say explicitly, but it does memory for a new, but a

74:54 is just looks used in the competitions are now on the GPU. So

75:01 is no need to transfer a new the CPU on the GPU.

75:14 So now also what happened? So and of course, uh,

75:24 using open a sissy and let him take care of everything you got 37

75:32 speed up compared Thio the single core and a little bit more than 10

75:38 12 times speed up compared to um, the tank corps CPU now

75:54 . Um, one can certainly be in terms off the speed by my

76:03 . Um, but I also want encourage you to look a little bit

76:07 step beyond it on. And that's I put the remarks on the bottom

76:13 the slide that shows that relatively the efficiency or the fraction of

76:24 that the managed to get on the is actually higher than the fraction of

76:29 they got on their own device. understand? Because there's been a little

76:34 cautious then I think this is, , take questions before moving on

76:42 No, they want you to. one is easy to remember, but

76:46 he this slide, this makes it to ask questions. Uh, I'll

76:52 this one up for a little bit I continue toe origins. Tell that

77:09 guess I'm almost, but I will maybe I'll show a couple of more

77:17 . Justus. An intro, I , to next lecture. So it's

77:26 , I would say, get mixed many times the compartment has a very

77:31 job in terms off, taking care everything. But as always,

77:40 the programmer or the one that knows application and the data may be able

77:45 do a better job, then the that has to infer everything from the

77:54 . So the next variation of this I'm going to show you is F

78:00 . As a user, try manage the data transfers in particular

78:12 But before that I will talk about notion off, Unified Member. So

78:20 is, uh the idea of the memory is, and I think,

78:25 shown on this slide. Um, I said there are physically separate memories

78:34 he has, you know, three data paths between them. Um,

78:41 the CPU memory has the memory bus the CPU on the GPU. Memory

78:47 also in memory of us to the . And between the two devices there

78:54 the PC I express bus. So exceedingly Numa, if you like.

79:02 is non uniformed memory access because, , this highly different capabilities off these

79:10 process involved in moving data. The memory notion is that you can treat

79:19 kind of as one address space. like in Newman is one address space

79:26 the shared memory in the note, is by no means uniforms in access

79:34 to them. So this is the , unified memory that exists also in

79:41 empty 4.5 or later. So, , so this is now what the

79:52 used when you tell it to manage for it, it uses this notion

79:58 is unified memory. And I guess that point my time is up,

80:04 we'll continue with this example next and I'll take some questions that you

80:10 it. Okay, stop that screen and see if there are questions.

80:39 , so far there is, I , mostly talked about. There's lots

80:42 similarities, but underlying hardware structure uh, visible on becomes more increasingly

80:54 the more you try to optimize Okay, there's no questions. I

81:13 start with the first in the region the

-
+