ICS Video Player

COSC 6365 Introduction To High Performance Computing - Lecture16_OpenACC

Transcript ×

Auto highlight

Off

Font-size

00:00	Yeah. Okay, so, last , I just got to start to

00:09	about that Virginia's computing or notes So I'll pick up where I left

00:21	last lecture and let's see if this So it was a little bit of

00:29	out time for today. So since first couple of points here is just

00:36	recap of the last few comments of lecture on, then talk a little

00:41	more about the programming aspects are heterogeneous in particular about something called Open a

00:49	C that some may be familiar It's kind of in the same spirit

00:54	open MP, and I'll try to out a little bit differences on commonalities

01:04	why we use open a sissy. anything was about etcetera, genius

01:13	And for this class, I would it is, uh, the

01:19	Aziz, you have used it so in terms of the stampede, for

01:25	, and some attached processor that maybe or an F B J or some

01:32	device. Uh, but for this , uh, will be GP use

01:40	So here's the point. Try Thio towards the end of last lecture that

01:45	is kind of the no architecture at high level, if you like,

01:53	will be the basis for the next assignment. And that's typical in terms

02:01	what to may use for lots of these days when you used the

02:08	or potentially even FDA s or some device. So, as I pointed

02:16	, main difference or differences are, that attend to be. Now there's

02:24	memory spaces and there's two instruction So then we'll come back to that

02:33	the lecture on then the next is examples very quickly, in case you

02:41	, um, come across that except using them from some weapon to

02:48	But here's what a kind of GPU model may look like, and they

02:58	attach or connect to this. Are bus station PC I express bus and

03:06	you can see in this kind of the other word golden type pins at

03:10	edge of the card. That's to about plugs into this bus, and

03:15	can also kind of C on the left hand corner here that deep use

03:20	to be power hungry, and it a lot of cooling. So not

03:26	of what you see it can. defense. And in addition, Thio

03:32	the GP use. Um, we're to use the bridges computers this

03:41	but we also have them at the Science Institute. Jovic and I think

03:48	punch has on GP use on it many off the cloud for riders Amazon

03:57	against the super Microsoft. Many of they do have keep the use in

04:02	of their notes, so there are commonly accessible in many shapes and

04:10	Then, um, we have a sort of beginning to show up

04:17	become a little bit more common. tools for programming. FPs has been

04:24	quite a bit over the years, they offer some benefits as kind of

04:30	compromise between fully custom piece of silicon a standard CPU. And you can

04:40	get them on PC expressed cards or can do what Microsoft did. They

04:47	Alibaba, the other with Chinese Cloud and big, uh, Internet

04:55	They also use FDA for their search and some of their functions,

05:01	and in terms of Microsoft, also through their cloud service, mhm We

05:08	not use FPs in this course, , but we should be aware

05:13	That's another element in terms of heterogeneous that are now rather the available,

05:19	through cloud services. And then the example I have in terms of the

05:25	of accelerators is so those of you are interested in machine learning and that

05:33	designed their own DPU, a tensor unit that is then used to support

05:42	like tensorflow. And these days they sell thes units. But they do

05:48	access to them in terms of their power, and it has the benefits

05:54	claimed over GP use. That's why did it so that in a custom

05:59	of silicon, um, and they're on their third generation to do

06:04	But it means basically programming heterogeneous And then I want to just a

06:13	briefly point out that in terms of computing, this has been the norm

06:18	a long time, in fact, in that case you don't use connectivity

06:24	an Ohio bus. Everything is on same piece of silicon, and that

06:28	a big difference. But in terms both performance aspect and programming aspects,

06:35	we will not help interview it in course. But is one example from

06:40	instruments that, um, for a they also did chips for mobile

06:47	And but they stay. They're more on teach other single processes. But

06:55	it's usually, you know, one of CPU core. There's low power

06:59	was produced by arm, that you have seen a lot of headlines in

07:04	years, also in terms of high computing and that NVIDIA is in the

07:11	of buying. But it shows that have done a number of different functional

07:17	that needs to be programming. There's one from Qualcomm in terms of their

07:23	phone ship designs that has it Number of processing engines on the same

07:28	of silicon. And here is another that's just from into us and older

07:33	. About four. They're stopping mobile and, as you can see towards

07:37	lower right 10 corner in the picture says GP use included together with a

07:44	set, of course, and by way, that may not the something

07:51	necessarily think of. But in in tone as the largest producer of

07:58	. But as of yet, they not have a kind of discrete component

08:06	units, all integrated with on a of silicon that at about this time

08:13	claim that they will release their first . We'll not stand alone by GPU

08:19	separately and to be connected over in bus. So they decided Thio seriously

08:30	up the competition with the media and in terms of having discreet GPS on

08:39	is another one boy and be in of their integrated um GP use on

08:46	same piece of city you know where were just examples. So say that

08:50	, keep used also exists as integrated the same piece of silicon that that's

08:59	something I recover in these course how deal with it. Those in how

09:03	program them. Any questions on this general kind of, or whether you

09:12	accelerators on some flavor, mostly GP in terms of communicating the the work

09:26	done by the accelerator? What are implications between having an attached versus

09:35	It's a huge difference. Uh, back to it, but very quickly

09:42	this point in the lecture. So ones that I integrated Andi used to

09:50	them a B use application processing units the graphics processing and CP use on

09:59	other accelerators on the same piece of . And so them biggest difference,

10:09	would say, and I will come to that is that when it's integrated

10:13	the same piece of silicon, the the different devices tend to has access

10:25	equal access, even to the same , which is not true when it

10:31	to the attach processing or accelerators. , and it also means that the

10:40	paths between C to use and accelerate or share. So it's even though

10:51	instructions that's are different for different But it's a lot more homogeneous in

10:59	of the kind of silicon infrastructure that being used by the different computational

11:06	as supposed to when you have attached so it effects. Other programming is

11:13	done and the tools being used as as the performance and I'll try Thio

11:22	that out as we go here in next few slides. So this

11:30	I think, the last night I last time on. Just try to

11:36	out that in terms off parallelism or of threads that can be supported,

11:50	guess between CPUs and deep use, a huge difference. So typically about

11:57	the magnitude, um, difference for . And for that mhm. The

12:06	part, I would say, is . I'll come back to that,

12:09	. Is that to get the full or high utilization off streaming processes that

12:19	use that will be the focus for rest of the lecture. You really

12:24	tohave your application capable of exploiting Cindy , instructions. And, as you

12:38	see, if you look at the column here, basically there is not

12:43	used a difference in terms of p between CPU see or CPUs and DP

12:53	. Yes, a factor. Five by no means being north. But

12:56	not orders of magnitude. And whether actually get this factor of five or

13:02	, it's highly dependent on whether you actually get victimization was simply toe work

13:09	your application and also, in this , the GP use that I put

13:16	the slides. Other ones there are specifically, I would say, to

13:24	with service sleepy use. So there other for DP use that may be

13:33	focused on supporting machine learning, and that case, they may still

13:40	The single precision capability is shown on slide, but typical. Their double

13:46	performance is way lower. So it's something to keep in mind in about

13:54	nature of the devices and what it to you get good utilization off them

14:00	terms off the nature of the application and the code being generate. So

14:09	are coming a little bit, I , to help answer the question that

14:13	didn't. But as I think this picture on the right of the

14:19	tried Thio illustrate some of it on text kind of make it kind of

14:27	. So the attached DP use It's true for the integrated GPS that they

14:40	not complete processors, so they all a host. You see, to

14:46	, it's a totally standard on It doesn't need anything else that has

14:52	. It has a lthough instruction decoding as memory. It does everything needed

14:59	execute coat CPUs have much more limited . Um, in terms of flexibility

15:13	dealing with cold. So that's why basically need to see for you.

15:18	that's the one thing that, it's important to keep in mind.

15:27	as when I talked in in the of lecture, talked to a little

15:32	about GPS, and I talked about , and so far most have been

15:38	on CPUs and the tours, and it said in the previous slides.

15:44	maybe if you up to a few , of course, on the pieces

15:50	. There is the CPU, whereas terms off cores, when it comes

15:56	deep use that tend to be in thousands. So again, the level

16:03	parallel this much they have on How can exploit is, you

16:10	up to two orders of magnitude higher on this. If you one of

16:17	advantages off GPS has bean that they been typically over ever since the,

16:29	, I just started to appear as , you can help him to annoy

16:33	Bus. They had 5 to 10 higher memory banquets than what specifically

16:44	So one of the big advantages for use has been how much how you

16:50	memory. Bad words, on the hand, coming back to the question

16:55	I haven't just asked if you have integrated GPU, it uses the same

17:01	. That means, yeah, doesn't the advantage off significantly higher memory

17:07	So it just has it. It's bandwidth eyes the Samos that for the

17:17	. The other difference that is important keep in mind is that GPU memory

17:25	tends to be a lot smaller than memory on SCP. So today,

17:36	of high end GP use may have to 32 gigabytes. Um,

17:41	on the other hand, see, you use may have terabytes,

17:43	memory on It's not the typical, nothing that prevents you toe configure an

17:51	the issue terabytes from memory. And of the knows that his per supercomputers

17:59	some of the richest knows they have from memory on So and then the

18:08	that connects is this are you And last time I talked a little

18:12	about the PCR Express bus and show that is kind of a very thin

18:19	compared thio the memory Busses, even the CPU, as well as even

18:26	so compared to the seat for So here is kind of the model

18:36	how get Virginia snows in terms of attached processors works. So it's basically

18:48	things start to an end on the , and in order to get anything

18:53	, one has to, to the the application that comes on some initial

19:01	, it starts out on the C in the CPU memory, and it

19:05	done to be moved over to the . That's his call on the GPU

19:12	this class. Then you have to the code over, and then it

19:19	start the execution, and then the typically can then proceed a synchronously on

19:26	device and whatever it is CP, may want to do on. At

19:31	point, the results are supposed to back to the CPU now in

19:40	since the GPU memories, it's significantly than the CPU memory. It is

19:48	not possible to move all of the data over to the DPU before execution

19:57	, but it actually has to be in faces where things gets moved over

20:03	maybe come back to the CPU. there's maybe a fair amount of interaction

20:10	the CPU and GPU during the execution the killed. In order to be

20:16	to process the entire data set on is just reemphasizing uh, how it

20:26	of works. Things starts on the hand side and gets moved over to

20:30	GPU, possibly in faces and the . I express bus, maybe having

20:38	severe performance. In fact, depending how much competition you can do in

20:44	GPU for transfer of data between the on the GPU on one has to

20:52	out when your leader literature if people actually telling you the full story in

21:00	off reduction in compute time or speed if it just looks at the GPU

21:09	by itself or if they actually do the bus transfers on the EU bus

21:15	get the total time speed up. when it needs to be careful,

21:22	there is not much competition capable even in itself, it may speed

21:28	uh, to a large factor if totally get killed, sort off by

21:35	slow transfers between the post on the . Ah, any questions on that

21:48	general picture or understanding how the structure on the trade offs. So the

21:56	era that points to the right offload . That's the Ah, three instructions

22:03	O. R, I guess when for phrasing it earlier, the code

22:07	transferred to the GPU, right? , of course, one of the

22:12	one that's just transfer data is, the data on which that instruction will

22:20	? Right? Um, so are using the same PC I E.

22:25	Lane to do that? Yes. it's, um, it's it uses

22:36	same PCR Express lanes. There is difference in terms off lanes being used

22:45	cold and planes being used for But it's a good question. So

22:54	sees PC Express bus, maybe often 16 bits wide. When you use

23:02	16 bits, both for code and , that's sometimes there may be to

23:09	but the common is after, depending what the devices that's being attached for

23:17	or 16 bits wide. So here's little bit off them because again,

23:27	device, the G, P or P. J and for the TPU

23:33	its own instruction set. So that there is again said early on different

23:41	that so many of you are know . And maybe some of you have

23:46	used cooled off for programming and the A GP use to lawyers, proprietary

23:54	or crew that doesn't work on completing use. Um, that's why I

24:02	stayed away from using it. In course. Open CEO is an open

24:09	that, uh, is supported in , at least by several vendors.

24:18	waas initially driven by Apple and A B and had quite a few

24:26	uh, buying into the thing, Intel in and video. But in

24:34	of the media and they focus on and opens here is a bit of

24:39	step child. I will say Intel been a little bit more forthcoming in

24:47	of trying thio support. The construction good compilers for open C, L

24:54	AMG has also been, um, good supporter of open zeal, but

24:59	hasn't had the financial resources off NVIDIA Intel. Some of it hasn't.

25:05	open CEO has kind of been it improved over the years, but,

25:12	, it's still a little bit of issue of using that. And that's

25:17	for this class and decided not to it in part of availability off those

25:22	for it, open it to see I will focus on for the rest

25:27	this class is something we'll use on . Give more background toe. Why

25:33	the next few slides? The other that one needs to pay attention

25:39	as that was the focus of last is the ability for compilers to generate

25:49	, code or Cindy codes. And in particular critical for GP use because

25:54	z the basis for getting good um, GPS and open CL again

26:04	designed to support generating, called for using a good way. But,

26:10	I said has been compilers of not have the level of sophistication as Comme

26:20	and Fukuda or openness the seat. then we talked about, you

26:28	for CP use focused on open MP one of them programming paradigms for

26:35	So, um, a little bit open SNC and and open and P

26:46	? This came about and then The difference is a little bit and

26:51	I'll talk about open a sissy. of the purpose of I would be

26:57	to use it for the next Look, so a little bit That

27:08	say on any one of these but open MP was started by the

27:20	Uncle community, big, broader And it from the start waas an

27:30	standard, uh, with both academics companies supporting the idea off open MP

27:39	a way off programming multi courtships. , as I said when I started

27:48	talk about open MP it Waas to to great simplified way or layer on

27:55	proposing spreads, for instance so make a little bit more collectable to deal

28:03	multi threaded systems. So I was much focused on again how to use

28:12	easy to use and inherit properties. huh, off those types off multicourse

28:21	. And that's meant Iwas kind of as, like, say, as

28:27	prescriptive system. Tell a lot what I want the system or the compare

28:36	do and the now over time. addition together, many course on on

28:46	periods also keep used made in in , an accelerant just quite becoming quite

28:59	and, um, open a sissy started as by a few vendors and

29:09	being one of them on pray being . And that was whether to,

29:18	guess, main companies that I do . But they kept at the proprietor

29:24	because at the time graze computer, of Highland High end Systems and the

29:31	they had also started to use DP . And that was at the

29:37	And they they didn't have much competition Terms Off GP used for engineering scientific

29:45	. As I said, until still the number one producer of JP

29:50	But they're all integrated on a piece silicon in terms off laptop or desktop

29:58	of use, and the that has a significant keep you manufacture as

30:07	They focused on the gaming market more NVIDIA. I would say both did

30:12	in that market, but they were or less. I would say equal

30:17	. And they did not use that in terms off trying to build something

30:23	the data center or scientific and engineering . So so basically, open agency

30:29	a separate, proprietary effort for a years, and I think after five

30:35	six years, they, like many realize that proprietor is not necessarily a

30:41	idea for acceptance by spread. So this point, open a sissy is

30:49	an open standard, but open nice see a society they started with and

30:55	being a strong driver. So that they tried to figure out how to

31:00	programming. Uh, the central genius with DP use being, ah,

31:10	little bit easier than using cola and programming of the use. So the

31:18	very highest level the idea What's the ? They're using directives and trying to

31:24	their layers on top of to it to make the programming of accelerators all

31:32	. Somewhat easier. But the end , the starting point was, has

31:38	difference that open, empty, standard . That's a different notion of threads

31:47	capabilities, of course. Being significantly , the GPU course I am,

31:55	, at that time Waas, I say, compared to Cebu course exceedingly

32:02	, so open a sissy started. massive parallel is on very simple threads

32:11	and capabilities off the course, whereas and peace started at the other

32:17	But a zit says on the starting about five years back open MP

32:26	, helping MP then started to also out how to extend the capabilities of

32:33	MP to deal with accelerators. So the open M p e. I

32:44	there's no 5 ft tall standard is many of the features, uh,

32:50	A. C C. Has and , open a sissy. As many

32:55	the features that open MP has the still a difference in the models,

33:06	to speak, that open empty, tends to be more prescriptive. And

33:13	idea, at least the argument from age, openness and see folks is

33:18	there's approaches, descriptive and leads more to from fighters to figure out how

33:26	generate Good cop. And I will to show some examples of what capabilities

33:33	open a C C compilers later in lecture. So here's a little bit

33:41	on the history, then a society points where independent, um, and

33:49	idea waas in the community that both has their merits. At some

33:58	uh, these two efforts should be of merge or integrated into one,

34:07	, approach if you like, and set our competitors being capable or having

34:12	best of both works didn't quite um, the open MP community.

34:27	I said it was focused on multi for a long time until the version

34:35	of open, empty Andi, as know, intelligence being the dominating player

34:40	terms of see to use. so they were kind of highly focused

34:48	making open and pay good tools for , um, off the court

34:55	And then the started to branch out because today, Intel is also interested

35:02	accelerated systems. So they, as of you may know, they even

35:09	chips that has integrated F b J on the same piece of silicon or

35:14	the same package. I should Sarah, simply use and they're about

35:20	release stand alone GPS. Meanwhile, open A, C c and on

35:32	two hardware manufacturers of are in the Open A C C. Consorts and

35:36	was also a compiler software company on PG or Portland group and which built

35:47	independent compiler companies. Or they built for C. P. U S

35:53	heterogeneous systems that a few years back they were required by in video So

36:01	I now owned by NVIDIA is, would say, really highly focused on

36:09	sure that compiled code using their compiler really well and e j g p

36:18	. It means that it also has work reasonable on CP use on Because

36:26	video needs the attached versions off and GP use needs a host, and

36:35	systems use Intel CPU. Some they generate code for CPUs as well.

36:43	the Indian and and in town that in media started, focused on selling

36:51	you, so that's where their focus . So this kind of merge did

36:57	have. So, unfortunately, I characterize things as a bit of a

37:09	in terms off programming attached processors So, um, the Andy,

37:25	for us as and users have had pretty strong comeback and from a couple

37:37	serious downturns in their business and are competitive both in terms of CPUs and

37:45	and the entire range all CPUs and , um, so and they have

37:55	off not abandon open cl. But have now pushing run open source effort

38:05	as the radio open. Compute, , initiative on they have some buy

38:15	in terms of that on making progress making things usable on both Intel and

38:23	CPUs as well as, um, own GP use and and the

38:34	The reason is clearly that they also they want to make sure that cold

38:41	and the DDP use is recently easy work on to their GPS. Just

38:46	marketing reasons, Intel has, you , been pushing the more recent versions

38:55	open MP thio than the good in off being able to generate code also

39:04	, um, keep you use, , and recently have started another initiative

39:12	make talk a little bit later But it's not out there yet more

39:18	in their aversion, something they call a p I that is supposed to

39:24	basically based on open MP, and is a peace to be able to

39:33	kind of the same source called to CPU CP. Use FB ace and

39:40	other accelerators. And as I uh, NVIDIA, they're kind of

39:47	open a sissy. That generous Tsukuda I B M. That doesn't really

39:56	too much of a stake in the wars they have kind of been predominantly

40:04	compartments for open MP. But because high end systems as used use,

40:13	have been focused on using and media , DPU soul. They kind of

40:21	could a cold but map code for the GPS. And it also means

40:31	sites like Pittsburgh that doesn't have D. M hardware. They don't

40:39	necessarily good compilers, open MP compilers could support in the video or other

40:50	and then pray and now have also their own compilers on can their customers

40:59	been using and and the area until . Now some other customers are actually

41:06	to use in GPU, so we'll how that kind of war plays

41:13	And then finally, GCC, they s o far basically been focused on

41:21	MP, the open standing more. has a little bit broader backing than

41:25	open a C. C also open . So it's unfortunate in the end

41:34	. It depends what platform you're What compilers are available for that particular

41:43	. In our case, we use that has in video, keep you

41:51	and in tow CPUs. And so , the best Buy combination off Intel

42:02	and Video GP use is the open C C compilers. So any questions

42:14	that? So that's the reason why end up. Not just in

42:20	can continue. Thio Talk about open , too. Switch on introduced openings

42:27	see, so it might be a off topic. But would you say

42:34	the job market for compiler engineers is and well? I hope that is

42:43	, and I I'm quite sure it . Um, now there's on your

42:52	. There's a bit of hesitation and software and in particular software in the

43:03	form of tools, and I will compilers as a tool. The program

43:10	has been a difficult one. I PG I was an independent compiling

43:18	They were acquired by NVIDIA 3 to years ago. I don't remember exactly

43:26	of the earlier compiling companies were no Cook and Associates, and they were

43:36	independent, completely company. That did well. Uh, and when Intel

43:43	thio multi core chips, they realized the programming or this is a lot

43:51	complicated. Andi, we need better on. Then they started to because

43:57	had their money to build up a software issue or software effort, and

44:05	acquired this company. Took an So today I don't know too many

44:12	I don't know practically any independent compiler . So Intel has an effort IBM

44:21	efforts and lead as efforts. And has efforts and the complexity of

44:29	Modern systems is increasing, and to extent they have the money. They

44:35	spending a lot off building very including better compilers. So So you

44:47	, from my perspective, I hope of you get into that business because

44:54	make uh, the productivity or generating code higher is would benefit the entire

45:09	. So on to talk more specifically open a C. C. It's

45:16	open MP, the same idea director S. O. And here is

45:23	basically the additional complication is said that to the original open MP, things

45:32	quite different not only in terms off target, um, core capabilities being

45:44	different. Uh, but there to That's and two memories, princess.

45:52	even though on in principle at least kinds of devices to CPU and the

45:59	so one need to generate codes for different things on figuring out how to

46:06	these two memory spaces. So the right? Uh, both win the

46:15	Europe extended open MP 4.5, uh, I know has then their

46:22	to also tell competitors like Open a C. What is supposed to be

46:30	code for and a touch device versus for the host device. So in

46:39	with compiler generated code, the runtime on that January, the code then

46:47	what gets executed, where and data that is, and coach transfer.

46:57	here is just still. A bit the structure is taken from in IBM

47:02	businesses that they have the piece of that is targeted for their power type

47:11	and then told a generate intermediate And then the optimized that for their

47:17	CP use and then, based on directives that tells what's supposed to be

47:22	this case, something foreign radia deep that part of the coal gas and

47:31	by basically NVIDIA tools compilers to optimize for and the NDP use and then

47:41	code a code for their GPU so kind of an integrated system, but

47:47	kind of makes use of two different type processes and co generations in order

47:54	eventually come up with piece of code gets linked together. And then the

47:59	system knows what's supposed to be And this is what, yes,

48:08	. So here's a little bit what claim is of open a sissy and

48:13	in case it's just as you may , and then you just mentioned that's

48:18	the name that IBM juices for their end processors in some way, wasif

48:26	remember that's one of the Chinese on . X 86 is both Intel and

48:35	and supports X 86 instruction set, it's not identical instructions. That's what

48:41	core is the same. Um, I think the this slide I as

48:50	as I can tell, it's only true, because once t g i

48:59	, that was again The compiler company Open A C C. Was acquired

49:03	in video I things to stop evolving cogeneration and optimization for nd GPU.

49:14	then the entirely focused on and vida um, the structure is very much

49:22	same, uh, that again, in the form of Prague Mus And

49:29	only difference at the highest level is this HST instead of about so just

49:37	in this case that and open a C compiler recognized This is something that

49:43	should be with us, a directive figuring out how to generate code proper

49:51	and attached device. And there's otherwise direction, no point, something in

50:02	vocabulary that some of you may be to if you have that GP

50:10	And it's a bit unfortunate again to that the terminology is different, but

50:17	is, and I guess you should it so and I will use

50:22	So in terms of the you use talks about workers vectors and gangs and

50:34	as he don't on top. my, uh, remember, exactly

50:42	how much of this is also carried to the open MP 5.0 version,

50:49	I think it might be the structure these things is really reflection on the

50:55	of GPS. So that same why I showed, uh, the

51:03	way back in against lecture form, I talked about GPS but also now

51:08	on this case, So GPS are together as a replication off in a

51:20	way. I would say at the level there is this, I think

51:28	processing clusters, I think NVIDIA, this comes out of calls them.

51:34	in this case, for the current of thinkers, six of this D

51:41	on a single piece off silicon inside one of these graphics processing clusters,

51:51	streaming off the processors. They have access. And the current version of

51:59	too. All this streaming of the in each one off. We're actually

52:09	of them, I should say, , also in each, I

52:19	uh, I only remember. Ah, there's a bunch of them

52:26	each of these G P. C . And then inside each of

52:30	uh, streaming multi processors are the cooler course. A stand out.

52:37	cool decor is kind of the processing on principle. It is the corresponding

52:48	too, you know, x 86 and into the rmd CPU, but

52:57	pointed out to started out Maybe they and these cars are much, much

53:08	than an exodus six core. But also means that the footprint in silicon

53:15	a lot smaller, and that's why can get so many of them and

53:19	of them on a single piece of . Yeah, so in the management

53:29	parallel list, one needs to be or this structure is actually reflected and

53:37	programing. So a worker is something is assigned through a single core crew

53:50	when it comes to in video. this is a similar thing. When

53:55	comes to in the d P there is basically a single threat

54:02	There's no multi threading include, of , single thread per court. Now

54:10	number of these courts, um, in each one of these strange and

54:20	processors. And typically there are 32 these cores in each one of these

54:26	multi processors, and this is where factory ization and simply features comes into

54:36	the scene. If they do anything the same time, they all do

54:43	same instruction on different data. So 70 feature happens among these could a

54:54	in one of these straining multi So that's what the notion of this

55:00	is that and in principle. So can think of the the worker as

55:12	threat in the open MP vocabulary the we have thought about this before and

55:18	vector is kind of a similar so it's not quite the same

55:26	um, the vector feature in open being exist on a single threat.

55:36	this case, it means you get of multiple threads, pulls together sort

55:41	one thread Perko decor as a And then there is the notion off

55:52	, that is then, well, member of the gang is mapped to

56:00	streaming multi processor. So these concepts critical to keep in mind for how

56:10	parallels is supposed to work on is from this slide to realize that unless

56:18	can use the vector feature, you a lot on the processing power zp

56:25	instead of using 32 32 true, of course, you may only being

56:32	to use a single one. All , Any question on that, we'll

56:42	examples on this coming up. So all right? So this concept is

56:56	much identical to they open and pay director. So does the same

57:04	but now we call them gang instead threat. But each member on the

57:12	execute identical code. So it's redundant . So here is a classic example

57:21	the four lope. That's what it . And I think we have the

57:25	again type of example money I talked . Open, empty. That's probably

57:30	what you nations I want to So, like, what? Open

57:35	. You want some work sharing so you can get more parallels than yours

57:43	gold and basically divide up the Yeah, they're working there. Follow

57:49	in that case instead of four than sec and use the name Luke.

57:58	that's the way that you get work in affordable. So at this

58:06	kind of pretty much just the same as an open and peak.

58:15	right. So it also again, what? Open MP is immediately following

58:22	. Uh, that's work church, to speak. So this is kind

58:28	what happens from now. You can , uh, more use our load

58:39	, so to speak, within Uh huh. It takes a

58:45	you know, type of the gold order to the code. Of

58:49	they're actually being able to support that a message. The other thing.

58:57	try to point out that, a bunch of be careful and basically

59:03	the parallel loop constructs separately for each . Comments to be paralyzed on here

59:12	now. We'll talk about the few these clauses that can be used together

59:17	parallel constructs. You can manage both in terms off wait racing classes.

59:28	can help manage the number of members gangs, the number of workers,

59:37	director length. You can specify Bice , and so, and also in

59:45	of the management, recognize many things the open MP in terms of copy

59:51	company out and also about far first, private and etcetera. So

59:58	a little bit about this, but all that much, um, in

60:05	because there are very similar to open on. Um, don't want to

60:14	too much beyond what budget for open openness, you see there is also

60:24	I think the parallel construct is a bit off. Openness is effort to

60:31	, um, similar to open but they started out with having one

60:37	, in other construct that they call colonel's construct on. I will not

60:42	about it too much today, but next time. But that schemes a

60:48	more freedom to the compiler Thio restructure and try to optimize things for the

60:56	. So at least more freedom to compiler. This is in the descriptive

61:03	of open A C C versus the of open empty. So here is

61:10	little bit policies that, um, is, um, in terms of

61:22	parallel construct is similar to began there are open and P things remains

61:29	Once things will set up for, instance, of parallel region have the

61:35	number of gangs and workers and vectors the region, it doesn't dynamically

61:43	Um, whereas in the carnal things are a lot more flexible and

61:51	can happen. So, um, now let's talk about one clause,

62:02	then I'll talk about, um, an example on. The only clause

62:09	mentioned was reduction clause that because I it in the example of the show

62:14	, so that's also supported in it . C. And that means that

62:18	computer, like an open MP generates code to make sure that things happens

62:25	in terms of reduction. And and pretty much that take home message from

62:31	slide. Um, and this production are the common one, supportive in

62:38	open, empty and most programming So now a little bit off getting

62:47	the example. And they will do example showing a little bit off,

62:54	, a couple of the compiler flags then what the consequence are using

62:59	Maybe for a simple example. So SEC has this fast flag that basically

63:10	encouraged the computer to do whatever I do to try to optimize the

63:16	Eso. That's I think, what will be used in the examples I

63:22	. There's another flag that allows you get the information about what the compiler

63:29	done to the code. And, , there is a different, you

63:36	, options for the flag. And I mean for you can I will

63:42	you all up changes that did, certain optimization or things just focused on

63:48	, um accelerator. Mm. as I mentioned that you need because

64:00	cold needs to be generated for both and accelerator. But you can use

64:07	open racism C compiler like the open take on violence to generate cold for

64:14	host only. And then you use targets. Love the car,

64:21	flag or the multi core attributes for targets flag. Or you can

64:30	uh, the after Testa on. use that for the GPU because Tesla

64:38	one of the product things foreign media use on the same day. I

64:43	that, um, as part of attributes for the black. But then

64:50	another after you managed and that tells compiler that it should kind of manage

64:58	for you. And I'll show you that works in the next example on

65:09	. Maybe I'll do a little bit this example and then take some questions

65:15	somebody want to ask a question So this again Matrix, matrix,

65:27	, multiplication, matrix, matrix, and Kobe. Its methods for each

65:32	as an interactive solar is canonical examples are used by compiler people and held

65:40	people very, very often. So was in this before, and they

65:45	see it again. I'm sure before course is over. So,

65:55	it just have solver applied to the equation as basically relax ation scheme where

66:05	this case, you use the blue basically the average of the values and

66:10	the blue points to get the value the red point In the center of

66:15	, this is just the square. now this is then sequential told,

66:29	, for this Jacoby reiteration type scheme what it is. It's too loops

66:40	the upper half of this slide. going go through all the points and

66:47	kind of two d trade off. , great points a values in the

66:57	it x and Y r i N directions. So the statement in the

67:02	is just doing the averaging for each of the points. And then you

67:08	to figure out whether this thing and the error is, and eventually you

67:14	things for toe converge, so they compute the average of the point to

67:23	point is a new, and then figure out what the error is.

67:30	then for the first two loops to out then what the maximum error is

67:35	this anywhere across the grid. And , um what, once you have

67:47	that to get because you're Kobe Iteration of works in a very structured or

67:56	way. You have to use all old points, Um, before you

68:03	and use any new point. So what you see. It doesn't quite

68:09	in the equation on the top, a Kobe tradition basically valued all red

68:16	before they go to the next decoration make them blue. So once you

68:25	computing all the read points, the points updated, then you basically make

68:33	blue. And then asl, long Theo Air is not sufficiently small or

68:43	have been tired of iterating and reached maximum. You keep looping. So

68:48	the way I look. So this the way the sequential code works.

68:57	now, trying to use it to for doing this business as remember,

69:04	kind of starts and ends on the to you. So you have Thio

69:10	code and data to the GPU on the idea in this case is you

69:15	all the computations on the GPU. it's all said and done, then

69:20	result will be moved back to the memory. So now, using open

69:31	sissy to try to get this job ? Mm hmm. In this

69:37	there's a parable ization all, um are a look on the two Lucas

69:46	day first look, best thio all the red points. So to

69:51	on. Then the next thing making the red points bill points. And

69:56	this case, Thio 20 correctness producers cross and a similar tourism was a

70:04	example for open MP, but it's the same thing that you have a

70:10	of independent actors. Ah, during workload sharing for the after four

70:18	And in order to make sure that reduction happens correctly, you conduced use

70:25	reduction clause and have the compiler generate proper instructions to make sure that things

70:32	global Max air Yeah, it's properly . So in this school,

70:44	So now we're supposed to compile right? So the first effort here

70:50	yes, the generate code for the . In this case, it shows

71:01	that fast and also the, flag to the compiler is free to

71:07	whatever it can do to optimize code the most for the city, to

71:11	. And then what did some information happened in terms off acceleration. But

71:19	so says that can be accelerated. things were, uh, to be

71:26	to be generated for the CPU. and basically says that, you

71:32	things that are unknown in terms of GPU site. Yeah, so this

71:37	not particularly interesting, but it just what you can get. I was

71:43	after a couple of more slides on questions now. So here's no one

71:50	. So yes, um, the did Generally coat was executed on in

71:58	off. There was an intense safe and this is the model and all

72:03	. So it got the three times up on the 10 core,

72:10	which is not too impressive as far I'm concerned. But it did.

72:16	least managed to get some parent was on some of the course.

72:23	so again, it's the caveat. , always have to be a bit

72:29	again. Um, and videos don't want thio. We have to see

72:36	they used to their do their best . And if you it is interesting

72:42	read the literature on the papers. When things are published by Intel people

72:48	a media GP use versus and media using Intel c p use how the

72:55	in terms are, how the different of devices perform. Um,

73:02	well, talk a little bit one that decided to instead generate code

73:10	and the GP use and let the pretty much figure everything out. So

73:19	, in this case, if you at left column on the slider tells

73:25	the compiler did on in this case these, uh so the gangs basically

73:33	code for independent stream of the Remember, a member of a gang

73:41	assigned to each streaming off the so they tried to use,

73:48	several all their streaming units. the other thing, uh, it

73:55	manage data traffic. And so it a again things starts and finished on

74:03	CPU. So it allocates memory on GPU for the variables, or erase

74:11	you need. And it also initialize so in a here the values air

74:18	from the CPU memory to the GPU since the copy and and I'll talk

74:23	little bit more about this, clauses that, um, follows

74:30	Then it also, uh, returns the values of a thio CPU.

74:41	then, uh, on the second , Yeah. And then I guess

74:47	doesn't say explicitly, but it does memory for a new, but a

74:54	is just looks used in the competitions are now on the GPU. So

75:01	is no need to transfer a new the CPU on the GPU.

75:14	So now also what happened? So and of course, uh,

75:24	using open a sissy and let him take care of everything you got 37

75:32	speed up compared Thio the single core and a little bit more than 10

75:38	12 times speed up compared to um, the tank corps CPU now

75:54	. Um, one can certainly be in terms off the speed by my

76:03	. Um, but I also want encourage you to look a little bit

76:07	step beyond it on. And that's I put the remarks on the bottom

76:13	the slide that shows that relatively the efficiency or the fraction of

76:24	that the managed to get on the is actually higher than the fraction of

76:29	they got on their own device. understand? Because there's been a little

76:34	cautious then I think this is, , take questions before moving on

76:42	No, they want you to. one is easy to remember, but

76:46	he this slide, this makes it to ask questions. Uh, I'll

76:52	this one up for a little bit I continue toe origins. Tell that

77:09	guess I'm almost, but I will maybe I'll show a couple of more

77:17	. Justus. An intro, I , to next lecture. So it's

77:26	, I would say, get mixed many times the compartment has a very

77:31	job in terms off, taking care everything. But as always,

77:40	the programmer or the one that knows application and the data may be able

77:45	do a better job, then the that has to infer everything from the

77:54	. So the next variation of this I'm going to show you is F

78:00	. As a user, try manage the data transfers in particular

78:12	But before that I will talk about notion off, Unified Member. So

78:20	is, uh the idea of the memory is, and I think,

78:25	shown on this slide. Um, I said there are physically separate memories

78:34	he has, you know, three data paths between them. Um,

78:41	the CPU memory has the memory bus the CPU on the GPU. Memory

78:47	also in memory of us to the . And between the two devices there

78:54	the PC I express bus. So exceedingly Numa, if you like.

79:02	is non uniformed memory access because, , this highly different capabilities off these

79:10	process involved in moving data. The memory notion is that you can treat

79:19	kind of as one address space. like in Newman is one address space

79:26	the shared memory in the note, is by no means uniforms in access

79:34	to them. So this is the , unified memory that exists also in

79:41	empty 4.5 or later. So, , so this is now what the

79:52	used when you tell it to manage for it, it uses this notion

79:58	is unified memory. And I guess that point my time is up,

80:04	we'll continue with this example next and I'll take some questions that you

80:10	it. Okay, stop that screen and see if there are questions.

80:39	, so far there is, I , mostly talked about. There's lots

80:42	similarities, but underlying hardware structure uh, visible on becomes more increasingly

80:54	the more you try to optimize Okay, there's no questions. I

81:13	start with the first in the region the

Previous Next

00 : 01
02 : 41
11 : 29
14 : 09
20 : 23
23 : 25
28 : 37
34 : 15
36 : 59
45 : 09
49 : 59
61 : 09
66 : 23
71 : 51
75 : 19