ICS Video Player

COSC6365 Intro to HPC Fall 2021 - Lecture02_SLURM_Modules_Timing

Transcript ×

Auto highlight

Off

Font-size

00:00	mhm seems to be working mhm mhm , maybe still a minute left.

00:53	, mhm Yeah, as many of today josh Yeah, I will say

01:06	everyone that if you can just try drink through the zone link on

01:11	on the blackboard but we don't see many of me, yep mhm

01:33	mhm yeah, okay, I think have four oclock so I will start

01:49	lecture. So today, as I last time we talked about most of

01:59	shared environment and how to use them the way it's going to happen is

02:08	but it talked to some slides given big picture in a way and then

02:17	suggests will do a demo related to to use um particularly resource manager that

02:29	on pretty much all shared environments and is a particular one known as slam

02:38	is very commonly used. So I talk about those things uh and then

02:45	I mentioned, so Joshua de Madan then I may come back and talk

02:51	a few more sites, so that's nature of today. So now,

03:03	there and the terminologies as far as going to spend a little bit of

03:09	on because it's important for you to familiar with it and since you are

03:14	already and it's particularly important because some the terminology being used when it comes

03:22	dealing with clusters and schedulers is um somewhat ambiguous. So it's important to

03:34	alerted to where the ambiguity is and and where it's not and I will

03:40	throughout the course to use terms that not ambiguous, but it may not

03:45	the ones who kind of used to about. Um then I'll talk a

03:52	bit about this particular resource manager that being used on in Pittsburgh for their

04:02	is used in a tackle the texas as computing center that you will be

04:09	as well as at the data science for Research Computing center at your

04:16	So it's very commonly used. The all three sites that you may be

04:21	use the same. So that simplifies a little bit right. And we

04:27	their model command and I believe the . S will also demonstrate that meticulous

04:31	timers. Timers is not the simple it turns out. So I'll try

04:39	in particular point out both what you be doing in class and the pitfalls

04:47	. All right. So I think mentioned last time they will two class

04:53	and request resources on has not received approval yet, but hopefully in the

04:58	couple of days I will get Um one of the system is known

05:03	bridges to and it's located in Pittsburgh the Pittsburgh and Supercomputer Center and the

05:10	one is known as stampede to and so clearly then austin has the

05:15	Advanced Computing center attack and here the you or else that you should be

05:24	two requests, your own account that mentioned last time. So that's kind

05:31	the first step requests your account and should be able to do that.

05:36	you already have accounts there is no to create a new account and you

05:40	will not be able to because it's and then that the account is entitled

05:47	various allocations of resources you may have you work with some research group that

05:54	account and then one of these two or exceed you have an account and

05:59	, then it also gets linked to a class account when that's established.

06:05	huh um and here is just to the required action, what you need

06:11	do, the ones who if you have one account Send us so you're

06:18	me your account ID So we can a link your account to the class

06:25	. Otherwise when your account is send us your user name, so

06:31	can again added to the class account the moment. There is no sharing

06:42	my knowledge between the Pittsburgh site and is used to exceed an attack where

06:51	use kind of a local accounts If you have an exceed account,

06:57	may not otherwise need to have different at each side. Do you

07:02	But they used to different allocation mechanism that's what we need an account on

07:08	sites and so is your choice how want to name your accounts but you

07:20	again start the process of requesting Usually it's very quick um, to

07:25	through the process and it's mostly So shouldn't be a big deal

07:32	Start early as you pointed out in of hiccups, most of you probably

07:41	may have seen clusters are very familiar the notion of clusters. I just

07:45	it's not, it's kind of a of what clusters may look like

07:50	in the upper left hand corner it's of more or less of the home

07:55	. Right? That no, it's uncommon among research groups and still certainly

08:03	very common in the early days It may be more of a question

08:08	one buy something recently prepackaged that you on the bottom of the slide.

08:17	um, the previous pictures kind of the front side and the back side

08:24	this. Even commercial clusters, it more like this were except for the

08:31	left hand corner. Again, it's of the homegrown version that the rest

08:34	them are the more commercial grade clusters they are different kinds of interconnection that

08:42	being used in the middle industry, switch into which uh, servers that

08:50	the cluster is hooked up and in case it may be typical RG 45

08:58	or copper cables that is being used kind of the mill. Otherwise The

09:02	one tends to be more fiber and are different ways you actually do

09:07	So the ribbon cables also that you in the upper right hand corner.

09:12	. And we will not talk too about clusters today. But later on

09:15	talk about it when it comes to exercises for assignments on clusters and not

09:21	program that yes. Now get into and you're welcome to ask questions and

09:35	just try to help me monitor, know, raise hands or chats and

09:42	interrupt me. So I will respond questions. Yeah. So here is

09:50	of a little bit of systems are together and it kind of starts in

09:57	upper left hand corner with a piece silicon and that you should have Pakistan

10:07	. Yeah, a compute unit that refers to processes or CPU but that's

10:17	one of these ambiguous concepts and I come back to that um later,

10:23	today and throughout the course, typically piece of silicon in the left hand

10:30	may he referred to as a die that's not ambiguous. It's very clear

10:37	it is. It's a piece of and depending upon what the design

10:44	That piece of silicon has, you , if you are even dozens of

10:50	today on it, I'll talk more the notion of course, I think

10:54	the next side. Anyway, one these things and Eventually one build a

11:03	that is also maybe popularly referred to a B C or in our context

11:10	this class. Mostly referred to as server. That is a self contained

11:16	. It has processors that has memory has power supplies. It has network

11:23	, you can connect them together in way or another. So it's a

11:28	contained unit and in terms of clusters to pull it referred to as a

11:33	. But again, when you buy , it may be referred as a

11:38	or a server and when it gets in a cluster mountain to refer to

11:43	as a note and note is also defined concept and it's not a big

11:51	. Um the processing chips and packaged of silicon that has courses on

12:00	Yes. Then when it comes to or putting together a server, plug

12:06	something known as the socket and that's concept that will be used throughout the

12:14	. And it's not ambiguous. So could say well defined. The number

12:20	sockets uh, in a server or , when it comes to cluster uh

12:28	most typical it too, But there also other designs that has more than

12:36	sockets that may be 46 For or typically a power of two. But

12:46	it comes to both rages to and to, they are two socket

12:54	No, the mechanical way these servers packaged, they sometimes are called as

13:08	unit or Patrick as Iraq unit or patches up in the most known as

13:14	blade. And these so called rec , they are then put together directly

13:26	Iraq that may has quite a a few 10th of these rack

13:33	Possibly another way of doing business as I mentioned that these servers are

13:43	up in the form of blades and blades are then put into a chassis

13:50	and the chassis is to stand put these racks and when you're out there

13:56	trying to potentially buy one of these , um it will most likely have

14:03	pay a little bit more for this version than Iraq unit version. And

14:08	reason is that Iraq unit is mhm complete in a way in itself,

14:17	has the power supply and it has for cooling and everything. So it's

14:24	self contained. The blade isn't totally contained. So in that case the

14:32	is has the power supplies and And the way things works is that

14:39	larger the fans you have, the efficient that tend to be. So

14:45	means blade service and cassis tends to more energy efficient solution and that the

14:53	then try to try to split the and the benefits on that by charging

14:58	little bit more for the blaze and Justice solution and Iraq units, but

15:02	the end, everybody's kind of supposed win on it. So these are

15:06	things guys of course there's processors and to use that is ambiguous. Uh

15:15	and notes non ambiguous plays and vacuums unambiguous justice is not ambiguous and cluster

15:22	. So the only one that is of a bit if it is processed

15:25	sent to you as an ocean. here's a little bit the cost

15:32	Again, that is important to understand you don't already familiar with it.

15:41	one aspect of these clusters, they'll not everything is the same in the

15:46	in terms of the notes and neither how they are used. So some

15:55	these servers or notes to star in cluster that serves as logging notes.

16:01	is and I'll talk more about that a bit, but that's the things

16:06	which users connect. So the net networks. Okay. Up to logging

16:15	then there are other dedicated notes for admin to kind of manage the

16:23	And those are things that users still , you sort of get access to

16:28	for making sure making sure things runs . And does the management tools or

16:38	then don't need to compete with user for resources they're dedicated. Then there's

16:46	things that they're supposed to use, that the compute notes And when I

16:54	I'm supposed to use so the logging , Yes, you're using them because

16:57	the way you connect to the But that's kind of the limit of

17:02	they're going to do without logging Anything else was supposed to happen on

17:09	other notes typically labeled compute notes and compute notes themselves maybe um kind of

17:21	in a way or grouped into um some notes that are similarly configured or

17:32	also may be configured for different usage and hopefully we'll talk about that a

17:39	bit more. About one is says this slide interactive notes. So those

17:44	things that you can again work interactively you're kind of most likely used to

17:50	your laptop and other things and then are other knows to which one submit

17:56	and they get run at some future , most likely not immediately. And

18:04	I mentioned that there are different groups particular in in Pittsburgh, the bridges

18:10	has what they call regular memory kind of reasonable amount of memory and

18:17	there are different shades of adding or those with uh huh more memory,

18:24	quite large memory or extra large memory notes that has deeply us or other

18:31	on them. So this is different of grouping the compute notes and mostly

18:38	will be using on bridges, what's as the regular memory Notes. And

18:44	in terms of putting big clusters there other Knowles said, dedicated to managing

18:51	to these systems or I'll notes. , so here is a little bit

19:03	respect for what the bridges to bridges to is a fairly new system

19:09	was put into production, um, think this spring uh remember correct or

19:19	last year um and said they are this case it's not a two CPU

19:28	that means two sockets in this two on this product pieces of

19:33	Silicon put them to this server board forms an old and in this case

19:44	there's Cpus are produced by AMG it's one of the yeah, silicon processor

19:58	, most people know us about intel AMD competes with interest, didn't know

20:04	the and it's kind of okay, strong competitor these things too income but

20:11	the these particular except for you. They are used in the bridges to

20:18	has 64 course in each uh I say socket, It's my problem.

20:27	not, you know, there are subjects 64 course and On each,

20:33	the type of, I'll fix Now. Bridges to then has 488

20:39	. That has to 56 gigabyte the . So this is particular for trans

20:44	regular memory. Notes. And then have the large memory that has 512

20:52	of memory and then they actually have additional laws and you can find out

20:58	information about the configuration of different um . And on the bridges to cluster

21:09	going to the U. R. . At the bottom of the

21:16	Now for the if you know that likely will be using unless anything changes

21:24	the assignments uh with programming abuse you see the configuration here. In that

21:32	they actually use into processors instead of family processors and they have very recent

21:44	of the GPS. There are other to that may be of interest in

21:52	you pursuing it research that they have of the also more ai focused the

22:01	on some other notes. Then when comes to using more than a single

22:11	, the interconnection network is important and one is the particular kind of protocol

22:19	used for the communication between nodes and use they called the Cinnabon protocol.

22:28	another part that is not mentioned on line is the particular topology of the

22:35	but um and then there are other of this particular interconnect. The melon

22:44	is a company that has dominating the band uh devices but the indefinable protocol

22:56	and it's the rs has it's a on the capability of this particularly to

23:03	them for high data right of this filament interconnection technology. And then the

23:14	system attack one Austin is a Sky anoxia notes that I have there are

23:23	few years old by now but not old. So probably 3 - four

23:29	old. Typically no, it's kept place When they get 4-5 years

23:37	So probably getting close to the point being retirement but they are also dual

23:44	notes and such as doing in this they each correct processor or CPU has

23:54	course And slightly less memory promote in case 192 KB. Now in your

24:07	you will run some scripts or make calls to functions that will provide you

24:17	detailed information on the particular processor is installed in the cluster and uh suggestion

24:25	demo that later on in today's Okay. Yeah. So yes,

24:36	just leaving it to the resource manager will, you will submit jobs to

24:43	is known as slur for simple initiative resource management. Oh and there's a

24:51	slides here about I guess wow has kind of operates and this is kind

25:02	cartoon shows best. As I mentioned well remotely connect to log in

25:11	Typically clusters and predict clusters serving the users to have a few logging knows

25:17	just the same single one, so may have 3, 45, half

25:21	dozen or so long in those to sure that users should have easy access

25:29	the Logan nose. Now, once there then necessary to submit jobs to

25:39	slur the Resource manager and the resource than in many cases. In fact

25:46	are the Resource manager is shared among clusters. So it knows but you

25:55	your jobs and mention script tell us your job wants to go and the

26:00	managers then manage. Excuse for the clusters and again, both. Attack

26:08	Pittsburgh use the same resource manager has um, or sissy here at

26:15	which so, and eventually you get output result from job execution and

26:28	reminder, they tend to have to students many times. Never run jobs

26:34	the log in. No, just a little bit cartoonish and I'm

26:43	spend much time on it, but one interested and should go and look

26:51	a little bit more about the You need to familiarize yourself a storm

26:56	for sure. But it's a pretty piece of software that is when both

27:05	priorities, job queues, accounting, and all kind of different aspects.

27:14	it's by no means is simple piece software is, is open source.

27:21	and that's part of the reason why of these, certainly academic computer centers

27:25	it, but also other centers use . It's quite robust piece of software

27:32	it says here, it is also signed well, in the sense it's

27:40	scalable. So he says he has give you some numbers are not going

27:44	create into it only to help the of Islam is put together and how

27:49	works, you will just use But it's, as I said,

27:54	here, we can manage up to than 100,000 nodes and hundreds, 100,000

28:02	per hour. So it is has pretty good group but it also means

28:06	requires sources of its own to run not complete with user jobs for its

28:16	. Okay, so the next ceremonies here, a few aspects of storm

28:28	some screenshots and then we'll hand over suggestion with them with so this is

28:34	of just a large thing from me you than to pay attention and the

28:43	. So yes, I do the , so you're kind of more or

28:48	get to see the same thing twice from me as pointing things out and

28:55	so you're just doing it again more detail. I can hopefully follow what

29:00	doing. So I just said it's a resource management tool and that's,

29:09	handles jobs to submit it and it cues and there are typically several different

29:19	for any given clusters. So you have use for in a sense,

29:28	jobs and you may have a different for large jobs and then you have

29:35	set and use policies set and the knows about the policies and the systems

29:42	mean are the ones that then implement various policies that have been decided for

29:49	use of the cluster for the different and here's a little bit of the

29:56	basically they're a bunch of demons. of them don't runs country central and

30:02	have the global view of what happens the cluster and then there's demons and

30:07	on each one of the compute notes communicates with the central a demon and

30:14	there is a hole to commence and talk about some of them but that's

30:18	you also need to go and familiar yourself the storm to figure out what

30:28	, commands are particularly useful. We not use all that many commands in

30:32	class but in other uses outside the when you do projects, there may

30:40	are they commands that we don't cover the initial part of the course or

30:46	you may need to go and fight slur. There's also a lot of

30:52	about in terms of the parameter list you like for the different commands that

30:59	important and for that you can need make sure that you have a way

31:05	looking up and understanding exactly what the means for the different commands. Um

31:17	here is a little bit now respect coming back to the vocabulary or the

31:22	culture uh of and how what it in the context of slur. So

31:36	of it is not particularly intuitive, would say. And part of it

31:43	because of the evolution of processor Slenderman, most other resource managers,

31:53	were designed or started kind of out there was Only kind of one core

32:04	processor chip. So nowadays for a chip, yeah, there are or

32:16	piece of processes, Silicon as we , there are many different course for

32:24	number of course on each piece or is silicon and each core may also

32:31	able to execute More than one Instruction Or at least manage more than one

32:38	Street. Known as threats. So , when these resource managers were

32:50	there was very simple, there was of on the core for sock it

32:55	be very precise. Now there may many execution streams managed on each core

33:04	then there's many course and kind of the same process or socket to be

33:13	. So the resource managers allows you control where things runs and also how

33:23	of each, The kind of So there is this notion of circus

33:29	defined. There is typically two to . I shouldn't say typical, typical

33:36	too, but there may be between and eight, Rarely more than eight

33:42	these sockets. In a note because you go to slam your request notes

33:51	then you may also request sockets and may request cores and you may request

33:57	and what this line is just saying upon what the actual physical hardware

34:06	The notion maybe on CPU is different terms of slow, it may mean

34:12	core, it may mean a threat CPU never means the entire piece of

34:21	that plugs into the sock, you , on my earth is light and

34:26	people talk about process is safe for , they may refer to the kind

34:31	package you buy from internal and or some other processor vendor so that's where

34:39	are ambiguous and needs to be careful make sure we want to understand how

34:43	used in the various contexts that you're . And then on the right hand

34:51	of this side it kind of shows ah nodes may be grouped in the

35:00	into different partitions so their partitions coexist run simultaneously and there's been a job

35:08	for each one of these partitions. this again managed by the resource

35:16	Mhm I think this is pretty much talked about this um and so

35:30	this is a little bit of examples of some simple comments that you will

35:37	and I guess suggest will come back them. I'm quite sure when he

35:42	this demo but one of the if one uses the kind of short

35:48	that is single character in this case and for instance apparently were case ends

35:54	different meanings So on the upper case is then the number of notes,

36:02	about the finance, the server. you go back to my one of

36:08	early pictures um whereas yeah, lower in and that's the number of tests

36:16	this creature for the job and then can control but since here, how

36:21	of these tasks if you want to socket or print out citizen. Mhm

36:28	there's many ways of controlling things and is important for performance. So you

36:36	, you know, assignments later we asked to play a little bit

36:43	these um it's also important beyond the that you understand this may be important

36:51	that you have options for controlling how streams are allocated and the cluster.

37:04	So I've got a picture of it and depending on how sophisticated the resource

37:13	is, it's um may select different of notes. So you may ask

37:24	10 or 100 notes and in the straightforward way the resource manager will just

37:37	the number one knows such a one of where it happens to be physically

37:44	in the cluster. Yeah, may for the performance you get because there

37:56	be more or less of network involved communicating between the notes that runs the

38:05	and the network is almost in all shared between all the jobs that run

38:16	in the cluster. So that means can depending upon how much, how

38:27	packets, how much intensity or traffic your application generates between the notes that

38:33	being used for the job. You get variable performance when you run at

38:40	times. Mhm Because the traffic by jobs may vary and you may get

38:48	, there are some are nicely respect that the network one times and

38:54	not. So it's also the case the second bullet that nodes our

39:03	So that means in the same note may be 100 and more coarse and

39:15	may be different jobs using different course course may not be shared but certainly

39:25	of course in the same note may different jobs so now some of the

39:37	in the server that is in an in the cluster I shared among the

39:49	. So if you don't have exclusive to the node that means yeah.

39:57	performance may depend on other jobs running a note at the same time.

40:06	that's why to put them on they you the time things you need to

40:10	sure that to get exclusive use because not only do you major gaps not

40:19	a great performance, such a good but it's also not so clear representative

40:28	time is your gut is because it be to some degree significantly dependent on

40:36	jobs that you have no control And I already said that not,

40:45	was also configured the same now in case of bridges or the regular memory

40:53	that you will use and the same . They stampede all the notes have

40:56	same configuration but in general they someone needs to be conscientious of

41:06	The last of Ebola in the Um a bit it's not intended to

41:12	the case but it happens that even the pieces of silicon, the package

41:22	you buy from insulin A and B have the same spectrum supposed to run

41:28	the same frequency for instance. Um it may turn out that's for

41:36	odd reason. Some of them may run at that particular clock frequency that

41:42	expected to run. That other reasons that these days um calling is an

41:52	of chips and all of them nowadays have firmware and temperature sensors built in

42:04	the firmware controls the clock speed of CPU so unless you have locked the

42:10	rate that the processor runs by depending the temperature or each situation in a

42:19	location of the cluster where the note all notes may not run at the

42:26	you expect them to that. Mhm Oh and here is just very

42:38	I'm not going to go into the at this point. I come back

42:41	it at some later lecture but it's just stressing that there are you

42:48	commands that allows you to decide how you want your jobs to run and

43:01	not today, I think that some elections so josh probably would get more

43:05	of the commands and figuring out how uh huh choose and control the choice

43:13	where things are the resource sector There was an additional twist on these

43:20	um that we'll talk about up today much later is that the operating system

43:29	ideas where I think should run. it may change his mind during the

43:36	of your cold so that's also from something to be repeatable and sometimes optimist

43:46	may have a better idea than the system what goes on And for that

43:52	may force the opera system to provide what you want by what is known

43:57	binding allocation to the particular resources. . Um yes, in terms of

44:07	submissions and the big picture stuff um you want to submit the job,

44:15	have to tell these first manager as pointed out, you know how many

44:21	you want and course and a bunch these other things um but you also

44:28	to tell us what you've got the time to be. So the resource

44:32	can get an idea of kind of footprint in some nice of the

44:37	how many know, Janine, um of course and how long are you

44:40	to use them? And you also to tell how much memory you

44:45	Uh so it can do some decent figuring out the allocated job. Now

44:55	policies being implemented by the my start the resource managers basically terminated jobs when

45:06	requested time is up and if your is still running of course, that's

45:12	, that's great. In particular if don't do in a form of check

45:16	in the job because that means the thing maybe wasted. So of course

45:22	one way to avoid that is don't sure that you have a good

45:29	So so there is no way in mind that you will run out of

45:36	and the job will determinate it the time. Now of course that means

45:44	didn't job manager and need to make that there is enough of a window

45:49	time to run your job and it a wild to allocate that larger window

45:57	time. So it may have the that your job sit in a queue

46:03	a long time. Some other sides , may I let you overrun for

46:09	bit before killing your job? That's . It's always a good practice to

46:16	figuring out how long you job is to run in particular before you do

46:25	form of what typically called production So you get through it recently.

46:34	. And so those are the two . And I think there's a bunch

46:42	these commands and I think uh huh may want to use the Southern Council

46:49	, the batch command, uh the command probably not, but in other

46:57	you may be interested in more ask too much. How much are

47:02	to gravitation? I get these Um So for me it's useful and

47:09	are other things are trying to do of jobs or data broadcast. S

47:19	simply slur so everything tends to be by the S Yeah. And the

47:25	commander will be interested in and the command for sure and the influence I

47:32	so the actual demo them and this just text support it if you want

47:36	go back. But again, the way is to go to the slum

47:43	and find out the precise meaning of commands. Um and I thank

47:54	So your actual demo these things and will show it. So there is

47:58	tells you processor, clock, frequency and a bunch of different things um

48:07	there is now two screenshots that will very quick with those and so

48:14	will go through them again in form a demo, so a little bit

48:21	the suspect. If you look at screenshot this is on the petition says

48:27	M and again that this regular memory on the british cluster um it says

48:33	much the partition is up and if is any particular time limit system this

48:41	that was done um and uh then gets a little bit of a listing

48:49	the status of the different notes and there allocated? So in this

48:54	even though all kind of maybe seem that something is up until something is

49:00	but it's kind of the petition is . But the particular note, if

49:05	look at the first one here like 3 23 happens now to respond and

49:13	are other examples here and can I'm suggestion then we'll talk more to this

49:22	. Yeah. Um and this happens be a guest extreme memory petition he

49:29	uh and also in otherwise you can the status of jobs so either are

49:36	running this case. None of them and they all sort of pending or

49:39	in a cube and it also tells know what the potential priority is for

49:49	particular jobs um in this case and I want to point out I am

50:05	you get the job numbers and those are obviously good in terms of you

50:08	that you want to you can get information or kill the job um and

50:20	run is simply submit and run the again. See a demo I think

50:28	doing interactive jobs as well as doing submissions. Mhm And uh this case

50:37	of uh huh submitting a job and the status of the job is being

50:47	in this case on the first line see that's from command and then the

50:52	of of course that it was requested this particular job and this petition,

50:58	job is supposed to run in and the pointed to the code.

51:06	see a demo. Mm I want make sure it has enough time to

51:12	. So that's why I also go things quickly um and in this case

51:19	she was a little bit on the so I will probably skip this.

51:26	me see, I'll let you basically to talk to these lines more and

51:33	is the batch applications health here, number of notes that such as so

51:38	again you can go back to the but the record also in the demo

51:42	you should have more than one way reminding yourself of his commands but

51:48	go to the slum doc, forgetting detailed documentation. Haslem works, stop

52:01	. Let's see if there's anything else want to cover controls before you go

52:08	, I can go back to them uh if you want to at some

52:13	triage but why don't I? I think I can start with the

52:18	yeah, I think that's the right to do. So you have enough

52:23	um uh huh Okay, if I getting my thing Yes, yes,

52:32	need to stop share mine first, fine, I don't think so.

52:37	me get over. Okay, All , okay, hopefully everyone can see

52:45	screen now. Uh again get this uh for example how to get access

52:54	the clusters and again we have to of the clusters that will be available

53:00	us once you get the allocation of , once one is the original uh

53:05	to cluster and the other is the to cluster and you'll need ssh client

53:14	get uh to get to those uh Windows are using the simple tool

53:20	Footy on Mac, I believe everyone access to ssh client on their

53:26	They can use that. Mhm. for getting onto bridges, I know

53:31	a little bit smaller here so they are all that you need to follow

53:36	your user name at bridges to dot dot e d u and that should

53:44	you into one of the logging roads bridges to and on stampede to

53:49	Uh Again username at stampede to dot dot utexas dot eu So that's you

53:59	what you mean a good aspect to questions. I will give the demo

54:04	the visits to clusters because you still one project running and I can show

54:08	some of the things but when you that it will simply Okay, start

54:18	representative. Yes. So when you in just putting your password and let

54:24	get you into the logging node and way you can make sure that you

54:29	on a logging noticed by just looking the console. So I usually use

54:34	logging message here and then on this note you can run pretty much all

54:39	uh Leonard's commands that you you don't like. LFPB would actually give you

54:45	about the processor that's available on the roads And every uh many other details

54:51	how much the L- one cache and things are. Um we can also

54:57	a few other commands that give you about how much the memories available on

55:02	on this uh blogging node was the memory, how much it's free time

55:08	is available and so on. Police you go back to the premieres the

55:17	let's CPU yes, I think again it uses the notion of socket.

55:24	, that's an important thing for the to again they were those things uh

55:31	the man old may not be familiar you don't need to know it in

55:34	beginning of the course but it's um and in this case it also

55:41	you the threats car core and that an important parameter to that we'll talk

55:51	later. So I'm just trying to out since you had it this notion

55:55	course it is in this case it's both the typical AMG and intel processors

56:06	course in this case to be specific are capable of managing two threads concurrently

56:17	a core but it's a system configuration so it's not always the case that

56:24	systems admins has configured the system to to threats for court. So sometimes

56:32	something you want to check because it an impact on performance. What's

56:38	Um I don't see anything else that to point out this is in terms

56:45	being yeah, I think about the but yeah, okay from this

56:55	It has 54 cores. Right. um and well it also has towards

57:07	half. It also tells you about cache sizes and for understanding performance.

57:15	an important part. So sometimes and your term when you're asked to try

57:24	understand how codes run, you may want to update yourself what they could

57:30	hierarchy looks like on the processes. . Okay. I think that's might

57:39	and also you can see towards the . Right? That's how much

57:44	kind of main memory It is on right in this case to 56

57:48	About uh huh. Not that they it. Um, so that's why

57:57	end up with This kind of strange 2 63. That's because the 263

58:03	base 10 and I say 2 56 based 10, 24 wow. Anything

58:16	you think for the future is a thing to take from these? I

58:23	for now it's only important to understand memory and then the how they go

58:29	profit organization on this trip. So one important factor to keep in mind

58:35	I'll get back to it when I to the interactive commands and one more

58:40	I guess I just noticed on this thing that maybe Confusing at first because

58:48	says 64 course per socket and it two sockets For notes. So that's

59:00	times 64. That's 1 28. on the top it says 256 and

59:07	that case it's because it has enabled threats for a court. So that's

59:15	factor too. But so try to this numbers. One needs to

59:20	think about course and threads and sockets That there's 128 physical force and then

59:35	two threads for court, but that's why it counts. Uh 2

59:39	trouble than 1 28. Again, useful information to understand things in the

59:47	. So I'm just pointing out that command gives you a lot of

59:52	Okay, okay, go ahead. huh Now, yeah, again,

60:00	one more command that you can there is available on every minute

60:05	if you can get information about the system, the version that has been

60:11	for is um, also ignored of . Um now notice that canal we

60:19	around the log in northern, we the flies that we need to get

60:23	to compute not somehow and do another . Now there's there's this,

60:31	I think it was chad question from , it's just the same password to

60:41	into the server. I'm not quite whether that we used to the portal

60:49	. That's a good question. As as it's related to Pittsburgh portal,

60:56	think it's true. What, but , so I think in terms of

61:04	the external sites attack and Pittsburgh, should answer to be precise.

61:13	for uh corrected, I believe it's same password as you use on the

61:17	, but slightly different for somebody to different way of doing the ssh I

61:26	there's two passwords for Champagne too and can have the what about the backwards

61:31	the same thing but I don't recall steps exactly when I find out.

61:37	Yeah right like you're saying basically they there's one look a place that is

61:46	by the site with stampede. It in two places but it's nothing that

61:51	you from using the same password. I believe that's okay. Yeah.

61:57	and one thing I forgot to mention since we're uh we're on bridges replicated

62:02	right now but when you use stamp through you need to set up multi

62:08	authentication. That's the requirement when you in on uh saM P. Two

62:14	apart from your password like it did here it will also ask you for

62:19	token code and it's gonna require you download a half that that's uh protect

62:25	that gives you that you can code time you log in. So there's

62:30	two factor authentication that uh locked into too. And then they have the

62:37	guide on their web site so you follow the steps there uh difficult to

62:42	access to that This one piece of . Okay. Yeah so as I

62:50	saying so there's now there's Uh three . uh Okay can you run only

62:59	SFX line for the job. Ssh for a job, I don't know

63:06	you mean. But yes, you have multiple ssh clients open for any

63:11	the size but not necessarily that you only one kind and you can have

63:17	ssh connections to the same place. hope that I'm just Yes.

63:28	Um Right. So yeah the next I was going to get to is

63:32	to run your job and so as would think there's three ways you can

63:36	it. So first is simply using command and I have a prosecutable for

63:44	simple hello world program contractor. And you do at run you can just

63:51	do uh for this case that provides number of starts or number of

63:58	So to say for your program to executed this is Ron and then simply

64:04	the executable file name. Now there's other options with us run as

64:10	You can just do a tron dash help to get on details of those

64:14	are someone who can get into them now. It should be simple.

64:18	with Saffron, when you when you your program uh it will give you

64:24	message, something like this. But job is now cute and waiting for

64:29	and after a while it will get to the resources and then run your

64:33	interactively. So the output for your , you will be seeing it on

64:38	on the control And this is this good when you know that.

64:43	My program was done in like general seconds. I just need to check

64:47	it works or not. That's And what everyone does it give you

64:52	your program access to some of the resources that you request based on these

64:56	and then run your program on that those resources. But as you will

65:01	that if you are running multiple you don't want to sit there and

65:06	our uh you don't want to wait the point so that if you

65:11	you want to do other things as . So the Exxon is not the

65:15	way to go for it. So second way you can do it is

65:21	using a uh interactive interactive shell. what that does is it gives you

65:28	to a confused note for for a period of time that you mentioned using

65:33	flags for time as well. And you can do that by using the

65:37	, interact And here uh you can provide it with the number of notes

65:44	you want to the capital 10 requesting just one note at a time

65:50	then you also provide the number of . So dash and trash and that

65:56	both equivalent. There's a question is host name for visit story quote.

66:03	, it's not new. H uh . All it was a bridge is

66:11	. They're just too dot etc. really you so that's Pittsburgh Supercomputing

66:18	Yeah, but they just threw at sp um we just still don't be

66:22	dot here. Okay. Yeah. uh you give a number of nodes

66:30	then you also provide the number of . So on these compute nodes,

66:35	have 1 28 physical coal, so need to provide the number that either

66:41	than or equal to the number of cores if you go more than that

66:46	, will give you another that. that's the that's the incorrect configuration that

66:50	asking for and when you do uh dash m 1 28 is going to

66:58	you access to one fool uh computers and let me switch to another flight

67:07	I did this already. So the command here and now the difference is

67:13	were initially on logging knows where you access to the resources and once your

67:19	get the allocation, it will tell that which note is ready for your

67:23	and then it will give you access the console on that on that particular

67:29	. So now see the difference that can log in node. Now you

67:33	you can see you're on a compute Now here. On a computer

67:37	you can just again run all the . Uh interactively and this is useful

67:43	you when you know that you have tests to run and you don't want

67:46	do it run every time, you just get access to one computer for

67:50	while and do all your tests and done with it. In that case

67:54	interact is very useful. So when you're uh performing multiple tests,

68:00	can just get access access to a , not interactively and do your but

68:06	again, the problem there is that still wait on the console to get

68:11	to the resources and then again, still need to wait on the control

68:15	get output for your program. So third way of running your job is

68:21	submitting bad job to the queue on . And that you can do by

68:28	a batch trip and it doesn't necessarily to be named as batch, not

68:33	it's named. Uh Partake of simplicity a very minimal batch file would look

68:39	like this where you provide all the the parameters for your job, a

68:47	of nodes, one known the name the partition and the time that you

68:54	you think your job will say number stops and then the command that will

68:59	your program and what what's that going do with when you, when you

69:05	that job giving a batch command, going to simply kill that job and

69:11	you that your job is submitted. it's in the cube and then you

69:15	check the status of your job by using the SQL command and here you

69:23	pass your username to filter out only job because if you do simple,

69:28	you there's gonna be a there's going be long queue of jobs because there's

69:32	many jobs running on the culture, by using cashew and you're using

69:36	you can see all the jobs that are currently running or depending or

69:42	Uh huh. And what that's going do is once the job finishes and

69:49	it is finished, the output of job will come out into a file

69:55	which will by default be named as job number. Uh the job that

70:03	submitted 90 waited. So that the of that uh that job 30

70:09	And then when you try to try to read the trial, you

70:16	see the output in your uh your style and the and the benefit of

70:22	as bad as you can submit as bad jobs as you want and then

70:26	forget about them for a while and do whatever you want to do on

70:30	control, trying to test or anything then come back and check it if

70:35	finished and then you'll have the output uh these uh these output files

70:42	So that's that's the freeways, you'll mainly uh learning your jobs for most

70:47	the assignment and political writing about for of them you thought so any questions

70:55	planning the job, this one on compute node? Uh you can you

71:05	do both on computer or you can both. S Ron and um without

71:09	front. So the difference will be if you just simply do listen dr

71:15	world, it's going to take the value of the number of thousands will

71:19	your program for that many times If want to sell run for less number

71:24	times. So you had access to up, but you now want to

71:27	for 64, You can you can the end up uh flag with 64

71:33	You've programmed around 40 people that Yeah. And the other thing is

71:42	the run command again? You can from both logging and compute notes um

71:49	interact command gives you access to the note interactively. So you can only

71:54	it from the logging notes and the batch command. You can run it

71:59	either logging Lord or the or the not as well. It's best to

72:04	it from the logging road because you want to hold up one compute node

72:07	you're submitting back job. So if know that you have a bunch of

72:12	that you need to do it best be on the logging No, then

72:16	do all your jobs during as Yeah. Yeah. I just want

72:22	point stress what you just said. can run it from the logging

72:27	not run it on the logging Locos , yes, yes, yes.

72:33	don't run your job on the logging . People are trying to get access

72:37	compute nodes, wire the logging node you're going to mess everything else on

72:42	club but it's a shared environment. careful of that. Always run your

72:46	on the computer development one way or either using is Ron interact or as

72:52	. Mhm Okay, so that those the three ways you can run your

72:59	a little bit more information on firm for one is the s info command

73:05	you can use to get details about different partitions of the notes that are

73:10	on the search again as the Columbia , you can see what the partition

73:16	is, whether it's up or not what knows her in that collection.

73:22	there is a question, are you if you use Catherine or as

73:26	I'm not sure what you mean by , it's okay if you can unmute

73:34	that would be good for if you type it out. What do you

73:38	? Like say uh figured out was ? So uh that song is basically

73:48	that it wouldn't be run on the on if I use these commands but

73:54	a question how your parameter is the , right? Yeah, yeah.

73:59	if you if you do it Ron as bad from Logan, know that

74:02	going to ask for some resources that be the compute nodes and Denver and

74:06	jobs so in that sense you will uh make the log and not busy

74:13	any way. Yeah. Mhm. yeah, on the log in that

74:20	never do dot flash and then your because that's going to run your program

74:24	the logging road whenever you want to your program to FM as bad,

74:29	one of those um Yeah, the command that might be useful uh

74:37	the one that I just showed uh you and I would just use more

74:41	that, but I don't uh go the way in the lips this

74:45	you can see all the jobs that running on the on the currently uh

74:51	again you can filter that by uh your user name. Yeah. So

74:57	one interact job that I'm on on computer note right now, so that

75:01	the job, I'm not showing right . Um what else? Yes.

75:07	other command that will be useful for and this is very important is the

75:12	command because many times it will happen you will see your job, either

75:18	spending for a long, long time you may have made a mistake and

75:22	your program is stuck on on some , it's just not finishing up And

75:26	see that your program says it's running it's been like 10 hours now and

75:31	still running so there's probably some something going on there. So using this

75:36	cell command and providing the job uh you can you can cancel your

75:42	and hopefully that will cancel my interact . There you go. So one

75:49	a comment on that this part of things will stress throughout this. You

75:56	to um set yourself some expectations how things should take and then you

76:04	if it's way off again kill the and go back and think what might

76:09	wrong. So it's very important to some way of estimating how long she

76:17	on the kind of notes or set notes that you're requesting and of course

76:24	is also wasted. You know, a limited amount of Yeah, time

76:29	get on it. It's not But these clusters are expensive. It's

76:35	trying to be careful and not wasting resources is a good thing but we

76:41	pay for it explicitly. So but doesn't mean we should be wasteful.

76:48	and that's that's more important when you you will be submitting your bad jobs

76:53	you provide the time you will be these resources in the basket and

76:58	that reminds me that your bad whether it has finished or not.

77:07	cancel it once a time limited So it's not going to check whether

77:11	have finished the execution of your program not. So you need to be

77:15	careful. So always give a little of buffer when you submit your

77:19	But I start to johnson said that give too much time because that's gonna

77:24	two problems. One, it might a long time to get your jobs

77:29	up because um we'll see how you've this thing for an hour and I

77:34	I don't have time to put the that takes an hour right now so

77:37	will just put everything in the Q put an impending fate. Uh So

77:42	, so just keep that in mind you when you put those shots

77:45	Yeah, I guess one good aspect hustler works and I think that's true

77:52	this program as well. So the part is overestimating in the queue time

77:57	. But it doesn't charge us for your request. It charges for what

78:03	use, which is not true. saying people work in the cloud

78:09	you make it hard for what you and not what you use.

78:15	So at least it doesn't charge us the full thing. So it just

78:18	for the long as much as you . Okay. And yes, one

78:29	piece of information that you will be is the package manager that's available on

78:37	, I think the command the same both. Um stampede and bridges.

78:44	and before I get started with that bridges, you have the interact someone

78:49	get interactive access to some notes On . It's a little bit different.

78:55	the commanders called either. The parameters exactly the same. It's just two

79:00	names for the same command on those to keep that in mind when you

79:04	stamp it to. You need to I love rather than interrupt.

79:10	Okay. Uh called package manager this the command called model and that uh

79:17	you use model avail it's going to you all the models that are available

79:23	for all the packages that are available these structures. So you can see

79:28	have uh the cuda packages uh FBI and there's other this is a compiler

79:36	you would need to lower the to your code and other there are several

79:41	package that will go through when we we use them. But this is

79:45	way you can check what what packages available there. Yeah. Uh you

79:53	to use the command to make sure you get this is the packages loaded

79:58	you need. Yeah, I'm getting that yet. And so yeah,

80:04	model aware, you see all the that are available. Uh someone saying

80:11	problems logging in it says access I'm not sure what's going on but

80:16	can we can take that discretion offline has almost ended somewhere a few more

80:22	and then quite a picture. So model lift. It tells you

80:30	packages that you have loaded currently. you see I haven't loaded really any

80:35	packages right now it's one of the packages that Bridget loaded load for

80:41	But if you let's say one to the gcc compiler that was over here

80:49	you can use the command model Lord it's always a good idea to also

80:53	the version numbers for the for the that you want to use because in

80:57	cases there are multiple versions of the . The first for example the room

81:02	here At least two or three different of the package and when you do

81:07	load and give the package name and it is around the command module

81:13	it will show you that Yes you loaded that package and now in your

81:17	you can use that package now for whatever purposes and yes, I think

81:27	pretty much it because you started with basics, any questions on that and

81:35	Thompson in case I missed anything, don't know cancel after the time limit

81:41	reached. Oh no, so it's of the opposite. So the question

81:46	will the job cancelled after the Time three. So it's actually finished.

81:52	the job will finish uh when the execution of the program is finished Let's

81:59	if you get five minutes of your in your it runs beyond try to

82:04	beyond the time limit. Okay? let's say if you give five minutes

82:08	your program finished in two minutes Then bad bad job will end again and

82:14	that you're not going to be charged the extra three minutes that you said

82:17	might be using. But yes. other thing is if you give less

82:22	that would take for your program to then flown will automatically cancel your job

82:27	whatever to say that the status of program was at that moment that's the

82:32	you will be getting in the output for this stuff. I don't think

82:44	gonna be a VPN is required. can just use any simple as a

82:49	to get to be trusted. There's reason for this sir. So yes

83:08	principle time is up. I don't if you have any comments on timing

83:14	you want to bring up. So do and take that next time.

83:19	we can we can do that next . Okay. It's probably the best

83:23	since time is up. Any other . So I guess Tuesday next week

83:35	do it face to face again and what? Um And that question regarding

83:51	course the U. S. Army regards to today's content. Yeah.

84:07	correct. So at least there's this classes face to face is just for

84:16	first two weeks at the moment according university. So that means it will

84:22	face to face on Tuesday next week if there is no change in university

84:30	then there will be face to face classes after next week and every class

84:41	try to make sure that there's also taxes and things will be recorded.

84:48	you will have an option of how can take the class but and You

85:03	principal are allowed to log into the so there's two steps one as you

85:10	to have an account. So if get an account set up and for

85:16	we don't need an allocation. I that if I remember correct, that's

85:23	for both of the customers will be now, once you have an

85:29	it doesn't mean you can use the in order to be used. The

85:35	you would need to have resources allocated you and this so happened and that

85:47	account from class last fall is still but they were close in a few

85:53	so that would go away. Uh you might be able to um log

86:00	and use that allocation but I need tie you to that allocation. So

86:08	kind of the allocation manager on both . So I need to enable the

86:13	of that allocation with the respective That's why I need our so josh

86:20	to have your user id on the sides so we can link your user

86:26	to the resources we get for the . So once you have an

86:33	once allocation is approved and once the step you have been linked to the

86:39	resources then you can use that. if you have an account um than

86:49	there is no class resource at all the moment on stem P. Two

86:54	is an old class so you can , I will, if you send

86:59	your side I can like you to old account and you can try it

87:05	but it will and The 31st of Month. So until then you can

87:12	the old L A county are still cycles left in it. So that

87:16	work but that's why we need a allocation To sustain it for the

87:28	Yeah, so as soon as we the allocation we will notify you that's

87:35	. But if you have a energy account then it just takes seconds for

87:39	to link your account to the allocation then you will be notified that now

87:44	can run on this questions. So why they encourage you to go and

87:51	your own account and tell us what account idea is. So whenever the

87:57	is improved we can link you. let's say someone else until you still

88:13	checked but I will check them out then I'll let I will link you

88:18	you can try things out. Uh we'll link you again when another allocations

88:29	? 116 Not reporting your location, everyone should have received a link for

88:35	. Ooh and that's what something will using for discussion to make sure you

88:41	into that. It's a it's just discussion forum where you can have questions

88:48	available to everyone, so it's easy have depressions on that. It's called

88:53	. You should have got a link joining the class. Uh huh.

89:11	. Any other question? Okay, you so much. And uh I

89:23	be in the classroom. Um So guess we'll see. Yes, we'll

89:28	be there on Tuesday next week. . Okay, thank you so much

89:35

Previous Next

00 : 01
08 : 19
15 : 29
18 : 59
24 : 53
31 : 17
37 : 03
44 : 03
48 : 09
52 : 39
55 : 49
60 : 19
65 : 27
73 : 07
76 : 11
83 : 13