© Distribution of this video is restricted by its owner
Transcript ×
Auto highlight
Font-size
00:00 mhm seems to be working mhm mhm , maybe still a minute left.

00:53 , mhm Yeah, as many of today josh Yeah, I will say

01:06 everyone that if you can just try drink through the zone link on

01:11 on the blackboard but we don't see many of me, yep mhm

01:33 mhm yeah, okay, I think have four oclock so I will start

01:49 lecture. So today, as I last time we talked about most of

01:59 shared environment and how to use them the way it's going to happen is

02:08 but it talked to some slides given big picture in a way and then

02:17 suggests will do a demo related to to use um particularly resource manager that

02:29 on pretty much all shared environments and is a particular one known as slam

02:38 is very commonly used. So I talk about those things uh and then

02:45 I mentioned, so Joshua de Madan then I may come back and talk

02:51 a few more sites, so that's nature of today. So now,

03:03 there and the terminologies as far as going to spend a little bit of

03:09 on because it's important for you to familiar with it and since you are

03:14 already and it's particularly important because some the terminology being used when it comes

03:22 dealing with clusters and schedulers is um somewhat ambiguous. So it's important to

03:34 alerted to where the ambiguity is and and where it's not and I will

03:40 throughout the course to use terms that not ambiguous, but it may not

03:45 the ones who kind of used to about. Um then I'll talk a

03:52 bit about this particular resource manager that being used on in Pittsburgh for their

04:02 is used in a tackle the texas as computing center that you will be

04:09 as well as at the data science for Research Computing center at your

04:16 So it's very commonly used. The all three sites that you may be

04:21 use the same. So that simplifies a little bit right. And we

04:27 their model command and I believe the . S will also demonstrate that meticulous

04:31 timers. Timers is not the simple it turns out. So I'll try

04:39 in particular point out both what you be doing in class and the pitfalls

04:47 . All right. So I think mentioned last time they will two class

04:53 and request resources on has not received approval yet, but hopefully in the

04:58 couple of days I will get Um one of the system is known

05:03 bridges to and it's located in Pittsburgh the Pittsburgh and Supercomputer Center and the

05:10 one is known as stampede to and so clearly then austin has the

05:15 Advanced Computing center attack and here the you or else that you should be

05:24 two requests, your own account that mentioned last time. So that's kind

05:31 the first step requests your account and should be able to do that.

05:36 you already have accounts there is no to create a new account and you

05:40 will not be able to because it's and then that the account is entitled

05:47 various allocations of resources you may have you work with some research group that

05:54 account and then one of these two or exceed you have an account and

05:59 , then it also gets linked to a class account when that's established.

06:05 huh um and here is just to the required action, what you need

06:11 do, the ones who if you have one account Send us so you're

06:18 me your account ID So we can a link your account to the class

06:25 . Otherwise when your account is send us your user name, so

06:31 can again added to the class account the moment. There is no sharing

06:42 my knowledge between the Pittsburgh site and is used to exceed an attack where

06:51 use kind of a local accounts If you have an exceed account,

06:57 may not otherwise need to have different at each side. Do you

07:02 But they used to different allocation mechanism that's what we need an account on

07:08 sites and so is your choice how want to name your accounts but you

07:20 again start the process of requesting Usually it's very quick um, to

07:25 through the process and it's mostly So shouldn't be a big deal

07:32 Start early as you pointed out in of hiccups, most of you probably

07:41 may have seen clusters are very familiar the notion of clusters. I just

07:45 it's not, it's kind of a of what clusters may look like

07:50 in the upper left hand corner it's of more or less of the home

07:55 . Right? That no, it's uncommon among research groups and still certainly

08:03 very common in the early days It may be more of a question

08:08 one buy something recently prepackaged that you on the bottom of the slide.

08:17 um, the previous pictures kind of the front side and the back side

08:24 this. Even commercial clusters, it more like this were except for the

08:31 left hand corner. Again, it's of the homegrown version that the rest

08:34 them are the more commercial grade clusters they are different kinds of interconnection that

08:42 being used in the middle industry, switch into which uh, servers that

08:50 the cluster is hooked up and in case it may be typical RG 45

08:58 or copper cables that is being used kind of the mill. Otherwise The

09:02 one tends to be more fiber and are different ways you actually do

09:07 So the ribbon cables also that you in the upper right hand corner.

09:12 . And we will not talk too about clusters today. But later on

09:15 talk about it when it comes to exercises for assignments on clusters and not

09:21 program that yes. Now get into and you're welcome to ask questions and

09:35 just try to help me monitor, know, raise hands or chats and

09:42 interrupt me. So I will respond questions. Yeah. So here is

09:50 of a little bit of systems are together and it kind of starts in

09:57 upper left hand corner with a piece silicon and that you should have Pakistan

10:07 . Yeah, a compute unit that refers to processes or CPU but that's

10:17 one of these ambiguous concepts and I come back to that um later,

10:23 today and throughout the course, typically piece of silicon in the left hand

10:30 may he referred to as a die that's not ambiguous. It's very clear

10:37 it is. It's a piece of and depending upon what the design

10:44 That piece of silicon has, you , if you are even dozens of

10:50 today on it, I'll talk more the notion of course, I think

10:54 the next side. Anyway, one these things and Eventually one build a

11:03 that is also maybe popularly referred to a B C or in our context

11:10 this class. Mostly referred to as server. That is a self contained

11:16 . It has processors that has memory has power supplies. It has network

11:23 , you can connect them together in way or another. So it's a

11:28 contained unit and in terms of clusters to pull it referred to as a

11:33 . But again, when you buy , it may be referred as a

11:38 or a server and when it gets in a cluster mountain to refer to

11:43 as a note and note is also defined concept and it's not a big

11:51 . Um the processing chips and packaged of silicon that has courses on

12:00 Yes. Then when it comes to or putting together a server, plug

12:06 something known as the socket and that's concept that will be used throughout the

12:14 . And it's not ambiguous. So could say well defined. The number

12:20 sockets uh, in a server or , when it comes to cluster uh

12:28 most typical it too, But there also other designs that has more than

12:36 sockets that may be 46 For or typically a power of two. But

12:46 it comes to both rages to and to, they are two socket

12:54 No, the mechanical way these servers packaged, they sometimes are called as

13:08 unit or Patrick as Iraq unit or patches up in the most known as

13:14 blade. And these so called rec , they are then put together directly

13:26 Iraq that may has quite a a few 10th of these rack

13:33 Possibly another way of doing business as I mentioned that these servers are

13:43 up in the form of blades and blades are then put into a chassis

13:50 and the chassis is to stand put these racks and when you're out there

13:56 trying to potentially buy one of these , um it will most likely have

14:03 pay a little bit more for this version than Iraq unit version. And

14:08 reason is that Iraq unit is mhm complete in a way in itself,

14:17 has the power supply and it has for cooling and everything. So it's

14:24 self contained. The blade isn't totally contained. So in that case the

14:32 is has the power supplies and And the way things works is that

14:39 larger the fans you have, the efficient that tend to be. So

14:45 means blade service and cassis tends to more energy efficient solution and that the

14:53 then try to try to split the and the benefits on that by charging

14:58 little bit more for the blaze and Justice solution and Iraq units, but

15:02 the end, everybody's kind of supposed win on it. So these are

15:06 things guys of course there's processors and to use that is ambiguous. Uh

15:15 and notes non ambiguous plays and vacuums unambiguous justice is not ambiguous and cluster

15:22 . So the only one that is of a bit if it is processed

15:25 sent to you as an ocean. here's a little bit the cost

15:32 Again, that is important to understand you don't already familiar with it.

15:41 one aspect of these clusters, they'll not everything is the same in the

15:46 in terms of the notes and neither how they are used. So some

15:55 these servers or notes to star in cluster that serves as logging notes.

16:01 is and I'll talk more about that a bit, but that's the things

16:06 which users connect. So the net networks. Okay. Up to logging

16:15 then there are other dedicated notes for admin to kind of manage the

16:23 And those are things that users still , you sort of get access to

16:28 for making sure making sure things runs . And does the management tools or

16:38 then don't need to compete with user for resources they're dedicated. Then there's

16:46 things that they're supposed to use, that the compute notes And when I

16:54 I'm supposed to use so the logging , Yes, you're using them because

16:57 the way you connect to the But that's kind of the limit of

17:02 they're going to do without logging Anything else was supposed to happen on

17:09 other notes typically labeled compute notes and compute notes themselves maybe um kind of

17:21 in a way or grouped into um some notes that are similarly configured or

17:32 also may be configured for different usage and hopefully we'll talk about that a

17:39 bit more. About one is says this slide interactive notes. So those

17:44 things that you can again work interactively you're kind of most likely used to

17:50 your laptop and other things and then are other knows to which one submit

17:56 and they get run at some future , most likely not immediately. And

18:04 I mentioned that there are different groups particular in in Pittsburgh, the bridges

18:10 has what they call regular memory kind of reasonable amount of memory and

18:17 there are different shades of adding or those with uh huh more memory,

18:24 quite large memory or extra large memory notes that has deeply us or other

18:31 on them. So this is different of grouping the compute notes and mostly

18:38 will be using on bridges, what's as the regular memory Notes. And

18:44 in terms of putting big clusters there other Knowles said, dedicated to managing

18:51 to these systems or I'll notes. , so here is a little bit

19:03 respect for what the bridges to bridges to is a fairly new system

19:09 was put into production, um, think this spring uh remember correct or

19:19 last year um and said they are this case it's not a two CPU

19:28 that means two sockets in this two on this product pieces of

19:33 Silicon put them to this server board forms an old and in this case

19:44 there's Cpus are produced by AMG it's one of the yeah, silicon processor

19:58 , most people know us about intel AMD competes with interest, didn't know

20:04 the and it's kind of okay, strong competitor these things too income but

20:11 the these particular except for you. They are used in the bridges to

20:18 has 64 course in each uh I say socket, It's my problem.

20:27 not, you know, there are subjects 64 course and On each,

20:33 the type of, I'll fix Now. Bridges to then has 488

20:39 . That has to 56 gigabyte the . So this is particular for trans

20:44 regular memory. Notes. And then have the large memory that has 512

20:52 of memory and then they actually have additional laws and you can find out

20:58 information about the configuration of different um . And on the bridges to cluster

21:09 going to the U. R. . At the bottom of the

21:16 Now for the if you know that likely will be using unless anything changes

21:24 the assignments uh with programming abuse you see the configuration here. In that

21:32 they actually use into processors instead of family processors and they have very recent

21:44 of the GPS. There are other to that may be of interest in

21:52 you pursuing it research that they have of the also more ai focused the

22:01 on some other notes. Then when comes to using more than a single

22:11 , the interconnection network is important and one is the particular kind of protocol

22:19 used for the communication between nodes and use they called the Cinnabon protocol.

22:28 another part that is not mentioned on line is the particular topology of the

22:35 but um and then there are other of this particular interconnect. The melon

22:44 is a company that has dominating the band uh devices but the indefinable protocol

22:56 and it's the rs has it's a on the capability of this particularly to

23:03 them for high data right of this filament interconnection technology. And then the

23:14 system attack one Austin is a Sky anoxia notes that I have there are

23:23 few years old by now but not old. So probably 3 - four

23:29 old. Typically no, it's kept place When they get 4-5 years

23:37 So probably getting close to the point being retirement but they are also dual

23:44 notes and such as doing in this they each correct processor or CPU has

23:54 course And slightly less memory promote in case 192 KB. Now in your

24:07 you will run some scripts or make calls to functions that will provide you

24:17 detailed information on the particular processor is installed in the cluster and uh suggestion

24:25 demo that later on in today's Okay. Yeah. So yes,

24:36 just leaving it to the resource manager will, you will submit jobs to

24:43 is known as slur for simple initiative resource management. Oh and there's a

24:51 slides here about I guess wow has kind of operates and this is kind

25:02 cartoon shows best. As I mentioned well remotely connect to log in

25:11 Typically clusters and predict clusters serving the users to have a few logging knows

25:17 just the same single one, so may have 3, 45, half

25:21 dozen or so long in those to sure that users should have easy access

25:29 the Logan nose. Now, once there then necessary to submit jobs to

25:39 slur the Resource manager and the resource than in many cases. In fact

25:46 are the Resource manager is shared among clusters. So it knows but you

25:55 your jobs and mention script tell us your job wants to go and the

26:00 managers then manage. Excuse for the clusters and again, both. Attack

26:08 Pittsburgh use the same resource manager has um, or sissy here at

26:15 which so, and eventually you get output result from job execution and

26:28 reminder, they tend to have to students many times. Never run jobs

26:34 the log in. No, just a little bit cartoonish and I'm

26:43 spend much time on it, but one interested and should go and look

26:51 a little bit more about the You need to familiarize yourself a storm

26:56 for sure. But it's a pretty piece of software that is when both

27:05 priorities, job queues, accounting, and all kind of different aspects.

27:14 it's by no means is simple piece software is, is open source.

27:21 and that's part of the reason why of these, certainly academic computer centers

27:25 it, but also other centers use . It's quite robust piece of software

27:32 it says here, it is also signed well, in the sense it's

27:40 scalable. So he says he has give you some numbers are not going

27:44 create into it only to help the of Islam is put together and how

27:49 works, you will just use But it's, as I said,

27:54 here, we can manage up to than 100,000 nodes and hundreds, 100,000

28:02 per hour. So it is has pretty good group but it also means

28:06 requires sources of its own to run not complete with user jobs for its

28:16 . Okay, so the next ceremonies here, a few aspects of storm

28:28 some screenshots and then we'll hand over suggestion with them with so this is

28:34 of just a large thing from me you than to pay attention and the

28:43 . So yes, I do the , so you're kind of more or

28:48 get to see the same thing twice from me as pointing things out and

28:55 so you're just doing it again more detail. I can hopefully follow what

29:00 doing. So I just said it's a resource management tool and that's,

29:09 handles jobs to submit it and it cues and there are typically several different

29:19 for any given clusters. So you have use for in a sense,

29:28 jobs and you may have a different for large jobs and then you have

29:35 set and use policies set and the knows about the policies and the systems

29:42 mean are the ones that then implement various policies that have been decided for

29:49 use of the cluster for the different and here's a little bit of the

29:56 basically they're a bunch of demons. of them don't runs country central and

30:02 have the global view of what happens the cluster and then there's demons and

30:07 on each one of the compute notes communicates with the central a demon and

30:14 there is a hole to commence and talk about some of them but that's

30:18 you also need to go and familiar yourself the storm to figure out what

30:28 , commands are particularly useful. We not use all that many commands in

30:32 class but in other uses outside the when you do projects, there may

30:40 are they commands that we don't cover the initial part of the course or

30:46 you may need to go and fight slur. There's also a lot of

30:52 about in terms of the parameter list you like for the different commands that

30:59 important and for that you can need make sure that you have a way

31:05 looking up and understanding exactly what the means for the different commands. Um

31:17 here is a little bit now respect coming back to the vocabulary or the

31:22 culture uh of and how what it in the context of slur. So

31:36 of it is not particularly intuitive, would say. And part of it

31:43 because of the evolution of processor Slenderman, most other resource managers,

31:53 were designed or started kind of out there was Only kind of one core

32:04 processor chip. So nowadays for a chip, yeah, there are or

32:16 piece of processes, Silicon as we , there are many different course for

32:24 number of course on each piece or is silicon and each core may also

32:31 able to execute More than one Instruction Or at least manage more than one

32:38 Street. Known as threats. So , when these resource managers were

32:50 there was very simple, there was of on the core for sock it

32:55 be very precise. Now there may many execution streams managed on each core

33:04 then there's many course and kind of the same process or socket to be

33:13 . So the resource managers allows you control where things runs and also how

33:23 of each, The kind of So there is this notion of circus

33:29 defined. There is typically two to . I shouldn't say typical, typical

33:36 too, but there may be between and eight, Rarely more than eight

33:42 these sockets. In a note because you go to slam your request notes

33:51 then you may also request sockets and may request cores and you may request

33:57 and what this line is just saying upon what the actual physical hardware

34:06 The notion maybe on CPU is different terms of slow, it may mean

34:12 core, it may mean a threat CPU never means the entire piece of

34:21 that plugs into the sock, you , on my earth is light and

34:26 people talk about process is safe for , they may refer to the kind

34:31 package you buy from internal and or some other processor vendor so that's where

34:39 are ambiguous and needs to be careful make sure we want to understand how

34:43 used in the various contexts that you're . And then on the right hand

34:51 of this side it kind of shows ah nodes may be grouped in the

35:00 into different partitions so their partitions coexist run simultaneously and there's been a job

35:08 for each one of these partitions. this again managed by the resource

35:16 Mhm I think this is pretty much talked about this um and so

35:30 this is a little bit of examples of some simple comments that you will

35:37 and I guess suggest will come back them. I'm quite sure when he

35:42 this demo but one of the if one uses the kind of short

35:48 that is single character in this case and for instance apparently were case ends

35:54 different meanings So on the upper case is then the number of notes,

36:02 about the finance, the server. you go back to my one of

36:08 early pictures um whereas yeah, lower in and that's the number of tests

36:16 this creature for the job and then can control but since here, how

36:21 of these tasks if you want to socket or print out citizen. Mhm

36:28 there's many ways of controlling things and is important for performance. So you

36:36 , you know, assignments later we asked to play a little bit

36:43 these um it's also important beyond the that you understand this may be important

36:51 that you have options for controlling how streams are allocated and the cluster.

37:04 So I've got a picture of it and depending on how sophisticated the resource

37:13 is, it's um may select different of notes. So you may ask

37:24 10 or 100 notes and in the straightforward way the resource manager will just

37:37 the number one knows such a one of where it happens to be physically

37:44 in the cluster. Yeah, may for the performance you get because there

37:56 be more or less of network involved communicating between the notes that runs the

38:05 and the network is almost in all shared between all the jobs that run

38:16 in the cluster. So that means can depending upon how much, how

38:27 packets, how much intensity or traffic your application generates between the notes that

38:33 being used for the job. You get variable performance when you run at

38:40 times. Mhm Because the traffic by jobs may vary and you may get

38:48 , there are some are nicely respect that the network one times and

38:54 not. So it's also the case the second bullet that nodes our

39:03 So that means in the same note may be 100 and more coarse and

39:15 may be different jobs using different course course may not be shared but certainly

39:25 of course in the same note may different jobs so now some of the

39:37 in the server that is in an in the cluster I shared among the

39:49 . So if you don't have exclusive to the node that means yeah.

39:57 performance may depend on other jobs running a note at the same time.

40:06 that's why to put them on they you the time things you need to

40:10 sure that to get exclusive use because not only do you major gaps not

40:19 a great performance, such a good but it's also not so clear representative

40:28 time is your gut is because it be to some degree significantly dependent on

40:36 jobs that you have no control And I already said that not,

40:45 was also configured the same now in case of bridges or the regular memory

40:53 that you will use and the same . They stampede all the notes have

40:56 same configuration but in general they someone needs to be conscientious of

41:06 The last of Ebola in the Um a bit it's not intended to

41:12 the case but it happens that even the pieces of silicon, the package

41:22 you buy from insulin A and B have the same spectrum supposed to run

41:28 the same frequency for instance. Um it may turn out that's for

41:36 odd reason. Some of them may run at that particular clock frequency that

41:42 expected to run. That other reasons that these days um calling is an

41:52 of chips and all of them nowadays have firmware and temperature sensors built in

42:04 the firmware controls the clock speed of CPU so unless you have locked the

42:10 rate that the processor runs by depending the temperature or each situation in a

42:19 location of the cluster where the note all notes may not run at the

42:26 you expect them to that. Mhm Oh and here is just very

42:38 I'm not going to go into the at this point. I come back

42:41 it at some later lecture but it's just stressing that there are you

42:48 commands that allows you to decide how you want your jobs to run and

43:01 not today, I think that some elections so josh probably would get more

43:05 of the commands and figuring out how uh huh choose and control the choice

43:13 where things are the resource sector There was an additional twist on these

43:20 um that we'll talk about up today much later is that the operating system

43:29 ideas where I think should run. it may change his mind during the

43:36 of your cold so that's also from something to be repeatable and sometimes optimist

43:46 may have a better idea than the system what goes on And for that

43:52 may force the opera system to provide what you want by what is known

43:57 binding allocation to the particular resources. . Um yes, in terms of

44:07 submissions and the big picture stuff um you want to submit the job,

44:15 have to tell these first manager as pointed out, you know how many

44:21 you want and course and a bunch these other things um but you also

44:28 to tell us what you've got the time to be. So the resource

44:32 can get an idea of kind of footprint in some nice of the

44:37 how many know, Janine, um of course and how long are you

44:40 to use them? And you also to tell how much memory you

44:45 Uh so it can do some decent figuring out the allocated job. Now

44:55 policies being implemented by the my start the resource managers basically terminated jobs when

45:06 requested time is up and if your is still running of course, that's

45:12 , that's great. In particular if don't do in a form of check

45:16 in the job because that means the thing maybe wasted. So of course

45:22 one way to avoid that is don't sure that you have a good

45:29 So so there is no way in mind that you will run out of

45:36 and the job will determinate it the time. Now of course that means

45:44 didn't job manager and need to make that there is enough of a window

45:49 time to run your job and it a wild to allocate that larger window

45:57 time. So it may have the that your job sit in a queue

46:03 a long time. Some other sides , may I let you overrun for

46:09 bit before killing your job? That's . It's always a good practice to

46:16 figuring out how long you job is to run in particular before you do

46:25 form of what typically called production So you get through it recently.

46:34 . And so those are the two . And I think there's a bunch

46:42 these commands and I think uh huh may want to use the Southern Council

46:49 , the batch command, uh the command probably not, but in other

46:57 you may be interested in more ask too much. How much are

47:02 to gravitation? I get these Um So for me it's useful and

47:09 are other things are trying to do of jobs or data broadcast. S

47:19 simply slur so everything tends to be by the S Yeah. And the

47:25 commander will be interested in and the command for sure and the influence I

47:32 so the actual demo them and this just text support it if you want

47:36 go back. But again, the way is to go to the slum

47:43 and find out the precise meaning of commands. Um and I thank

47:54 So your actual demo these things and will show it. So there is

47:58 tells you processor, clock, frequency and a bunch of different things um

48:07 there is now two screenshots that will very quick with those and so

48:14 will go through them again in form a demo, so a little bit

48:21 the suspect. If you look at screenshot this is on the petition says

48:27 M and again that this regular memory on the british cluster um it says

48:33 much the partition is up and if is any particular time limit system this

48:41 that was done um and uh then gets a little bit of a listing

48:49 the status of the different notes and there allocated? So in this

48:54 even though all kind of maybe seem that something is up until something is

49:00 but it's kind of the petition is . But the particular note, if

49:05 look at the first one here like 3 23 happens now to respond and

49:13 are other examples here and can I'm suggestion then we'll talk more to this

49:22 . Yeah. Um and this happens be a guest extreme memory petition he

49:29 uh and also in otherwise you can the status of jobs so either are

49:36 running this case. None of them and they all sort of pending or

49:39 in a cube and it also tells know what the potential priority is for

49:49 particular jobs um in this case and I want to point out I am

50:05 you get the job numbers and those are obviously good in terms of you

50:08 that you want to you can get information or kill the job um and

50:20 run is simply submit and run the again. See a demo I think

50:28 doing interactive jobs as well as doing submissions. Mhm And uh this case

50:37 of uh huh submitting a job and the status of the job is being

50:47 in this case on the first line see that's from command and then the

50:52 of of course that it was requested this particular job and this petition,

50:58 job is supposed to run in and the pointed to the code.

51:06 see a demo. Mm I want make sure it has enough time to

51:12 . So that's why I also go things quickly um and in this case

51:19 she was a little bit on the so I will probably skip this.

51:26 me see, I'll let you basically to talk to these lines more and

51:33 is the batch applications health here, number of notes that such as so

51:38 again you can go back to the but the record also in the demo

51:42 you should have more than one way reminding yourself of his commands but

51:48 go to the slum doc, forgetting detailed documentation. Haslem works, stop

52:01 . Let's see if there's anything else want to cover controls before you go

52:08 , I can go back to them uh if you want to at some

52:13 triage but why don't I? I think I can start with the

52:18 yeah, I think that's the right to do. So you have enough

52:23 um uh huh Okay, if I getting my thing Yes, yes,

52:32 need to stop share mine first, fine, I don't think so.

52:37 me get over. Okay, All , okay, hopefully everyone can see

52:45 screen now. Uh again get this uh for example how to get access

52:54 the clusters and again we have to of the clusters that will be available

53:00 us once you get the allocation of , once one is the original uh

53:05 to cluster and the other is the to cluster and you'll need ssh client

53:14 get uh to get to those uh Windows are using the simple tool

53:20 Footy on Mac, I believe everyone access to ssh client on their

53:26 They can use that. Mhm. for getting onto bridges, I know

53:31 a little bit smaller here so they are all that you need to follow

53:36 your user name at bridges to dot dot e d u and that should

53:44 you into one of the logging roads bridges to and on stampede to

53:49 Uh Again username at stampede to dot dot utexas dot eu So that's you

53:59 what you mean a good aspect to questions. I will give the demo

54:04 the visits to clusters because you still one project running and I can show

54:08 some of the things but when you that it will simply Okay, start

54:18 representative. Yes. So when you in just putting your password and let

54:24 get you into the logging node and way you can make sure that you

54:29 on a logging noticed by just looking the console. So I usually use

54:34 logging message here and then on this note you can run pretty much all

54:39 uh Leonard's commands that you you don't like. LFPB would actually give you

54:45 about the processor that's available on the roads And every uh many other details

54:51 how much the L- one cache and things are. Um we can also

54:57 a few other commands that give you about how much the memories available on

55:02 on this uh blogging node was the memory, how much it's free time

55:08 is available and so on. Police you go back to the premieres the

55:17 let's CPU yes, I think again it uses the notion of socket.

55:24 , that's an important thing for the to again they were those things uh

55:31 the man old may not be familiar you don't need to know it in

55:34 beginning of the course but it's um and in this case it also

55:41 you the threats car core and that an important parameter to that we'll talk

55:51 later. So I'm just trying to out since you had it this notion

55:55 course it is in this case it's both the typical AMG and intel processors

56:06 course in this case to be specific are capable of managing two threads concurrently

56:17 a core but it's a system configuration so it's not always the case that

56:24 systems admins has configured the system to to threats for court. So sometimes

56:32 something you want to check because it an impact on performance. What's

56:38 Um I don't see anything else that to point out this is in terms

56:45 being yeah, I think about the but yeah, okay from this

56:55 It has 54 cores. Right. um and well it also has towards

57:07 half. It also tells you about cache sizes and for understanding performance.

57:15 an important part. So sometimes and your term when you're asked to try

57:24 understand how codes run, you may want to update yourself what they could

57:30 hierarchy looks like on the processes. . Okay. I think that's might

57:39 and also you can see towards the . Right? That's how much

57:44 kind of main memory It is on right in this case to 56

57:48 About uh huh. Not that they it. Um, so that's why

57:57 end up with This kind of strange 2 63. That's because the 263

58:03 base 10 and I say 2 56 based 10, 24 wow. Anything

58:16 you think for the future is a thing to take from these? I

58:23 for now it's only important to understand memory and then the how they go

58:29 profit organization on this trip. So one important factor to keep in mind

58:35 I'll get back to it when I to the interactive commands and one more

58:40 I guess I just noticed on this thing that maybe Confusing at first because

58:48 says 64 course per socket and it two sockets For notes. So that's

59:00 times 64. That's 1 28. on the top it says 256 and

59:07 that case it's because it has enabled threats for a court. So that's

59:15 factor too. But so try to this numbers. One needs to

59:20 think about course and threads and sockets That there's 128 physical force and then

59:35 two threads for court, but that's why it counts. Uh 2

59:39 trouble than 1 28. Again, useful information to understand things in the

59:47 . So I'm just pointing out that command gives you a lot of

59:52 Okay, okay, go ahead. huh Now, yeah, again,

60:00 one more command that you can there is available on every minute

60:05 if you can get information about the system, the version that has been

60:11 for is um, also ignored of . Um now notice that canal we

60:19 around the log in northern, we the flies that we need to get

60:23 to compute not somehow and do another . Now there's there's this,

60:31 I think it was chad question from , it's just the same password to

60:41 into the server. I'm not quite whether that we used to the portal

60:49 . That's a good question. As as it's related to Pittsburgh portal,

60:56 think it's true. What, but , so I think in terms of

61:04 the external sites attack and Pittsburgh, should answer to be precise.

61:13 for uh corrected, I believe it's same password as you use on the

61:17 , but slightly different for somebody to different way of doing the ssh I

61:26 there's two passwords for Champagne too and can have the what about the backwards

61:31 the same thing but I don't recall steps exactly when I find out.

61:37 Yeah right like you're saying basically they there's one look a place that is

61:46 by the site with stampede. It in two places but it's nothing that

61:51 you from using the same password. I believe that's okay. Yeah.

61:57 and one thing I forgot to mention since we're uh we're on bridges replicated

62:02 right now but when you use stamp through you need to set up multi

62:08 authentication. That's the requirement when you in on uh saM P. Two

62:14 apart from your password like it did here it will also ask you for

62:19 token code and it's gonna require you download a half that that's uh protect

62:25 that gives you that you can code time you log in. So there's

62:30 two factor authentication that uh locked into too. And then they have the

62:37 guide on their web site so you follow the steps there uh difficult to

62:42 access to that This one piece of . Okay. Yeah so as I

62:50 saying so there's now there's Uh three . uh Okay can you run only

62:59 SFX line for the job. Ssh for a job, I don't know

63:06 you mean. But yes, you have multiple ssh clients open for any

63:11 the size but not necessarily that you only one kind and you can have

63:17 ssh connections to the same place. hope that I'm just Yes.

63:28 Um Right. So yeah the next I was going to get to is

63:32 to run your job and so as would think there's three ways you can

63:36 it. So first is simply using command and I have a prosecutable for

63:44 simple hello world program contractor. And you do at run you can just

63:51 do uh for this case that provides number of starts or number of

63:58 So to say for your program to executed this is Ron and then simply

64:04 the executable file name. Now there's other options with us run as

64:10 You can just do a tron dash help to get on details of those

64:14 are someone who can get into them now. It should be simple.

64:18 with Saffron, when you when you your program uh it will give you

64:24 message, something like this. But job is now cute and waiting for

64:29 and after a while it will get to the resources and then run your

64:33 interactively. So the output for your , you will be seeing it on

64:38 on the control And this is this good when you know that.

64:43 My program was done in like general seconds. I just need to check

64:47 it works or not. That's And what everyone does it give you

64:52 your program access to some of the resources that you request based on these

64:56 and then run your program on that those resources. But as you will

65:01 that if you are running multiple you don't want to sit there and

65:06 our uh you don't want to wait the point so that if you

65:11 you want to do other things as . So the Exxon is not the

65:15 way to go for it. So second way you can do it is

65:21 using a uh interactive interactive shell. what that does is it gives you

65:28 to a confused note for for a period of time that you mentioned using

65:33 flags for time as well. And you can do that by using the

65:37 , interact And here uh you can provide it with the number of notes

65:44 you want to the capital 10 requesting just one note at a time

65:50 then you also provide the number of . So dash and trash and that

65:56 both equivalent. There's a question is host name for visit story quote.

66:03 , it's not new. H uh . All it was a bridge is

66:11 . They're just too dot etc. really you so that's Pittsburgh Supercomputing

66:18 Yeah, but they just threw at sp um we just still don't be

66:22 dot here. Okay. Yeah. uh you give a number of nodes

66:30 then you also provide the number of . So on these compute nodes,

66:35 have 1 28 physical coal, so need to provide the number that either

66:41 than or equal to the number of cores if you go more than that

66:46 , will give you another that. that's the that's the incorrect configuration that

66:50 asking for and when you do uh dash m 1 28 is going to

66:58 you access to one fool uh computers and let me switch to another flight

67:07 I did this already. So the command here and now the difference is

67:13 were initially on logging knows where you access to the resources and once your

67:19 get the allocation, it will tell that which note is ready for your

67:23 and then it will give you access the console on that on that particular

67:29 . So now see the difference that can log in node. Now you

67:33 you can see you're on a compute Now here. On a computer

67:37 you can just again run all the . Uh interactively and this is useful

67:43 you when you know that you have tests to run and you don't want

67:46 do it run every time, you just get access to one computer for

67:50 while and do all your tests and done with it. In that case

67:54 interact is very useful. So when you're uh performing multiple tests,

68:00 can just get access access to a , not interactively and do your but

68:06 again, the problem there is that still wait on the console to get

68:11 to the resources and then again, still need to wait on the control

68:15 get output for your program. So third way of running your job is

68:21 submitting bad job to the queue on . And that you can do by

68:28 a batch trip and it doesn't necessarily to be named as batch, not

68:33 it's named. Uh Partake of simplicity a very minimal batch file would look

68:39 like this where you provide all the the parameters for your job, a

68:47 of nodes, one known the name the partition and the time that you

68:54 you think your job will say number stops and then the command that will

68:59 your program and what what's that going do with when you, when you

69:05 that job giving a batch command, going to simply kill that job and

69:11 you that your job is submitted. it's in the cube and then you

69:15 check the status of your job by using the SQL command and here you

69:23 pass your username to filter out only job because if you do simple,

69:28 you there's gonna be a there's going be long queue of jobs because there's

69:32 many jobs running on the culture, by using cashew and you're using

69:36 you can see all the jobs that are currently running or depending or

69:42 Uh huh. And what that's going do is once the job finishes and

69:49 it is finished, the output of job will come out into a file

69:55 which will by default be named as job number. Uh the job that

70:03 submitted 90 waited. So that the of that uh that job 30

70:09 And then when you try to try to read the trial, you

70:16 see the output in your uh your style and the and the benefit of

70:22 as bad as you can submit as bad jobs as you want and then

70:26 forget about them for a while and do whatever you want to do on

70:30 control, trying to test or anything then come back and check it if

70:35 finished and then you'll have the output uh these uh these output files

70:42 So that's that's the freeways, you'll mainly uh learning your jobs for most

70:47 the assignment and political writing about for of them you thought so any questions

70:55 planning the job, this one on compute node? Uh you can you

71:05 do both on computer or you can both. S Ron and um without

71:09 front. So the difference will be if you just simply do listen dr

71:15 world, it's going to take the value of the number of thousands will

71:19 your program for that many times If want to sell run for less number

71:24 times. So you had access to up, but you now want to

71:27 for 64, You can you can the end up uh flag with 64

71:33 You've programmed around 40 people that Yeah. And the other thing is

71:42 the run command again? You can from both logging and compute notes um

71:49 interact command gives you access to the note interactively. So you can only

71:54 it from the logging notes and the batch command. You can run it

71:59 either logging Lord or the or the not as well. It's best to

72:04 it from the logging road because you want to hold up one compute node

72:07 you're submitting back job. So if know that you have a bunch of

72:12 that you need to do it best be on the logging No, then

72:16 do all your jobs during as Yeah. Yeah. I just want

72:22 point stress what you just said. can run it from the logging

72:27 not run it on the logging Locos , yes, yes, yes.

72:33 don't run your job on the logging . People are trying to get access

72:37 compute nodes, wire the logging node you're going to mess everything else on

72:42 club but it's a shared environment. careful of that. Always run your

72:46 on the computer development one way or either using is Ron interact or as

72:52 . Mhm Okay, so that those the three ways you can run your

72:59 a little bit more information on firm for one is the s info command

73:05 you can use to get details about different partitions of the notes that are

73:10 on the search again as the Columbia , you can see what the partition

73:16 is, whether it's up or not what knows her in that collection.

73:22 there is a question, are you if you use Catherine or as

73:26 I'm not sure what you mean by , it's okay if you can unmute

73:34 that would be good for if you type it out. What do you

73:38 ? Like say uh figured out was ? So uh that song is basically

73:48 that it wouldn't be run on the on if I use these commands but

73:54 a question how your parameter is the , right? Yeah, yeah.

73:59 if you if you do it Ron as bad from Logan, know that

74:02 going to ask for some resources that be the compute nodes and Denver and

74:06 jobs so in that sense you will uh make the log and not busy

74:13 any way. Yeah. Mhm. yeah, on the log in that

74:20 never do dot flash and then your because that's going to run your program

74:24 the logging road whenever you want to your program to FM as bad,

74:29 one of those um Yeah, the command that might be useful uh

74:37 the one that I just showed uh you and I would just use more

74:41 that, but I don't uh go the way in the lips this

74:45 you can see all the jobs that running on the on the currently uh

74:51 again you can filter that by uh your user name. Yeah. So

74:57 one interact job that I'm on on computer note right now, so that

75:01 the job, I'm not showing right . Um what else? Yes.

75:07 other command that will be useful for and this is very important is the

75:12 command because many times it will happen you will see your job, either

75:18 spending for a long, long time you may have made a mistake and

75:22 your program is stuck on on some , it's just not finishing up And

75:26 see that your program says it's running it's been like 10 hours now and

75:31 still running so there's probably some something going on there. So using this

75:36 cell command and providing the job uh you can you can cancel your

75:42 and hopefully that will cancel my interact . There you go. So one

75:49 a comment on that this part of things will stress throughout this. You

75:56 to um set yourself some expectations how things should take and then you

76:04 if it's way off again kill the and go back and think what might

76:09 wrong. So it's very important to some way of estimating how long she

76:17 on the kind of notes or set notes that you're requesting and of course

76:24 is also wasted. You know, a limited amount of Yeah, time

76:29 get on it. It's not But these clusters are expensive. It's

76:35 trying to be careful and not wasting resources is a good thing but we

76:41 pay for it explicitly. So but doesn't mean we should be wasteful.

76:48 and that's that's more important when you you will be submitting your bad jobs

76:53 you provide the time you will be these resources in the basket and

76:58 that reminds me that your bad whether it has finished or not.

77:07 cancel it once a time limited So it's not going to check whether

77:11 have finished the execution of your program not. So you need to be

77:15 careful. So always give a little of buffer when you submit your

77:19 But I start to johnson said that give too much time because that's gonna

77:24 two problems. One, it might a long time to get your jobs

77:29 up because um we'll see how you've this thing for an hour and I

77:34 I don't have time to put the that takes an hour right now so

77:37 will just put everything in the Q put an impending fate. Uh So

77:42 , so just keep that in mind you when you put those shots

77:45 Yeah, I guess one good aspect hustler works and I think that's true

77:52 this program as well. So the part is overestimating in the queue time

77:57 . But it doesn't charge us for your request. It charges for what

78:03 use, which is not true. saying people work in the cloud

78:09 you make it hard for what you and not what you use.

78:15 So at least it doesn't charge us the full thing. So it just

78:18 for the long as much as you . Okay. And yes, one

78:29 piece of information that you will be is the package manager that's available on

78:37 , I think the command the same both. Um stampede and bridges.

78:44 and before I get started with that bridges, you have the interact someone

78:49 get interactive access to some notes On . It's a little bit different.

78:55 the commanders called either. The parameters exactly the same. It's just two

79:00 names for the same command on those to keep that in mind when you

79:04 stamp it to. You need to I love rather than interrupt.

79:10 Okay. Uh called package manager this the command called model and that uh

79:17 you use model avail it's going to you all the models that are available

79:23 for all the packages that are available these structures. So you can see

79:28 have uh the cuda packages uh FBI and there's other this is a compiler

79:36 you would need to lower the to your code and other there are several

79:41 package that will go through when we we use them. But this is

79:45 way you can check what what packages available there. Yeah. Uh you

79:53 to use the command to make sure you get this is the packages loaded

79:58 you need. Yeah, I'm getting that yet. And so yeah,

80:04 model aware, you see all the that are available. Uh someone saying

80:11 problems logging in it says access I'm not sure what's going on but

80:16 can we can take that discretion offline has almost ended somewhere a few more

80:22 and then quite a picture. So model lift. It tells you

80:30 packages that you have loaded currently. you see I haven't loaded really any

80:35 packages right now it's one of the packages that Bridget loaded load for

80:41 But if you let's say one to the gcc compiler that was over here

80:49 you can use the command model Lord it's always a good idea to also

80:53 the version numbers for the for the that you want to use because in

80:57 cases there are multiple versions of the . The first for example the room

81:02 here At least two or three different of the package and when you do

81:07 load and give the package name and it is around the command module

81:13 it will show you that Yes you loaded that package and now in your

81:17 you can use that package now for whatever purposes and yes, I think

81:27 pretty much it because you started with basics, any questions on that and

81:35 Thompson in case I missed anything, don't know cancel after the time limit

81:41 reached. Oh no, so it's of the opposite. So the question

81:46 will the job cancelled after the Time three. So it's actually finished.

81:52 the job will finish uh when the execution of the program is finished Let's

81:59 if you get five minutes of your in your it runs beyond try to

82:04 beyond the time limit. Okay? let's say if you give five minutes

82:08 your program finished in two minutes Then bad bad job will end again and

82:14 that you're not going to be charged the extra three minutes that you said

82:17 might be using. But yes. other thing is if you give less

82:22 that would take for your program to then flown will automatically cancel your job

82:27 whatever to say that the status of program was at that moment that's the

82:32 you will be getting in the output for this stuff. I don't think

82:44 gonna be a VPN is required. can just use any simple as a

82:49 to get to be trusted. There's reason for this sir. So yes

83:08 principle time is up. I don't if you have any comments on timing

83:14 you want to bring up. So do and take that next time.

83:19 we can we can do that next . Okay. It's probably the best

83:23 since time is up. Any other . So I guess Tuesday next week

83:35 do it face to face again and what? Um And that question regarding

83:51 course the U. S. Army regards to today's content. Yeah.

84:07 correct. So at least there's this classes face to face is just for

84:16 first two weeks at the moment according university. So that means it will

84:22 face to face on Tuesday next week if there is no change in university

84:30 then there will be face to face classes after next week and every class

84:41 try to make sure that there's also taxes and things will be recorded.

84:48 you will have an option of how can take the class but and You

85:03 principal are allowed to log into the so there's two steps one as you

85:10 to have an account. So if get an account set up and for

85:16 we don't need an allocation. I that if I remember correct, that's

85:23 for both of the customers will be now, once you have an

85:29 it doesn't mean you can use the in order to be used. The

85:35 you would need to have resources allocated you and this so happened and that

85:47 account from class last fall is still but they were close in a few

85:53 so that would go away. Uh you might be able to um log

86:00 and use that allocation but I need tie you to that allocation. So

86:08 kind of the allocation manager on both . So I need to enable the

86:13 of that allocation with the respective That's why I need our so josh

86:20 to have your user id on the sides so we can link your user

86:26 to the resources we get for the . So once you have an

86:33 once allocation is approved and once the step you have been linked to the

86:39 resources then you can use that. if you have an account um than

86:49 there is no class resource at all the moment on stem P. Two

86:54 is an old class so you can , I will, if you send

86:59 your side I can like you to old account and you can try it

87:05 but it will and The 31st of Month. So until then you can

87:12 the old L A county are still cycles left in it. So that

87:16 work but that's why we need a allocation To sustain it for the

87:28 Yeah, so as soon as we the allocation we will notify you that's

87:35 . But if you have a energy account then it just takes seconds for

87:39 to link your account to the allocation then you will be notified that now

87:44 can run on this questions. So why they encourage you to go and

87:51 your own account and tell us what account idea is. So whenever the

87:57 is improved we can link you. let's say someone else until you still

88:13 checked but I will check them out then I'll let I will link you

88:18 you can try things out. Uh we'll link you again when another allocations

88:29 ? 116 Not reporting your location, everyone should have received a link for

88:35 . Ooh and that's what something will using for discussion to make sure you

88:41 into that. It's a it's just discussion forum where you can have questions

88:48 available to everyone, so it's easy have depressions on that. It's called

88:53 . You should have got a link joining the class. Uh huh.

89:11 . Any other question? Okay, you so much. And uh I

89:23 be in the classroom. Um So guess we'll see. Yes, we'll

89:28 be there on Tuesday next week. . Okay, thank you so much

89:35

-
+