© Distribution of this video is restricted by its owner
Transcript ×
Auto highlight
Font-size
00:00 he told me. Okay, so know this. So today, talk

00:25 one more tool kit, and I this or the towel as there's no

00:31 the tuning and analysis tool kit, guess before I do that,

00:39 yes, say that's in terms of assignment, one that was just returned

00:45 . I think everybody did quite Um, and the theme, I

00:53 , if any general comments, if were kind of deductions or points,

00:58 it is. Try to make sure when you look at performance, in

01:06 sense, it is relative thio what platform can do. So it's a

01:12 , whether it's, uh, good bad, and just looking at times

01:18 themselves doesn't really tell whether it's good use or not. So that's the

01:24 that will come back in every excitement you're supposed to trying thio Reason about

01:33 well did the code actually do relative what it could have been doing?

01:39 for that purpose, Papa is one on today. Towers, another tool

01:45 is, um, really tool set kid as it's set on well,

01:55 briefly go over again. Some of capabilities. It's a very rich to

02:00 . And then so Joshua will. more some of their kind of simple

02:07 that well might be sufficient for most the assignments in the course. So

02:13 talk a little bit about the capabilities little bit, how their tool kit

02:17 actually put together or architecture in. talk about how one can uses

02:24 get to instrument the code in many ways. And then one song had

02:30 talk a little bit about how it the measurements and what type of measurements

02:36 be collected, you know, and specifically talked about profiling and tracing and

02:45 is what you will be using and . You might be able to use

02:52 . It's a little bit more extensive then finally talk a little bit about

02:59 off, trying to make sense of collected through instrumentation. Oh, areas

03:09 of just seeing this pyramid before and focused on the memory system, and

03:17 put in the little graph that was nothing in Lecture three showing the evolution

03:23 complete compute capabilities versus, uh, main memory capabilities. And there is

03:32 of divergence, so a lot of focus in order to get performances on

03:37 well the memory hierarchy is used. Poppy that let's Talk About last time

03:48 focused on the process of level. so it is with register caches on

03:59 local memory. It has also got increased capabilities over the year to

04:05 they were network interfaces and the and there's also set about the

04:12 which we will not talk about in course. That's target specific parts off

04:21 sort of system in memory hierarchy. , on the other hand, is

04:25 comprehensive tool set, and in fact becomes comprehensive by basically actors and interface

04:33 many up their their tools, including . Oh, come on. So

04:46 is a little bit what things it the comprehensive tool. Um, you

04:51 do all kinds off tests or Uh, we can do for parallel

04:59 and sequential programs, and there current to It's only used for sequential code

05:06 to make keep things simple at but it's perfectly viable both for shared

05:13 programs as well as for clusters. has a pretty sophisticated system for

05:20 Managing the output on the results of as well as then for analyzing the

05:27 on for many off the analysis parts actually use other tools public open source

05:34 that are out there. Onasa seventh bottom it developed by the University of

05:41 , they have a performance research lab , as the website tower, you

05:47 gonna continue you and there are lots information, including presentations and reports,

05:54 well as downloads that you can do if you want to use it.

05:58 for prince source in history and in of again coming back to when I

06:04 a few times to try to understand performance. And for this course,

06:09 not just the time you ended up for running a piece of cold,

06:14 actually trying to understand the resource utilization and towers and can be used not

06:20 to figure out how well it's but also way so focusing on where

06:25 improve them. So the typical first you do is figure out the

06:31 and then, in order to understand the time is good or not,

06:36 you really need to have an expectation the platform s capable of it and

06:42 the time is, Ah, good . Or now and then anyone can

06:48 at instruction, the collections and Mrs and other things that Poppy does

06:56 I talked about last lecture. Then can also look at Iowa. That

07:02 to be disk. We don't worry much about this, can discourse.

07:05 that's something that certainly can be And it's essential for many those

07:12 even though we don't deal with it this class and focus more on

07:16 uh, non, this part off execution and then half in things

07:26 So these interests standards that the questions one should ask oneself when it comes

07:31 understanding coat performance. And it's, , it's typically like three steps.

07:39 , there's three instrumentation. So to sure that somehow the kind of data

07:47 interested in in order to judge the is being collected from execution of the

07:53 and there many different ways of instrument the code that, uh, towel

07:58 help it, and I'll talk about a little bit. So yes,

08:03 the source code instrumentation Montenegro's rappers from routines and even is kind of a

08:10 more sophisticated, I would say too, on the dynamic instrumentation by

08:16 the binaries. We will not do in this course, will stay with

08:21 at the source code level. Then is the actual measurement, basically running

08:30 code for a number of cases that of interest. And for that,

08:35 can do profiling and tracing that I talk about profile. Then you're skips

08:42 of summer information. But it tells depending upon how you do the

08:47 perhaps where to focus on on Where the tracing also has sequence information in

08:54 of called sequences. What happens when different ways of doing is to do

09:01 instrumentation that then focus on particular On DSO does indirect instrumentation, but

09:09 not buying sort particular probes in the , but using counters of various flavors

09:16 infer where things maybe of interest, can be collected in many different

09:25 You can progress. You can do blocks in the code that can produce

09:30 that can. I also mentioned when talks about Poppy, and if one

09:36 not careful, one can end up lots of data and then trying to

09:42 sense out of data. It's I would say. Definitely, benefits

09:48 having some form of analysis. Tools visualization, I will say, is

09:54 very useful because it tends to be a good visualized. Herman Cain very

10:00 spot where there may be anomalous or or something, which is much harder

10:06 do if you had a tablet So that's why the and assignment also

10:13 ask you, Thio, Graf things addition to providing a table that gives

10:17 more exact information that you can do a graph typesetting. And, of

10:25 , for large data set to many need more sophisticated tools like data mining

10:29 statistical analysis. So on. And think it was mentioned on the previous

10:36 . Even it What's this notion? and inclusive measurements. So this is

10:41 to explain what those two terms main exclusive focus on particular parts, whereas

10:47 inclusive measurements covers kind of a whole without giving yes particular actions. And

10:56 really illustrates example, uh, how might make a difference and what information

11:02 get later on. So this just to illustrate these three faces in terms

11:12 trying to do performance, the bugging assessment of your cold, the instrumentation

11:19 , the measurements and the analysis And under each column, As you

11:23 on the slide, it tells the ways I was already listed on the

11:30 slide that I shall but source school or using libraries or linking in the

11:37 virus for your static or dynamic, , instrumentation of your code on.

11:43 you can also work with Execute Herbal do this rewriting of the binary at

11:48 type and we'll talk more about the . As I mentioned on this is

11:55 showing a little bit about the various . Then when it comes Thio

12:01 it can be events based in a you can do it on. You

12:05 , the whole program. You can it on the routine basis. Talk

12:09 present when it comes to probably doing convened with the communication calls, and

12:16 you can deal with heterogeneous type architectures you have accelerators like GP use or

12:23 B. J s or other attached . I'll talk a little bit of

12:30 profiling options more in detail and to sense on this flat and called path

12:34 the names that it shows on the as well as about. They're tracing

12:43 I'll talk a little bit about the and number off the names mation and

12:48 traces are other software packages off Use off for doing some of the

12:57 on a society is a comprehensive So this is kind of and I

13:02 Sign that tried to put everything together on the left hand side. It's

13:07 of the basic things that we talked today, which is, uh,

13:13 the top. Left hand is most the instrumentation and measurement, and then

13:18 bottom is some of the more simple part in the center, and you

13:23 a little bit, I would actually kind of should have been in

13:27 left hand part. But I just fit the people that put it together

13:31 it actually has to deal with the . The PDT that is stands for

13:37 database tool kit, and under in the middle section is the analysis

13:43 . And there is the database far as the B M F

13:52 the MF and then there is the profile, and it goes on and

13:57 para proof that supports them. Profiling . We will not talk about right

14:07 neither today nor plan in the future we're trying to just again. They

14:14 enter to the tool they're supposed to you expert users. So here is

14:20 a little bit pulled out that you maybe be able to see what it

14:24 in terms of the instrumentation. So source code instrumentation and then with the

14:30 library. Rappers and I were talking those on the next several slides.

14:35 then there is the measurement part that you to define events that, like

14:42 proper, you can have use precept . Or you can define your own

14:48 and its various resolutions. And then the profiling and tracing that I will

14:54 about and then at the bottom of slide that shows they kind of data

15:00 for profiling and tracing in terms of counters, system counters and various internal

15:09 . Okay, on this is just them. Some of the analysis

15:16 So I said, I will talk they called instrumentation, so starting on

15:23 is a dimension these different ways of the instrumentation. Andi thes Our first

15:30 ways to resource called instrumentation, is use the program data. There's Toolkit

15:37 about that didn't come the next Um, and your assignment will do

15:43 compiler generated instrumentation, but you can then do manual. And then I

15:50 illustrate some of the other ways of in codes, but necessarily will not

15:57 using them in the course. So here is using this program. Database

16:06 clipped the best. But the principle illustrated on this slide that it basically

16:14 the source code from the application and analysis off the cold. And

16:22 based on the instrumentation that you may , that you want that, do

16:31 ? Eventually, the tower instrumentals and an instrumented code that you can then

16:39 of re compile. Now it has the decided instrumentation within in the obstacle

16:46 by the compliance. And he was little bit of the process shown that

16:51 it does that after the person it , the typical computers doing generate an

17:00 language representation of the program that has analyzed and then what they call the

17:07 tape is then the thing that processes output from the analyzes together with the

17:15 of instrument instrumentation you wanted and then , and an instrumented source code that

17:24 then compiled on be executed. The other one is compiler instrumentation.

17:31 that produce? Um, basically, , put their, uh, preamble

17:41 r before the particular compiler, You to do so tell on the score

17:47 then the compiler. And that means tile C or FORTRAN component which other

17:53 use well, then automatically instrument the . According to pre defined set that

18:01 top folks decided, was a useful of the codes. It's not quite

18:06 flexible and rich, but on the hand, it doesn't require you to

18:12 and specify specific things that you want have, uh or measure and,

18:21 , suggestible demo later for see how use the compiler to instrument code.

18:29 , and this is just showing that have a bunch of options than to

18:32 out how much of information want in output. So again, what I'm

18:42 is just giving you high level overview options are supposed to given your

18:48 just a learning to things that top do. And it we'll just use

18:55 very few options in the assignment. foreign area life using, uh,

19:03 computers in particularly you may find going and looking at the towel and poppy

19:11 in detail. Useful rapper instrumentation is through different things you can do pre

19:20 , uh, library. You can library teens that exists. Or you

19:25 sort of link an specific, library routines in the library. Teens

19:32 in terms of us, said here the pre processing is, makes it

19:38 simple and use a pre processor that , uh, works on the source

19:45 and then does insertions. And of course it. It's limited in

19:52 they can do depending upon the pre that what is capable, uh,

20:00 wrapper libraries. It's perhaps particularly useful you're there are calls or routines that

20:11 link in for which you don't have source consider restricted more or less just

20:15 fixed binary, and you still want collect information. And some the information

20:21 be collected by this using rapper and this is just so it's a

20:28 bit harder can use the rapper writin . We're not going to give you

20:33 exercise for you to find your own proteins. And there is just the

20:42 about the rapper worked, uh, you have a gun, the source

20:47 , and you have the past program then instrumentation. From then, you

20:52 to wrap things around the generate an cold I think we will use.

21:01 I think suggestion would show pre lobbying some of the instrumentation box. Listen

21:10 then slide is not what you're shared member and MP, I or

21:15 you or open sea and won't be in the first or in this assignment

21:21 used to use star. But you like to use Tao and future

21:26 and that just illustrates that Yeah, then the various diverting pretty lonely on

21:40 is options of binary instrumentation that again will not use. But I just

21:44 to highlight, so that doesn't really the source code it all so it

21:49 your binary and during runtime, that's I call the diamonds for dynamic instrumentation

21:54 runtime on. There's three different options , but we haven't used in this

22:02 , even though another called developments that make instrumentation is things that has been

22:08 in various projects that have been And so you're silent. There was

22:16 . So ah, I will talk this and I was maybe ask for

22:24 minute on DSI If there are Yes, I said, This is

22:28 very high level over here, and they're just a larger to capabilities off

22:36 more than going into details. Josh, we'll talk a little bit

22:41 in detail about specific usage that is for the assignment. So these are

22:48 two, I guess, conceptually or different ways of instrument in the codes

22:56 generate measurements. One is to insert on Do you can do that?

23:01 gun through using the PDT and the kit, or the compilers that in

23:08 probes in the code that defines petition segments, you know that they want

23:15 or I can do it. Indirect , by using Creole example, is

23:22 hardware performance counters that may not directly act on specific code segments. Eso

23:33 is just, uh, illustration off , um, indirect performance measurements.

23:40 what Poppy does, for instance, because it used to performance counters and

23:46 , various instructions and cash behaviors. you can then restrict that information to

23:52 used for Cold Box and Fred's But it sort of interactive doesn't say

24:03 for particular segments of code, except you define it through a thread single

24:11 . So yes, it is. can have used to define events in

24:17 like in Poppy on Again. It exclusive and inclusive measurements where again inclusive

24:26 like if you dio start timer and you run it for a routine and

24:30 you have and time I call, it has everything in between without doing

24:37 for specifically, say statements. that's an illustration, and they tend

24:45 be mon a tonic lee increasing like timer in the American or your account

24:54 . How many times a particular instruction being used. So it's also I

24:59 a good point to say why you to use tools like poppy or town

25:06 if you try thio, get detailed performance or runtime information and performance

25:15 information about a non trivial piece of doing by doing by inserting entire

25:22 making some runs and then making Mawr of time recalls to get narrow things

25:28 , it becomes quite Messi, time consuming in making many different

25:36 So by using tools like Top Towel Poppy, you can collect a lot

25:43 this information and single runs without going the headache off manually, doing all

25:49 assertions and captures. And I think , the concept the interval events is

25:58 between basically start end of probes. events may be triggered by particular statements

26:06 actions in the program, and then can do it in, necessarily in

26:11 of context and when there's routines for or statement level. So a little

26:20 about using Tao. And I think has just come to talk about the

26:25 and tracing. Um, so I say the typical way of doing performance

26:34 bugging or optimization is I would recommend first to do profiling that I will

26:41 about next to basically find out where what is the most time consuming parts

26:48 the code, and that could be on block level, A sustained routine

26:57 . Not typically, it may be to do it on, Uprooting level

27:02 and figure out which routines are the time consuming and then so narrow things

27:10 . And this is, um, of more or less what this

27:14 So I certainly collected data, and you have to make sense other than

27:22 . So here is now trying to this like this for the difference between

27:27 and tracing, if that's not So basically what the profiling does,

27:35 collects aggregate or summary information for the to decide Aziz of interest. It

27:46 no, um, sequence information in , so there is no particularly dependence

27:56 . Where is in the tracing? do have, um, the time

28:03 . So then you see when various happens in the execution of the

28:11 so I'll come back and talk more the profiling and the tracing in the

28:17 several slides. But this is just introduce the two concepts, and one

28:23 aggregated information on car attributes as to them and tracing and has the time

28:31 between events. So this is one that is important to really keep in

28:43 why you may not always want to with doing trace collection because it can

28:50 be overwhelming an amount of data that being generated. And that's why some

28:57 at various levels, maybe a good point us it's there. You can

29:03 with some limited profiling. One. look for some name events in the

29:12 or right you can do was known flat profile and get example in itself

29:19 may not the very intuitive what it . But it's again the summary information

29:24 whatever code segments you decide or code is of interest. Then you can

29:32 things like more detail on a loop . Or then you can do things

29:40 based on the con graph and figure how, um, the various routines

29:47 being called and other time we spent the in those routine following the called

29:57 . So the profiling pot. As mentioned this aggregate information and you can

30:06 , you know what the particular aspect the code that they want to be

30:13 basis for the profiling like loops, blocks, function calls, threads,

30:21 you want and then the thing that associated with that, whether you want

30:29 or various forms. Accounts maybe hasn't they're Bice transferred. If it's a

30:36 that it's likely to be limited by accesses you made want to focus on

30:44 data motion. That's when there data patterns like bites. If you think

30:52 , you may want to look you know, instruction calendars for integers

30:57 floating points. Or you may start some material level and think about function

31:04 or routines just how many times it executed and if you have it,

31:11 like what they had so far. like matrix multiplying, there is data

31:18 and figure out, you know, for function calls very easily from the

31:25 structure. But and most practical the pathway through the code is data

31:35 . So in that case, how times functions are called the data?

31:39 on the data set, you're running on, and you may need to

31:43 run it from many different datas has figure out how things behaves so that

31:52 that the flat profile is again basic information for the, um, parts

31:59 the code that you want the instrumentation work for whether again routines or threads

32:07 what it is, but it's kind high level somewhere. Information. Where

32:13 the Cold Path? Profile follows the for the call. Andi. I'll

32:19 some examples. Next. Um, think also this is fun. So

32:26 borrow this slide from, of course quite a few years ago. And

32:31 guess the people at the site that scientific unveiling folks there's still a lot

32:38 FORTRAN cones out there. So as can see on this side, what

32:42 used? The instrument. The code the tower FORTRAN compiler. In that

32:47 , the town underscore if 92 instrument cold using the tough FORTRAN compiler and

32:54 it's just chosen this case to run and it remember from the slalom lecture

33:05 how to specify the number off course processors of safe to use that you

33:12 etcetera. That is in the clear statement. But again, see,

33:17 will give a concrete example later. here's again what the flat profile may

33:24 like for a particular code and in case decided that time was the property

33:31 the code that one was interested And in this case it done on

33:38 car 15 basis or there's some routine C h e PRD, and it

33:46 us how many seconds were spent in cold and the next one. Waas

33:52 , this c c h e b f. It's such a going down

33:57 . And so it's a bunch of function calls, and it tells in

34:04 how much time was spent in those . It doesn't tell you how many

34:11 in every team was used to whether used once or if produced many

34:18 So it doesn't give you the time for distribution for different cause to the

34:25 routine, depending on when it was in the execution. So it's just

34:31 aggregate. So in this case of may be, if I try to

34:37 , the cold will be natural, start to look at the routine that

34:42 the most time. But it doesn't that because it took the most

34:47 it's not a very efficient code. that one her cars additional insight to

34:56 . And this is, um, trying to the loop level on the

35:03 thing using now FORTRAN compiler, the the couple on. In this case

35:09 are. They may look like So there is a look for me

35:13 multiplying the matrices that is, by the dominating time, wise and

35:20 Then there's a bunch of other So clearly fun goes down this profile

35:26 is being generated. At some you probably don't care if it's very

35:31 or not, because even if it's , it takes a little time they're

35:36 to optimize. This is not going change, change around time much.

35:43 can also use multiple counters. And , Joshua demo this So we'll skip

35:49 . But just showing this case, poppy performance counters to collect some information

35:56 instructions and this case cash basis for . And this is an instruction count

36:06 you can see what's happening in the Ah, the I guess in this

36:15 that operate on because you the way used in right, the instruction count

36:22 the time it took. So it's complicit, a measure that you get

36:27 this multiple counters and the next seven some cold path to try to make

36:38 of that. So this is I will. You can look and

36:45 that we'll see. The same routine more than once, depending when it's

36:50 in this called path. But I'll some simpler examples in the next few

36:56 . So here's a very simple example illustrate how they called path they look

37:01 so in this case it shows the routines being called and how much time

37:07 spent in the various routines. So is kind of more comprehensible. But

37:13 and this is the more serious not as a lot smaller,

37:20 called path tree. And when you it with a towel, it doesn't

37:28 show all the potential branches, but also allows you to follow and see

37:35 time like on the previous graph he in the various parts on the

37:41 Check, um, on the next slides is just trying to illustrate the

37:52 off information you can get between the and inclusive profile. So you have

37:58 this profile, and now this is exclusive version. And here's an inclusive

38:06 on obviously look very different, even maybe our, um, from just

38:14 the slides on you but pulled out the next slide. Parts that are

38:19 of more interesting to try to shoot where there may requesting different significant difference

38:28 doing exclusive and inclusive measurements. So for one of this routine here,

38:38 actually the numbers are the same. the light blue air that shows that

38:45 not really any difference between the inclusive . So there was nothing else interesting

38:51 on. On the other hand, you take the uh huh, dark

38:58 or almost black, there you can that exclusive time is very small.

39:06 the inclusive time is very large, speaking. So in that case,

39:14 closer measurement doesn't really point out on much about what happening in this

39:20 A ties 30. So these are important to kind off be aware of

39:30 two, and that includes their measurements not always all that helpful. But

39:36 , it can be a first step it is more moderates in terms

39:43 Mhm. They have put that sir. Um, so could you

39:51 explain inclusive and exclusive with respect to off routine? So, as as

39:56 did or here, each routine was in the profiling, right. So

40:02 that the routine how will you different between exclusive and inclusive? So you

40:10 did the exclusive measurements by thinking and at this craft that this little insert

40:17 you have on the left side and craft, the inclusive part measures the

40:25 in this case for everything between the and the finish of this full

40:33 So function call exclusive Onley measures the statement A equals a plus one.

40:44 else that goes on and this full , including the call to the

40:50 is in the inclusive party. So these statements, individual statements would would

40:57 an exclusive. Though they are fair has 100 or uh, automatically

41:03 Then all the equations will come and . Is that right? Well,

41:11 if you want an exclusive measurement for statements, yes, then that will

41:19 measured and everything else will not be in that timing. Okay. So

41:28 can alter which is to be, is we can alter. We can

41:33 which are which are to be chosen exclusive and inclusive. So, in

41:39 , how usually reports inclusive, inclusive exclusive values for all the all the

41:46 in your court. You don't have select event reports either of them,

41:53 it reports both of them. You see it on this slide. Maybe

41:58 , maybe you. You can continuously biscuits. So this is in a

42:03 for the exclusive and inclusive time, ? So as you can see,

42:08 exclusive and inclusive time for all the . So each name everything in the

42:14 column. It's once of routine in code. Eso if you take,

42:18 example, just the first one dot application. So that's most likely the

42:25 call in the source code and that that particular section, if you see

42:31 , takes only two point something But it also includes all the other

42:37 calls that are that you can see it. So, including everything that

42:44 subroutine takes 54 seconds s o. would be the summation off all the

42:50 that is going on. Exactly. everything inside it, it's it's taking

42:54 total of 54. But that particular itself is only active for two

43:02 Does that make sense? Yeah, then they don't add up like 52.8

43:07 be a subroutine off dot application. . So they don't add up because

43:13 next one is 33 category microseconds, it? And what is the unit

43:20 this? No, u s. , these are the seconds. But

43:25 cannot add up all the inclusive You. If you add up all

43:29 exclusive ones, then you will end with 54.92 That's that's on the

43:35 Okay. Okay. Explosive times for . Then you'll end up with

43:40 Okay. Okay. Okay. Thank . I had a question as

43:48 Yeah, go ahead. So you a routine that is heavy on Rikers

43:54 , um, are are discrepancies introduced to the time it takes to allocate

44:00 on the call stack or things of sort have over him that's not captured

44:05 town. Yeah, I'm not too about the Rikers, and I believe

44:12 reports each, uh, each record has one one sub protein itself.

44:19 not sure about that. I'll have check. Okay. Yeah. As

44:25 as recorded in goes I'm not entirely about that. Yeah. So,

44:37 , coming back to this slide. what she has talked about in the

44:42 . So again, the the tower has, you know, 54.9 inclusive

44:49 , but in itself, at the or the call overhead. For that

44:59 is like to seconds. So the thing that it's being called, this

45:04 exact routine, right? That took . So again, if you had

45:14 exclusive time for the towel to the time for the exact then you get

45:20 toe almost the 54 9. So best because you see that the difference

45:28 what's being called first thing in Tao inclusive time, plus exclusive time for

45:35 gets to the more or less the for the whole thing. So this

45:42 shows, uh, the number of and being used in this case.

45:48 is only called once, executive called . But as you go down some

45:53 the sequence of call in this some other routines that are called very

46:00 number of times. So you got lease information and then you can see

46:06 kind off towards the middle On the . It sort of pops up a

46:13 bit of the country, and then again goes down in the calling sequence

46:20 this addict differently. In appears E at two different lines and then called

46:27 different number of times in the two the first call on 180,000 times.

46:33 the second time is baby in the . It's 90,000 times, so you

46:40 a fair amount off information how things being used. So that's what I

46:45 . Um, the profile as the that I will show before on the

46:51 , the total number and the told safer. The Advocate 13. But

46:55 how many times it was called. , um so any more questions are

47:05 to this, Otherwise we'll talk a bit about the tracing part. So

47:17 just done. They add the sequence calls not just in time so and

47:25 much time each one of the calls take. And so you have time

47:31 for each invocation off each call and direction off the running time. So

47:39 here is an illustration when you have programs that one needs to have basically

47:45 synchronized clock, and in this case shows to process is a and B

47:53 that are going to exchange information by sending stuff to the process be in

47:59 process, V than being having I . We'll talk about how this things

48:05 when we talk about N. P and classical programming. But this just

48:11 show how things needs to be synchronized the global trains you So you

48:18 um, basically map up all the went events between the different processes on

48:25 common timeline and see when they And so the cause and effect And

48:33 is what I don't think you're showing today, maybe a future time.

48:43 , and you can get something that finally mastered to look at and in

48:46 case, waas. They, tell uses visualization to a convent

48:52 Or, um, sure, for one is yet for over.

48:56 And here is what you may look a much more complicated sequence in kind

49:01 in the middle, a little you see, and very short time

49:09 that covers about eight milliseconds from you , 15.592 and the color coded things

49:19 the various routines being called and the lines trying to illustrate some of the

49:25 in this trace between the different processes this case, four different processes.

49:31 , um, so it gets, , quite complicated. And it is

49:39 it says that tracing has the temporal and the special aspect, but it

49:46 which the profiling does now. But problem is, tracing produces very large

49:56 sets that usually not the first thing do. But it may be something

50:01 need to do to find out for instance, in particularly parallel

50:08 why there is lots of either time various processes and because it, um

50:17 waiting for information from somebody else. those things may be very hard to

50:23 out just based on profiling. So need thio both this time sequence for

50:29 different processes to be able to infer causes perhaps inefficiencies. I think this

50:41 be a good point for so to talking about actually have to use

50:49 in particular for the it's time. then there is a first time

50:57 I'll show some more slides about the part what you can do so we'll

51:08 . But I let, uh yes. Now talked about E.

51:12 have a question. Yeah, eso talked about idle time, right?

51:18 can we see the ideal time in code in profiling or interesting eso?

51:32 question. I do not know if actually get explicitly the whole time.

51:45 , other one, except, for , by if you first look at

51:52 single thread performance, the only thing guess you can use for the idle

51:57 is to use proper. And you out stalls that shows, uh,

52:02 the time for the individual course. that case, um, when it

52:11 to using open MP and other forms idle times due to dependence between

52:22 whether you can get that without maybe so. Yes. Can

52:29 Yeah, I'm not sure about open either, but yes, for just

52:34 serial programs, as you said, , there are events that correspond toe

52:41 number of cycles your program was stalled whether it was for waiting for any

52:47 or waiting for the resource are just memory access. So there are for

52:53 because there is no particular routine and , so if you're trying todo do

52:58 by statement or by block or by thread. Uh, it tells you

53:05 time for thread, for instance, not the breakdown where the time in

53:09 thread went. And there's no particular that is used for idle time.

53:18 , there's a consequence. So that's I think you would need the trace

53:23 figure out when it happened. And eso if we have something like they

53:31 very process waiting for a memory toe to that process. So some data

53:37 be transferred so we don't we cannot that time very, very devastating

53:43 So both the process. Yeah, . Definitely the trace to figure out

53:52 processes. Waiting for what? Other ? To communicate something. Okay.

53:59 you. Bruce. Yes. I'm that they can't quite detailed information.

54:04 11 side. I wish there was simple way, but I don't see

54:13 . Yeah. Yes. What Good question. So so thank

54:22 So Yeah. Yep. Okay. is my screen visible? Now I

54:36 see your skin. Okay. So, eso this will be a

54:41 demo about how you will be using for the upcoming assignment or the one

54:47 already out. Uh, so give S O for giving you the

54:52 First, why will be using So if you remember from the previous

54:58 when we used Poppy, we had insert poppy calls inside our source

55:05 And let's say when you have huge called multiple thousands of lines of

55:13 gold with complex structures. Then inserting calls kind of becomes cumbersome. It's

55:20 impossible to do it, but it really cumbersome, uh, and then

55:26 all the performance metrics for your But in this case, Dow comes

55:32 very handy. And it, does all of the insertion off

55:38 as we saw in the slides in source code. So you don't have

55:41 make any changes to your code. , how does that for you?

55:48 so that's what That's the reason why will be using Dow as an interface

55:53 puppy. It's not a substitution off . It's an interface to using puppy

55:57 a much more simpler with eso here the left, you can see the

56:02 that you have for this assignment. the it has to multiplication functions.

56:09 is the classic multiplication, and from interchange, multiplication will be used just

56:15 the classic one for this example for demo. As you can see,

56:20 have not added any poppy calls to does not have the puppy dotage or

56:24 of the high level puppy calls the the functions of source code eyes completely

56:31 . Now, when you want to using how the first step you would

56:36 to do and I opened it here well, is as you can see

56:41 , 1st step is Thio Load the Tao that you can do using the

56:48 module load and just provide model Remember, we are on the computer

56:53 to make sure you don't do any this on a logging road just to

56:59 sure if the model was loaded correctly not. So here you can see

57:04 down. Model was loaded. when Dow is installed on on any

57:10 thes systems, or even if you ahead and install on your local

57:15 you will configure it, using all packages that you have, and when

57:20 configure it, Dow generates make files it will use to compile your

57:27 Andi have all carry all the parameters it will need while compiling your instrumented

57:33 . So to say, those make are located in the directory pointed by

57:42 environment. Variable. Just close. the style. So if you,

57:48 , list that directory, you will there will be two make files in

57:53 directory for our case. Since we're just the serial code for this

58:02 uh, will be using this uh, make file that starts with

58:07 lonp. The other file make file . You see, it starts with

58:11 FBI. So that one will require to have FBI calls in your in

58:17 source code, which we do not right now. So that's why I

58:20 be using the second one. if you want to use it,

58:25 will need to set up a, , environment radio called as Tao make

58:35 . And then you will just uh, the path to that particular

58:43 file that you can do by just this environment variable now and giving the

58:49 of field make five. So when do that, let's just go ahead

58:54 make sure that it was centered So just try to print it

59:01 As you can see, it uh uh huh. It took him

59:08 . Eso thes forced to steps. just have to do once every time

59:14 log into a new SSX session and you don't have to worry about

59:17 So loading the model and setting that file that you just have to do

59:22 once. Now the next step is a couple of compiler options that you

59:30 said. We'll get to the metrics later on. But first, we

59:35 just sit a couple of compiler So if you want your option your

59:40 to give out a more verbose output it's compiling the code, you can

59:45 this option. We will not do right now. It's going to make

59:49 output a little messy, but you do it. It's not gonna make

59:53 whole lot of difference now, As said, we'll be using Tao

60:00 uh, towers and interface to the library. So how do you tell

60:06 that what performance metrics that you want measure so that you can do using

60:11 environment variable called Tao Underscore metrics that's by now and setting it too.

60:20 whichever performance metric that you want to . So let's say you want toe

60:28 , measure the number off single precision in your court. What will

60:37 What we'll do is we'll just set value off that particular environment. Variable

60:43 this please settlement from puppy the PSB . That's trance for single precision

60:51 We said that. Let's make sure went incorrectly. Right. So now

60:59 double metrics environment variable points to that that event. Now here,

61:08 I already did some testing, So those numbers for now, All those

61:13 for now. But as you can , we do have a madman dot

61:17 uh, source code. That's exactly same as what we have here on

61:21 left. Eso But you when you compile your cord, what you do

61:28 what you just do. GCC mammal c and you give the output

61:34 uh, name when you want to your court, uh, with tower

61:40 that it gets instrumented. You will replace GCC or I C c.

61:47 you in case you were using intel with the Dow rapper for C compilers

61:53 that, uh, that is the underscore C c not message. So

61:59 replace your compiler named with that. , when you compile it using this

62:06 , as you can see how that's instrumentation by itself and ultimately in generated

62:15 executable that has the instrumented source So it now has all the probes

62:21 it and all the necessary calls already that executable, uh, necessary to

62:27 that performance measurements. So once you compiled your source scored, how do

62:34 run it? So again you can the command from Tao that's called a

62:41 exact and then provide a tag and that it's a serial program because that's

62:49 . Now. We're dealing with just single tragic programs on. Then simply

62:53 give you, uh, gave it lane. When you run that

63:00 execution is gonna finish normally and already this profile in there. But this

63:07 will be generated whenever you run your code. Now, when you want

63:13 read this, um, thats generated , you will use the command line

63:20 provided by Dow, which you can by using the command, Petrov.

63:27 you can run paper off in the that contains this profile file. So

63:36 you do that, you can see profile that was generated for single precision

63:43 because we said down metrics as single operations to be counted. Now,

63:48 you can see the main Dow uh, the main function, the

63:56 Manimal that was called in our function our source code. So this

64:00 and also the initialization of the So all four functions you can see

64:07 the left You can also see exclusive inclusive counts for each of these.

64:13 recall that for matrix multiplication, the off operations are toe to times on

64:22 . So and we had 1000, the total number of operations will be

64:28 billion operations. Ah, so as can see, if you go through

64:35 inclusive gowns So this classic magma actually those two billion operations. But because

64:44 classic Madam awas called from the main and main function was ultimately part off

64:52 , uh, our application, these billion operations were counted inclusively for all

65:00 three functions. Now Here comes the between exclusive and inclusive counts. When

65:07 see exclusive counts, you will see these two billion operations were actually just

65:13 inside the classic Matt malfunction That actually the did all the operations. So

65:20 that's the importance off checking exclusive counts compared to inclusive counts for any of

65:28 performance metric. Uh, now, the previous example, we just,

65:35 , mentioned one metric in the style operate, uh, down metrics,

65:44 , environment variable. Now let's say have tow. You want thio measure

65:50 than one, uh, metrics. what you can do is you can

65:55 define a Colin, separated list off and just set it in the town

66:07 . Now, the best thing about is you just have to change this

66:11 variable. You don't need to re your cold. And when you have

66:16 that, you can simply just go and run your code again. And

66:30 that's going to do is it's going create profiles inside these two directories that

66:38 toe thes two events. Now, this case, one thing that you

66:45 remember always is if you are mentioning or more events in just one

66:54 Uh, I'm just went out for allocation, but anyway, if you

66:58 multiple events, make sure you always the command poppy event chooser and check

67:09 compatibility off the events that you are , uh, a same time.

67:19 you don't do that, if you up using, let's say poppy L

67:23 d C M. And that's the one cash data cache, misses and

67:28 single precision operations. These two events not compatible with each other. You

67:32 check that by using pop, even if you happen to use incompatible events

67:38 will not report any values for any these events, and you will just

67:42 on thinking what's going on. So you're using multiple events, make sure

67:46 use events that are compatible with each . That's pretty much it for.

67:54 mainly what you will be using for assignments. Uh, at least for

67:58 second upcoming assignments. Any questions on , uh, the codes that you

68:07 on blackboard? They already have thes , and I've added a bunch of

68:12 as well, so you can go and read through them they pretty much

68:16 the same steps that I just showed . So if there are no

68:29 I will stop here. Okay? . Mhm. 10 minutes left.

68:50 you see where I want something Right. All right. So these

69:03 usto I want you ready. So so yes, demo. So there

69:12 just corresponding slides in the deck This shows a little bit how you

69:20 then use some of the other tools they're not planning on using. But

69:25 shows that this Dashti option if you to be binary writer, other

69:33 And there is a kind of also the list off environmental variables. Short

69:39 . But again, best thing is to make go to the tower website

69:44 you want to do something that's not . But it's also what the defaults

69:52 . Just a couple of slides against show what the program analysis allows you

69:56 do. And in terms off the , up off profiler, Uh,

70:04 will just very quickly show you examples what kind of graphs can be

70:10 Thio illustrate. Present the data that be collected. So here is

70:19 You already so a little bit on per threat where things are kind of

70:24 with respect to the time the various steaks and the color coding is in

70:34 protein, and in this case, a parallel code. So you so

70:37 each thread, you see each routines proportion off time. In that

70:51 um, there is, and there just a little bit under the options

70:56 can choose. Things are being presented I and here is a little bit

71:05 ways. Where are you? In case, you can show it relative

71:14 each other in a more clear perhaps that for each routine on each

71:19 you get the profiling. You can the relative difference in how much particular

71:27 teen is used in a different um, so on And here is

71:37 a different way in terms of the inclusive and exclusive times and the calls

71:45 this is an open, empty coz a parallel code that now we have

71:50 it simple for the first exercise. just to single threat that we will

71:56 get to do open MP examples. at that time and may be useful

72:03 go back and look at this particular on bond. This this is an

72:11 again on the country where you can a particular part of the country on

72:17 . C. There how the time spent in that culture. Then there's

72:24 kinds of fancy ways are doing three bar graphs on this particular case

72:31 um, different routines. Different times you can choose, uh, how

72:38 represent things. So this is one there is another one when the triangle

72:44 in terms of representing things and there whether you're scatter plot and trying to

72:50 , uh, thirteen's. So, , we may try to do a

73:00 paragraph at some point, maybe no when we talk about open empty than

73:06 at this moment. So it's just you a number of these different plots

73:13 . And this is just an example some of their software that Tower is

73:19 to do the graphical representation on the and trace and trace representation. And

73:27 is trace analysis, and, as said, they get somewhat complicated,

73:32 it can follow over time. Different are used for the different processes how

73:39 change behavior over time, respect of is being used again. Routines are

73:44 coded, so I think that's pretty what I had in this. There's

73:51 highlight about the routines that are being . Were not continuous, um,

73:54 the of course. But if you back and look at this total comprehensive

74:01 off gotta encompasses on this central software , you will find these thirteen's.

74:11 different ways of illustrating with a different . So, as I said,

74:15 is just to highlight what you can more than trying to teach you exactly

74:19 to use each one of the But I encourage you to explore if

74:24 do more complex things outside what we're , This course uh huh. This

74:32 just a reminder on some of the and during performance measurements that things are

74:39 so that among run other things means don't get the full detail. It

74:44 means that you may not get exactly data every time you do a new

74:53 , because it is some statistical effects there. Um, and there are

75:03 things I'll talk more about. I to talk about contact compilers, for

75:11 . So the statistic, your your and that was part of the awareness

75:18 think I wanted to create using in assignment. One, that compiler optimization

75:27 change. Also, instruction counts because things may be optimized out, so

75:34 know may not get the same instruction , depending upon the level of optimization

75:42 is being used by the competitors. also the case that the same compiler

75:50 level doesn't give the same results for . The same code on different

75:56 So if you run it on an Exodus six or you run it on

76:01 M. D x 36 you may Catholics exact same data, you

76:08 for the same data set at the compiler optimization. Never. Um,

76:20 that's I think, what this lines thio awareness, like on the next

76:28 , is just much of the references wished or the lecture slides. So

76:36 that, that was the slides for . So questions either for suggestion,

76:42 demo or how to use now or in general. We'll try to answer

76:48 best we know, we don't have use in more than by the class

76:52 have used it in some other co projects and then, uh, coming

76:58 to their all the time. They trace tracing a lot to try to

77:05 dependencies. And we're time waas lost the court will go on hyper

77:16 okay? Or or operating system sees course to separate course. Whenever we

77:24 , like approx like CPU or any those, um, cool for

77:30 So they didn't Wasn't really a problem assignment one, because we were using

77:35 note and one core. Um, if we were using, uh,

77:39 or three, um, and we trying theoretical Max is how would we

77:45 that it used three physical cores instead , you know, physical course if

77:50 were using three hyper threads. um, the physical course,

78:02 Right. This is a good So there's one way to know exactly

78:14 you get. And I will talk that in some future class, and

78:20 is to lock threads to course toe the operating system from moving things

78:30 So if you don't fix threats to , the threat may not stay on

78:37 same core through the entire execution, operating system may decide to move

78:45 So that's why I do not know these tools, if you can get

78:53 time trace off. What core trace been on through the duration off the

79:09 . Right. So there is, , the problem and they're on.

79:23 unless I think you're locked traces, cannot, um no. Where it

79:32 allocated, Intel has leased for some processors a way to decide how you

79:51 threats to be allocated two course and threads. And I covering up.

80:06 when I talk about open MP, it doesn't quite answer your question.

80:11 do you know where it was Unless you lock it. Okay.

80:19 . It's the only mechanism that we of to be able to guarantee the

80:24 hardware being used. Yes, you . Yes, it's available to the

80:30 . Okay, So in an where we the servers have 24 per

80:38 and 48 per known right? So we were to say a lot,

80:43 of them where they're just be waiting that makes the that prevents the operating

80:50 during an error or what we get sort of error message or what?

80:55 will be there, right? I remember. Sorry. What I remember

81:06 top of my head is that it not defined whether things get kind of

81:17 when you don't have enough physical resource or if you did, do you

81:23 an error? So if you basically more threats than the hardware can

81:37 then, um what happens, I , is it's undefined in the

81:45 Everyone, But I will look it when I talked about it. Thio

81:50 you better, more precise answer. that's what I remember. Okay.

81:57 you, Dr Johnson. So that fault that most of the allocation that

82:10 guess that always does it Yeah, kind of round Robin, so to

82:20 , between sockets on dso it does soccer zero If it's a two soccer

82:28 , soccer zero, then next red it run back to socket zero and

82:33 back to socket one. And when goes back to the same socket,

82:38 takes the next core. The physical for the second thread on the core

82:47 to the second core. And then it has filled up all the

82:53 Then it goes, if our pathetic enabled and then it starts to do

82:58 same thing at the hyper threading Okay, so uses the physical force

83:05 it starts hyper threading. Right? the fourth mechanisms that I think most

83:12 use. Okay, so so but , yes, you. So that's

83:21 what the option is, what they spread and mhm. But there's also

83:28 option that has different names. But one compact, in which case it

83:35 the threads asl long as possible on same court on the same sock it

83:41 it takes on the next socket. mean, so that means, for

83:51 , f. You have relatively small sets, and it's a lot off

83:58 sharing between the threads on the same set. It may be advantages to

84:04 them on the same socket if the fits, and then three because they

84:08 need to go to the other On the other hand is your,

84:15 , restricted by memory bandwidth. You want to have spread things out among

84:20 socket, so you get as much bandwidth as you can, but you

84:28 control that in open M. P far as I remember using this attributes

84:35 how you want the threats allocated. , I was just wondering because,

84:43 , we ran something with for I wouldn't know if it was hyper

84:48 or not and whether or not the paper was correct or not. But

84:53 if you said it, it's uh, expensive the physical port before

84:58 starts. Hybrid turning them should be to go. And yes, very

85:04 question. So, uh, and very thoughtful. So yes, so

85:10 if it to know. So in off the assignment itself, as long

85:17 you explain how you figured the that's all fine because, you

85:22 So if you say well, if have two threads, I will use

85:28 maximum for two cores. And assuming that what's happening in three threads three

85:33 and that assume that what was happening long as you tell that was the

85:37 for the judgment or if you, know somebody has computes the total for

85:44 entire processor or socket or chip as as it's clear what the basis waas

85:51 I have It's all OK. Answers this course. In reality, If

85:56 really want to be detailed, then actually correctly. You need to understand

86:00 many course was actually used. Thank you, Dr Johnson.

86:09 Well, yeah. Any more If not, I won't try to

86:24 the recording here with summer. I'll there

-
+