© Distribution of this video is restricted by its owner
Transcript ×
Auto highlight
Font-size
00:05 mhm books trying to get through the thing. So okay, so today

00:16 we'll talk about these issues. so basically getting to talking about tools

00:25 understanding, performance of codes on. that's really the focus on today will

00:33 one particular tool that is listed at bottom of this light, properly known

00:38 poppy for performance application programming interface. first, I will give some higher

00:47 , you know, concept that hopefully . Uh huh, not too

00:57 And then we'll talk about Poppy and three hours with the demo of some

01:02 the public commands that was used for next assignment s. So,

01:16 for the class off today, the is really yes. On. I

01:23 even single thread, not just single , but that also they talk about

01:29 applies to both single threads and note and cluster performance over. It's by

01:36 means that the tools are just focus one thing, but the focus of

01:42 classes just to to listen, I say maybe not the simplest in actual

01:48 , but in complexity, it's just to dealing with a single threat.

01:54 force of graduating and dealing with more situations with multi conference. And it

02:03 points out when I come back to a few times. Uh, but

02:08 steps is one thing when you trying understand the performance are supposed to be

02:14 for correctness. Um, I guess that, too. But it's

02:19 Thio be very carefully in selecting data so they don't necessarily want very large

02:29 sets because that generates a lot of that this perhaps very hard to penetrate

02:35 figure out someone needs to be fairly in the choice off the data set

02:40 using for trying to assess performance. then, of course, depending on

02:46 your objectivist, you select proper tools we'll try to point you for some

02:51 in this class, ones that And in one way or another the

02:57 needs to be instrument. And so actually do you get the information that

03:02 hoped again in order to get insights the performance, and then you run

03:07 things, and then it's often a job to try to analyze what actually

03:14 on in the code and wants to out where the this just might be

03:19 least to poor performance and to try make some changes to the code that

03:24 will improve the performance. And then basically kind of literate. The number

03:31 times until one is kind of happy the performance of the goat eso the

03:39 premises. There's good three things you need to know on. The first

03:45 is, I would like to know application and and what it actually requires

03:54 the characteristics of basic properties. the code or I should say all

04:01 application rather than the code. I'll very trying to be more precise as

04:05 go. So in this case, application is the problem you're trying to

04:11 , and you have selected some algorithms solving that problem. And once you

04:18 selected algorithms as well as know what trying to find out, then you

04:25 get basically an assessment off. What of workload requirements are in terms off

04:34 ? A logic needs in terms of many floating point of integral ups and

04:41 memory references do you potentially need, at least how much data do you

04:47 have and then so we talked about a couple of lectures ago about this

04:54 intensity measures that then gives an idea the balance between computation in which I

05:03 and the data access requirements. And it's also in terms. Off comes

05:13 from the problems and the algorithms. much partly listen, do you

05:18 Maybe that so much sequential dependency is you can't really huge use of a

05:24 of parallels. So these are kind high level concepts that comes from again

05:30 the application and sort of bridging a bit, too. The hardware or

05:37 platform you're using is, I would , since 30 Weakest link, typically

05:43 most applications, is the memory system to understand what are the memory

05:54 And there is perhaps you have sufficient of data or that you can't just

06:04 in the notes. So you need have based on your data sets.

06:10 can decide that I need whatever number nose has required just to fit the

06:16 . Another one, if it turns that there were application, is most

06:21 memory band of Limited. Maybe you to use based the number of those

06:27 choose not on how much data you , but about your memory than with

06:35 . Ideally, is in the work get the execution times you hope for

06:42 , um and then what kind of memory requirements. But that's depends a

06:49 bit again on your algorithms, but this has come to starting with the

06:55 , and the algorithm should get the off your requirements. And when it

07:02 to parallel competition, there is another in terms of how you distribute the

07:06 that this typically orders or bonds. Later on, the of course,

07:11 talk about the parallel aspect. So now, for the assignment one,

07:16 simply just dealing with this single threat start with Dr Johnson. Yeah,

07:23 you be able to give an example fewer processes with more threads would be

07:31 ? Um, okay, so So is, um, try to tease

07:39 question apart a little bit, so insurance so at kind of ah,

07:50 level, there's this thing that was , I guess in last lecture,

07:55 terms off, hyper threading or simultaneous threading mhm s O. That means

08:03 I have to our a few execution sharing the same piece hardware. And

08:16 things are constrained by your memory it might be useful to use multiple

08:27 to share a single core because that threads that are waiting for things from

08:35 to use. The functional units that would be item if you have a

08:44 tends to be fairly heavy on compute logic operations than using multiple threats to

08:54 the same thing that is in critical may, in fact, a great

09:01 . So in that case, a so the detail core level. It

09:08 on the nature of the application, you want tohave, multiple friends or

09:13 sharing and cold. So that's for instance, the difference in I

09:20 that kind off science or applications that two centers were using Pittsburgh versus tax

09:31 has their case enabled multiple physical whereas at least something recently peaceful.

09:40 not to do that now, depending so if the how many What level

09:53 parallels to choose. So that means ? How many cores you want,

10:00 know, a core today, if talk floating point numbers and they do

10:07 few tens off giga flops per And if you have something that needs

10:14 and you don't want to wait we choose the number. Of course

10:19 want. Based on their own you'd like to to get based on

10:23 hammer functional units you have. So most things today Ah, and pretty

10:34 everything. When you do some from neural network type stuff or the the

10:41 and engineering computation, most of the are large enough that in order to

10:47 reasonable execution times to choose, I say hundreds or even thousands, or

10:55 in perhaps more extreme, faces millions feds in order to get recent execution

11:03 . Now, if you have something is memory bandwidth limited, and so

11:14 memory channel today, with a memory put on it probably do somewhere 20

11:20 30 gigabytes per second in terms, its data, great capabilities.

11:27 if your data set this terabytes, , so even in which the country

11:32 fifth in the memory about even if pits today there are some you know

11:38 they call, you know fat nodes may have want terabyte of memory.

11:45 you try to just access that data a single memory channel on assume.

11:51 that for simplicity? 25 gigabytes per , then it takes 40 seconds.

11:57 to read the data once, and that's acceptable. Maybe not. So

12:01 that case you made to choose, , execution threads and parallel is based

12:09 how many memory channels you want to reasonable access rates to the memory.

12:17 again, many of the computational science to food mechanics, structure mechanics.

12:22 if you work, you know, terms of machine learning and to get

12:28 done. People use very large data with sometimes in the various an

12:37 um or is the recognition. And terms of machine learning, they may

12:44 in the training part, billions of . And that's, you know,

12:50 of data. So in that again, you need high memory bandwidth

12:54 order to get being reasonable execution times there in that help. Yeah.

13:06 I didn't give you a particular and we're not trying to give you

13:10 competition and data such requirements and then it maps to the hardware. Thank

13:17 . Thank you. Talk to Okay. Um So then they at

13:26 very other end, is the system the platform they're using. That kind

13:30 briefly has talked about in answering the . Yes, stand tryingto understand how

13:38 processes that are gone owed, or that they used interchangeably processes and chocolate

13:45 sockets. And, um, from lecture Danger mostly said there are or

13:53 previous election that many of them The most common thing in alternative is

13:58 to socket notes or two processes for , but for it's not uncommon

14:06 So that gives you one thing. that's part of what the exercise

14:10 An assignment one to learn how you out what your harder environment is.

14:17 it also tells, you know, many course on how many effects can

14:20 use per core, so that all these questions that you should ask yourself

14:26 trying to give the model off what expect in terms of the problem you

14:33 given the platform they're going to So part of it is the processing

14:39 , and the other part is the system, and then if you have

14:48 typically knowns knowns with accelerators, so have something with GP use.

14:54 it really depends on the platform So in terms of Judah laptop processor

15:00 the GP uses most of the time into the same piece of silicon,

15:07 when you look at clusters or things for more computational intensive, used,

15:18 deep use are typically something that is the Nyo bus. So then you

15:24 to figure out what the ability to data between your accelerator or GPU and

15:32 processing civil. This is kind of system architecture, er, level thing

15:38 will help you set expectations for what execution time ideally could be if things

15:49 very well used. So that's again motivation for some of the task you're

15:56 to do in the assignment, one . Then learn how to find this

16:02 so you can build your model for to expect. This is just a

16:08 . Won't go into this is pretty the same thing. Uh, had

16:13 summer slide in the processor lecture last on then the third component. It's

16:23 the code because the cold is the . That basically is the mapping between

16:29 application and the hardware and, your right to CO, uh,

16:37 your problem and C or C plus or some other conventional programming languages.

16:45 they don't really, and just anything the actual characteristics of the platform you're

16:54 . So they are supposed to be representation what you want to get

17:00 So that means compilers and other parts the software stacked that is the bridge

17:08 your text inform off the source code the action hard work, and the

17:17 these days is quite complex. So your description of what you want to

17:23 done in to what actually is happening complex. And the performance tool talk

17:33 is trying to help you assess good did the code eventually up being

17:46 And it zed Here there is the to try to approach is is often

17:51 the first tried to figure out what the performance critical parts of the

17:55 And for that, when you're something profilers that we'll talk about next picture

18:01 , then once you have found the that maybe the ones that take the

18:08 time, so in that sense performance , you're trying to figure out more

18:13 about what goes on in those and today we'll talk about tools started

18:19 that part. And, um so one tooth. Three things to keep

18:30 mind, understand the application, understand hardware, and then try to understand

18:37 cold. And we'll talk later on some coming lecture about the compilation process

18:46 optimization that goes on and how one kind of helped compilers doing the things

18:52 hoping would do automatically. And this back to whenever the comment is that

18:59 need to be a conscientious when you to the performance optimization or debugging

19:07 Careful about the data set to choose too much and not too little.

19:13 , um, I said said. that's what I was trying to emphasize

19:17 with this three parts application system. Cold is Toe have a model in

19:24 red what to expect, And based that, we try to then attack

19:30 problem of optimizing the performance. And is just things that none of these

19:39 will solve the problem for you. definitely something run. Use the tools

19:44 gain the insights, but it's unlikely that these automatic tools will actually solve

19:53 problems. So in understanding off these parts that I mentioned and then proper

20:01 is what you should expect to be , get the good performing coach.

20:12 the tools are typically not one size all type, but they have.

20:19 focused on different part off kind of system that you have. So I

20:30 , you know, the first slide Papa is the focus to today,

20:37 it's kind of focused on the process part, even though it does

20:44 um, features that also addresses network . But it's it started out as

20:56 a process of focus, tool but then being expanded to cover additional

21:03 And then there are other tools, , addressing more of the memory hierarchy

21:10 for parallel cold later on. of course, we'll talk about this

21:15 . Passing interface for that is used a programming paradigm from clusters, and

21:25 against performance tools that the dressing uh things for NPR. And then there

21:33 one number L. A tool that talk about the next class, which

21:37 known as Tau. But today it be poppy. Excuse me off.

21:46 the very first thing one time to is just to collect timing information.

21:52 , make sure the time and we about that early on. Think second

21:58 also. But it's important than to the proper told to collect execution

22:06 And the general advice that I have to use something that count cycles because

22:16 iss anything that otherwise used time. may not give you enough insight because

22:23 upon other things happening in the but counting politics is just sort of

22:31 accurate. And then you can do at the various levels of success.

22:37 besides, you can do it, , for segments of your code or

22:41 the whole program. Um, to understand the workload, then you

22:50 to look at things are related to arithmetic or logic operations that the code

22:58 supposed to do. So that's kind targeting the function of units and try

23:04 understand how well those are being used the cold. Eventually on. To

23:10 insights into what actually happened is humane to look more at about going on

23:16 terms of them memory, heartache in off cash hits and misses, and

23:24 talk more about Josh behaviors and insights that in the coming lecture. And

23:34 these are kind of high level The other things, when you try

23:39 then use stools, has to be that tools may in fact both add

23:50 and change the behavior off the So that's what I kind of warned

23:55 in terms of the bugging lecture. you do print deaf the ads

24:00 and it potentially not only changed execution , but it can also change cold

24:08 . So they they're different tools. of them are not intrusive at

24:14 Um, on some of them are quite intrusive, so but depending on

24:22 you do, it's not necessarily that a tool that end up potentially

24:29 the behavior of code is a bad to do. Maybe that's what's necessary

24:34 get the information and as it's a area, so some of it means

24:41 such statement and the Kolding after re another case, you may link it

24:45 some library that basically ties to the , execute hable and then collect information

24:52 on the binary and in the course Thio mostly used open source tools because

25:03 are, well, easy. The . You don't need to sign a

25:10 or buy them from some vendor because neural, the vendor tools are not

25:19 . Um, and it also means you use the open source tools that

25:22 should cover, not all. Then are platforms, but the cover usually

25:27 of vendor platform. So that sense a way of working this kind affordable

25:33 you get familiar with the tools, , under 20 they may have specific

25:39 for the specific platforms that they Source tools don't have. So there

25:44 has always benefits central bathroom, each . Um, yeah, they're aspect

25:54 tools. Uh, yes. To aware of part of it's static.

26:00 that you do it prior to You young maybe, you know,

26:05 to insert in the code and So that's clearly and start against

26:11 There are other approaches that allows you dynamically instrument called depending upon what you

26:17 after him at Atlanta and how this done is either they can be done

26:23 in automatic. Yes, this will you an idea of. It's a

26:27 rich and set the tools out We will only deal with a couple

26:33 them for doing mostly processor focused instrumentation a copy. And then,

26:43 the profiling part. Well, yeah, that's true town. But

26:49 is just just to make you aware , there are lots of tools out

26:55 that's on. This is kind of a cartoonish thing. I'm just trying

27:02 , um, point out another Many of the tools are basically based

27:07 sampling because too much detailed information Sometimes we too overwhelming. It generates

27:18 lot of data. And if you lots of data, that also means

27:23 does generate or change the program behavior those things tends to be needed to

27:30 . Were it into disks because he gigabytes of data, yes, to

27:36 what goes on. But if you the statistic example type approach, it

27:42 means it is statistics. So you have perhaps all the details such a

27:48 want and this is just so direct . And so if you do direct

28:01 , the gaps exactly the target, you want on it, we can

28:07 very data, the information that the tools may not give you. But

28:11 I said, it is often changing execute hable and can cause overheads.

28:19 this is just again a learning you trade offs. What is tools

28:24 So when you try to interpret, end outcome of using the tools you

28:30 to be aware of. Uh It affected your code, so you

28:36 , you know, pass the Um, and these are, you

28:43 , event triggers tools that can measure , as it points out, theory

28:49 exclusively or inclusively depending upon how things being done and then, um,

28:56 are kind of atomic events, but is again just to a largely to

29:04 characteristics of the tool and for But for the class, we'll talk

29:12 exactly about what copy does in this on than again. Used to be

29:25 , as I mentioned already a few that one needs to be aware off

29:30 the tools potentially affect the execution, over headed introduces in terms of and

29:38 corrupting the measurements. And as we out in the debugging lecture, the

29:46 of the timers encounters are important and come back to this concept of the

29:53 when we talk about party. And I said that what the granularity is

30:01 the measurement and typically the process is to start with some profiling and trying

30:08 . Find out what they are time parts of the colds are and try

30:13 focus on those. And then you sort of kind of try to drill

30:19 unless you have a good hunch about the problem might be in which trance

30:24 can sort of dive, too, more detail explosion off where things of

30:29 happens in statements so called. So is kind of just summarizing what I

30:38 . That, unfortunately on tends to up using a few tools for a

30:45 of tools. Once one learns more the behavior of the code,

30:53 so now any questions on this kind general introduction to performance optimization? Otherwise

31:03 will sort of dive in to talk the specific tools after we will be

31:13 . I don't see any questions in chat. Okay, thank you so

31:19 I will talk about Poppy. So a first general background too Poppy and

31:36 and many users, community communities, is a key issue because,

31:49 cotton, many are wanting. There toe kind of run faster, try

31:54 figure out what goes on. And is the case, uh, that

32:02 processors today they have what's known as that focus on recording data from program

32:20 Early on and there's performance game, would say that the vendors, Intel

32:29 the idea others have built the processors want to get access to the information

32:37 they viewed it as being very For competitive reasons, too, have

32:45 gain insight into detail insight into how processors work. Eventually, they came

32:53 and realized that having the user community tell them what the potential issues is

33:01 their processor was actually helpful for designing the next generation processor, and that

33:07 couldn't really do all the sort of themselves and gaining all the insights so

33:16 allowed the academic community. Thio eventually tools to access these counters that were

33:28 into the processors and collect useful So even kind of end users,

33:34 programmers, our applications could learn how that a structure their codes to get

33:45 . So the person that kind of , the, um, leader are

33:56 they pushed the hardest and got processor Thio agree to open up Once this

34:03 from UT that was the original designer contributor to Poppet that is now basically

34:14 lots of people. So I was about the various parts on popular that

34:23 relevant and that it will be exposed and using Poppy and all the so

34:32 kind of the inter talk in a quick review of some of the

34:35 And then so, Joshua, your on them Oh, are some of

34:40 features that are particularly useful in understanding behaviors, respect to processors that you

34:49 . So by now it's basically supports most processors out there. That and

34:58 terms off processor vendors, that's a , D and info pretty much

35:08 The Exodus six market. And then is idea, um, that does

35:13 end server processing their terms of their X serious and more recently, process

35:22 design based on arm has also become common on gaining and adoption. So

35:30 also supported, and they also support pews from both in the end and

35:38 . So certainly what they're being using the course. There is support for

35:44 by these poppy tools, and what allows you to do is to collect

35:53 , not just about timings on, it also gives you kind of summary

36:01 , like instructions per cycle for cold of entire colds or per threads,

36:08 you can choose how you want and I'll give some examples and going

36:12 . And then you can also understand behaviors, branching behaviors and memory and

36:22 stalls. So a lot of the information for understanding what potentially is limiting

36:32 performance of your coat. And the few years they have also added

36:38 Since now, energy empire consumption is one of the critical aspects of is

36:44 interest. And as I mentioned you can get this information down Thio

36:52 sort of basic block level in your or process or perfect, and just

37:00 case the concept of Basic block is familiar to someone. It's the little

37:07 up. You know, the computer talk a lot about, and it's

37:13 the code segment for which there is which there is no entry points.

37:19 this simple slash ocean the piece of on the left and bunch of

37:26 and then on the right hand you kind of an illustration of the

37:30 blocks. Um, the main point just showing that what the basic block

37:35 something to which you cannot jump into when the exit basically after having ability

37:41 go somewhere else. So this is the concept. I'll come back to

37:46 when I talked about compilation but just familiars and give you What is

37:53 Basically now Poppy works with these performance that exist in the processors, and

38:08 they kind of classify things fine, she can recall by events and the

38:16 , or then again classified as even events. That is kind of directly

38:23 that the performance counters tell you. then there are precept events that may

38:29 groups of these native events that perhaps more useful in understanding the code

38:36 And I think I have some sites the precept events and the differences,

38:42 I think suggestion also demo some of . Now events are then often collected

38:51 event sets. And we have some off some of these events is maybe

38:59 want the one thing to be aware . And that's unfortunate. May make

39:04 a bit cumbersome on some to use like Poppy. And it's not the

39:12 with Poppy. It's a problem that is often a limited number off counters

39:23 can be used to collect these, , the information from the process of

39:33 I counsel all kinds of things. to read some of the counters that

39:38 vendors used to get the information and stick it in some other counter sectors

39:44 for pop and they tend to be . So that means sometimes one has

39:50 make several runs to collect all the after one. Because any given

39:58 you can only well let's say five 10 pieces of information that,

40:06 is a little bit of complexity because a statistical tool so everyone may not

40:13 an identical run. So that's something one has to be aware of,

40:22 because of the statistical nature or I'm and having to the multiple runs,

40:30 may not be exactly the same. , So So there are again the

40:43 events that this collection off useful that the designers are probably thought that

40:55 lot of people would like the collected collection of native events or sometimes just

41:03 native events by itself, Um, you don't necessarily have to create,

41:12 , kind of an events have to . Now, not all precept events

41:20 supported on every platform. So in case, for stampede to they will

41:27 there are 59 according Thio on the recent info, we have that.

41:36 you are generated yesterday out of the fact, 108 possible events that they

41:41 at the moment 59 are available or in your life and something All those

41:54 are kind of events that are composed . If you native events. The

42:02 events there are unique to every So it's whatever I am the decides

42:10 support or into decides to support. I am bm decided to support.

42:15 there is no common agreement among the vendor is a process of vendors that

42:22 all have to support something. They most of the native events are

42:27 but not all of them. And style also tells you and see actual

42:34 . That's how you find out what's available. Eso the copy rail tells

42:40 what all the precept events are in pipe in Native tells you which the

42:45 events are and again see, we'll talk more about that in 10

42:53 . Um, then my halves, you dio select what eventually want

43:02 The cord, you know, should decide what is primary. And for

43:09 thing, whether it's some efficiency measure or it's sort of global thing

43:16 more details information about cash behaviors or , something about the kind of memory

43:24 that comes in this translation. Look Buffers T. L. B

43:29 um, which we will not talk about in this class, but that

43:34 happen to answer questions about it. , you're not taking computer architect.

43:41 classes should be familiar with it, it's modern architecture's air, not kind

43:46 given point. Not all the memory actually map directly, so you go

43:52 one of these local side buffers tables . When you need to remap or

43:58 some other part of Remember, you access to um, and there's all

44:06 of other details. So so now fit to a few sides because I

44:14 this is what found suggestible covering, fact, and this is just the

44:21 for the 59 events. There's three four slides here on. If you

44:25 at it, you can. We not particular ordered, and I don't

44:29 what the print order waas in this about the things deals with There were

44:36 deliver two. There were three caches and it counts, Mrs. And

44:42 can separate both data instruction caches, , both hits and misses. And

44:51 you can see you under the veil , you can see some which ones

44:56 available on stampede to in this case which ones are not. And in

45:01 cases you can see whether an event in fact, a native event or

45:06 . If you look at the very one, this is Level one data

45:09 misses. It's not the derive events actually native event. On the other

45:14 , if you drop down to the Linus, it's Level two data cache

45:19 . That's apparently something that is composed a few native events and and you

45:28 see that tends thio look down past middle of cycles for foreign point units

45:35 idle. So you get the information kind of things are not basically are

45:40 and waiting for something. Um, then I'm trying t o other things

45:49 that important to you and that waas know, something that tried to point

45:53 in terms off stream it often. in most processor, when pre fetching

46:02 not successful, things ends up being and you probably let you collect

46:09 And one thing so stalling and waiting this case for memory accesses well and

46:14 more detailed if it's stalled because of or writes, Um, it also

46:20 information on branching, whether branches, taking or not. And,

46:28 there is more additional information, so encourage you to take a look at

46:33 sites about blocked. Mhm is and again, you just need to

46:42 when you do the assignment. This a veil statement, and that will

46:47 what it's on the slides on. , um, I guess I should

46:55 this and then I will hand it to solution. Then I'll come back

46:58 some time waste after his demo. , but there's both to levels,

47:07 level interface and the level interface, high level that's except down fairly

47:12 Um, inside you may get what it easy to use. Where is

47:18 low level interface gives you much more than abilities to specified your sort of

47:25 events. That's that you may And I think I am.

47:30 um, okay, maybe I do couple more slides and then I left

47:38 takeover. So there's just some of high level, um interface that is

47:49 basically in this case there. You know, high level,

47:57 calls that you can use for And then there's three in this case

48:04 that gives you instructions per cycle or per cycle or floating point operations for

48:13 point instructions or floating point operations per . Um, and those are not

48:20 instructions per cycle and operations per Because remember when I talked about the

48:27 , many of them are the Allied or can do many operations in a

48:33 instructions or separating out instructions from operations an important desperate. And I think

48:42 is just a detail and I will that an interest of time or and

48:48 the questions that can come back to and I think there's a couple of

48:51 here, there's chose. You can this high level interface basically said,

48:56 the counter and then you have your and then you stop the counters and

49:01 you get Uh huh. What number transactions in this case in terms of

49:08 high level interferes gives you in this for the events is to define the

49:15 number instructions and the total number of cycles that WAAS occurring between the start

49:21 stop for the country's. For this cold on the low level interface has

49:29 lot more details on. Here's just example or what you can do then

49:37 family detail information, and I think will stop and maybe come back.

49:42 if so, Yasha will start your by commenting on the lower level.

49:47 example, I want you to have time, so that's why kind of

49:51 exhibit. I'll cover it in the . Okay, so then,

49:57 take over. So then there is a few slides that's basically is more

50:03 less screen screenshots off the devil That Joshua will do. And if there's

50:11 left, I will come back and about the concrete example. But then

50:15 hand it over to say yes. , so we just start my start

50:21 share my screen. Oh, I got it. Okay.

50:33 great. So is my screen showing now? Yes. Okay.

50:41 And just if we increase the fun now, my wondering. Yeah.

50:48 it visible enough? It's okay with . I hope the students can

50:56 Okay. Great. Right. as a doctor, Johnson mentioned puppy

51:02 a performance measurement toe. Now, tool provides you, uh, two

51:10 . One is command line based. second, it provides you an interface

51:16 a as a library itself. like any other library you have in

51:20 language, you can use poppy through function calls. Uh, now,

51:27 understand what Poppy does as way Professor talked through the slides that each

51:35 these processors nowadays that we have, have a certain set off hardware counters

51:42 can be configured to measure some performance based on which event we choose and

51:49 we wanted. Thio collect now to Poppy. First thing that you would

51:57 to do is to make sure you loaded the model on at least on

52:02 our any other cluster. Basically, make sure first you are on a

52:07 notes. So I'm going to compute here, and we already have the

52:13 model loaded. It will still, to show the command you can use

52:19 Lord puppy toe the latest latest version the, uh, a party modules

52:27 make sure we have it. So use model list and you can make

52:31 it is there now a Z There are two kinds off events that

52:39 CPUs and puppies are able to access those are the native events and the

52:45 events s on the left. You see that kind of description of

52:50 So native events are pretty much all events that are available on a certain

52:57 that includes some general events and all events that are architectures specific.

53:03 when you want Thio, check all native events for a particular CPU.

53:10 command that you can use would be native available and this is going to

53:15 a very long list, so I suggest you use it with a pipe

53:21 more command. When you do you will see that you can get

53:28 the information about the CPU that you using. And then I also tells

53:36 that this this particular CPU has 11 counters. So that means there are

53:43 physical counters that our president on the and you can Onley configure Aziz many

53:51 to collect during a single run off program that can that can fit.

53:57 to say in these numbers thes number hardware counters. Uh, so

54:04 So if you just press enter and going down, you can see all

54:08 kinds of native events that are That's going to be a very long

54:13 . I'll just show you a few these. It's now coming out

54:18 It s the next would be the events. Now, preset events are

54:24 derived from the native events are just off native events. Thio two

54:33 uh, to some general naming conventions these metrics. Now, what puppy

54:41 was they took most common events across CPUs, and then they just created

54:47 mapping to some standard names for the events. So when you want to

54:56 all the preset events that are uh, you can just use puppy

55:02 . And so this will be a shorter list, so I just go

55:07 the top. So this will be sort of output that you will get

55:13 poppy avail again. All the information your your CPU and all the preset

55:20 that are available on the CPU or the RV generator. You can also

55:26 which events are available, which one derived, and get more description about

55:33 particular event. Now. Poppy avail has a few flags that you can

55:41 it along with. So if you poppy avail, uh, with minus

55:48 and then let's say you provide one the events name. Let's about the

55:53 one total cash, Mrs. So you can get more details about that

56:01 , uh, preset event. So they've been named, uh, you

56:07 also see that it's a derived event It has two different native events that

56:13 that it's derived from, so you see it's native event L one d

56:19 and l two requests. These are together to derive this lobster thio,

56:27 this particular pre settlement. Uh Right. So those are the two

56:36 off events that are available on most the processor architectures. Now, when

56:43 want to access these, you have options forces the puppy, high level

56:49 , and second is the puppy low MPs. Oh puppy, low level

56:53 . It can access both the preset the native events, so it gives

56:58 more control over What do you want collect and what you want to do

57:03 your court? It's a It's a detailed FBI. And if you just

57:07 Thio, collect a few preset you can simply choose the high level

57:13 , which can only access pre It cannot access native events, but

57:18 is much more easier to use if just want toe, do some minor

57:23 performance measurements. Eso now, before move to, uh, the

57:30 uh, function calls, there are few more commands that you would want

57:35 , Uh, no. About so the puppy event chooser. So as

57:42 saw that there are, uh, number of hardware counters, that also

57:49 that during a certain run, you only measure, uh, some on

57:55 small set off these events. And there is another thing that the,

58:02 quite a quite a lot off events not compatible with each other. So

58:07 would want to make sure that you the, uh the events that are

58:14 with each other during a single execution your code. If you use

58:19 events that are not compatible with each , you will most likely end up

58:26 nonsensical values for your for your performance . So you should always make sure

58:32 using events that are compatible with each and pop. Even chooser is the

58:37 that would allow you thio check the off events. So the way you

58:44 users you would use pop, even , and then the next he would

58:50 is what kindof events that you want check that are compatible with. Let's

58:58 , this particular particular event, which the single number of single precision operations

59:06 you do that, it will tell that can give you a basic summary

59:12 the CPU and also tell you what tree set events are compatible with this

59:21 event. So let's say you happen be collecting single precision operations for your

59:28 . You can also collect total number instructions, total number of cycle and

59:33 on. Alongside that, uh, event so you don't have to run

59:40 code unnecessarily multiple number of times. is another command that you can choose

59:50 use his poppy come online, and that does is it allows you to

59:56 if a certain event is available, , on this particular CPU. So

60:04 to give an example, let's uh uh, there was there was

60:08 event in the Native Events list that wanted to check, but you don't

60:13 to go through the whole list as saw. There's quite a quite a

60:16 list, so you can choose to a puppy command line to check if

60:21 if it gets, uh, if available on the CPU or not.

60:25 for now, we'll just again use poppy l 20 cm. Uh,

60:29 is again a pre settlement. But you run it, it will give

60:36 a message that, yes, 20 cm was able Thio puppy was

60:42 to add that event and it ran micro benchmark, just a simple called

60:47 in the inside puppy. And it able to collect some performance measurement for

60:52 . So this is just to make that, uh, this event is

60:56 or not, Or can you use or not? Just to give you

61:01 sense of what happens if use, an event that's not available. So

61:06 party floating point operations is not, , available on this particular CPU.

61:12 when you try to add that, will give you an editor that,

61:16 , this event does not exist and was an error been adding this So

61:19 can, uh, no beforehand which you can use on which events you

61:26 Eso Is there any questions till Okay, that's the comments and run

61:39 since some of the things I might to try to again understand.

61:45 you know, the arithmetic workload and than the floating point operations is not

61:50 that one can get anymore. It to be, but in full turned

61:55 off their processes. But I remember you can still get the number of

62:02 point instructions. Yes, you can that. A swell as there

62:07 there's another event that they've made it . That's the single precision ops.

62:12 that yeah, I was I'm actually . Why? What's going on with

62:18 two events? Because floating point operations to give single precision ops as

62:22 And this is pretty much the same , right? So, you

62:25 we should probably check about that as , just to say anyone should

62:31 Yes, to make clear that it's that one. Cannot you have any

62:36 on arithmetical logic workload, but maybe as precise as you want, so

62:43 can get something at the instruction, , but not necessarily at the operation

62:48 Operation electrodes, Right, Right. common Part of that is,

62:59 related to the architecture, because when have this very long instruction words,

63:07 doesn't mean that the full width is filled with operations or the compiler might

63:14 operations. But it made only There were third field there was

63:19 So instructions may something perhaps been more in some ways on the operation count

63:26 the operation countries would require additional insights just counting the instructions. And I

63:33 that's part of the reason why they of didn't want to tell us about

63:38 action operation count. That was an . Okay, so, yes.

63:51 those were the main commands. there's one more command that you guys

63:57 want to know as poppy men If you use that, you can

64:04 , uh, the memory, information about the processor, specifically the

64:09 and the TLV. So you can for this processor in l one is

64:15 split cash. So it has a cache and instruction cache separate. I'll

64:19 is a unified cash, and l again is a unified cash. So

64:23 member and four can also be Uh, right. So those were

64:30 commands now moving on to the functional . Or you can say it does

64:37 library off puppy that you can use C programs. Uh, so this

64:43 somewhat, uh, how the puppy . But look, So is there

64:48 question? Okay. Uh, So when you want to use poppy

65:00 your gold, first thing you should you need to do is you need

65:05 include the puppy dotage verified of Uh, you can ignore this line

65:15 lines from now. But first, you let's say you want to check

65:19 many counters are available on the system now s. So if you just

65:26 off just like that, yeah. the function that you can use and

65:33 it was in the slides as well the puppy numb counters eso This function

65:39 tells you how many counters that are on your seat on the particular

65:47 Then you can, uh, print off it. One thing you should

65:52 remember, whenever you call any poppy , you should always compare it with

65:59 constant poppy. Okay, so this condition should be like this.

66:04 if the return value off any um, happy function is not equal

66:11 this constant. That means that function up somewhere, and it's not working

66:15 . So you should go ahead and sure you have loaded the puppy modules

66:20 have the puppy dotage included. If still doesn't work. Then there's something

66:25 with Poppy. But anyway, make sure you always check this now

66:32 you want to compile this code, , the command to do that is

66:42 by using GCC. Then you need include they include but off happy your

66:51 code name, uh, part to lib directory off puppy. Then there

66:58 these couple off. Well, this flag that you want to add and

67:03 just the name of your execute evil you want eso when you do that

67:09 were and compile it, it will on executing felt like this. And

67:16 when you do that, you'll get output as popping the number of countries

67:20 are available. So any question about ? Okay, uh, moving on

67:33 the example for poppy. High level PS. Remember that, poppy?

67:37 level A p. I can only Prince settlements. Uh, so when

67:44 want to use just preset events for performance off your code, you can

67:48 use the puppy high level FBI Make sure you have poppy dotage

67:54 And what this code basically does is just multiplies, uh, two

68:00 with each other, which were initialized the beginning. Now, when you

68:04 todo measure some preset events, you first need to create a,

68:11 area that contains the name more So to say, for for the

68:19 event that you want to measure, you also want a raid that that

68:27 , that will read the values of counters. And in this case,

68:32 should use a long, long, data type for this for this particular

68:38 because in many cases it could be large values. Once you have got

68:44 both of these things set up, first function to call is the

68:49 Start counters. You don't have to any other set of for Bobby.

68:54 start, start the counters, tell it which events it should be

69:00 and the number of events that are be measured. So in this case

69:04 just one. And again make sure on, I think is going wrong

69:09 the public calls. As soon as done that, you can just start

69:15 the work that you're supposed to in code. And when you're done with

69:20 work, you can just call Poppy counters with With the parameter past as

69:28 output output variable that should contain the . When you do that, what

69:35 things are going toe happen First, value of the counter will be read

69:40 things particular variable that you provide And Thekla counters will be resented. So

69:48 you read the counters again, it most likely give you some weird

69:55 It will not give you values that at here again. So remember reading

69:59 reading, the counters once it raises values inside those eso As soon as

70:08 read it, you can just print out. And always remember, when

70:13 done with your work, make sure relinquish control off. All the resource

70:17 that you've got to make sure called stop counters with stops the puppy runtime

70:24 of to be specific, it stops those counters again. Compiling off this

70:32 is similar to what we did for previous example. So when you run

70:37 court again off the high level, will tell you the total instructions that

70:45 executed for this particular court block. any question about that Okay, so

71:01 moving on to poppy low level. remember poppy low level. It can

71:08 both the native events as well as research events. Uh, and it

71:15 gives you much more control off what want to do with your with your

71:19 as well. Eso There's not ah lot of change when you're trying to

71:26 simply read some events for your The first thing that is different from

71:35 level is you first need to make you initialize the puppy low level

71:40 Uh, now here you want to this function puppy create events said What

71:46 does is it creates an empty events toe, which you can add events

71:51 later on when you once you've done . So here. See that first

71:58 am adding a puppy preset event, is the puppet total instructions and then

72:06 the same event, said I'm also a native event which just measures the

72:12 one cache misses. As soon as done that again, you can call

72:18 start here Azaz compared to high you would call puppy start counters

72:24 It's probably start for low level then you would perform your computations.

72:30 again, just in the end you call Poppy read to read the read

72:35 counters. You can then print it just called stop when you're done.

72:42 again, just compiling is again the . And so you can see the

72:51 that we got for those two So total number of instructions were close

72:55 this, and the number of cash were 67 now s O. That

73:03 pretty much all about how you can a simple usage of poppy. Now

73:09 that let's say you have, 1000 or 2000 lines off colds

73:16 You are also using multi threading and sorts of funny business. Uh,

73:23 these poppy calls to your source called creates a large executable, obviously,

73:30 also it may add more overhead, it's not that easy to use.

73:35 , it's it's a little bit easier newer poppy versions, but at least

73:39 the version that we have here, 5.7. It's not that user

73:44 If you try to do it so is Dr Johnson said. In

73:48 next class, we will see, , tool called Tao during an

73:53 Utilities. It allows you to keep source code the way it is.

73:58 that means just this computation except all the puppy calls. And it does

74:07 instrumentation operations automatically. So you don't to worry about adding thes functions.

74:16 in your source code. Eso. we'll see it in the next

74:22 and that's pretty much it. So questions? Okay, so no

74:36 Right about time. I I'll stop . Okay, I will just point

74:45 there is only a couple. Two presumably left. So waas to

75:09 Yeah, so point out quickly that the slide set for today, that

75:24 it what I wanted There is this so this there is a simple example

75:31 to use park it again. Some and a simple example is just

75:40 takes, multiply and commonly used for much anything. When you teach compilers

75:45 anything else. Eso It's very Simple makings multiply algorithm that hopefully everyone

75:52 familiar with three nested loops do multiplying the two major sees and on

76:00 left side years what's known as the math mountain on the right one is

76:04 that the only thing I was changed there was. They'll open to change

76:12 . That's the only thing. And point is to show that that can

76:16 an impact on performance and how you get, um, then Poppy to

76:22 insights. And then the witness was show with that. So the leftist

76:28 stupid known as in a product basically or magics A and Times column a

76:36 B. Where is the other Is kind of scaling on Collins and

76:44 scale columns to compute see colonize as or mathematical. What's happening? And

76:52 was kept this slide and just pointed walk happen and how probably can be

76:59 . So there's a uh on the hand side. You see the particular

77:05 that you get information about from Poppy and using the particular public command.

77:11 then there is the outcome for the multiply using in the products and for

77:16 other one. What's the reordered or version? And if you look at

77:22 , you can see that the time from about 13 seconds or something to

77:28 three seconds, or basically the factor improvement in performance and the instruction county

77:33 the same. Nothing changed in terms again instructions being executed about the time

77:40 tanker factor for. And then you look at what's done below in terms

77:45 instructions per cycle. Or if you at the very bottom in terms off

77:52 cash behavior that you can see and along. That's the real time.

77:58 per cycle went up from about, know, 0.35 or something. 2.1

78:03 seven and the total number of cycles down, and you should be correspondent

78:09 in time. And I guess on next side you can get the needs

78:13 in terms off the cash behavior. in this case is here that on

78:19 cash request rate, in fact went , but the administration down substantially from

78:27 0.3 per instruction to point double 07 this gives a little bit inside what

78:33 . That's the cash behavior. Got lot better. Um, number of

78:38 is the same because the number of animal most or the instruction reduced to

78:46 the multiply add listen embedded in the nested loop in a loop. Bonds

78:51 the same. The same number of being executed. But the memory behavior

78:56 considerably better. So this is kind an example of the insight you can

79:02 from using, Poppy? Andi, know my time is up, so

79:06 will stop with that. But I you to kind of look at some

79:09 the You're such scenario. Probably that in the side. Someone up on

79:22 questions you want. I had one question. Yeah, Um, what

79:30 is the feature sides of, the CPU, the feature sizes?

79:38 what you said just to make Yes. So, what do you

79:45 of as features? Because to it's the cash to sedate A

79:53 um, cash lines and the other . I believe he is asking about

79:59 feature size, The numbers. in terms of silicon. Yeah.

80:05 . Okay, so I think the to, if I remember correctly,

80:09 14 nanometers, right? 71 So just measuring this like the smallest or

80:16 piece of silicon that is on the or right. So the feature sizes

80:23 of the minimum features size from a point of view, so it sometimes

80:32 referred to as a no. Then because as the footprint of a transistor

80:37 terms of nanometer, it also tells little bit how wide the silicon wires

80:45 on the chip. Um, so that sense, it's tends to be

80:52 minimum with their other characteristics. Stuff between layers, and that's a lot

80:58 . But this is basically the extent the horizontal dimension. Directions.

81:05 these features you can do sort of the smallest possible quantum if you wanted

81:11 . Okay, yes, so and related to, um so you may

81:22 it or not. But there's Kip is basically photographic technology to take

81:29 exposed silicon, and there's all kinds trickery how you actually make the pattern

81:34 the imprint on the piece of silicon depending upon what technologies use use,

81:41 limits the smallest feature size you can in terms of the, um,

81:47 kind of extent, and it so it's related to the wavelength of

81:52 you used to exposure. And then why when, um, the state

81:59 the art today's when needs to use ultraviolet because the feature sizes is at

82:09 . One wavelength of the light is to shine the pattern onto the

82:13 That's where it's come from. Cool. Thank you, Dr

82:19 Okay. Here, welcome is good . So I had I had I

82:26 I had one more, if that's . Yeah. Question. Um,

82:30 believe it's the very last one on homework. Um, it says that

82:35 each of the benchmark functions seek to a model of performance as a function

82:39 the data set size. Would you elaborating a little bit on that

82:46 Okay. Um, so the e mean, the first thing I thought

82:57 was, for example, some of benchmarks might have thresholds at a different

83:02 . Like for the stream one. was noticing that it makes like an

83:05 down parabola after you reach a certain that coincides with the cash.

83:11 is that the type of analysis that question is sort of asking for or

83:18 , it's not unrelated. What I in mind was simply, um,

83:24 back to my first. I guess or three pieces. Whereas I know

83:29 application or your systems or your So I think it comes from matrix

83:36 Thing is that the number of arithmetic of the workload in the first place

83:42 to end cube since we use square . And the other part is that

83:51 you're supposed to use a single Fred , um so then it's what is

83:59 max capability off the floating point performance a single core. And,

84:07 then you can kind of have two models as well, if it is

84:15 um, basically single functional units or kind of what people can do scale

84:23 Or it can basically do one multiply for cycle. So then your model

84:31 be. You know, this is much works, and here's the capability

84:36 the hardware. So my expectation if it's a truly 100% efficient

84:41 this should take this amount of time the other kind of extreme. And

84:47 the you stampede to that, the of the skylink it can actually do

84:54 to double position floating point operations in single cycle on the single core.

85:01 then you get the totally different uh, number of operations per seconds

85:07 you can have a model for predicting time if it's 100% efficient. So

85:13 was the kind of model I had mind that you can then set the

85:18 . Or what time should it have and compared to it the time it

85:23 took? And then you can get idea. Was the cold really

85:29 Or was it yes, using a small fraction off the actual capability off

85:36 hardware so as a norm. And why I thinking, I think I

85:44 one slide that showed that when I to, like, matrix multiply well

85:51 cold can get like 98 99% off peak performance. On the other

85:57 a typical application cold that is even . They only give you 3% and

86:03 the code doesn't even do 1/10 of off good use of their cold.

86:13 that was behind that question, and it wasn't really elaborated, but that

86:19 the intent to get you to think how well is the platform being

86:24 Okay, so it's more along the of relating it to the theoretical

86:29 Yes. Okay, grant I was to do incur fitting toe what you

86:36 . Yeah. Yeah, for All right. Thank you. Dr

86:39 . E Welcome. That's a good to Yes. So a lot of

86:52 assignments that are basically stated to again trying Thio foster you tohave a model

87:04 what good performers would, uh, out to be in terms of

87:09 You observe and use just measuring So you have an expectation what would

87:14 good And then compared to what you observed any more question stop recording at

87:38

-
+