© Distribution of this video is restricted by its owner
Transcript ×
Auto highlight
Font-size
00:00 Okay, so a couple points before start. So first point about the

00:10 measurements. So if you notice the run time off a code,

00:16 less than a millisecond, which as professor just mentioned, is the

00:21 for rappel sampling. Then try Wrap the computation part off your code

00:30 a loop and run it multiple earned multiple iterations. So the total time

00:36 at least larger than tents or 10 20 milliseconds so that you don't get

00:44 issues with our measurements. Otherwise, you're run time off, the code

00:51 smaller than does sampling resolution off rappel you. Your measurements could may not

00:57 much sense when you try to analyze . So wrap it into a loop

01:03 then take the average to get the off. Power consumption for one

01:09 The second point is about problem, . Thea End body called. So

01:15 you, uh, compile it using Dow's C compiler rapper and when you

01:22 it in case you encounter is segmentation , then try to copy that instrumented

01:31 to your home directory. That usually the problem with that code.

01:37 if you still find any trouble with . Just reach out to me and

01:41 can try toe, get it working and I follow up questions. Air

01:50 Thio Cash points. Okay, then start to talk about today's lecture.

02:04 this is, I would say, for quite some time, the last

02:09 there. Let's kind of the hardware tools so as we continual movement of

02:18 programming aspects in terms of programming lycopene, MP and FBI and talk

02:25 about software and algorithms. But these the topic for today. Focus on

02:31 power and try to understand what goes on, why things are the way

02:36 are in terms off architecture, er computers. So the first thing why

02:45 an important aspect, as it has a design constraint for over a

02:51 Prior to that kind off, people used more slow to get more compute

02:58 on things works kind of wonderfully. didn't work so well, Um,

03:04 this so called them out, scaling working, and I'll talk a little

03:07 about that today on then. The part is, of course, as

03:13 showed, I think in the first or maybe last lecture? Is that

03:18 total cost of ownership for computer system , or PC or whatever you have

03:25 cost off electricity and if it's a center? Also, the calling on

03:31 cast is actually exceeds the cost of system itself. And that clearly a

03:36 concern from any companies a T East the other part is because off electricity

03:47 not necessarily generated by so called Resource is so a lot of it

03:55 based on trust and fuels, so even though depending on where you are

04:01 the globe, companies may have an thio explicitly by clean energy. But

04:07 average, at least about two thirds the energy comes from combustion. Herbal

04:16 is supposed to wind and solar and thermal that are inherent thickly so.

04:22 a big concern for companies off the sizes, like Google and Amazon.

04:27 they take various steps as I mentioned to death, clean energy and the

04:33 neutral before too long and away a bit justice to explain the cost.

04:40 cost is important. The cost of is increasing, but that doesn't explain

04:44 whole thing So I'm trying to focus the other parts in of the electricity

04:50 for the reason. And here is the I guess, another part off

04:56 parent energy is a new, important , and that is that computing is

05:04 an increasing portion on the total electricity globally. So it has what used

05:13 be kind of even only 10 years , a very small portion on the

05:17 energy consumption, electric energy consumption. it's now about a third of 40%

05:24 the best scenario in about 10 this or so. It's not necessarily a

05:29 thing because computing tends to replace more energy consuming activities or in

05:37 It's not the bad thing. On other hand, um, it's something

05:41 needs to pay attention to. To Thio both reduce the cost and

05:47 environmental impact off, uh, computing information technologies in general. That's and

05:57 didn't want to move so on. is just a reminder, a little

06:01 curiosity thing work just the PC, it may consume over its lifetime.

06:09 , as the power is a limiting in terms of design, and this

06:16 kind of a old graft by as you can see at the bottom

06:19 the years, and this covers the up to about 15 years ago.

06:26 during the time covered by the you can see that the heat density

06:35 computing chips approach basically the same intensity you have in a nuclear reactor,

06:41 that probably give you some idea That's not sustainable. This type of

06:47 that was exhibited until about 15 years ago on this side about us cute

06:55 it also relates what the heat density on the processor chip compared to your

07:03 place where you used for, you , making the tea, coffee or

07:07 X or, uh, similar So the cooking plate is considerably

07:14 um, it dense than computing And so they got to the point

07:20 one could not necessarily find cooling technology , actually prevent them from melting or

07:29 . And there's kind of another. on the horizontal axis Now instead of

07:35 clock frequency on the chips, whereas previous slide had the years and the

07:44 technology being used, that being smaller smaller feature sizes by the years on

07:52 vertical axis is still the heat and the various processor names is

07:57 The slide is something you may but 4000 and 411 of the first

08:04 microprocessors. And as she go up the right, there's more and more

08:09 processors, even though even on this it covers up until about 78 years

08:16 , maybe 10 years ago. um, and part of the reason

08:25 obviously chip designers want Thio get more more uses out of the silicon and

08:34 of it that most increase the clock and increased clock frequencies. Contributed

08:41 The exponential improvement in performance that, , uses industry got used to on

08:49 of expect on That's actually has Bean critical basis for the entire industry.

08:57 if performance on chip doesn't improve, the point of buying a new

09:03 It is also interesting it in there left corner. You can see what

09:07 brain does. Your brain is in off computational rating, um, in

09:14 of neurons incredibly slow compared to current . But it's also incredibly much more

09:21 efficient than anything we build today. . Here is a little bit of

09:26 picture again in terms of clock Now, over time instead of related

09:33 the power consumption as you can see frequencies five adore off at about 2005

09:42 again, the reason for that was intensity was not sustainable, and there's

09:48 reason as well. Which is this North scaling that I will mention?

09:54 in the beginning off, this is of, you know, As you

10:00 see, the first few years Thio show or start, the multi

10:05 area with IBM was one of the one of the first company.

10:09 that came up with them off to processor, and then Andy came a

10:15 of years later on. Then another years later in tow, joined also

10:21 the club producing multi core chips. the point was essentially to try to

10:29 , yeah, the exponential improvement in performance of a chip when the Cooperates

10:36 not go up. And the reason clock rights has could not go up

10:44 the way the sea most technology So this is the famous square law

10:51 I would say anyone that does computer should that you will be familiar with

10:55 at least know off because it's fundamental rule anything being done with sea moss

11:02 terms of computing. So the square is this thing where P is the

11:09 consumed bye some any form of circuit switching device transistors in particular. In

11:16 case, which is proportionate to the used square by square of the voltage

11:24 proportions of the frequency four was known the dynamic of the switching power.

11:30 is the thing that actually changes And that means does the actual computing

11:36 is in dynamic piece. Then there a leakage inboard power that is the

11:43 to points. Transistors and capacitors are of leaky. The Loew's charges I

11:50 when we talked about the Iran and then there's three cooling part of

11:57 can be a model as being proportional their both is to the fourth

12:03 Well, that's a separate issue. then there's a couple of diagrams

12:07 and one is the left hand Three different to curves by three different

12:14 of symbols. The cars for three frequencies where the the world one is

12:20 the lower clock, three points or gigahertz. And how you run with

12:25 marks is I have to 0.60 60 and what's on the left. Access

12:34 the participation for running. Gaussian Elimination holding a system of dense equations,

12:42 it basically shows that the higher the of the higher participation and has predicted

12:52 square bill on the top of the . And on the right hand

12:57 it shows the fact that the clock has a relationship with the voltage as

13:04 . So and in that case has different chips, um, being illustrated

13:12 how the clock frequency can be controlled the both the channel ships. But

13:18 point is, from Texas altogether in . In a way, the power

13:25 is more than proportionate to the clock because you also have to work with

13:31 voltages. So in the end, power grows quite quickly in principle.

13:38 the left side, you have kind separated them to some degree. One

13:43 do that control the both things separately the clock Rico in city, but

13:50 general, from a design principle, are tied. So this is just

13:58 quotes because into levels kind of caught surprise, and it shows how serious

14:04 was. And that's what we Multiple chips. And in fact,

14:09 had the backtracks. They had everything ship the product, and it turned

14:15 it was so hot that they couldn't cool it, and it became

14:19 So they decided not to ship that and had to go back to the

14:24 board. And that's why they ended coming a bit later than the other

14:29 processor vendors with multi core chips. here is a little bit as one

14:36 the info fellows, literally fellow very exceptional scientists within infill.

14:48 um, benefits are in position, as fellows, so it's kind of

14:54 very it's the Inquisition within in This is one of them.

15:00 and what I want you to focus . This particular slide is to look

15:05 the right column and says energy for of FBI. And then each role

15:12 the stable is different generations or off for different at that time, what

15:20 known as inter processor by the name pendulums on pension problem for etcetera.

15:27 in the right column, you can that the energy per instruction grow from

15:33 the first I the 40 86 for nanograms per instruction, Thio almost 59

15:42 for instructions that grow fivefold over a generation are into the processors. And

15:51 , they realized this is not And they had to go back to

15:55 drawing board and start to do designs in order to still get the exponential

16:02 in performance. Um, here's another from inter processors again, more or

16:09 showing the same thing. And I this because it had sort of one

16:16 that is the kind of orangey colored the has the dashed, um,

16:25 of initially declining writers line as I to the right and then uh

16:32 to the very far right on the diagram and up words error and when

16:39 chose the performance provoked that has now one of the guiding principal for judging

16:48 quality of designs so you can see a number of generations off see most

16:56 . The energy efficiency actually went down again it got to the point where

17:01 was not sustainable to do business that . And then a number of the

17:06 that was used to make programming like counter order, execution and a

17:14 off other features ended up being taking in order to get improve performance for

17:26 ? And that's what's happened in terms the month score a high court

17:30 Things in fill in this particular case back to, uh, much earlier

17:36 that has less feature in it in to put surf, of course,

17:41 the same guy and Gap exponential improvements for code that had parallelism by using

17:48 course instead of trying to push the . So this is kind of what

17:55 is the fact that they had the law. That's a lot more transistors

18:00 each generation of the most technology that used to introduced all kinds of nice

18:07 from a programming point of view. it waas and also from reliability and

18:16 kinds of respect. But it did the energy efficiency and again, uh

18:22 , the same principle had to So any I guess I can stop

18:30 talking a little bit. Why these happened in terms of the energy consumption

18:37 how things are. Actually, what rules are today is supposed to.

18:43 the first 30 or so years off law, I had a question or

18:53 clarification to make sure that I understand . Okay, so you said we

18:58 in the course of history. We steadily, you know, following Morse

19:02 up until we reached a wall, which point we solved by introducing multi

19:09 which decrease the efficiency overall, no. It didn't decrease the efficiency

19:17 the way until about 2000 fire and will talk about in the next several

19:24 . In addition to Maura's long, was something else that worked that this

19:29 called Denard scaling. So that May that in addition, Thio getting more

19:38 on the diet, you could also the clock frequency. So at constant

19:46 on the chip now, when this start scaling, stopped working that was

19:55 longer true. And then the chip started to increase beyond what was sustainable

20:05 before, when they are computer added features that consume power, the

20:14 efficiency for doing, say arithmetic in went out, and I can come

20:24 to these questions once I talked about next several slides on we can see

20:31 I have answered the questions at that . Okay. Thank you, Dr

20:36 . Sure, now you're welcome is good question. So So I'll come

20:41 to a little bit when I talked when I talked about the the Iran

20:46 with Iran is slow. So the thing is again all the computing,

20:51 just storing information and basically the captured this little simple R C circuit.

21:01 what did they not scaling? Kind implied is what is on the left

21:08 side with the text here that basically that the switching energy. If you

21:17 out there, uh, electrical our equations as that the energy per switching

21:24 proportional to the volte square and the . If it's second and in

21:30 the transistors that are the switching agents computing. In fact, our acting

21:39 capacitance is as well. They're more it because they switch on and

21:43 But it's basically charging the plates off transistors. That makes changes the

21:49 so it's proportional to the wealthy square capacitance. And, of course,

21:55 the energy part, the power than the consequence of how frequented you charge

22:03 discharge is capacitor. So the power becomes proportion to the frequency so that

22:09 have the dynamic energy part that is both the square times, the

22:17 Now the typical thing that waas and sea more scaling technology in every generation

22:29 sizes got the butt down to about of 700.7 or the previous generation.

22:37 if you take it as a square of chip, that means you get

22:44 square. So that means of roughly factor tomb or transistors in the

22:49 area. So that's the double the of transistors. But that also means

22:56 a capacitor parked a little plate. also gets basically half area, so

23:03 means first, the power or energy transistor gets down by half. But

23:12 you double the number off transistors. then you're back and having the same

23:18 by, uh, making things smaller the same time as you double them

23:23 the same area. But when this North scaling worked, you could also

23:30 the clock frequency by about 40% for chip generation. So in that,

23:37 total, for any given ship we've got almost three times the performance

23:44 these scaling laws. And again, was used to introducing more and more

23:50 features on the chip at constant power put on the right hand side of

23:59 side. First on the Moore's That again is not about performance,

24:05 it's about transistor density. But that's of the doubling every 18 20

24:12 That used to be the case. no longer quite true, because it's

24:17 significantly harder to do things. And fact that things are significantly harder is

24:24 of reflected on the bottom right level Graham that shows that the cost of

24:30 transistor went down significantly. So that's the chip cost to sort of the

24:36 . Consumer ended up being pretty much over the years because the cost per

24:42 went down and when the transistor count up. But that hasn't quite been

24:49 anymore, so that's why sometimes, , chips now gets more expensive because

24:55 cost of a transistor is no longer justice of curiosity part maybe, but

25:01 a good thing to know on what expect. So this is just a

25:07 bit graph to illustrate the same thing in terms of chip power meaning capability

25:14 the left hand side you got twice number of transistors, and then they

25:18 faster as you got almost three times performance, um, for hardship

25:28 And that worked as long as these prediction how the see most technology work

25:40 to reduce the electric field between these place in the capacitor kind of

25:48 So that meant as long as you the electric field so it remained

25:54 then you can get this benefit are both increased, cooperate and,

26:02 constant power for the silicon area or chip. The problem. Waas.

26:10 is what's known as the leakage power as the what's called the gate accent

26:16 best that the thing between the two so they capacitor got thinner and

26:25 It also meant that the leakage parent higher and higher and when, so

26:33 you look at the upper left hand of the true crossing line or upper

26:39 that this reasonably flat, not increasing rapidly as you move to the

26:46 which is reduced feature sizes or time to the left on the upper left

26:52 graph. The dynamic part did not all that much, but the leakage

27:00 increased significantly, so it became a factor. And that's why,

27:08 scaling the way it used to work about 2005 and leakage car became a

27:16 , uh, kind of stopped. that's sort of the end over this

27:20 called the North Scaling because of leakage . And as you can see at

27:25 point, um, it waas the between the plates off the capacitor.

27:33 no more than about five atomic so it wasn't practical, even with

27:41 technology to make it all that much . So, in fact, so

27:48 kind of the choice that happened after scaling stopped working, that either you

27:56 basically the double the capability. But the park consumption went up with the

28:03 of otherwise Prior to that, you gain in terms of clock frequency,

28:08 now it actually would increase the or the other option is to keep

28:16 power and then then the basically the capability is limited to maybe increase about

28:28 per generation instead of the factor of . And as build already pointed

28:35 reality isn't quite as good even at . So that's why basically sets at

28:43 , you can expect to get about more capability out of a piece of

28:49 after the end of the North So that's why, because cooling limits

28:57 cooling technologies are really at its limit , or waas about 2005 and not

29:03 much has happened since things the hot and so there. So this big

29:12 , in fact, today require literally off some flavor or another. Air

29:17 is no longer good enough, so the participation isn't really an option.

29:24 one pretty much are limited by constant and as well as as it was

29:30 in previously the car creates stopped going up for chips. Uh, after

29:38 2005 when this all technology scaling stop anyway, more slower still has kind

29:46 been working, but more slower is about transistor density, not about

29:53 but even that just now getting and here is kind of a little

29:58 of the slides showing how things have the So it's the power limitation.

30:08 , calling capability limited the participation and was kind of reached peak or and

30:16 been pretty much being flat in terms power density and the power density and

30:23 lack of the North. Scaling then forced the clock frequencies to pretty much

30:27 level on that in itself. That core performance pretty much also leveled

30:34 um, at around the same but again, more slow. It's

30:40 of still working, so you can get a lot more transistors on the

30:45 . And the way that tells them used is, as it shows on

30:49 bottom to get more and more cores the guy. And this is basically

30:55 for the industry, too. give people exponential improvement in terms off

31:03 and make it justified. Thio, know, buy new computers or new

31:12 with new technology and generations, and has also caused this notion because off

31:22 the cooling capabilities. So what it out to be is you can't really

31:29 all the transistors on the chips to working and doing stuff at the same

31:34 , because then the chips would overheat you can call them. So now

31:41 there's this concept off dark silicon that just trying to signify that it's not

31:51 to actively use all of the transistors a chip at any given moment.

32:00 here is kind of another slide showing with technology and generations Ah, the

32:08 of chip that can effectively be actively in computing at any given time.

32:18 today technology are and the depending upon and others. But 14 is what

32:30 with introduces. They also use 10 trying to move to seven and other

32:38 , um, or foundries. They Nargis what's known as seven nanometer design

32:45 , but it's basically shows you that much today on about the quarter of

32:52 transistors on a chip is coming, maximum that can be used in a

32:56 time. Um, so um, a bit this has a consequences that

33:08 was trying to say here that part the reason why there are the size

33:19 cash is on chips have gone up particular the last level cash is that

33:25 are largely idle most of the so that is a helpful thing in

33:31 to keep the power consumption down. , of course, it also helps

33:36 improving the performance. So going forward one chemist certainly expect that,

33:45 cash is will keep growing as a count keeps increasing on the tips.

33:52 course, there needs to be a of one. Also sees the number

33:55 course growing, but course are relatively compared to the total ship area.

34:02 a large more than half of the area today tends to be cash.

34:10 this is just another thing off Anyone interested in this topic. I

34:15 you to kind of go on Look something called the international technology roadmap for

34:21 , and they issue new road I think nowadays, about every other

34:29 and my demon back more than five ago, they made the point that

34:35 and power consumption is the key part moving forward and designing chips. And

34:46 think this is a good stopping point come back and see if there are

34:53 . Central air. So the one I wanted to remember is in addition

34:58 more stolen that I'm sure all of have heard about. There is also

35:01 scaling that you may not have heard . But that was a big

35:06 Thio the exponential growth and performance per generation up until about 15 years

35:15 and that's no longer works. And , um, the rules of designing

35:21 have quite changed on. That's why see increased caches. And, of

35:28 , there is a balance with increased council exponential improvement and performances.

35:36 uh, Onley possible for applications that apparently in because single core, the

35:45 third performance is not like 60 Eso now, I guess, to

35:58 get more of background information, I'll a little bit. So where does

36:02 go? And of course, there the simple again most technology, which

36:10 George transfer technology, just moving electrons between transistors and if it's storage just

36:18 time. So here's a little bit the slide, and this is something

36:25 think even at one should be aware , not only for understanding the technology

36:31 the rules for how things might change forward and what computer architects are trying

36:36 do, but it also tells you that is relevant for applications relative to

36:45 consumption in power. So there is the left upper left hand side.

36:52 is kind of a table in the graph that illustrates the table and what

36:57 shows there. Energy for certain types operations. So it, as you

37:05 see for arithmetic like addition, it's much linear in the size of the

37:15 type. So if you have and kids, uh, and Versus is

37:22 and they it, it takes about the energy. 16 bit floating point

37:30 a little bit more expensive, but more complicated because in the floating point

37:38 have Manti PSAs and you have exponents you need toe line the operations in

37:46 to be able thio, add or things so it does become more

37:54 Then, uh, so it's a about the factor of what Uh huh

38:02 30 to be done in 5 32 . Rolling point out. That's a

38:05 of 10 ish. More or Um, multiplication is definitely more expensive

38:11 the ad, as you can because it's the 32 bits mouth

38:20 but fire for yeah, 3.5 times something. The multiplication is kind of

38:27 squared type operation compared to add, you know they expect the multiplication czar

38:33 expensive than additions, whether it's integral point now the big point and then

38:42 back to my stressing off memory and last lecture or two is that accessing

38:56 even S Thomas has said here, means cash practically is more expensive than

39:04 operations. It's even more expensive than 32 bit floating point multiplication. And

39:13 you need to go to the that means you need to go off

39:19 . It's more than two orders of more expensive. So that's why having

39:30 cold or algorithms being, um, using cash is effectively is incredibly important

39:40 terms of total energy consumption. yes, could you remind us is

39:48 what I'm used for all the levels cash or just the lower ones.

39:52 levels of cash? Yes, they even even there were three caches and

39:59 ones that I have to start to a level four whatever for that may

40:05 use s tram. They may use Embedded the ram? A. So

40:12 mentioned it depends on whether they need fourth level or level beyond three.

40:22 retain information or not So extreme is for cashes in order. Thio for

40:29 to be retained and speed so That's the basis. Why Ekstrom is

40:38 for cash is be the attention, question. And this might be out

40:45 the scope of what we're discussing. typically, when we discuss,

40:50 data structures and algorithms, we assume memory, right, which is a

40:56 pitfall that that being said, can design data structures that are aware of

41:01 different levels of cash and maybe even type of ram that's being used to

41:06 for that as opposed to, you , just saying that a binary search

41:10 is long gone. Um, that a good question. I'm not sure

41:17 have a good answer. So there two elements to it. Uh,

41:25 first, this the size. if the sizes of the data structures

41:34 relatively small, and then it doesn't too much, because then hopefully it

41:42 15 cash. But the other part the Traverse ALS scheme. So how

41:48 access the data structure It, even design, has lots of cash is

41:57 important because that quote a bad Elin, terms of higher step through

42:05 and then the two excessive access is river off chip memory. So I

42:14 say it's a large fraction of how access to data structure. Okay,

42:23 , someone shouldn't. It's not just static structure. It's a traverse ALS

42:31 is important. Unfortunately, programming the standard ones don't have control mechanisms

42:40 deciding how data gets advocated. And is, you know, it comes

42:49 every now and then, uh, discussions about programming language to sign.

42:57 of course, calls against making source highly portable because then they come dependent

43:10 Sisters of the target platform. So lower left hand corner graph trying to

43:26 also a little bit, uh, the energy consumption for accessing data.

43:39 it basically points to the fact that distance and the wire energy is effectively

43:48 in Esther himself, consume the same of energy regardless of where on the

43:54 it is. But it, um from where the functions units are is

44:03 culprit. That's where a lot of energy goals of the wire energy is

44:11 very important aspect. So again, very, um, cash efficient algorithms

44:21 , thio get this much usage out level one cash is that you

44:27 once you have the data in etcetera is an important aspect, and

44:32 is, we'll talk a little bit about that. Target can help computers

44:39 out how to allocate things, or Thio do. The sequence of access

44:45 we'll talk about and some structure I want to talk about how to

44:51 and basically organized the source code so amenable or takes less transformation for a

44:58 to do the right thing. so I think that's what I wanted

45:06 say. So not a little bit what both computer architects and can

45:13 and then very high level software comments and then that will be future lectures

45:21 well as the algorithms for. And I'll talk a little bit more of

45:26 management that is related to rappel and current assignment that you have so now

45:35 architectures. So again, big data as well, Azaz chipped in and

45:45 like Intel Andy, but also the , in this case slowly. Berkeley

45:51 . It'll be element. So may one of the Department of Energy's national

45:58 . They spent also, like Internet , a lot of money on utility

46:04 infrastructure. And, of course, want to spend your money on the

46:10 part and knocked on infrastructure. So did investigate their codes and that all

46:18 their codes after within MX code and mechanics codes on but increasing also chemistry

46:28 . So what they did, as said on this slide here on this

46:33 a few years back at that time X 36 or instructions that used by

46:40 of interest computer as well as saying , um at the time, about

46:47 instructions. And today it's over 500 , but they discovered, but their

46:55 they would do just fine with 80 them, of course, support for

47:00 other 200 plus instructions it requires Iligan the silicon parked on requires foreign

47:09 So they're basically they thing is to energy efficiency is to avoid waste

47:20 That was, in fact, the that was used by one of the

47:24 famous computer architects and the HPC Seymour Cray, in the sign of

47:30 computers. Wave. Don't introduce anything don't need on this is the

47:38 and I think I may have shown the first lecture that in regards to

47:45 waste on the purplish dragging kervin and , depending on the degree of specialization

47:55 tailoring they were designed thio. The you'll find most important issue can get

48:03 three orders of magnitude more operations per millimeter out of your silicon or being

48:12 it application conscious in the way and terms off energy efficiency and operations for

48:21 there is actually one more order of . So you can in fact it

48:29 four order magnitude or 10,000 times more or operations for what part of the

48:36 design. So this is there's lots potential by specialize in and not using

48:45 general purpose type design, and that's what has been happening. And it

48:51 is happening at the moment some of that are interested in AI in particular

48:59 A I N is driving a lot innovation and computer architecture or chip design

49:09 the level to support. Uh machine learning or deep learning networks.

49:16 part of the things coming back to that showed a couple of slides back

49:21 if you can reduce the data So there's some things with reduce

49:26 safe a lot of energy as well increasing performance. So that's what's goes

49:31 in terms off chip designs for a Dr Jumping. Yes. So I

49:37 there's a very rigorous pro con cost analysis of the degree of specialization that

49:44 like to do, right? So is the role of a I

49:49 to identify what we should and what shouldn't specialize. Or what kind of

49:54 kind of patterns are we trying to when we like, For example,

49:57 know Apple introduce a machine learning and into their hardware as well?

50:05 so so see if I can give the reasonable answer So there is in

50:18 off the computer architecture or the silicon . So first, I think the

50:30 learning community, um, now for number of applications that it has been

50:39 , that which is image understanding or and also, um, speech recognition

50:51 translation has been quite successful application for learning. And in those they have

51:00 that for many parts off the both and in France part, you don't

51:12 very high position. So a lot things can. It's good enough to

51:18 eight or four bit vintages or people even played around one or two

51:25 So in order to support those algorithms applications that use the algorithms that

51:35 um, cost, you know, and Intel and M D now mostly

51:46 off dp used being the ones that was used Because the machine learning algorithms

51:55 highly structured, they use many of are based on convolutions that are not

52:01 dependent. So you can get kind streaming processors on the GP use being

52:07 good, um, vehicle for doing learning. So the wise the companies

52:16 video named Ian Predictor that were, , dominating the DP use for the

52:24 type processors. They introduced, hardware, direct or circles. The

52:34 support 8 16 bits position and I remember. I think they even now

52:43 forbid position, so that allowed them get a lot mawr application performance out

52:50 the same piece of silicon for the power. So that has kind of

52:56 and mhm, um, the industry again and has been a good

53:05 And until also not having deep use for, um, chips for basically

53:22 and laptops that has integrated on the Die GP use. So, in

53:29 , in terms of the total number GP use Intel, it's the largest

53:34 GPU. So even though it doesn't them separately, it all comes on

53:40 of silicon where it's embedded but now tow towards the end of this year

53:46 start to compete with envy and into name D in terms of discreet dp

53:53 . And but it was a long was saying because they were late in

53:58 that, they started also to introduce reduce position. Uh, special circuits

54:06 there kind of serve erred on focused . So I don't know if that

54:14 answer the question. Yeah, I . Okay. Thanks for asking.

54:21 good to help me elaborate on So So this is just the same

54:27 the specialization and where power and energy been prime concern concern that is mostly

54:36 the mobile market. Specialization has always there that they don't use the standard

54:45 CPU to do everything. There is users. They did not sing the

54:49 . There are encryption engines, their engines. So it's the diversity off

54:56 pieces on the same piece of um, now to sort of bigger

55:06 of the concern for energy consumption. I said, the big users like

55:16 Internet companies Google, Facebook and Microsoft Amazon they Goudelock and were the first

55:25 were a bit earlier in terms off things, then Microsoft and more focused

55:34 software for peace is something they started to do cloud computing. But Google

55:39 Randall data centers. The first thing did was to design their own servers

55:44 they didn't think the server they could from IBM, HP, Dell,

55:50 company where. Na enough off energy power conscious. So they started to

55:58 their own servers, and a few later, Facebook got to the same

56:05 . They also started to design their servers, and Facebook came and started

56:12 consortium, known as Open Compute where published our designs that are more energy

56:19 than what platform and is used to . Of course, Google and Facebook

56:24 are sufficiently large customers. They don't to go to a karaoke shop to

56:30 their stuff built to just tell HP Dell and IBM what they want,

56:34 they get it good for them. it's just to point out that these

56:40 that run very large data centers and giant electricity bill that I mentioned before

56:47 took things in their own hands. they started by this during the signs

56:54 this survey level. Okay, And , after having done that for a

57:01 years, they got more ambitious because , the silicon wasn't really up thio

57:09 power and energy standards that they So Google, as some of you

57:16 , that they're interested in machine learning no over there tensor processing unit that

57:21 went off to design and they're now the third generation on. They're tp

57:28 . They don't sell them something, they use them in their data centers

57:34 they part of their motivation for doing and concrete terms us that I remember

57:46 was they needed to more than double size of their data centers if they

57:52 continue as the distance as usual. it's a substantial element in,

58:02 their strategy, and I'm sure they continue to an asset. And then

58:10 wasn't quite as ambitious in the sense they did not quite designed their own

58:16 . But they use this field programmable arrays to, uh, implements support

58:22 some other search engines in and also terms of some other image and is

58:30 that they support on their cloud. they went after again, standard processes

58:36 measure up in terms of performance and and power. So they took

58:41 the situation in their own hands. this was just another example that I

58:48 before in terms of this vision processing that also ended up being used for

58:54 learning and, um, movie. was an independent company bought by

59:00 And then they put this in this Compute Sticks that you can now buy

59:06 about 100 bucks and has a used attachment. And, um, I

59:15 ended. Yet another design that is necessary are being too, knows that

59:22 is smaller than in a row more computing time to make it these air

59:28 example. To show that the big with sufficient resources, they said business

59:35 usual doesn't work. And a lot the driving part is the power and

59:41 consumption where the standardized design and I I should answer. There was another

59:48 of the question that comes back to now is that because of the design

59:53 has improved, it also means that volume or chips that you need to

60:05 in order to obey, make some these Mawr application specific designs. Payoff

60:12 smaller than it used to be, with two or three orders of magnitude

60:19 performance and are gain. The business is kind of there for doing,

60:28 , some more specialized design. So why there is kind of now the

60:34 of proliferation. Off chip designs are feasible. Where is from about 19

60:44 to 90? Until a few years or 10 years ago, the sort

60:52 convergence happen because again the cost off business, you needed a high

60:58 and that's why you know into, others introduced more and more features on

61:03 tips to be able to support a and bigger market and get more sort

61:10 revenue in money to support to chip . So but at the time when

61:17 scaling work that work. But once this the North scaling stopped working,

61:24 economic rules changed as well as the change improve. And this was the

61:31 thing I talked about this in terms the memory park today. That's also

61:36 energy by the tighter integration of memory , um, the processing part.

61:42 now, through comments about software, comments, and we'll talk much more

61:48 issues about software and remaining part of course. But this is, I

61:56 , the preamble to the next few , more so than the rest of

62:01 that, um, shows the huge and performance in this case. That

62:17 the time for doing it, so doesn't reflect the energy. But energy

62:22 not proportional to the computing part. will talk about, I guess,

62:26 the end of the lecture, I , but that's basically shows that by

62:31 is a good thing. In terms programmer productivity. I think most people

62:37 on that. But if not everybody on the efficiency in terms of performance

62:48 energy. When it comes to product , though some advocates things, it's

62:51 that bad. But as you can here, there's kind of four orders

62:58 magnitude and mawr between and a fairly , piped in cold and highly optimized

63:05 . When it comes to this particular , that is something that benefits high

63:12 of optimization, and it's not too to do the optimization for that.

63:17 you write your program and what languages um, in itself is a big

63:24 . So if one is interested in , one need to think twice or

63:31 times if they don't want to move fighting. But we'll talk about the

63:38 techniques, starting with kind of vanilla code and the addressing higher improve that

63:45 coming lectures. Now a little bit about the power part, I

63:53 First I show a couple of just thio, make everybody ever the

63:58 off sort of the big picture and talking a bit about measurement, but

64:04 talked about how you can control the and energy consumption. Um, from

64:12 hardware and software, not algorithmic point view that that will talk about

64:16 So here is kind of a typical in a day doesn't and not necessary

64:22 your own personal computer. But it , um, So this picture is

64:27 from a Facebook presentation, and in case, it's serious power coming into

64:35 data centers of this is a fairly one words 30 megawatts, and that's

64:44 to several 1000 households. Tens of of households in your small town.

64:50 , that's the big data center is to in terms of power. But

64:55 , so this is at the various in this part distribution getting in from

65:01 power grade down to the service at bottom. There are several points where

65:06 gets quote unquote transformed from high voltage the power grids. Teoh sort of

65:15 nish molds, uh, the chip and it is each of these

65:23 There tends to be both control points certainly measurement points for, um,

65:30 our energy. Um, consumption here kind of dropping down to once you

65:37 onto the circuit board and then onto chip. So and I'll talk more

65:41 that is just to give you the and rappel. As you know,

65:47 no. But mention it is at chip level soul. We don't quite

65:54 access Thio the other levels, our measurements unfortunately, systems that miss don't

66:01 Thio make those accessible for, I , security reasons. But maybe someday

66:08 will figure it out. I need face where Lincoln also get interesting parks

66:15 both survey level of board level and control application level power consumption. So

66:26 is a little bit off said there devices where levels of position,

66:35 the I was Iraq role level that called Khar distribution units. There are

66:44 taking something that is so, 400 goals are up or maybe sometimes

66:55 on down to something that individual service handle. That is, sometimes it's

67:01 volts. Sometimes it is 27 in US and you can see the sampling

67:09 is not all that time for a seconds. And the accuracy is not

67:15 in terms of absolute what but on other hand, is quite if you

67:20 going through Iraq tends to be. track may be in the order of

67:25 to 100 kilowatts today. So one is not an unreasonable relative set to

67:33 accuracy. Then, of course, have water, meters and other

67:38 And as I mentioned at the board , there are something called the intelligent

67:43 management interface that is in every And if you have your own

67:48 you should be able to get access it. Then if you're interested,

67:52 , information about fan power and board and a bunch of other things that

67:59 outside the chip. Um, then are things even power strips that they

68:05 use have. Sometimes some of them better ones. If you like their

68:13 to also report, uh, current both the tragedies and power values at

68:22 chip level, there is rappel in are using for your current assignment.

68:30 as I mentioned, the sampling rate about one hurt. Yes, one

68:37 . Not so sorry. I was , one millisecond 1000 samples per

68:46 So it was actually until was head A and E. And introducing

68:54 And as Josh mentioned last time and talked briefly or limited as it was

69:00 too manage part consumption in the data . So the L is limits,

69:07 it basically has political running average power . So it has a window of

69:13 for which, and determines the average of consumption. And then it has

69:19 to figure out whether you have room dissipate, more power or if it

69:30 does forecasting and for what's going to in the next window, since there's

69:35 inertia in the system. So it's sophisticated control, but it can be

69:43 in America. Is Thio get some what the power consumption is? It's

69:49 a little bit of helping getting insights but by now, and they also

69:56 the same kind of features for their . And indeed, A has also

70:01 correspondent things for their GPS. And side is I think I'm pretty much

70:10 yes. So update frequency. And one point is also the resolution.

70:16 terms of power on there has been time we did the confusion. It

70:21 still be, but most sources The resolution is about 15 micro channels

70:29 terms off the descriptive station that happens wrap up. It's unfortunately very hard

70:38 find exactly where this is well So it's a victim, an

70:46 but it's others don't claim all that numbers. So it's, uh it's

70:54 good, uh, things to keep mind both the time resolution and the

70:59 optimization in terms off resolution and Then it can Onley monitor limited

71:13 And that's what's known as the which is the whole CPU packages kind

71:22 . Um, sensible thing is kind want the thing that you plug into

71:26 on the circuit courts. It's kind the whole thing, the whole

71:30 And then there's a couple of different off power domains and one ISS,

71:38 PP. Zero. That is of course, and then it's the

71:44 of what is on the diet that Inter calls Encore, but it depends

71:52 what they intended. Use of the is whether that is the feature you

71:59 get or if it's used for something . So it's basically three things you

72:07 get up. You can get the , you can get all the course

72:11 , not individual course, and then can get another thing either the memory

72:17 the encore and the next slide just . They tried to point out,

72:22 know, packages the whole thing pp is just the core and associating

72:29 And then the third option is either and you cannot choose it comes be

72:36 by the chip. You're actually using now? How do you control

72:44 I guess. Any questions on Okay, so next is how do

72:55 actually control and manage the power? we'll cover that until the time is

73:02 on dso way back Google again. I said, this were the first

73:08 the start to design their own They were not happy with what they

73:11 die. So and part of it they were pushing for what they call

73:18 proportion. Computing on the left dish here shows the typical work told pattern

73:27 gold. Oh, that Yes, . Hi. Ticks And those so

73:35 by no moans means uniforms. So that case, uh, having energy

73:44 computing could save them a lot of . So the upper right kind of

73:49 shows that the park consumption is very related thio the energy consumption it only

73:56 if they worked and went down to . You still have more half of

74:00 power. Morning. So what do do to control it? So there

74:06 two things that concept that you want to be familiar with, And that's

74:10 gating and complicating for kind of controlling first level or group force level kind

74:20 control. And I'll talk a little more about that. Then there is

74:24 bit more refined control that is known D. V S s for short

74:28 voltage and frequency scaling. And in to make some order in chaos,

74:35 is actually standard for doing this. is for short, known as a

74:39 P. I on Don't try to those in the last few minutes

74:45 So the thing is that in the of go down old days, all

74:52 conversions was done off chip and it just a single feet apart for all

74:57 pieces on the chip. However, 15 to 20 years ago, want

75:05 out how to also do part of on the chips and that enable to

75:13 separate part domains on the chip. today, as a minimum, each

75:20 has its own partner made. And there tend to be several own

75:24 for they caught up quote unquote encore chips also has. So those are

75:33 that can be individually control in terms voltage and on and off. And

75:38 is being used. Ah, and fact, that's built picture just

75:45 illustrated in order to do that's. , the chips have separate processors on

75:50 diet that actually does it power management the chip. And that's why,

75:56 instance, you can very well get performance when you do benchmarking because there

76:02 is independent controller that managers clock frequencies power to the chip in order to

76:11 the chip safe. And it may be controlled with whatever power limits systems

76:17 have set for maximum participation at the level. And I may not get

76:24 it in this lecture. But at end of this slide deck, there

76:28 a case study from Facebook that you find interesting in how they actually managing

76:34 power consumption down to the chip There is just a picture of the

76:41 off trying to manage part consumptions from of an old IBM processor on the

76:48 side, things not using any form power management and things that kind of

76:53 compared toa the dark things on the . I will skip this side at

77:01 moment, running out of time. , I encourage you to look at

77:03 . It's fairly self explanatory, showing the things that are constant, independent

77:11 how Maney course are on and how and what power increases with the number

77:19 ships. That is our course that on on the particular chip, and

77:23 a couple of different benchmarks. Some them are compute intensive and somewhere

77:29 But let me talk about this and it comes back to the square

77:33 and again that the point is that can potentially reduce clock frequency without correspondingly

77:43 the performance so you can gain energy without losing too much performance. Producing

77:53 , uh, dynamic voltage and frequency and here is just an example for

78:01 particular set of benchmarks known as the benchmarks. But there's an old processor

78:09 now, but it shows a little in terms of the percentage gains and

78:14 in terms of performance and energy, I encourage you to take a look

78:18 it. But here's a little What actually happens under Iraq and like

78:25 skylight processor, using stamping, it shows kind of two different scenarios

78:31 how things are controlled by farmer in processors. Start is the red dotted

78:38 that goes upwards toward the right, that basically shows the participation,

78:45 a za function of the clock frequency the chip that you can control for

78:50 firmware controls. There's a little bit trickery foreign uses to control it,

78:55 it can be done but typically is by these control processor on the

79:03 There are other tips that the dash line that goes down is that still

79:09 , Thio says. For a an that is CPU limited. Um,

79:20 means the faster you run it, less time it takes. And in

79:24 case, you may actually save even though you've learned more power because

79:29 time gets produced mawr than the power out. So that's known as the

79:35 to the whole strategy, and that's way many of these controllers work,

79:41 tried to maximize call frequency, to reduce time and minimize energy.

79:47 the other hand, if it is , then with limited or then even

79:55 you raised a buck frequent on the , you don't necessarily reduce the execution

80:02 . And that means you burn a more power without gaining production in

80:07 And then the energy consumption goes So what? This algorithm that is

80:13 in the sky like processes does is to find the optimum point based on

80:19 sampling off various registers in the process during around time, and find the

80:25 spot in terms off car frequency off CPU. And I guess at this

80:33 , my time is up, so I'll cover a few more slides at

80:37 beginning of next lecture. But I stop since my time is up and

80:42 if there's any questions on this Okay, so if not, I

81:02 spend a little bit of time talk me about these Control works in terms

81:10 frequency control. So we're probably talking 10 15 minutes in the beginning of

81:16 lecture, and then I'll talking about empty next time

-
+