ICS Video Player

COSC6365 Intro to HPC Fall 2021 - Lecture09_DRAM

Transcript ×

Auto highlight

Off

Font-size

00:00	mhm Yes it is time unit and are not many that has checked in

00:24	. Uh huh will never Let's start asking. And at this time.

00:31	, so I'll continue to document memory today. The last time I was

00:37	talking about cashes and a little bit memory is integrated into and server and

00:49	I'll focus more on main memory itself its properties that is important to understand

00:57	failure. Mm So that's it. it's just kind of well see this

01:08	part one as you forget to part , but part one is about Basically

01:12	main memory in itself. And for , if I get this more about

01:19	power management of systems. So the thing is to talk about the design

01:26	then go through the various stages and try to somewhere as what I've done

01:32	about And this is part two as said, if I get it,

01:39	highly control our consumption. So now the memory, this is just a

01:46	showing a little bit of the Think that's some sense. Uh for

01:53	difference between memory ships and processes chips go back and look at what was

02:00	about in terms of processes, chips to be in the range of 5-700

02:08	mm and in this case here they considerably smaller. I think this one

02:20	. So this Shows a little bit the horizontal axis. one three different

02:26	memory manufacturers and then for each one them it is successive generations of chips

02:38	but that's this line chance to show I'm is there are considerably smaller.

02:44	As you can see from the left , they're about 40-60 sq mm.

02:51	there are More than 10 times smaller your typical processor chip. And then

02:57	itself has certain consequences in terms of you can actually work with these

03:05	The good news is that memory tends be very cheap. Part of the

03:11	for that is that the chips are so the years and the manufacturing process

03:16	just usually considerably harder than it is processor chips. So that I think

03:24	much and you can also see in of the number or dancing of these

03:29	that are uh pretty impressive. So look at the most recent generation chips

03:37	are about 100 megabits per square So that means you get a typical

03:44	may you know, even up to on a single these silicon that is

03:52	to 50 square millimeters insides Anyway. I was just to get a big

04:00	, a little bit of Iran The chips that is used for making

04:08	memory and pretty much all systems and different is I guess I should say

04:16	. So the memory is since many back their kids back are manufactured in

04:23	same technology as the process of So that is the c most technology

04:33	with but two different types of memory designs fundamentally different designs also have

04:44	different properties. So last time mostly about Cashes and Cashes are built

04:50	I was known as islam or send random access memory. Yeah. Whereas

04:58	memory that the thing that is packaged as dim for instance our goal separately

05:04	the circuit bowl or in this high with memories. They yeah tend to

05:13	birthed out of was known as dynamic access memory. And the difference between

05:21	two kind of a sketch of here the design is like. So the

05:27	memory using the RAM chips there are basically to be cheap and for that

05:37	means also very small in size you lots of bits of human piece of

05:43	and so they are effectively just want capacitor and transistor that is basically an

05:51	on switch and the capacity is the that stores The charge that defines whether

05:57	system is in a zero or 1 the bit that's a zero or a

06:04	. Now the thing that is used caches are the Strahm and they are

06:11	done as six transistor cells not to you on the right hand side.

06:15	that means they are considerably larger in of the silicon area required to store

06:23	bit compared to the D ram. difference is that they are able to

06:30	ST which the than mammograms are kind not very good at. So they

06:37	of leach. So it doesn't take that long before you lost the space

06:45	I'll come back to that. But means that dynamic random access memory also

06:51	to have what's known as a refresh restore charges in the memory cells.

06:58	you get the error because it doesn't charged at least now the design than

07:07	actual memories um are organized in this way isn't as organized as the matrix

07:16	in each one of these little cross past volume extremist orbits. And so

07:28	the operations will come to in a bit uh as to when you want

07:34	particular bit or collection a bit they to gain both a role address and

07:41	address. That's very simple. Just you get the matrix element out of

07:46	matrix that you have a Rwanda column . So there's nothing unique about that

07:53	it also means that if you take micro photograph and remember if you get

07:58	that looks like in the upper right corner, they are incredibly well um

08:05	um or like also seen on this for flipping and touch screen. So

08:13	are incredibly dense and they're much dancer terms of transistors per unit area.

08:20	you'll find in the process of design so well structured and can be compacted

08:30	very highly. There's just another thing tells you a little bit to stick

08:34	the hot I guess by now four ago but density has increased but again

08:42	the picture of it memory chip and is just kind a little bit more

08:48	of the extreme and then if you go and look at the footprint in

08:55	of the actual physical size in terms square millimeters of cells and they are

09:01	larger than what they are in the cells. Again, so that's why

09:08	system is expensive relative to the room it requires a larger amount of silicon

09:18	If I remember I guess roughly at 10 times society. It's not more

09:27	so that uh was going to talk their how this actually works and that

09:35	in part a consequence of this kind matrix type design to get to memory

09:44	or works. So it's just uh are pictures of one thing on the

09:50	hand side. It shows up against and there's a role address in the

09:56	address but right. one of the of memory chips is that column and

10:08	address this share basically bet lines. you're like so you can only give

10:16	address at the time. Either the address or the column address. And

10:21	reason for actually sharing the wires and pins is again that the memory chips

10:29	small so you don't have much real to actually provide all the signals you

10:37	both addressing and data and clocks and and all of it. So it

10:44	been the case for a long That depends for rolling column of justice

10:50	shared and that has consequences on is why the dirham is much lower than

10:57	islam? I'm coming to that, not part of it. I'll get

11:01	why it is a good question about and a few slides. I'll try

11:05	explain why it is but its inherent the matrix design. That's where it

11:11	from. Not so much from the that to share pins. So then

11:21	in part because of also energy one has constraints in terms of how

11:35	um in fact the power, the and I'll come to that. So

11:44	is in fact and I'll explain it the next few slides, but that's

11:49	one goes through some of this That is in the right hand column

11:53	this slide where says activate, read pre charge and refresh. And I

12:00	on the next few slides I'll try make sense out of why these things

12:06	the way they are. But it that it's a bit of a process

12:12	either write or read the memory because takes several steps either to read or

12:18	the memory. It's not just sending address there, even if it got

12:24	separate Derwin column. It is, there's more to the story of how

12:28	the arms are operated. That's not case for extra. So it's unique

12:34	the design for them in order to themselves small and sheep to uh

12:44	So here is a little bit more again where this capacity stores a bit

12:51	the transistor access to switch that basically you to either read or write the

12:59	. The problem with this theory ramses you read, I suppose that you

13:05	the capacity of being charged represents the which is the most common, that

13:09	can be the opposite but there is charge and the transistor that represent the

13:15	. And when you want to capture state, you in fact loses the

13:22	and the transistor. So whenever times read something um you're kind of as

13:31	said destroy the state and that you not what you wanted to happen.

13:36	that means that after you kind of the value, you need to restore

13:42	to make sure that the next time want to access it is still

13:46	the state that you had before. that's part of the reason why there's

13:51	bit of a process that in Read it. In fact requires a

13:57	write type operation every time. Um if you remember this uh matrix picture

14:09	I think I'll come back to that the next slide, I can't say

14:13	. So um so maybe I'll talk it and the slide. So basically

14:20	happens then what Francis one needs to order to be able to read arrow

14:31	um need to what's known as And that's again it's a function of

14:37	fact that how you managed power on chip that. Yeah basically only kind

14:43	enabled one at the time in terms reading roles. And that means when

14:50	go from reading one role to another there is a process again of closing

14:56	or restoring things that you may have in the read process before you can

15:04	or activate another rope. Now Rose to be quite long for instance or

15:19	Rose may contain many Work. So not just 32 or 64 bits.

15:26	uh in order to collect some segments number of bits in a role.

15:36	also then give column addresses. So was again the sharing of the bits

15:43	first to give a row address. that allows you then to activate a

15:50	role that contains the data you want read. And then you provide column

15:58	. But as I mentioned there are data items to cook in the role

16:05	the D. Ram. So you the optional selecting basically different columns that

16:16	the collection of them uh contain the for a word that you will or

16:22	or single or double precision where that want. So once you have activated

16:28	role you can read collections of columns sequence without going through the process of

16:40	they're all again. So uh I'll back to that and if you just

16:47	so the process is usually activate the and then if you have good access

16:58	memory, that means hopefully you will columns within the same rule in order

17:07	to enter the penalty and time in of going through the closure and activation

17:14	a given role and activation of a role. So this is just a

17:19	example showing that this case there are different columns that one wants to read

17:26	of the same role. So that's you activate Rose zero and then you

17:31	for uh set a column zero and read for one and read 3 to

17:36	. But then the next read was a different role in this case just

17:40	next row, row one, but could have been to anyone role and

17:45	process will be the same. So nothing unique just to jump into the

17:50	row. But then before you need do that one needs to do what

17:56	as the pre charge. That is restoring whatever this role was that they

18:01	working with. And then after that's and you can activate it new role

18:07	you want to read. And then process repeats in after all and when

18:15	row is red, It kind of into row buffer and I'll come back

18:21	that 10 later slides. So in what happens is you copy this roll

18:26	the robot for and then you get various columns out of this robe.

18:36	So this is just a listing of I kind of just said, what

18:41	when you want to read in a , you activate it and then you

18:46	go through the rate, right process you want to do and then you

18:50	to close it and then Hold on the next one. So I'll stop

18:58	for after another couple of slides and if there are questions given that there

19:05	a bit of a process as a of several steps in order to read

19:10	write a data item to the one has this notion of cycle time

19:21	is not to be confused with the cycles of the memory. So cycle

19:31	and then when it comes to dirham used to describe there. So the

19:40	time between for making successive requests to memory because each request or access is

19:55	associated with several steps. So that's cycle time for the RAM is longer

20:04	the access time and hello come back that again. And another graphical

20:13	But so it was kind of an of what this access times are cycle

20:20	relationship between them but it's kind of silly example but just to try to

20:27	the fact that there is a process that's why time between successive requests is

20:36	than just retrieving a particular item part the memory. Um and now I'm

20:46	a little bit of the detail but anyone of you at some point in

20:52	life need to buy it the um configure servers and pc's these items are

21:00	fact very important and this refers to different steps that are necessary when you

21:09	read. All right in the Iran there is this column access stroll bar

21:23	just this t cast time that is number of cycles that it takes after

21:32	send or give the column address to memory to the Iran before you can

21:41	get the data uh for that column . And part of these things comes

21:53	the fact that it turns out that memory bus for channel operates the higher

22:04	frequency than the deer on itself. that's in part coming um to halfway

22:14	fully answering the question. Why I'm a slow I will try to

22:18	it more yeah in the following Why? Also then the cock rate

22:26	the memory is so much lower than crop. Great on there. Memory

22:32	that in turn is actually slower than clock rate of the processor typically.

22:39	so there is an memory dear. memory is characteristic with is for timing

22:49	or numbers. So one is again number of cycles it takes after you

22:55	the column address before you can actually the values out. Then there's also

23:04	number of cycles that it takes after give the go address before you can

23:12	the column address and that's sort of road to column delay or RCD.

23:22	there is the thing when in here close up the role that also takes

23:28	time and that's characterized but this time recharge RP and then there is yet

23:39	time that is not related to the time for the memory that is the

23:46	you kind of need to stay in role after the issue a role address

23:53	you can issue mhm, recharger, up the role to move to the

23:59	real. So again, when you at specs for the RAM chips,

24:06	tells you a little bit about what bus rate tends to be or is

24:11	and I'll show you some examples In next side or two but it's also

24:17	in terms of these various delays associated the different steps in using the tiara

24:29	was kind of a little bit graphical of how the different times are

24:36	So the RCD was the time after give their all address. Before you

24:42	give the column address, someone you to do something at all the first

24:48	the road address and that takes a time and then you want to issue

24:53	column address and that is the tr tells you how many stifle it Thanks

25:00	terms of memory bust, cycles of , cycles before events go from one

25:06	the other and once you have ready issue a column address then that in

25:12	takes some time to teach us. then you can is to several column

25:22	without waiting for the completion. So can kind of pipeline address column addresses

25:31	that and then we have this minimum time for a role that is the

25:41	, active time will have to stroke then there is on the right hand

25:47	then they pre charged and when you're all of this. Yeah, so

25:53	minimum active time after you issue the address that tells you the earliest time

25:59	can issue pre charge and then the charges subject sometimes and then they're kind

26:06	deal with the whole process of activating old and closing it up and now

26:16	talk about some numbers I guess for , the chips here and chips and

26:22	I'll stop and see if there are . Um but before I've used this

26:30	of DDR and they will show up the next few slides and in case

26:35	think I define it in some earlier but just for your reference again,

26:40	stands for double data rate and what means is that one can got data

26:49	on when the clock signal is rising when it's fault. So then and

26:56	sort of, my little picture here shows a bunch of clock cycles.

27:01	the double air shows the length of clock cycle. So in that case

27:05	can get uh either read or write values per clock cycle through this double

27:13	rate design. No, just in for what I think is on the

27:20	slide shows that the external, there's ratio. Typically that is for

27:28	the internal clock for the memory is times slower than the bus clock.

27:39	. So here is now, I was a state of the art

27:47	for DDR, The Science of DDR is now the most recent designs that

27:54	being used in servers at the end this year, There will be a

28:01	survey generation coming up that will use next generation DDR five, that may

28:08	used in some other scenarios but um is still making the point was typical

28:17	the er, memories and the numbers terms of data rates, some of

28:27	numbers change and some of the numbers change much. Try to point that

28:34	. We have talked earlier on that , process and memory tends to be

28:40	more following because it's slow um and come to the reason why the clock

28:47	so much lower for inside the memory the slide or two but if you

28:57	in the first column on this so I guess the first column is

29:01	still er four and then there's You , four digit number after 1600 on

29:08	top and then 30 200 at the and that's related to okay, the

29:17	channel bus speed um that is She looked at the second column that

29:29	you the actual clock rate four, DDR memory ship itself, so to

29:40	, or the memory array of bits the memory chip, so that goes

29:50	in the lower range of performance for our memory from At 200 MHz And

29:57	top of the line is 400 So well my doctor, I talked

30:05	the process of designs, they tend operate in the 2.5 to 4 gigahertz

30:13	range. So They're about 10 times clock rate than the rate at which

30:21	bitter race in DDR memory operates. that's the fundamental reason I would say

30:32	memories a lot slower. No one wonder why is to cooperate so low

30:37	here clear. And I'll come to , as I mentioned now in order

30:43	try to mitigate a little bit um problem with such so low clocks rates

30:52	the our memory one plays some tricks that as you can see here that

31:01	I. O bus for memory channel incorporate as if you look at this

31:09	consistently four times higher than what it inside the memory and I'll talk a

31:17	bit tall. These things can be in which and how they actually

31:25	The our memory is designed in order be enable to also deliver things on

31:31	bus at the rate the buses operating the fact that it's four times

31:38	Then the memory cells themselves operates. then on we look at this data

31:46	, the transfer rate that is nowadays um rated as transfers per second and

31:59	is just meaning million transfers per seconds related to megahertz cooperates. But this

32:06	every thin or wire then can deliver at This rate which is on the

32:17	line is 1600 mega transfers perception and is per wire so to speak in

32:28	Memory Channel. As you can see is a factor of two for the

32:32	data rate goes from 800 to So there's two per clock cycle and

32:37	true throughout this column. So that the day they are future. And

32:44	we have just um all names here then which is then the factor of

32:54	higher than the transition. So This of indicates that it's kind of at

33:03	eight chip because now it's related more what's happening on the bus. And

33:10	you do then use this eight chips have put eight of them together.

33:18	then you actually get uh the rates terms of advice per second on the

33:25	channel. That in this case is gigabytes per second. For the lower

33:31	for the higher Speed grade lead er is 25.6 gigabytes per second. So

33:44	moving on to this notion all the I get there was the T

33:51	C. D. And R. . And the cast or column I

33:57	stopped. I guess I ended up listed as cl in this column

34:02	But these are the best for the here that sells how many cocks

34:07	In terms of the memory bust it takes delay between the single column

34:16	and before you can get something. if you see the row address,

34:21	number of cycles it takes before you issue a mhm column address and then

34:29	police charge time. So you can that the faster the internal cost.

34:35	coastal If you go down by the here, the higher the rate of

34:42	caucus for the memory cells the more delay cycle this takes before you can

34:50	do something. So even though the are higher as you go towards the

34:58	here. So the higher rate speed chips, the delay between the different

35:04	in the process of reading or writing . The chip goes up. So

35:10	the end the column here to the then tells you the actual physical time

35:18	they used to for the lower grade . Yes, 12.5 nine a second

35:27	if you look at the fastest it's not much Um it's actually

35:35	It's actually 59 seconds. So it's of counterintuitive and that's one needs to

35:45	for whether I want to spend the to buy the fastest are great memory

35:52	. Um that has benefits but if do um for instance you don't need

36:02	switch between different girls all that the you can stay in the same role

36:08	you don't need to pay the time pre charge and so on. So

36:12	memory access pattern also it gives you idea whether um it pays off because

36:22	the late agency involved that the agency not necessarily go down. In fact

36:27	goes up and potentially in this case the states the same so late in

36:34	it doesn't buy you anything band width . It does buy you something and

36:43	is one important very important aspects in you configure your memory, you know

36:52	here for a second and I'll try explain a little bit by things are

36:58	of staying pretty much the same and and not only in terms of the

37:05	to the cooperate that you have inside memory chip but it's also if you

37:12	goes through and I think I have on the future site but over

37:17	So the different generations of DDR the late insee hasn't changed much.

37:24	I'll see if there are questions at point and the muscles again try to

37:30	more why they things are so slow on physics. Okay. Uh do

37:46	little bit more on the physics then guess there's a little bit more of

37:50	was said to Leighton see measured in what it is. And the graph

37:56	at the bottom left shows pretty much say it's actually speed rates but it

38:06	different bullets here is you can see table above I guess I should say

38:12	has the different generations of DDR memory the right thing column here shows

38:18	it has decreased in the first and don't have a year but this is

38:23	ago in terms of the single data design stuff. A double data rate

38:30	probably about 20 years old, the one. But as you can see

38:36	numbers has not really changed and it's essentially inherent in the way the physics

38:46	for C mel's memories. So it's that member of designers are ignorant of

38:53	need to make higher performance chips but is fundamental. So as long as

39:01	stays with the c most technology for chips memory chips the latency is not

39:11	to change and that's something very important try to remember as a kind of

39:21	rule of them don't have too much that the internals of the our memory

39:29	improve in speed um tricks have been if I can say so, but

39:37	has been ways to try to mitigate fact bye increasing the capability to deliver

39:47	to the memory bus. So one to make up for the slow clocks

39:56	the actual architecture of this sign of chips that I will get to think

40:03	. Um yes, so I think will talk a little bit how one

40:10	up or try to bridge the differences one can, but the bridging is

40:17	terms of bandwidth, not in terms latency, I can't Strict Physics Last

40:30	. So yeah, it's kind of picture of how the year on chips

40:39	put together so and one of the few slides today I showed this kind

40:47	fundamental idea of memory being being organized matrix with rules and columns and and

40:57	cross point in between rows and There is a bit sometimes more than

41:04	being stored but fundamentally it's kind of you'll column organization. So here is

41:11	little bit of you know, square to denote one of these. A

41:21	of memory bits. Now, in for this case that the DDR please

41:30	and part of it is also true the therefore they'll talk, give examples

41:35	that today, therefore is a little more complex on the principle Maybe more

41:40	explained on this DDR three. So why I still have this Slides for

41:45	DDR three memory to illustrate how things up. So in fact inside the

41:52	there are eight it raised of memory and those are known as banks not

42:05	confused their ranks. That has to with um sets of memory chips that

42:14	the bus fit for. These are are internal in the they are designed

42:23	sometimes it also is kind of known pages and things uh incurs penalties when

42:32	move from one bank turned up. , it's just said you need to

42:40	roll and call them addresses and that inherent for each one of the

42:47	But then you also need to select back. So the standard For the

42:55	three memory requires that there are eight inside the chip. Memory designers don't

43:03	an option. So eight banks requires bits too select which particular banks I

43:10	to talk to and then we have role addresses and the column addresses.

43:19	now if one looks at the spec the things that exemplified on this drawing

43:25	, It's known as a one gigabit times eight. Yeah, So that

43:34	eight bits wide delivery sawing, every kind of clock is puts out

43:44	bits and then The organization then suffer collection of eight bits, there is

43:52	at least some makeup. So one Yeah, so the chip no,

43:59	saying I think Maryland. So this the banks already mentioned on so on

44:04	is the times eight. So if looks at the detail here and so

44:09	things that comes out that dimension gets of the ship is then it's white

44:14	much specify weight of the. Yeah The 128 that is an and in

44:26	the number of columns um there is that follows that Because some other way

44:34	at 1.8 tells you how many columns are. So that means they're seven

44:39	you find at the bottom here to out which column you want to be

44:43	. And also and when I looked the bank description you find in a

44:49	columns and the descriptions or it's a of roles times the number of columns

44:56	that it's associated with each um your intersections. So it's kind of the

45:02	dimension. That's just an issue economy that are 64 bits. No The

45:15	bits um is kind of not totally . So I think it says

45:21	Right so This is known as kind um the burst mode of eight.

45:31	that means for every request in fact get 64 bits and not just

45:41	So that's kind of the roll buffer a column for the robot for that

45:48	terms of the readout You get actually Value orbits instead of eight bits.

45:58	that's why and then kind of can up for the discrepancy between the data

46:10	of the memory channel and the internal rate. He said there was a

46:16	of four in the turn in clock between the bus and the internals and

46:22	it was double reiterate. So there's but bust cycles. You deliver The

46:30	. So that means in fact there a factor of eight difference in terms

46:39	what gets put out and that's why burst Motor eight is matching that First

46:45	of four in the clockwise and factor for the double. Okay. All

46:52	. So, uh let's see if have something. Yeah, so these

46:57	the number of all simply. And so if you work out the numbers

47:06	, 16 K rose tons 1 28 from 64 beats per row column

47:13	Um then you get 1 28 megabits then there are eight of those banks

47:21	considered for. Okay, go bit chip. So it's, you

47:35	some memories are fairly complex entities in own right. Uh Let's see what

47:46	have. Right. Yeah. So other thing is also Then I point

47:54	in number two on this slide, , that many of the processors works

48:04	64 Byte Cache lines and Each the with is typically 64 bits or eight

48:19	. So also this first mode of kind of matches the ability to serve

48:27	lots. So that's connection between burst and cash lines as what um any

48:45	on this? Uh, so as soil, I showed you something about

48:52	er four and then I'll give you little bit more of examples here about

48:58	er, memory and problems in using . So, again, one is

49:14	to figure out again how to increase ability to deliver things faster to the

49:24	bust. Spike, the fact that cooperate internally is low and that's part

49:30	what one is trying to do with DDR for design that is or

49:35	So it tends to have groups or and the access time depends on whether

49:44	are working in a single group or group in different accesses or access things

49:52	different groups. So the timing performance of the R4, it's more complex

50:01	it is for the they are three does improve the potential peak then if

50:09	can get uh, Memory DDR so they did therefore peak. There

50:21	, there is on the memory bus according to the Spectators twice that it

50:28	fully we are full but it doesn't that the internals are working nfs to

50:35	. Just that the complexity of the as I will call it all the

50:41	and chip itself has got the more oil complex. Oh, so it

50:51	a little bit, I guess This four groups of banks that addressing is

50:56	of similar still things are shared on bus. But then it just as

51:06	said, it it's more complicated. that's why also you can't just take

51:13	server that was to send for DDR memory and tried to plug unity.

51:18	, Uh trip for nearly our five . It doesn't work. So then

51:27	have a slide here that leads to problem and how do you get the

51:34	performance are not out of the DDR . So because of the internal designs

51:44	well as the pin limitation, there potentially serious performance. It's just in

51:54	video are so it's by no means random access memory as the notion of

52:02	RAM and talks about main memory is deer and but it's by no means

52:08	access. So yes, basically saying if so on this particular example there

52:22	a memory chip that there DDR 1333 MHz so far works out the

52:33	And it was on a previous Line that corresponds to 10 .66 -

52:39	gigabits fights per second. And that's the access pattern to the diagram is

52:47	most favorable if you're only working in banks, air drops by a factor

52:56	two. And if you just work a single bank Than it goes down

53:02	18 of the peak performance. So you happens to be unlucky and successive

53:15	that you want happens to be reciting the same bank, your performance or

53:23	Main memory goes down by a factor eight For the the R three memory

53:28	potentially more fatty. Therefore so it's big difference in terms of the ability

53:41	the memory to deliver its peak performance pattern has a huge impact on the

53:50	performance. This is just for a the around chip I think. Um

53:58	and then a multicourse, easy to when access is the memory comes from

54:03	course. You may also have additional on request successive requests um not being

54:15	call us then to single banks are banks. So here is kind of

54:25	simple example in a server type in which case in the in the

54:32	scenario basically things are just interpreted successive across the memory channels. So you

54:42	all the memory channels and in the case it turns out that the data

54:48	you want is on a single memory for that memory channels it happens to

54:55	in a single day. So in case the ratio between the best scenario

55:00	the worst scenario is um the yes 3 41 gigabytes per second. The

55:12	scenario and then it's very segregation depending how the access pattern is relative to

55:19	main memory design. So in that it can be a factor of over

55:24	performance degradation. Just as a function where how you access the memory for

55:32	data that you are having for your . So any questions on done before

55:45	talk about them why the corporate is , so I have a question uh

55:56	we use the single bank honored he channeling our band with radios. Strike

56:08	hopefully you're data. I lay out respect the memory and the way you

56:18	it uh will be such that you end up in this worst case

56:27	So well typically as being done is if you have for instance your matrices

56:37	multidimensional arrays, there are first flatten one year ray. All multidimensional arrays

56:47	by default, you know it all order column is your order depending on

56:53	to see or for tre so it into I won the right and this

57:00	the array is then typically laid out that to get the best performance if

57:08	accessed by Straight Strike one. So means it's played out across memory channels

57:15	across banks within the data chips or for the each memory channels. So

57:27	try to do it. So if have Australia one access, you get

57:32	peak memory performance but coming back to cities sometimes you, that means if

57:41	flattened writes like a robot. So you go in a row, the

57:47	element in the role is also then a place where you get fast access

57:53	being out either when you go down column. The next column is um

58:03	length of a role away from. So the first element in the second

58:11	is The roll length away from the element in the 1st row. So

58:17	means it may end up in a where it could be in the same

58:24	as the first element of the Roll. So it could be that

58:29	the column elements are in the same . So that when you work to

58:36	column access instead of row access, get the very slow bank performance instead

58:45	the best case scenario across banks and . So that's fine. It tends

58:53	be if people have, they optimize for The four turn cold by the

59:04	border. And they were made And that was part of the mm

59:08	to guess the second assignment I Um then if one were to convert

59:18	call to Seiko. Um that has opposite uh flattening so column in your

59:33	then it is preserved, the innermost low accesses. Then you get miserable

59:44	and that kind of fairly easy to patio Madonna may take sponsor book client

59:51	. Um and it, regardless of programming language is used, try to

59:59	the water between the two innermost loops you usually will see a big difference

60:03	performance. So so that the banner limited according to the data access but

60:13	not just like if you use the channel, your bandwidth is not automatically

60:19	only their new access to the RAM upon the access pattern, your bandwidth

60:25	be limited. Right, correct? . Thank you. So that's the

60:34	from you know, the program is to get good performance. One would

60:43	to be conscientious both how your erase by default laid out in memory and

60:56	what the access pattern is in the to their race. Such a do

61:04	and unfortunately there is not much control the layout of the kind of flattening

61:13	multidimensional erase standard programming languages has going fault the notion that memory is random

61:23	. So it doesn't take structural memory architecture memory into account. So it

61:32	things are random access and that's the on which the design is made.

61:37	unfortunately that's not true in reality and why one needs to as a programmer

61:43	trying to optimize performance, understand also the memory system itself is designed.

61:55	no, a little bit of comments and so I mentioned that the physics

62:01	why it's hard to get Corporates to up to be comfortable to what it

62:10	for processors. So yes, here's gap but talks about Sun Island.

62:16	this, I think I showed this like before. So you can see

62:23	also this, the technology being used building memory. So one has effectively

62:31	RC circuit and um the future sizes by the state of the art technology

62:42	today it's about and typical I think memories in the 10 to 50 nanometer

62:52	. It then the feature sizes the and not all this the smallest basically

62:59	issues whether you use that. state of the art or one generation

63:05	in terms of cost. Perfect. that yes, but the point is

63:11	scales. So the as we do notice hopefully I mean how they some

63:24	of the physics or electrical engineering course that time we can all imagine that

63:32	the thinner wire is the higher the is or you know, I think

63:39	usual analogy is kind of a holds most of us have seen.

63:44	really tiny holes are a stroll and things through. The straw is much

63:50	than pushing things that were white fat . So there's a resistance goes up

63:58	the smaller the future sizes are on check and something is also happening to

64:06	capacitors. The area scales down. that's a good thing. But also

64:13	vertical distance between the plates and the scales down. So um no changes

64:23	of the charge that they need to in order to in um charges or

64:31	capacitor. So in the end that up being this RC constant that defines

64:39	things behaves and you work out the that the cross section on the

64:45	it goes the square that means the goes up with the square of the

64:51	factor. Um or the wire and and rules with the scaling factor the

65:02	. But in the end the the RC product actually gets worse with

65:10	scaling factor. Now, if the gets shorter than and your sister also

65:23	smaller because the wire is shorter. when it comes to memory, the

65:32	of the warriors tend to stay the because you need to get things out

65:36	the chip. So you have to singles across the chip. So if

65:44	wire length remains saying the same. even though Warriors gets thinner and smaller

65:51	transistors get smaller and the charges get in the end it gets kind of

66:00	so it through a lot of tampering the physics. That one I've actually

66:08	to retain the cop grades on the ships. Um in principle one should

66:16	that to actually potentially have gotten worse the sense that it would be necessary

66:23	use lower cooperates for state of the technology but one has managed to maintain

66:31	whereas so that's because I think one to run things across the chip to

66:37	the signals out. That's not quite . Um I finally quickly flee back

66:47	one of the early signs. If can do a quick flip here,

66:51	just don't want the right way to it. So here you can see

66:56	are basically segments. So inside these there are segment someone doesn't fully run

67:04	without seeing the restoration. Um But Ron needs to run them quite some

67:10	. So in practice between single restorations wires remain at the same length.

67:19	that's not true when you look at of designs because the process of designs

67:25	not as dense and they have more of also feeling power. So in

67:31	case one can actually benefit from the future sizes. Except as I mentioned

67:38	the leakage is the problem. Someone even increase the cooperates on processor chips

67:44	. So even there things have kind landed in a space where carcasses don't

67:49	much as they are not increasing much their. Um And the discrepancy in

67:56	grades comes from basically the fundamental physics the desire to have very dense designs

68:05	uh huh. Memory and not quite stents designs. It's More than a

68:12	of 10 less for processor chips. in that case function have been able

68:22	run things at the higher clock grades the smaller feature sizes. So this

68:29	trying to explain why um It is the cooperates has remained low for memory

68:41	and that there is this problem that not easily solved by architecture. Um

68:52	ramps. So I hope that's when answer the question why it turns out

69:03	the rams our soul much sewer and one is try to make up for

69:12	by having this first note inside the and multiple banks in order to be

69:18	to output uh huh or delivers all a certain sense. Okay the Iran

69:29	is inside itself paralleled by having delivering internal buses more bits than the external

69:39	is only like 64 bits In a internally for eight picks out. But

69:51	other thing to be aware of that latency and memory chips has remained pretty

69:57	constant for over a decade, almost decades and it's not likely to chip

70:03	change come forward even though uh huh me increase ways of having more parallelism

70:12	the DDR memory to deliver more on memory bus. All right. Um

70:25	a little bit of another way of to alleviate the difference uh in speed

70:34	main memory and the process of ship the speed of cash is on site

70:43	the process is just so one thing to try to bring oh memory

70:50	I'm closer than being say on the board and basically try to use Iran

71:01	Iran inside the chip itself and that been done in some recent process of

71:09	science. So and that's known as the Iran for embedded the Iran so

71:15	that case ones use the deer um held designs and not the s from

71:24	cell design for the so called embedded that has been used in for instance

71:33	IBM started to use it for there three and sometimes they also call it

71:37	the four cashes. And intel has started to use this embedded the

71:43	That yeah is one transistor cells for but it has different both speed and

71:58	data retain mint properties again because it's totally different to science. So it's

72:04	ah behaving or operating like the cache cells. So it's operating on a

72:12	mode because of the design. But again they see most technologies or in

72:19	to try to get a little bit speed and energy efficiency and again it's

72:27	transistor so you can get more bits the need to E. D ram

72:32	you can get in the next So that's kind of a trade off

72:36	designers authority to use embedded there and some of the chips. So here

72:45	kind of an example what it is put it you can see it there

72:49	two my practice decide. And the thing is to try to get another

72:58	design but it's still external to the . Um And that this was known

73:03	to be the X point is um a product about two or three years

73:12	it's I told the new way of memory cells and in terms of speed

73:20	cost it fits between the Iran and memory. It's just something to be

73:26	of. It doesn't change. Doesn't in between Ekstrom and era money is

73:34	the Iran and pick prevalent or disk some flavor or flash. And there

73:42	just summary characteristics in terms of Layton for the different type of memory technologies

73:51	are being used, that the color lines is for the different memory technologies

73:58	it kind of shows a little bit things fits in terms of latency and

74:06	will stop there and it any more I'll try to get summarize it and

74:16	probably I don't the part two, just a few minutes I guess I

74:25	do my own somewhere and just reminding of what was partially coming in there

74:34	uh lecture last time in terms of these deer and ships are then put

74:42	in terms of modules known as thems uh unused chips are different with and

74:50	enables configuration in terms of different amounts memory um, that you have not

74:59	but using different number of bits per chip, but the width also helps

75:07	in configuring memory and then things get . Both energy rights and time wise

75:16	go up on a circuit board, via the socket onto the circuit board

75:22	dinner slot. So they're embedded system to use membership directly soldered onto the

75:31	board instead of using yes and them . But so again, increase both

75:43	performance and to performance in terms of and latency one in recent years in

75:53	last few have started to both memory dies and then integrate them in

76:05	same package as the processor chip. using what's known as silicon, no

76:11	interpose er that has a lot more and than what you can do from

76:18	socket to the board so you get channels, the memory and you can

76:24	operate them at a good speed. and then there was just some simple

76:32	of um the performance kind of difference both in terms of speed and energy

76:47	and I guess I found looks at bottom rows here um I can see

76:53	the Stacked memory, the high bandwidth are about 10 times As energy efficient

77:01	the DDR four. Um and it's considerably higher data. Right. So

77:11	are some choices today but I've been memories clearly also more expensive but mm

77:20	candidate. So it has been used some of the G P U.

77:23	in particular. Yeah, but not Gpus because of costs. So you

77:30	get Gps people, they are memory spm memory mhm Let's see what

77:38	Yes. The main point that they discussed a bit. And council question

77:45	to be aware of that main memory its name. The RAM. It's

77:54	no means uniform access uh it can a judge performance difference. So it's

78:06	and we try to understand performance if not good. What the reason mitt

78:15	it's just unfortunate access pattern to the the compiler decided to flatten the

78:25	Mhm And right, this is a innocent. So okay. No time

78:33	time. Part two maybe time for Questions. I'll come back to part

78:39	in the future. Lecture. It not be next lecture album. Um

78:44	, probably decided. Talking about open . And the part to hear about

78:51	power is um there is possibilities from user to actually manage power. But

79:01	most cases it can help explain somewhat performance data that you collect and doing

79:11	or timing experiments because on the control uh car crates and that happens happens

79:22	the rock. So it might be useful most of you as a way

79:30	trying to understand what might have the difference between different runs and the

79:39	time. It also I think it's interesting to see how but the big

79:45	like facebook and other studio in terms how they actually control power in their

79:51	centers, including all the way down the ship. But I'll stop

79:57	Take questions. Mhm

Previous Next

00 : 01
03 : 57
08 : 31
14 : 17
18 : 55
20 : 45
26 : 15
37 : 47
40 : 03
48 : 41
51 : 25
54 : 23
61 : 55
70 : 47
72 : 55