© Distribution of this video is restricted by its owner
Transcript ×
Auto highlight
Font-size
00:06 I won't say what happened there. right, so a little bit more

00:12 the interconnection network. So last I talk a little bit about networks

00:23 are used in particular in clusters designed high performance, in which networks is

00:32 integral on a very important part and represent the nontrivial cost of the system

00:40 trying to get give you some also cost models against some sense for how

00:53 cost is kind of related to The topology ability are networks are being

00:59 together as well as a little bit . They're all characteristics in terms,

01:05 capabilities. What in particular diameter, agency. Andi. Awesome. And

01:15 by section, with respect to kind congestion in some sense of the

01:23 And ended up, I guess, , the lecture we're talking about factories

01:30 has become, I will say, dominating into connect for logical clusters designed

01:36 performance. All right, so step a little bit, uh, talk

01:46 crossbars that has been used and is on. Then we'll talk about the

01:54 off so called multistage networks. And fact, the factories are mostly used

02:00 small to stage networks in the sense there computers are the least of the

02:06 on all the internal. Also, tree are in fact stages or switches

02:13 just used for switching. Um, then caveat. Some of the networks

02:21 has what's known as combining features so can do reduction in the network,

02:28 instance, as well as replicating messages broadcasting. All right, so I'll

02:36 about the crossbar first and then the network and then towards the and talk

02:41 little bit about dropping in this Yeah, all right, the

02:48 My looks like this. So in nice part about the crossbar is that

02:55 cross part like this one is also known as non blocking. So that

03:01 anyone on the SE notes on the hand side can send messages to anyone

03:07 the notes on the top on the . And there's no contention. So

03:15 is a pathway between any pair of without causing blocking off other messages.

03:22 that's okay, one of the benefits cross bar. But it is,

03:30 , also relatively expensive Thio implement. that's why it's the usage has been

03:43 , and it depends on the technology the size networks, but or

03:49 Mom wants ability, whether it's useful or not. And I showed an

03:54 . I think, last lecture on fully interconnected that is effectively than the

04:00 equivalent crossbar that WAAS or is used some systems today, notably by great

04:12 than in our own by its P . Anyway, so here is kind

04:18 one very kind of famous example. computers by now more obsolete. But

04:24 creates a big revelation. When I built, the so called Earth simulator

04:29 embarrassed the US by being the most computer at the time by a good

04:35 . And it costs the U. . To start several programs to build

04:44 in the number one thrown, so speak and high performance computing.

04:50 this computer, when it was required his own power station. It

04:53 very power hungry. Well, that's so much for this particular slide that

04:59 shows that this system effectively implemented the between 640 processing notes. And as

05:11 can see in the middle of the , implementing this, uh, require

05:17 almost 3000 kilometers worth off cabling in to got this, um,

05:26 Ondas. You can see On top the slide was a building built for

05:31 for the computer, and again the station was built next to it in

05:37 the power of the computer. Here's little bit off or they may look

05:41 in terms off. Ah, this up together is one floor for the

05:46 , and then there's two floors underneath doing the infrastructure for power and cooling

05:54 once or alone, Well, not height. Floor is just used for

05:59 off the computer. Anyway, I That's one thing that anyone in this

06:06 at least have heard of at some in their life. Now there is

06:12 example asses that crossbars are being used create computers again. One of the

06:20 synonymous with supercomputers since the founding of company by Seymour Crane, that legendary

06:28 architect um, they design their own parts, which is kind of distributed

06:37 hierarchical across parts, which they're trying illustrate them, uh, this

06:42 So then they build networks using this , which there is a 64 port

06:51 , and this picture shows a little . Tires put together as,

06:56 every wall was known as tiles that do report tiles that is organized in

07:05 off kind of a mess, or rows and eight columns off these switching

07:12 that are 16 by eight Cross part they are full crossbars and in

07:19 this particular switch chip. In it's not just this one,

07:26 but, as it says on the hand side of this line is in

07:30 , five crossbars in one piece of . And they did that in order

07:38 separate different types of traffic. So are the way the communication protocol is

07:48 implemented. That they do uses their channels for setting up communication or our

07:57 in the system. Uh, and do not interfere with the actual data

08:04 that has a separate data channel. in this case, eso this chip

08:10 designed about when he was put in two years ago. It was signed

08:15 from the backside. Suspected it would something new coming up in a year

08:19 two. But anyway, at that , each on the links, if

08:26 like, uh, each port on particular switch, and I capability 200

08:34 per second right on the Layton City get through. The switch was,

08:39 average, 359 seconds. And I'll a little bit more on later

08:46 How they used to switch to put a network thio interconnect orbit larger systems

08:54 64 notes S o N a questions crossbar or this particular chip. So

09:12 question that was, eh? So not sure that there will be any

09:17 nable inside, but at work, do a lot of, um,

09:24 defined infrastructure. Um, just becoming popular for businesses nowadays. Um,

09:30 any of the virtualization aspects translate Is it all lost? I'm talking

09:38 , like, the ability to um, certain, um, parameters

09:43 metrics. It's all of that loss soon as we start virtual izing any

09:49 our resource is, um or would have to have, like, a

09:54 defined allocator to really understand what's going underneath the hood and these virtual

10:02 I do believe that is correct. , so and I don't have,

10:13 know, Chris sponsor for you about in these networks. So if you

10:20 at sort of wide area or local network or your typical things for when

10:27 use Internet, sauna and then one virtualized it and build, said software

10:40 network layer on top of the actual infrastructure. And then there is also

10:45 quality of service aspect when you get priorities to different types of traffic.

10:53 so a former virtual ization of the I'm not aware of has been used

11:03 cluster settings. But the quality of aspect is there. So you can

11:09 , different types of traffic in the . And that's in fact supported in

11:14 switch. And it's also supporting and is made by the other dominating,

11:23 , vendor in terms off networks for performance computer that is Milana Hawks that

11:31 , um, the offer quality or or prioritization of traffic in the

11:41 Okay, that makes sense. And suppose that in any sort of HPC

11:46 you wouldn't be interested in virtualization anyway of the the overhead that's associate ID

11:53 with the virtualization itself, right? . Um, so it I can

12:04 don't both argue for and against because you, most of the large scale

12:13 there are large scale because they tend favor in some sense extreme applications that

12:23 either the entire system or large fractions the system. Otherwise, they could

12:30 well. I built, if you smaller, cheaper systems than these extreme

12:35 systems. On the other hand, you do things like, uh,

12:42 cloud service providers that also have very systems, the virtualization may actually help

12:51 of wall off different applications from each . So in that case, having

12:57 software defined network on top of the physical infrastructure mean, in fact,

13:06 in terms off security aspect and interference a security perspective between different job training

13:13 the system at the same time, But in terms off system that's pushing

13:23 performance, it probably isn't helpful to virtualization. It may not be worth

13:31 effort. Okay, that makes I was thinking more in terms of

13:36 using software to, um forced you a certain part of memory has to

13:43 , you know, certain thrashing or distance communication, long distance with quote

13:50 it, the academic sense, so that's more in terms of the

13:56 management and trying to figure out how allocate. Um, the process is

14:01 it's m p I across their notes the system and trying thio get them

14:09 it does not cause too much congestions high latency in the network. And

14:19 come back to the way the way is computer companies. So one is

14:25 the factories again, if they're fully out, then effectively there's,

14:35 enough bandwidth to think of them as paths between each off the leaf notes

14:42 causing contention in the network. and in case off using what I

14:55 , I believe last time. And know I have some. No one

14:59 in today, I guess the Dragon Network that this company using their

15:05 um, it is kind of an to the factory. They use basically

15:13 crossbars. So and principle there should be too much contention. But it

15:21 on the writing that we'll talk a bit about towards the end of today's

15:27 . So, um, but the manager results and trying to if it

15:33 some smarts, try Thio, minimize traffic in the network by suitable

15:46 but it does is also affected how writing is done. Okay,

15:51 thank you, Dr Johnson. This a question. Good question.

15:55 the I again for since there's no but the network and the principle of

16:04 these networks is a bit different, is also quite interesting in the extent

16:10 which companies decided that, um, need Thio build on design and build

16:19 own switch care in that trip. on commodity networking parts for these on

16:29 performance systems. That's what I was ask. It seems like this.

16:35 network interconnect will be a lot more than things we've seen in the past

16:42 this course, uh, or the network that has kind of no come

16:54 gone in popular too. So don't . But I would say the last

17:01 years, years if, um, been just a very small part of

17:10 market and very much just for the , Uh, performance systems in some

17:18 not necessarily extreme scale, but things applications air such that no matter how

17:25 try kind of design your algorithm, will be a lot of interaction between

17:31 notes. So that's why I mentioned . The MPP is where I want

17:39 the network not on the Iot but basically into the same level as

17:46 can your memory. So it's kind network was the first class citizen in

17:50 sense that main memory isn't the basically the network and to the processor

17:58 the same level as the Level three is, um, but commodity that

18:06 is cheaper. Ah tends to out . His proprietary stuff, uh,

18:14 a few years, but then it of catches up. Oh, and

18:21 has designed that is, again the on the high end of the

18:26 In terms of performance, they pretty always had had their own network

18:36 They sold off. Not this There's no nothing shot, but I

18:46 there's two generations back. They sold their tech network technology toe intellect this

18:52 trying to integrated into their efforts on year for network say Intel does also

19:03 processors, in addition to doing, to use. But it has kind

19:08 struggled a bit to keep up with even, uh, recently high volume

19:18 , but so on a zai mentioned is higher volume and cheaper and

19:30 uh, against difficult. Thio make modest, if not low cost,

19:38 even that tends to follow other protocols a few years. Delay in terms

19:45 the adoption in terms of even higher . But protocol is has a higher

19:53 , if constrictive follows the I Ethernet . So again, when it comes

19:59 the high end, it's hasn't been , um, why did Dr.

20:08 if you go sort of stepped down in the fairly and sort of high

20:15 oriented systems, then you'll find But not for applications that many fluid

20:21 and other applications that requires a lot inter processor communication. That is still

20:30 the nominating if you can afford and their performance. But it should also

20:37 said that the and I'll mention it on that, uh, because off

20:49 prevalence of Ethernet and all kinds on situations, even vendors, I'll switch

20:59 for it's busy systems. Um, their initial um, focus, I

21:09 say I'm getting the best possible performance of their switchgear. They have chosen

21:16 also figure out how to support the protocol on their switchgear and,

21:23 by their underlying designs, being very focused on performance that can even get

21:32 that a performance on the Internet Then you understand that Bonilla switch that

21:38 may get from, You know, not trying to put Cisco down and

21:44 of the other then switch vendors, their focus is not necessarily be in

21:50 the minimum latency but problem or in of throughput than being competitive in terms

21:56 cost. So the company mentioned melon they also support there are basically focused

22:07 Dudas and in Philippon company. But order to reach and get a bigger

22:12 , they also support Ethernet on their and do so competitive in terms of

22:20 with fewer Ethan and vendors. And turns out that pray for this

22:27 Um, kind of they do support Internet on it but have their own

22:37 , so to speak, um, the performance sensitive applications. But they

22:43 also talk to standard Ethernet devices and out how to translate the standard Internet

22:50 into their own. That has a lower overhead understanding. Even it

23:01 so okay, and then more questions comments. All right, So talk

23:18 about the multistage networks. As I already, the way the factories air

23:23 is in fact, Azaz multistage But there are a few of them

23:29 is still around. And as a bit, I guess our evolution of

23:37 hard to build high performance systems. wanted you to at least be familiar

23:43 the terms. So before that, gonna start talking about shuffles and perfect

23:50 and talked about the Shuffle Exchange And this is an example of what

23:54 truffle is on. The one that card, if anyone does, knows

24:01 the card shuffle is, and this exactly what happens here. So particular

24:05 thing with, uh, eight the right cars and typically card player

24:13 actions that you have and then the in the cards. And that's simply

24:17 this shuffle happens. So their networks in fact, implement that and I

24:22 show you in the subsequent slides. they're used in the shuffle. Interconnect

24:29 used in many multistage networks, and you can do the first thing,

24:34 then so it's very simple How you this destination address for resource If you

24:40 to do, um, kind of shot foot robbing in the network,

24:45 what you need. And sometimes you to do that. So you think

24:50 sounds like an odd thing. But fact, that's his oxen. What

24:56 in when you dio so change from column Major ordering. In fact,

25:09 time of fermentation that is required to that is, in fact can be

25:15 described as a shuffle around shuffle So it's in fact, a primitive

25:22 that can be very useful in many of competitions on network. So if

25:30 that suppose this type of communication pattern to do well for many applications,

25:37 here is an example now off, right, it is so 16 Port

25:44 Network. And if you can look how the things are drawn here

25:52 um, the you will see that look at the first kind of stage

26:01 the bottom half off the notes or are in to leave when you enter

26:09 first blue box column. So this what you know, shuffle in the

26:16 bits in this case is used in stage to build this omega network.

26:27 , so, um yes, What wanted to say was about so sometimes

26:37 the way these networks were used. that jihad processors on one side,

26:44 , on the left side and what on the right side waas memory

26:49 And this was kind of known as Dancel kind of approach to computer architecture

26:56 , you know, voice on one and girls on one side in this

26:59 , CP use on one side and modules on the others. And these

27:09 , I've been used. So IBM something they call the R P three

27:16 research parallel processor prototype, and I never made into products. I did

27:22 for as a learning vehicle but than result in some sense. Waas What's

27:28 as the SP two? That was scalable parallel processor. That was the

27:35 , Andi, this was kind of . This started building this type of

27:41 that is illustrate for a few different off total systems. Um, so

27:49 what's kind of known as the based , and it zaps you. It

27:55 like the same as you read. is not in terms of inter

27:58 but the structure is kind of Andi, um, the red errors

28:06 the top shows in which direction it's shuffle. So if you go like

28:11 to right and some interconnection stages in network is rather an untruthful than the

28:18 , and you can see that they're of different ratings of different number of

28:22 involved in the different stages in building network. This is something that has

28:28 been used on. Here is another that is similar to the previous one

28:35 one of the stages. He was as a butterfly interconnect. So anyone

28:42 about butterfly except the ones flying But in terms off either algorithms or

28:49 before, no takers lot. If knows about past 40 transforms no,

29:11 , fast forward transformed. The data pattern is in fact, exactly by

29:16 fly, so it's very common and it's actually very good form of

29:23 networks that I will talk for a more about it. However, this

29:29 network happens to use one of these a butterfly interconnection for one of its

29:34 and building a multi stage network here a fully configured by the flying Network

29:41 this is exactly one way of the that during the data interaction and computing

29:49 first four year transform. So if were to do that, this would

29:55 kind of a perfect interconnection network because these intact data interactions will be directly

30:02 by links in the network. uh, eso this network was used

30:13 by a company called BBN N On for Nick and human ous.

30:19 I remember is a what the acronym for. And if they became perhaps

30:30 famous not for this butterfly machine. they were the ones that in fact

30:38 in the initial DARPA project to the on it to the Internet and establishing

30:47 connectivity between the East Coast and the Coast. And it was based on

30:52 switchgear that they built, uh, this company P m e N

30:57 They're not in the switch business, , but there are different, among

31:02 things, the defense contractor and and is a little bit off sort of

31:09 you can lay this kind of butterfly out, and I'm not going to

31:15 too much into the details, someone plugging in the numbers that I asked

31:18 a little bit about last time. it turns out it's a very good

31:23 trade off in terms of diameter and section with and put that costs or

31:29 or building the network. So it's only good for doing a 50 like

31:35 in which is kind of divide and style algorithm, but it's also good

31:44 for building networks. And I think next slide shows a little bit

31:50 um, different drawing then I had the stylist on the previous craft,

31:59 it's essentially the same thing laid out , and it shows how they

32:05 In fact, I think it was Dolly again that I mentioned before in

32:11 off doing being, I guess that and BP for research at the

32:20 Um, you also came up with I'll talk about later in terms of

32:25 when he was a graduate student, Catholic. So anyway, this is

32:29 things can be laid out in terms the Butterfly Network, and I think

32:35 the next slide, I know, it comes, I think on this

32:41 to them. So, um, , um that is being Dali after

32:53 pushing for this butterfly network instead, factories a zone option, he,

33:03 , came up with this notion of they call the Dragon Flying network.

33:09 , it did nothing. My name consulting for the great computer company.

33:17 and here is how no create, fact, put together there Dragon Flying

33:27 for their computer systems today using the switch that I mentioned early on in

33:33 lecture. So on the left hand of the slide, it shows how

33:41 kind of built the networks into, some sense, three layers the lowest

33:50 than think closest to the computer. there's just connectivity directly to the switch

33:59 the way they used is 64 which is that the primarily or typically

34:07 16 of those 64 ports for connecting notes then. So that means they

34:15 another 48 ports left that they can to interconnect switches. And on the

34:22 hand side, what is shown is they call a group, which,

34:30 the left hand side, is consists 32 switches on those switches are fully

34:40 . So off the 48 ports that not used to connect up compute

34:48 they use 31 2 for a switch come connect to, uh, the

34:57 switches in the same group. So is enough then communication channels to connect

35:05 to 31 other switches. And then leaves 17 ports to connect between what

35:19 , uh, groups or 32 switches . So it's my work about all

35:27 them, the mouth of the calculus this kind of while building up the

35:33 one ends up well up to 544 that can be connected up. And

35:41 more than a quarter of a million processors in all that you cannot get

35:45 good systems with with a maximum off cops in the network. So that

35:55 , and since a lot of them directly connected, there is not many

36:02 that are being shared on cost for . And on the right hand

36:09 it's a little bit Let's march your . So in that case, what

36:15 proposed instead of using your single link , say, switches in a group

36:23 can use or pair up set of . Or that means instead of having

36:32 to 30 and for 31 on the , you connect to 15 of the

36:42 , so you get basically 16, , switches in a group and then

36:53 the right hand side. They also then to up links to connect to

37:01 part group in connecting two different But it's still a substantial number off

37:11 or compute nodes subject can accommodate in single system with three hops. So

37:18 is what on in fact, this of network is what's being used in

37:26 three so called exact scale system that far has been ordered in the United

37:35 . So this company has in one all the extreme scale computer,

37:44 , buildings that has been made in last two years in the US

37:53 so let's see what I had That was next. Another network.

37:58 any questions of this is a little related Ireland questions on this case,

38:04 for contention, and now I think have this slide towards the very end

38:09 the day is lecture coming back to particular network and a little bit how

38:14 deal with running in these dragon fly . Okay, so here's another network

38:29 has been, um, also used order that results. And it's a

38:36 type network, and it was originally for building phone switches. So this

38:43 something that cannot on their that's way and potentially also larger phone networks,

38:50 definitely see in bending phone switches. it is a recursive network. So

38:58 it shows the kind of one records step point You can live these networks

39:04 . This was also I guess I kind of a bit of simulation of

39:08 rickerson network works and basically implementing the of a full cross part. So

39:18 see if I can get through this bit. Um, so there is

39:23 split it into two parts. then you continue, uh, sort

39:29 half size in the middle, and you can't have this shuffle connections on

39:34 site. And then you keep doing a few times, and then you

39:42 eventually down to where they used to two switches. So if you don't

39:47 the ricker shin all the way to down as Waas on the previous slide

39:53 , right? So now you have by force, which is, and

39:57 upon what the optimist size in terms for switch might be. When we

40:04 about that last time, depending upon technology and then bandwidth, you may

40:11 able to get on the switch. number of portrait different, and they

40:16 to have grown a little bit over . This kind of simulation your shows

40:21 happens when you keep doing the record , and we can do it all

40:25 way to a very limited number of , um, in the switch.

40:32 it also has some properties that don't in Yangshuo on this slide that it's

40:38 that you have an ability also to rob around congested area. So it's

40:46 than a single path between source and . And that's also the case

40:53 um, Butterfly Network that I just you and the factories so you can

41:01 wrapped around congested areas on the network E. Think that holes as Moria

41:15 . You can look through the slides you're interested, but it's just appointed

41:20 has the features on allowing for wrapping other networks on, like the Butterfly

41:33 and some of the other ones that showed you are, in fact blocking

41:38 . But these are non blocking. part of the, uh, benefit

41:43 cost networks as well as the dragon in the factory. Okay, I

41:50 see one. Okay, so this now. So then course network

41:56 iwas or has been used. And may still be used on. Not

42:02 for sure, but there was this called Miracle that was started by Jack

42:14 . Fact of the member of Caltech Gus lecture showed you this one of

42:20 first parallel computing systems. That was of the start of the current generation

42:32 parallel computing systems. There were several that would call it the third round

42:37 actually lasted, um, on that in the eighties. And last time

42:44 was this cosmic cube that was a bit of a landmark design that was

42:50 by Geoffrey Fox and Chuck sites and and check. Since then, together

42:59 one of his students build, Ali a running protocol that is known as

43:05 or running that we'll talk about in bit. And, um Then Chuck

43:12 decided to often start a company called Todo Lo Leighton see high performance switch

43:22 . And then he also decided for company that but they were going to

43:29 lost to use close network topology for their switches. So this is just

43:36 picture how they did it. And of the thing that is also important

43:42 the hair is kind of skipped over far is how you module arise the

43:49 so you can build the large scale or the standard components and not have

43:56 kind of rewire things, depending on system that you intend to build.

44:01 the class network has in nice all being petition nable into standard model

44:08 action, then with arbitrary a large so or switches in this case.

44:18 this is just a bunch of pictures the date. So I think eventually

44:23 did this, I think the largest , and I'm not sure if they

44:28 , um, did in the largest . This company has said they had

44:34 own marionette protocol that was again designed very low latency high performance, but

44:44 , like I mentioned about melon, and pray. They also, after

44:48 few years, supported Ethernet on top its own hardware, infrastructure and low

44:56 protocol to get on a better performance the Internet and the your typical of

45:05 Commodity Infinite Switch. And they had in a different what they called spine

45:11 and switch cards through built up large systems when there is a little bit

45:20 on this picture of did certain this company, Jack Side, sold

45:25 company a few years ago. they have. They're not as visible

45:31 . I think the company still the that bottles still produces switches, but

45:37 Internet high performance season that switches, will say, by using for AM

45:44 hardware underneath. Um, so our compare a little bit a few different

45:56 . I'm not talking about drafting, , uh, stop talking about topology

46:01 this point. Andi, see if any questions on what I've talked

46:10 So if anyone just try to somewhere what it's worth, kind of remembering

46:18 za basic construct for doing multistage The shuffle connectivity on by the fly

46:31 , uh, are worth trying to , because those air used in building

46:38 kinds Off network has slightly different properties close. Or Venice Network that was

46:46 way back for phone systems or phone has proven again, having very nice

46:54 and mhm being in the currency. they may be used to buy it

47:01 company that bought their income, but , in terms of building logical

47:07 it's the factories and the dragon fly of the currently state of the art

47:12 How to Do things. Okay, , so the next several slides is

47:26 trying Thio make a comparison a little off, and I showed in terms

47:32 the Cost network and or starting with four crossbar and then, in that

47:39 , going from a full crossbar by switches of different numbers of input and

47:45 sport getting a more or less the interconnection network. So I'm trying to

47:54 a little bit, in this using their a number of switches and

47:59 costs and links as the cost Not necessarily volume area, as I

48:05 about in the Thompson great model. , so I think the next kind

48:12 slide shows a little bit of what . Um, just counting switches on

48:20 number of links and doing this particular with 40 96 inputs and outputs to

48:29 cross sports question top. So that's you get the Costis simply number all

48:37 one creature, 11 for each columns basically 40 96. And then there's

48:45 two unique links per switch, others input into our port sports.

48:51 if you didn't start to go through steps that I did in terms of

48:56 from the cross party building about the network to use to buy two

49:02 um, and building the networks in , the relative cost off the multistage

49:14 as, um in some sense in times lower. So in this

49:22 they full cross bars on the And since the number is larger among

49:28 means theme. The thing built by by two switches is, um nowhere

49:38 . You can also see the number links gets reduced in doing these

49:43 and then I can also see that four by four kind of results in

49:48 same sort of benefits in terms off cost, but then as to increase

49:54 number of porch the benefits? less because that becomes closer to the

50:00 crossbar. But you say, weeks. So this is just And

50:11 what can do some graphs showing, know what the trade offs start and

50:15 not stand on this flight on. think the other part was a non

50:21 feature that dimension about. Then that's depending upon what Master stage network

50:29 end up well again. They close hands as well as factories and

50:36 Networks has multiple passed between each pair source and destination pairs, so that

50:44 help alleviate contention in an airport. other thing that is typical in this

50:52 is like what's illustrated with the kree . I talked about the use arriving

51:02 to connect, uh, 16, 1000 your their typical configuration. So

51:10 sometimes called us bristling, where you and number of computing nose into the

51:16 switch shorts. So the interconnection network have as many Leafs knows, that

51:22 are computers that's simply this bristling and in terms of gray and young

51:31 was 16 computer knows looking into single , but the number off so outgoing

51:44 just in their switch. Waas So there's plenty or relative,

51:53 get data past from the leaves Now this. I'm not going to

51:59 through it. But there's just trying summarize the things I've talked about in

52:03 of thereby section with finding side out potential area costs in building the network

52:12 well. As uh, I Yeah, it's on the typical characteristic

52:18 that one. Andi I have noticed did not put in there Dragon find

52:25 the way credit gets it, but to give you a little bit

52:33 Sense off the networks. I've talked that. Uh huh. This comes

52:41 to me a little bit. This and as you mentioned a few times

52:48 some of the resource managers than take have attempted, uh, to take

52:56 network topology into account, since in of moving data, um, things

53:05 good at the level one caches and I get worse deliver to get

53:08 A level three that's significantly produced when got two main memory and then it

53:15 even one more significant reduction. When end up having to move, they

53:22 across saying to connection network. So or resource managers or even the runtime

53:36 supporting, for instance. And depending upon what it is and

53:42 yet sometimes information then from the resource and tried to have figure out good

53:49 . Of course, many their interaction of data dependent and you one cannot

53:58 know at compile time or a data . And yet they have placement

54:07 What the best allocation is respect to . That may not be so

54:12 But again, the dragon fly, well as the factories, are try

54:19 alleviate and provides a lot more uniformity terms of data accesses than some of

54:28 earlier networks, which is part of reason why they have taking a

54:34 As I mentioned, the factory network not been used for over 30 years

54:39 building systems of all scales and the Flying Network has been used for on

54:46 . I think mostly by Craig. anyone is free to construct their own

54:51 according to the Dragonflies principle, but factories are something being predominantly

55:01 I do not know what the current storm does in terms of trying to

55:06 topology where let's learn into some degree out of this so called PBS a

55:16 manager. So there was just home your information, and that has been

55:26 to being issued that people have struggled it, but also tried to remove

55:30 problem we're trying to build networks up less sensitive to the place.

55:41 so next a few minutes about unless their questions. So okay,

56:04 talking about driving. Okay, so are many different ways of thinking about

56:13 . Uh, so one is thinking running. Basically, that tends to

56:20 the case and a lot of including many Internet, where you have

56:29 set the routing tables in each. , that tells depending what the destination

56:35 , what what outgoing, fortunately used forward messages. The other one is

56:43 opposite. It's randomized running, you , talked about that in the next

56:49 slides. And of course, there adaptive routing that tries to,

56:57 adapt thio and choose running past according in the lead agency or in the

57:06 or congestion and there's other things that did not put on here. I

57:14 I also kind of It's not as running that he has tried to out

57:19 . Shortest distance, Uh, but not necessarily optimal. Then their so

57:27 soaring forward, I will talk about virtual cut through and the warm running

57:32 I mentioned. That was something that either invented or certainly popularized by jack

57:43 on bond. Ben Donny. And there's something dimensional. They're running that

57:51 not going to talk more about And except I want to say,

57:55 you have something that, like we're a mesh network when you have very

58:02 defined dimensions in terms of the dimensions the mesh dimension, order rounding,

58:09 try to wrap the dimension and the order you know first X, then

58:13 then y Z, or whichever dimension anyone choose. But to do them

58:21 , they're out each dimension before switching running in a different dimension. Principal

58:28 go back and forth between them, the dimension order rotting enough piece,

58:35 ? And then, of course, all the destiny issues of deadlock and

58:38 lock, And I went up talk about it either. Except hopefully the

58:44 is that on their luck, it's that messages get stuck. Uh,

58:53 . Waiting for each other, like talked about in terms off, I

58:59 . Um, send receives an I in that. If you're not

59:06 , one can cause deadlock every messages for some other message to move live

59:15 is the opposite. Um, in sense of messages gets injected to the

59:24 . But some are never managed to and basically circulates in the network

59:33 And those two years, the last aspect aspects that Locke and live lock

59:39 a reason why There has been great about using adaptive rowing in these networks

59:55 proving that adaptive rowing is deadlock in look free is not necessarily easy.

60:07 but adaptive routing is, in used bye gray in their networks.

60:18 , maybe come back to that at end of the lecture. Any questions

60:26 this kind of labeling on different types routing the poor and talking a little

60:31 about each? Okay, so here a little bit Anyone that is taking

60:42 network class knows about TCP and i and message structure. And,

60:55 I play headers. I t new four and six and cetera. So

61:06 main point time for me in this about bringing up this slide is to

61:13 a little bit about what you may have seen before, which is flits

61:20 fits. So there is a lot attention again Thio performance and protocols for

61:36 works for clusters and MP please. , um, So the header sizes

61:47 big issue eso on trying to have headers because many times in particularly if

61:56 think about synchronizing processes across notes, payload is very small, so overheads

62:10 a serious issue. And when I about the bill gene machine in

62:17 they built, in fact the dedicated for synchronizing processes so they didn't have

62:23 deal with the normal message passing in data networks. Also, in terms

62:30 the switch that crate built that was , in fact, five different prosper

62:40 in one dealing with different aspects of communication in order for, for

62:47 synchronization and control messages, Thio either delayed or interfere with the data.

63:02 now flits is commonly used and there's oriented running protocol, and what it

63:15 is kind of the smallest flow control on the flip. It is then

63:23 up off the number off fits, that's it tends to be exactly matching

63:33 underlying physical interconnection structure. And that's I said on the slide so many

63:44 , um, one use a very number off. I'm a physical

63:53 If that's what to use, you also use optics, and with

63:57 it's a little bit different. But the no level it's in the case

64:05 you know, to switch that copper tends to be the cheapest alternative and

64:13 good enough in terms of performance that can do. Use copper cables so

64:19 kind, you're down to physical So if it may very often the

64:25 four bits, but you don't do control on every four bits, you

64:32 a bunch of them and train going the set the wires, and that

64:37 the flicked for what you do so . So all right, and then

64:49 upon what you do. As I , the headers they do, I

64:54 have like in I P. you have priority quality service on all kinds

65:01 other attributes and encoded into the Um, now randomized trotting, uh

65:14 exactly what is, um so Let's . And that is a well known

65:24 scientists that Turing Award winner. Among things. He came among other

65:31 he came up with WAAS. This of random, my strolling Andi he's

65:37 of joke about that. It's kind a really joke, as far as

65:41 know, as this is the one you is a Brit. So the

65:47 that you keep post office send mails you want to send a letter from

65:53 to somebody else, the post office really try to get it directly to

65:58 recipient, but actually send it to arbitrary random place first and then Sunday

66:04 where I was supposed to go. this is, in fact, a

66:11 balancing technique. So instead off trying find the so the best path from

66:23 to destination, there are things to random intermediate destination and from there to

66:33 final destination, so that minimizes that of hot spots in the network.

66:43 this is what really and driving or I starting tends to be eso.

66:51 is again, depending upon. Mentioned a little bit in terms of petition

66:57 of petitions among notes that sometimes and networks, randomized assignment or petitions to

67:07 may in fact be beneficial. Instead trying, Thio, you know,

67:12 minimum distances and many or because minimum is not alone on telling you what

67:24 you're going to get, because there gets some links that are shared between

67:30 routes, and then you get congestion it's not so trivial to figure

67:37 have to do the optimum placement. drowning was also used in terms off

67:45 connection machine, and I mentioned a times that was the first that was

67:49 good factories. And it's also the I used to work for before coming

67:54 In which, and we used random starting in this factory. So as

68:02 can see again from the stylish picture the right that each on the leaf

68:09 around things at the bottom had in case two options for uplinks into the

68:20 and then at the first level of tree internal notes than each one of

68:30 also had to up place and uh, one level above the

68:39 It has both four uplinks and four links. But the lowest layer in

68:43 case, huh? To a place four groundings. So there was a

68:48 of the factory is not the full factory, but the way the around

68:53 was used. Industry is randomly select one of the airplanes one sends messages

69:01 , so that balance the loads. once you kept Thio, get to

69:06 lowest common and sister in the tree get to the proper leaf node,

69:11 it's a dedicated path from that turnaround down the tree to the leaf

69:17 so randomized things on the way but had deterministic running on the way

69:26 the other machine on that I'm aware that you some form of random ization

69:35 computing on a city for help. genius element processes that was designed by

69:42 very well known computer architect Burden Um and he used one of these

69:51 of dancehall approaches to building the parallel so he had and interconnection network between

70:00 on one side and memory modules on other side. So what he

70:04 he randomized the allocation off data to memory modules in order to try to

70:11 the chance of hot spots in the or for the memory modules themselves and

70:19 one machine. That kind of user was also this fluent machine I waas

70:27 by one of my students when I a jail university. And then there

70:31 actually built by some people in but it never was a commercial

70:37 But one of the interesting part of was adopted by others in terms off

70:44 programming models. I would say was you that this machine had apparently prefix

70:49 the basic instruction and that waas, , that in a programming language by

70:57 fellow called by kind bailouts at Carnegie that adopted this idea and showed the

71:03 or prefixes basic instruction. Uh And all right, so that's what

71:11 wanted to say about brand. I sprouting, um store and forward

71:18 A. Za typical way. Things done in many networks. It's not

71:24 in this high performance network for but certainly and local and wide area

71:32 is very common, and I'll try illustrate it on the next slide,

71:37 believe. And then there is kind a no improvement on the store and

71:41 . There is no less virtual cut networks. And, um, the

71:48 is, uh, that in virtual through, you don't necessarily store the

71:57 for each hop. So I will that on that on the next two

72:02 that tried to put the limit of graphic administration of the story and photographing

72:07 some text on it. So the forward running is the symptoms that the

72:14 first decides. You know where Thio for? To send the message based

72:18 some routing algorithm on knowledge of the , either. And there are some

72:27 running tables or it has some knowledge network. And if it's some kind

72:32 adapt routing, then it also know about driving traffic. Possibly,

72:40 but it depends whether it has just local view or a global deal what

72:44 on in an aircraft? But the is, essentially that consents the package

72:50 the next note on why it? it gets the package on it looks

72:58 then the first thing that happens In fact, it gets put into

73:04 buffer or memory, and at some depending on what the policy for running

73:11 in the switch, it takes a at the header and figure out what

73:15 priority the package has and where it's to go. And then it puts

73:21 message Internet put, preferring one output three for that where it wants to

73:27 to go. But the point is in each stage in the running,

73:33 packets get stored in memory, retreat memory and then expect inspected. So

73:40 every packet ends up enduring a round toe memory. The virtual cut through

73:49 to be a little bit smarter, as soon as that gets the header

73:55 than inspects the header, and if turns out that it knows where it

74:04 so should go in terms of buffering it output buffer is free in

74:08 router. It is the allocation into output buffer and sends the header and

74:16 , merry way and effectively builds kind circuit for the trailing part of the

74:28 . So maybe there flips that are , so it ASL Ong as there's

74:36 that prevents the header. All the of the packets gets forwarded immediately and

74:43 not endure a round trip to However, if the header gets stuck

74:50 it shows little bit little or in middle, I guess, of the

74:55 rolls here than what happens at that . Their message. The whole message

75:01 stored where the headache can no longer . And it doesn't release the but

75:11 and path A has been using before to the rather where things get stuck

75:20 it gets a faster formatting through each the driving switches as long as the

75:25 and doesn't get stuck. But when gets stuck a to that point and

75:30 stored, and then you have to the process to try to eventually get

75:35 the destination, and that warm home is, um, I was saying

75:47 on the got to cut through, in the sense that and the

75:57 all the flips, if you get stopped if the header gets

76:06 So the message with all the different are in fact spread out across the

76:17 in the network, so none of pieces endurance around trip to memory in

76:24 one, or or anyone of the switch is on the way to the

76:31 . So what happens in this case , um So the good part,

76:36 guess, is that it doesn't endure own trip to memory. The potentially

76:41 part is that buffers and the writers the way still are occupied. If

76:50 had against. Stuck, however, . If you have reasonably good congestion

76:58 , it turns out, got this . Home running has been shown to

77:04 quite effective in this computer networks that used for again internally in customers,

77:12 MPP. So warm hole rounding has of been the norm for quite a

77:18 years and how to do her performance problem. I think this is kind

77:27 my last slide for today, and a little bit just tends to be

77:34 90 biased and flavor on this great , but it was more or less

77:44 on relative ease and finding some data for it. So this what I

77:53 to say where this slide is essentially things that they, in addition to

77:59 their own switch that were designed for , late and see they, in

78:07 , do not use the infinite Bond Protocol. Honor is a open standards

78:13 like Internet, but they took the Ethernet for a call and made their

78:19 what they call high performance computing, HPC Ethernet that has ah less of

78:28 header overhead. Soas faras remember that afforded by tender instead of 64 by

78:33 . And so that's one of the they re engineered in terms off.

78:39 then the re engineered the whole protocol when the Internet. Then they use

78:46 running in their network. And again used the Dragon Fly Network, where

78:51 lots off redundant or optional pathways. running messages and there is no more

78:59 the distance and three hops between endpoints the network. Um, and this

79:07 just the effectiveness on their adaptive routing as well as separating the traffic between

79:14 and data movement. So, the bottom science bottom graph. It

79:22 shows the gain by using their particular control and adapted running protocol that pretty

79:28 everybody wants in terms of the all the task regardless, or whether

79:36 was a synchronization or, um, of many toe one that is kind

79:44 a gather type operation, right? and they also always kind of a

79:50 operation. Um, that everything ended completing sooner. Then it did,

80:00 particular skin for adoptive running in congestion . So when that hope I'm giving

80:09 a flavor off the networks and that is a lot off attention first paid

80:15 attention Thio or today have toe design . But also there, right,

80:22 think that and congestion control that is in the switch cares that they

80:38 So at that time is up and take questions. So ah, and

80:52 thing. So if you, from jobs at some point in your future

81:01 these clusters hopefully would appreciate that when placement on the different processes that you

81:12 use for NPR makes a difference on kind of best case scenario or what

81:18 observe the fact that in general the is the shared resource. So that

81:26 you're gap impacted by other jobs running other notes in the system because even

81:31 they do, sections of the network be shared or you don't get necessarily

81:40 performance. Um, and that comes to the question. Whether you can

81:44 of walk off you want a piece network cannot be impacted by other traffic

81:50 the network on. So why should some variability when you do benchmarking or

82:02 performance in these systems that comes from network itself? Unless you manage your

82:13 lucky to you sketch their own network be allocated to a network that is

82:22 not having anything in common with other of the system where other jokes were

82:43 , thank you for today, and was stopped sharing my screen.

82:56 So then I guess I'll stop the as

-
+