© Distribution of this video is restricted by its owner
Transcript ×
Auto highlight
Font-size
00:00 all right. Eso today will be little bit, um, three

00:10 Um, so it's more a question . Probably give you some ideas about

00:19 important aspects in terms of the HBC certain types of computations that just put

00:28 label and then since parts linear algebra . So let's get on with

00:37 So this is the three pieces I . Thio. The reason for the

00:44 one it's something that I know some you are working on in another

00:53 not necessarily the composition about dealing with major disease. And the point of

00:59 about this one is unlike matrix for which the work, so to

01:08 of competitions is evenly distributed across the space. That means, uh,

01:17 the to access, if you like and columns for major X factor type

01:23 . Or as I mentioned for matrix , there are really three,

01:28 coordinates to think about. And, so that's what I mean in terms

01:34 the index place. But when it to early, the composition is not

01:39 . For that race is an And how did you do data allocation

01:44 Custer's? So that's the point, will try to bring across for that

01:51 then e when asked a little bit matrix transposition if I could say something

01:57 it. So I want to say about it. And the last thing

02:02 something that also behaves very differently compared dance matrix operations and there many i

02:11 of it that is important not only science and engineering, but also as

02:16 turns on in machine learning. That something that is of interest. Too

02:20 students, states, states. So the thing. You know, Try

02:24 make some points that hopefully you can and maybe a guide do whatever you

02:31 to do and outside this particular All right, God's elimination. Hopefully

02:38 do you remember? But in case don't think it was probably the high

02:43 method that we aren't in terms about solve systems of equations. Bye.

02:50 a multiple one or the equations in case picture as a matrix. Andi

02:59 variables from the other set of questions multiplying, particularly over the proper multiplication

03:08 , and so that when they do subtraction, the variable disappears from the

03:14 , making equations So there is just that this case, using the first

03:23 as what's known as the people draw multiplying it with it's a factor of

03:29 on, then subtracting the results from second equation. Then the coefficient in

03:37 of the variable excellent disappears and similar do for the rest, and then

03:44 keep doing it step by step. in that case, after this first

03:49 , then you can focus on this a little three by three subsystem towards

03:54 lower right hand corner on. This kind of illustration of the subsequent

03:59 So this is what hopefully you did about Carson elimination hard works. Eso

04:05 is known also as the factory ization . And sometimes it's also on carries

04:12 name of l U the composition um, the matrix off equation coefficients

04:22 , that would be what's on this , uh, to the left of

04:27 equal signs, except including the X . So that's set to the coefficients

04:33 the matrix and after the steps of gas and elimination as we can

04:38 Mr Three, the Matrix is basically triangular, and that's the you in

04:46 L U notion of value decomposition, it turns the l is,

04:52 multiplication factors you usedto force it into upper triangular matrix. So that's and

05:01 sure I will alternatively use Elio or an elimination without thinking too much of

05:08 distinction. So there's two is just this that I'll try to point out

05:16 and talk a little bit about that issue to this line. So a

05:24 on, um, kind of illustrate elimination. Where is you? The

05:30 , then, as with this little towards the lower right hand corner,

05:36 is this kind of square matrix and a number of steps off the Gaussian

05:44 one has the elsewhere the kind off is a dramatic and and white area

05:55 the diagonal towards the left, as as the similar white area towards the

06:03 on the square. And then at step here, one basically, do

06:11 figure out what the it's it's not who should be. In some

06:18 you can just use the equation that would be correspondent to kind of

06:26 um listen, um, purplish reddish in the middle, um, us

06:32 in the draw and then find for one of the questions below. Water

06:39 should be, too. Justifying that with Thio. Eliminate things in the

06:46 blue area, Internet interior, successive . So after that step, you

06:52 have a smaller matrix. Just, , would be the Green Matrix or

06:57 matrix. Now, the problem in that that tends to lead to poor

07:06 is that finding the multiplication factor um so that we're in computing those

07:20 and then using the multiplication factor to the paper draw and subtracting it from

07:27 one of the other Rose is something produce what's kind of a blast one

07:36 vector operations. That in itself is necessarily bad, but it makes generally

07:44 poor use off memory hierarchy. So fact that these air kind of rank

07:52 or scaler operations and updating the make green part of the metrics is something

08:00 wants to avoid. So the schemes doing and to avoid that is to

08:06 to really cast things in a way the l. You their composition.

08:15 fact, we mostly will use matrix multiplication as a sub function and the

08:22 to do that is you don't need for each roll, say, and

08:31 reddish area do what ends up being of another product in updating the green

08:42 . But you don't toe up. to update the green area until you

08:50 to use it to find new pivot and the new, um, set

08:58 multipliers to update things so one can the update off the green area until

09:07 need it. So this is kind the basic strategy to get to something

09:12 can use memory. Heart is much . So in this case, basically

09:19 in these blue and reddish area, may treat things one rolling column at

09:26 time, But then you don't do to the green area until you're kind

09:32 finished figure out. But all the are for each one of the roles

09:37 the reddish area. And then It turns out when you then update

09:43 green area. It is the light rectangle times the kind of reddish,

09:53 , rectangle on top. And now is basically matrix matrix multiplication that is

10:00 used to update the green area. this is the trick that is being

10:06 in all the library routines or anyone wants to right and efficient Matrix or

10:14 factory ization or Gaussian elimination took So I think I have couple of

10:23 of that, and then I'll stop a second and see if you have

10:27 . But this. It's kind of old side on the point.

10:30 essentially to see if one were to what was proposed in the previous

10:39 Then, in fact, the Blue line is as you using this kind

10:47 a trick, and it compares it the performance off the well designed matrix

10:54 that uses memory hard well. And can see for most of the machines

11:00 that is what's on the horizontal axis held. You pretty much behaves or

11:06 as well as the Matrix matrix Multiplication . So that's kind of the take

11:12 message of this slide, and then can player on. We're trying to

11:20 out how much should you delay the of the green areas should you do

11:28 four or, in this case, or whatever number of rows and columns

11:33 those like blue and reddish panels before update the green one. And this

11:41 shows for one of the machines. had some data from that going

11:47 um, rank one or one column wonder what the time Thio doing,

11:54 , 32 rows and columns. At time, we got speed up on

12:00 factor of 3 to 4 for this computer, and this is another one

12:07 this HPC Challenge website and, color coded. They made six multiply

12:16 the caution elimination columns in kind of . And again, you can see

12:24 the efficiency off in this case to the guy's an elimination is comparable or

12:30 too much less than the Matrix multiplying . And so that's what I wanted

12:40 say about one cast or try to girls in elimination. To use memory

12:47 is well in order to delay updates the green area and but those of

12:54 that are interested in it. There's different versions of this strategy on If

13:00 look in the literature, you will something that is called left looking alright

13:04 algorithms, depending upon the what extent are delayed and updating the green

13:16 So Okay, so now I'm switching kind of hard to do things on

13:23 distributor system and Napfa to get efficient of the memory in the thing of

13:35 . So the problem on this is of a reason to talk about something

13:39 didn't talk about before. When it to problems that uses regular race right

13:50 , you know, to the or doesn't need to be just to the

13:55 operations, but essentially things that are a very nice, simple structure.

14:02 here is kind of an illustration off you're different ways that people use and

14:09 , um, there are standard allocation after is being used, for

14:16 an MP I, and dealing with there may be allocated across the

14:23 So the upper well, the first cases here deals with things being

14:32 The thinking, all the collection of knows as just kind of upon the

14:40 , regardless how they're physically interconnect. the upper left hand version since the

14:47 lay out with the kind of blue rectangle in owning what's signed them to

14:58 processor. Andi think you talked about in terms of matrix bacteria and maybe

15:04 the meetings matrix but application the previous . Now, if you think about

15:09 cartoonist example of Gaussian elimination in the right hand corner, the problem when

15:17 comes to gas an elimination that after number of steps, um, process

15:23 the blue one and play out number . We'll have nothing to do.

15:29 basically, you lose processing power or cannot make full use because the computations

15:37 not uniforms across the in the express this case. So then another attempt

15:45 can do to make things. The is Sarah gets to be part of

15:51 business for a longer time. Todo was known a secret they out.

15:58 basically, instead of taking a block columns and assign them, you go

16:04 this picture round robin across the columns with respect to the number of processes

16:10 you have. So after you have up all four processors before columns,

16:15 go back and get the next column assigned to process zero. So in

16:19 case, processes era have things to until kind of the very end,

16:24 there's not enough things to do then the problem with that, You

16:33 , I think I try to say about that. So the problem about

16:40 second cyclically out? Yes, it processes era busy for the majority of

16:47 time, like the other ones. the problem is, you cannot use

16:54 strictly off delaying operations to multiple columns the same time. Quite this

17:03 So and that's quite significant. So clipped in something. I showed a

17:10 time in some early lecture in terms the difference between the scaler things,

17:17 things and run row and column at time, and doing things where there

17:21 takes multiplication. So it's the Jewish in performance, so that's not what

17:27 wants. So the next thing to then realized to be enabled Block operations

17:33 to do this. Boksic the clay . So now one can do get

17:38 benefit off, having processes zero being of the business for a long time

17:44 still using, uh, the block operations. Then, um, one

17:58 try to do things a little bit low balance as well, and try

18:03 do the to the layup that actually , Um, it's not the way

18:11 is. Just blocked two D If you remember when we talked about

18:16 Vector and Matrix Matrix multiplication to show the two D they are resulted in

18:23 communication, which is the weakest part clusters. So then it turns out

18:32 somehow the winner is this thing when does the block cyclically out. So

18:38 is something that you will see a in many off the dense, linear

18:43 type. Um, parallel implementations. , I think this the next few

18:53 just kind of illustrates a little Um, this thing that works out

18:59 terms of the value decomposition and again to cover a few topics. Today's

19:06 going pretty fast, and I will , though if somebody has questions about

19:10 and one or two. Most Well, actually several most lines.

19:16 I'll still stay with this aloof. here is just another just showing the

19:20 slips and steps, and you can of it. This little squares in

19:26 set off slides as being the template showing, um, where,

19:38 processes are assigned to the matrix So each of the square represents an

19:44 across all the processes off data. this is a little bit again summary

19:53 this block Second, they are Now. One thing that is kind

19:58 missed often. And this using this is what I'm going to talk about

20:08 the next change slides and then they'll and take questions if there are

20:17 So this now. So here is of another illustration in terms of this

20:25 where I think of it as being 16 by 16 matrix assigned not block

20:32 by simply in a block order to four by four processing erate. So

20:44 block cyclic allocation. So in this , than the color coded things for

20:49 right in the yellow means I'm thinking the processing as being using to buy

20:54 blocks s 02 sets of two columns rows to being able to do some

21:02 rank operation now not to lose low . What, in fact you can

21:11 is to do the following that. of doing the next set off rows

21:19 columns, you move to do the set of relevant column in the next

21:27 roll and color, and this is legal, depending upon in particular,

21:34 you have what's known as a diagonal matrix where you don't need to won't

21:41 about pivoting order is still the stable . Um, there are also

21:49 you can build one if you need do partial pivoting as it's known to

21:55 , um, accuracy and calcium So then you just keep doing this

22:02 you just keep going, and then wrapped around them until they come back

22:06 you cycle, too. So in end you get something that is equally

22:12 balance. But you don't if for parts of the competitions, because usually

22:18 don't do one thing like a little , it's part of some,

22:24 bigger code, so to speak, does many different operations. And for

22:30 parts of operations, maybe the block is not what you want. Maybe

22:35 to stand a block operation. This be preferred, So this just shows

22:40 you have an alternative way. Thio get load balance for situations when you

22:48 with triangular matrices, in fact, it has certain consequences because the pivoting

22:56 was chosen differently than just going things . It's a role column. Order

23:04 long around, so it causes a . Ation respect. The hard area

23:09 allocated among the note. Someone needs be conscientious off the permutations that

23:17 and I won't go into it. , in effect this way are doing

23:25 . This block six, which elimination on sort of blocked allocated majors,

23:35 the way we built this library on connection machine and the company it is

23:41 work for from coming here. And , we were not paying attention when

23:46 so we were a little confused uh, what the results looked like

23:53 we didn't pay attention to the fact is happening is things, um,

24:01 changed from basically wrote a column major in this process so one can easily

24:08 where things are. And if I to, um, work on it

24:13 rectify department ation, what can this out? What the Perma Titian

24:18 which is, um, eliminates from slides, but you can look at

24:24 , but I don't want to get much into because I wanted to cover

24:28 more topics. So this is just things that you think about alternative ways

24:33 doing business. Take care of the balance across notes. So you get

24:40 the bottom line of the slide was one can use blocks cyclic elimination order

24:48 consecutively off or data allocated in just standard block order. Or you can

24:56 the block cyclic allocation, and then can do kind of a sequential elimination

25:02 respect to the processors. So, . And this is an illustration how

25:11 actually worked on this machine that the built in. Um, this can

25:18 tried to point out the scale Or if one looks at the little

25:23 in the left hand corner, it , you know, mega flops per

25:27 . I was, you know, years ago. So it wasn't giggle

25:31 at the time. But you can going from one no 2000 notes in

25:36 machine than the execution rate. Tono drop that much, So the scalability

25:45 pretty much close to perfect in terms a range from 1 2000 notes and

25:56 also a little bit of personal pride to admit that this beauty system was

26:04 the number one system on the very top 500 list that was ever

26:09 So it was the most powerful machine the time. As per the top

26:16 list. E think that's what about value, their composition and how to

26:24 efficient use of memory hierarchy for U by using delayed updates and then

26:32 to get load balance by either during called dot cyclic allocation or instead choosing

26:40 implementation order if you are free to it or otherwise, doing the processing

26:50 in a block cyclical order, respect the processing nodes and then doing a

27:00 assignment off, Um, favorite roles columns. So any question on that

27:11 quickly X position just to give you things to think about if you work

27:18 these areas, okay, So matrix And in fact, a lot of

27:30 has got into matrix transpose er now go into depth about that either.

27:35 just, uh, give if you as of how things can be done

27:43 . This is clearly the very naive off what The Matrix transpose, in

27:50 , is right. That's what elements to be exchanged and yes, it

27:57 , but it usually doesn't perform very as the points out here. There

28:02 get cash misses a lot on you even make very poor use off cash

28:16 . So suppose things are in the major order, then? Yes,

28:22 they read the cash line from you got a few elements in the

28:27 role. But then so if you look at the first short error

28:37 the two elements of their then when goes to the second row of the

28:43 , that may be very far away the memory. And so that causes

28:50 the year on page fault Typically if that's the case, to get

28:55 second element on in principle in this , you we're only used, at

29:03 for the second roll. It would use one element in that cash

29:09 because then you're going to when you to the right. If it's

29:14 Major ordering the first cash line you may have that element about when you

29:22 to the third role in this, you're not using anything in the cash

29:29 contained the first element in the second , so and it's not likely to

29:36 good and most architectures. So the kind of first type of fix is

29:44 then try to do things in in block order. So in this

29:54 you can unload and this little square , they then make use off.

30:04 , what the data elements that you in a single cash line stretch azi

30:11 as things and fits in the cash , and it's kind of illustrated a

30:17 bit on this side. Then you of do that respect to the cash

30:25 . So going from so in a one cash, then you also do

30:32 at the register level. See, the level one cash to read things

30:37 the register file that you can basically and right back to the level one

30:44 without going to a level one cash than once. And it's a little

30:50 of deceit. And on what? this illustration, because it's kind of

30:56 for the upper left hand corner that kind of a diagonal block. But

31:01 you kind of move down The then the two whites blocks here,

31:07 one to the right on the corner lock in Amman below other ones that

31:13 going to be swapped in the So they both need to kind of

31:17 fit in cash. Eso one is kind of be in reality, worked

31:24 terrorize off these blocks and make sure fifth in the cash and into the

31:29 thing. Possibly respect, um, with several levels of cash. And

31:35 also we respect to You're on page , maybe sensitive exactly how you choose

31:43 white box or the pairs of white . In order. Thio minimize either

31:50 Iran page false or cash, Mrs . So and then the other kind

31:58 things that it's commonly used also, is kind of easy conceptually, is

32:05 recursive type of make X transposition. I said, You know, back

32:13 I talked about major response apply early . It was the idea that these

32:20 algorithm what kind of perfectly match or the perfect job for cash hierarchy by

32:29 the recursive partitioning into smaller, smaller That then corresponds thio the different levels

32:36 cash. Unfortunately, when I came metrics multiplying practice for, um kind

32:44 more detailed engineering, it's just of things were implemented. It didn't kind

32:51 pan out. So for respect, matrix multiplies the block. Detective algorithms

32:57 always tended to beat or has so always beat the recursive algorithms. And

33:04 why most algorithms or library implementations for efficient metrics multiply. Do not use

33:12 recursive style off matrix multiplying. But there was about my six transpose or

33:20 student of mine. All right. ago, you played around with this

33:26 hey implemented them both the naive version recursive version. And in this

33:34 three different blocked versions with different block for different magic sizes or on the

33:42 axis is, um, the log the matrix sizes and on the vertical

33:53 is the number off cycles per So that means lo is good on

34:02 is not good, because that means cycles per element. So as we

34:07 see, the naive version didn't do at all. And unfortunately, the

34:14 one. As people discover whether the month supply Yeah, it,

34:21 wasn't all bad. Eso I did than a naive one. But

34:28 the block versions on these two processors we looked at did better.

34:39 um, let's see, I think had a couple of other slides about

34:43 metrics. Transpose So, yes. I think any questions, though,

34:54 this, that's this single processor matrix algorithm. Yeah, I have one

35:05 . Actually, we don't talk about one next. And then I talked

35:10 the parallel version. Eso way Actually, quite a few years

35:18 Um, this fellow and Giannoulas Ackland doing image processing and image processing people

35:27 use a lot of f fifties. , and at the time, there

35:34 basically it's just computers were not all powerful, Didn't have many levels of

35:40 , and, um, mostly register and maybe some on board memory.

35:49 main memory, but most in the memories are not large. So you

35:53 to deal with disks. But so it's kind of simple memory hierarchy

36:01 like today with the programming languages C Fortune data is stored either in a

36:08 or column major order. So he concerned with how to minimize the

36:17 too. External storage. Um, , as we know, is

36:25 uh, expensive from a performance point view. and it's also expensive from

36:29 energy point of view. So it trying to come up with a scheme

36:36 minimize their round trips to external giving whatever story jihad that was recently

36:46 and energy efficient. So this algorithm up being, you know, known

36:54 sequence algorithm. And it, as would say, stood the test of

37:03 . And so that's why I kind included as a good idea, I

37:06 , to consider. So what the does, as for the illustrated for

37:15 little four by four matrix, is it works on pairs of rows at

37:23 time. So what's required for this version of the algorithm reading a pair

37:28 rows and the hope that they can , um and then do some operation

37:37 that pairs of rows and then write back out to the external storage.

37:44 then, um, they're going through the rows of the Matrix pair Wise

37:50 this case, there will just be steps or to integrations if you

37:57 On step one, that is the column. Some of the first thing

38:01 read errors worked ISS operations on the two roads and the swapping that was

38:08 in these first two rows. And the second iteration is coming to the

38:16 pairs of rows in this first basically then stepping through eventually all the

38:21 on the Matrix. And then what done when the second pair of those

38:27 in sort of fast memory and this that was done them so then clearly

38:35 job is not done. So then they call the second step that is

38:42 this for before Matrix. That's this up being sort of the final thing

38:49 needed to be done, but in in the second step, then still

38:53 of roads. But now, instead pairing rose one and two, the

39:00 is one and three and two and . So what's on top on the

39:06 or the right column? ISS the of one and three and then the

39:11 off two and four and showing what swaps are. And if you look

39:17 it now, in fact, the is done so and then in the

39:25 that it's a short paper is only , Okay, no page in the

39:30 that he works everything else, but worked it out in the context of

39:36 the school to K F 50 and they needed to access external stories.

39:41 it's natural to assume then that everything the size two to the power of

39:46 that is natural, for they call to a clear 50. So

39:52 um, Madonna's as well. Instead just doing one para role. If

39:57 have a little bit more memory, can do blocks, uh, to

40:03 the part of J roles. And still kind of worked with parcel,

40:07 , because again that was useful in off the major excites itself, being

40:12 power of two and terms of number rows and columns. So then he

40:17 worked it out and figure out what number off passes to the external memory

40:21 was necessary to accomplish to transform. this is a very classic algorithm that

40:29 still popular, And, um, think in particular again, given how

40:38 have stored in my memory. 01 Major order is still not the bad

40:44 to think about, and in if you also worry about the Iran

40:49 falls where there's the blocks off. , rose, depending on how big

40:56 Matrix is, Maybe sitting in um, the Iran page.

41:04 you may want to them work something's to be run pages and jump between

41:11 of Constance You're on page fools. I was trying to find, you

41:20 , I'm sure some people have tried work it out with respect to cash

41:24 and cash, er, keys and well as the year on page

41:29 But I didn't in the time I to try to put together something for

41:35 . I didn't, um, find and a good references to put

41:42 So this is a pointer, at for those of you are dealing with

41:48 that you may need to you data to go and look at things because

41:56 as illustrated with this slide, you can have quite significant performance

42:02 How did you do these things? that's what I had for more or

42:08 . Sequential or single? No, takes transports, and then I have

42:12 couple of slides for distributor transpose. any questions on these single note transposition

42:30 ? Okay, so this, um a two slides for distributed versions,

42:41 this one is for and I have do it. Um, when the

42:51 of processor are kind of configured as to the all right eso In this

43:00 , there were the collection of processes kind of configured as, uh,

43:07 with four rows and six columns. then it was done on the square

43:15 . In this case that WAAS 12 12 May checks. Um, but

43:23 is applicable to much larger matrices. you think of the 12 12 as

43:29 A supposed to single elements now, , if the processing array would have

43:38 squared, things would have been even than it is on this particular

43:47 Um, so I choose this. is not the square to illustrate a

43:52 bit of the kind of messiness, complexity of what happens when things

43:58 um, kind of different shapes. paper has quoted a to bottom

44:06 The reference also has, um illustrations algorithm has worked out when there is

44:18 , um, common device except number between the processors and the number in

44:27 the number of processes signed. Two and columns this case various. A

44:32 , greatest common divisor of two between process of numbers. But it's it

44:37 it gets a little bit even more . Um, so what's happening?

44:51 in this case, what shaded is of the Matrix elements or blocks is

44:57 to the process of zero. And you transpose the matrix, what happens

45:03 them? Where are they supposed to ? And if you look at the

45:10 on to the left of the squares on to the top of the

45:17 it kind of reflects the block cyclic off the Matrix elements. Eso,

45:25 this case, the if they go the road direction because of the block

45:31 allocation than the first element or block Iraq gets to assigned processes.

45:41 The next one gets to process the etcetera. And then when you have

45:48 through all your processors in the process role, you came back to the

45:57 processor to get its next look. that's why you see this kind of

46:04 cyclic ordering in the numbering. It's 123 it's 16 so there is basically

46:11 step in the index by the number processes in your own, and they

46:19 down. A column, then is clicked with respect to the number of

46:25 in the column. So then it sing from implemented by four every time

46:31 get to the next block in a processor. So after the transposition,

46:38 still supposed to have the block cyclic off now the transposed matrix and try

46:47 kind of illustrate where things went by of color coding. So there seems

46:55 be one red thing missing. I it's not block number 00 states where

47:01 is. So that's why it's One but basically a color coded elements

47:06 the first a column off the Matrix and just illustrate where those blocks went

47:15 block 00 doesn't go anywhere. This where this state is on the

47:21 Um, but if you go down first column all the matrix than,

47:34 , block fora in the column and the column right column zero, then

47:42 not going to be in the first after the transposition. So with a

47:50 cyclic allocation that elements is not going process and forward s O. That's

47:57 you see. One of the error then you're clear. Look at the

48:01 one that is blocked number eight. get basically allocated the processor eight mods

48:11 and that's now ends up being It took. So this is kind

48:17 the way that things get scattered around the communication pattern is not on that

48:25 trivial If you have this block cyclic eso this is and kind of this

48:36 code that FORTRAN code that they wrote the time is kind of illustrated here

48:41 to do. But there's a lot in in the paper for anyone.

48:48 ? No. And I also put one that kind of for the binary

48:55 you. Because that's still even though in dimensional cubes are not used all

49:01 much anymore in terms of the interconnection , it also works on very nicely

49:07 related networks like the Butterfly Networks. at one point that is not the

49:18 in the other algorithm that I just about Is that for this kind of

49:26 dimensional networks and well, today is of more than one day too.

49:32 you have multiple passed between source and of the things, so you can

49:38 then figure out to use the most resource, which is they're commenting communication

49:47 , and not just to use a link off the links for a

49:54 But you can divide up the blocks is supposed to be communicated between a

50:03 of processors to send them along different . So you can try to end

50:09 using the full bandwidth interconnection network by the amount of data between the pair

50:20 as many chunks as matches the number parallel paths you have between the two

50:28 on. Then making have to be little bit careful in how your schedule

50:33 in order to avoid contention. But that wants to look at that there

50:39 also a couple of papers about that and that I think I was when

50:45 had about The Matrix transports, stop for a second and before I

50:54 a little bit of our sparse They won't be that much, but

50:59 to make some points that I think be useful to some of you and

51:05 questions on the matrix transports. It's quick highlights. And there are

51:18 . And also on the blackboard uploaded of the references that are the bottom

51:24 the various slides used so you can them and gold, and you will

51:31 a lot more. Okay, so I want a little bit about supports

51:38 disease, and I have about 20 left. So, I mean,

51:47 mhm be able to cover a whole more than the enforcement natives representation aspect

51:55 , um, you many may not familiar with that are important. So

52:05 of you who do some unstructured or and engineering code finite element finding volume

52:14 what may be familiar with it and of those of you who I looked

52:21 machine learning papers may be familiar, I know from some of my own

52:27 when they started to read about they wouldn't. We're not familiar with

52:32 Make X representation. So hopefully it be familiar to some of you.

52:38 if there's any time like, talk a little bit tired than right

52:44 using those representations. All right, there's a little bit about some pointers

52:54 where you confine information and even library . So the Berkeley group is one

53:03 place to look. And then a of work for linear algebra, including

53:07 matrix stuff. Um, Forest majors basically Canada comes from some type,

53:15 logical representation. And I talked about when I talked about partitioning. So

53:22 just model things from the graph where we get the Matrix that reflects

53:28 Andi. Here's when it can do where you have, which we did

53:34 terms of the incidents magics before. then you can do things like you

53:40 there. So lab Asian matrix that talked about before. Andi.

53:48 so here's I had some example, I'll skip that in computer graphics on

53:53 . Interesting. You look at the and see what to do in terms

53:57 half. So now getting to the structures a little bit. So first

54:01 is to do things sort of using regular grids, and they'll go very

54:06 and I'll talk more about more arbitrary unstructured first major cities. There is

54:14 thing and the necessary, as you finance, you know, find a

54:19 is thio approximate the derivatives and there equations. Make them disc critized in

54:27 over time. And then if you to find it regular, find a

54:33 in you get some of this And we used it in terms of

54:37 in a number of examples. In , of course, if you do

54:42 matrix representation, um, by Linda . So in this case, if

54:49 is your X Y, when you your matrix representation and when your eyes

54:55 , um, into one, the you get the vector for all the

55:01 values, variable values and all the points eso you take this to the

55:09 space into the one the space than get matrix representations off this flavor not

55:22 get sort off diagonals off elements or non zero on you Eventually.

55:30 in these two day case, we sort of a block. Try the

55:34 structure of the matrix. Um, the size of there are blocks

55:44 then, on the size of the that you're using. So now there's

55:53 quick examples again that they will flick is for kind of also you're not

55:58 with it to get a bunch of things and below. In this

56:02 you can see some things are the matrix may actually look for this trying

56:10 that is on top, and I'll back to this and more illustrations and

56:15 Here's another kind of typical sparse And for those of you are

56:21 they're think in reference slides or references have towards the end of the slide

56:28 . There are now sparse matrix collections people use on that was used when

56:34 talked about petitioning as well. People . They're petitioning. So there are

56:40 of libraries and where you came down matrices from. And here is coming

56:48 one that looks all the way. the point is that anything that is

56:53 coded is something that is not zero quite spaces a zero elements.

57:00 now I wanted to cover the storage , um, at least and this

57:08 . So here says what's known as compressed role storage schemes. So you

57:16 imagine from the examples that show the that the number of non zeros and

57:24 role may be quite few eso that the connectivity. So if you have

57:31 graph problem. You have quite large when millions or even billions on

57:42 But usually the relationships of the connectivity the graph for most notes is quite

57:50 . So the number of edges from note on may not that all scale

57:57 the number on notes in the So the degree of the nodes maybe

58:02 limited. So you may have, you or maybe tens of edges in

58:11 graph for each node, except some and some a little bit more.

58:18 , and I guess, for those you who does, um, some

58:23 problems and model things on this then you will know that if you

58:29 at the globe and then close to pools and you get the print

58:34 ID fan into the polls because all different tests relations off and has a

58:43 point in tow poles. And that's , when you their grades for

58:47 you try to treat the pole areas from the rest of this year's,

58:51 instance, anyway, so this is to say that typically the number of

58:58 in a row is very limited, of the size of the Matrix.

59:02 that's why one doesn't want to store . But then, for our matrix

59:09 about the works, you really need know what elements are if you can

59:15 out the zeros. So here's this Rose Stories scheme that's kind of illustrated

59:23 the bottom for this matrix. So this one does is the store on

59:30 non zero elements. So the first here, just chosen on their elements

59:37 you can throw easy to see comes the Matrix. And then what it

59:42 is then, since the number of zero elements in the different roles are

59:48 necessarily the same, it in this dimensional array off non zero elements.

59:56 thing to do is to figure out each new role stocks. So there

60:03 a road pointer that tells where each stocks in this linear array structure.

60:11 that's what you see at the And then for each one of the

60:15 zero elements the store, the corresponding index. So now you actually have

60:21 way off figuring, um, exactly coordinates of each one off the non

60:30 elements and only store non zero elements they cost is basically one area

60:38 um, non zero values that may floating points and maybe integers depending on

60:43 data types you use. And then have a column. Indices. That

60:48 an array of integers, that for you may need a bunch of

60:54 depending upon the size of the But usually it's at most,

61:01 about two times, uh, the of non zero elements in terms of

61:06 footprint in often less because there may less space for the column in the

61:12 , even though there are, as the non zero elements. So this

61:17 one common storage scheme. And of course, one can do it

61:21 column instead, is just in that is to a column pointers instead of

61:25 pointers and and store each one of own industries. Um, and this

61:32 another variation on that where some algorithm there diagonal a little bit different than

61:39 elements, so it may be convenient pull out the diagonal on. That's

61:44 done on this one. Andi. , there's kind of no magic to

61:49 , Um, another one. Sometimes is again depending about how you use

61:57 sparse matrices. You may in fact to store uses so called coordinate

62:04 But now it's a little bit more because now for Eastern zero element,

62:10 store both going column index. On other hand, it's fairly quick on

62:17 get everything you need about each one the elements. You don't have to

62:24 kind of the A search or look lower column pointers you can directly address

62:31 you want. Then there is was a well known format l pack,

62:40 that was again coming out of the and engineering community. There is elliptic

62:48 solver package that what it stands for on this particular story scheme is also

62:54 occasionally in the machine learning community. what it does is kind of still

63:01 kind of ah matrix like representation off forest matrix. But it,

63:10 compresses rose in this case, so stores only non zero elements.

63:20 but it done generates a rectangular and then you need the second ray

63:27 mimics The Matrix are non zero elements the original matrix, but instead has

63:35 column interest for those. So they're indexes implicit. And you have to

63:40 the column index for each non zero explicit. And then it turns out

63:46 this has been very it was Um, Thio be efficient for vector

63:57 . So in this case, you things by Rose and you get sort

64:01 a vector limpet make correspondent. So the length of a row or the

64:06 of the column, depending on the , things but the last for fairly

64:12 are very efficient memory accesses and When you have victor instructions on this

64:22 another one that just had a for . But at some point, when

64:27 talked about something, I mentioned the prefix operation. Um, on this

64:38 Gabriela on seeing you, the design of a programming language that uses the

64:45 and so primitive. And then they up with a particular efficient stories game

64:50 use in parallel prefix operations and processing agencies. Now, one thing that

65:02 , um, very well commonly used what's known as the skyline or profile

65:12 , in which case you store elements . I think of a role.

65:22 , so if you move left from to right in a row. When

65:26 hit the first non zero element in row, you obviously store it.

65:34 then you store all subsequent element in role. Um, even if its

65:41 . So that's what the skyline of things. And so this kind of

65:47 lines, you know, going from diag along upwards or left words trying

65:54 illustrate that in this case, there the number of elements of the sparseness

66:00 being stored. And in that it includes cereals. So it makes

66:07 more efficient because now you only need store where and it all starts and

66:13 kind of the length of the But then all of the say if

66:18 go through all of the column industries implicit, so they don't need to

66:22 explicitly storm. So that off the sufficiency, both in terms of storage

66:30 sometimes in terms of the processing, you can use vector operations even though

66:36 variable length factor operations Um Okay, , then I guess talking about the

66:47 vector and then I have one more . I wanted to try Thio point

66:54 . So here's a little bit all of stuff you want to do.

66:58 here is now how you would For instance, they come pressed role

67:04 . So the point of this is top. Is the three races exactly

67:09 compressed go formulation and then how, you're doing the Matrix factors? So

67:16 basically sports matrix times the dens vector now sends, um, if you

67:25 through enroll off the Matrix than the that are stored and the and ands

67:42 array. They are not in successive , so you need to figure out

67:48 elements off X you need for that matrix off the road. And that's

67:54 you end up using indirect addressing that kind of marked in red here or

68:00 gather operations. So these gather scattered that I talked about that way back

68:06 the past is very intrinsic and important terms off forest matrix operations. And

68:17 , it's not something that the memory tends to like, so those are

68:25 not very high performing and computer architect's used all kinds of triggers to try

68:33 figure out how to make gather operations goodness that can be, But this

68:42 to show that it becomes a little more complicated and unfortunately, kind off

68:48 slope compared to anything you do with May agencies on, um, there's

68:55 if you turn it around when you the column oriented one, you need

69:00 . They gather in this case for wine X may be fine, but

69:05 they also need a scatter operation in of storing the results back. So

69:11 intrinsic and this force metrics are these carried type operations or indirect adjusting on

69:20 Since an illustration for the the impact anyone interested can take a look at

69:25 I may be relevant for something, doing other classes off projects.

69:34 and this is kind of what you use this notions in the segment it

69:38 and again on some machines that for instance, on the connection

69:43 We have this supreme native operation and was also again something that kind laid

69:49 did for his thesis. Now, other point is something that is important

70:02 Oh, this is something onboard. from a Gandhian Generals group cut Berkeley

70:07 base. The point is that best matrices it's hard because of the structure

70:15 get the whole lot of instruction level data level parallels because of you had

70:22 indirect addressing and things are not nicely a za block. So things tend

70:32 be limited by memory accesses. So making good juice off memory access

70:43 um, higher priority. I would I'm trying to look at how to

70:50 simply instructions or, um, they're W type constructions. It turns out

70:58 during these, focusing on the memory or trying to block matrix sparse matrix

71:06 is in fact also benefiting. How you deal with the instruction set for

71:11 processes? So this is just try use block operations and the next several

71:18 trying to illustrate that where a few that Andi this was this. I'll

71:26 Thio illustrated more by pictures than talking this particular slide to make these

71:34 So here's an example of the structure may be in a common. There

71:41 instructed analysis problems, so you can in this case that's definitely sparse Matrix

71:49 . Also, you can see this simple structure in this case,

71:54 now it turns out, each one these little blue dots. It is

71:59 really adopt its thio focuses. This in fact, I'm kind of a

72:06 structure. We're in this case, looks like, recently done spot.

72:11 I think it in this case, is s O that why you're kind

72:17 instead off trying to do element twice far storage. Now you basically treat

72:25 as the elements on. We treat blocks then as a guns matrix.

72:32 this is kind of trickery. And turns out that this is what you

72:36 to do, even if the these are not dance. But yeah,

72:44 totally sparse policy, Right? So shows a little bit playing around with

72:50 on the factor of four here is out the winner in terms of

72:55 they played their arm of figuring out shapes for this particular case. And

72:59 more examples. Here's another example looking at things, and in this

73:06 , the blocks not really dance. then so here's to look more carefully

73:13 things. And again, you still a lot of structure that you can't

73:18 advantage job, but it turns out on this side. It's cheaper to

73:23 it out the zero and treat these as dense. So it means you

73:28 , waste little bit of storage, in terms of the number of elements

73:32 need to store. But they're on other hand, you don't need to

73:36 either there or column index for each , because now you know it's a

73:40 . So you say on index storage you save. I mean on

73:45 because it's just increment by 11 or going into one column directions. So

73:51 are lots of gains and how the actually works by doing this blocking,

73:56 though it means to store a bunch zero elements. Um, on that's

74:05 . So this shows a little Then the and I wanted to say

74:09 more thing. I even though I'm for time because the next thing,

74:15 , and then different questions. I . So here's what I wanted to

74:20 out as well that the structure of parts mattresses are not inherently well,

74:28 first degree it's inherently given by the . But the mapping between the graph

74:36 the Matrix has a lot of degrees freedom and that impacts? Uh

74:43 This sparse matrix looks so here uh, a bunch of ordering some

74:50 use so called minimum degree ordering that ordering its etcetera, and it's not

74:55 time to talk about them. But show you the pictures off the consequence

75:00 , playing around with this, ordering notes in the graph. So,

75:07 , this is a little bit to about what's natural ordering. Somehow people

75:13 the graph one way on the top of the thing, and it shows

75:17 number of elements that has ended up the Alaska factories ation. In this

75:21 , on the top, that's about or so. And if you use

75:27 other ordering the minimum degree, it of looks more complicated. But in

75:31 in that case of what dropped from 300,000 to about 200,000 non zero elements

75:37 the way, you factored the matrix this particular case on the rest of

75:43 on skipping, just show your Eso here is one. I was

75:50 , uh, some design, slack to the Stanford linear accelerator.

75:57 guess they decided some observation chamber of accelerator, And this represents something that

76:06 out now, whether using different this becomes a lot nicer to look

76:14 , and they're more examples in this , and I'm over time. So

76:16 will stop on that and just try point out that when you work with

76:21 majors is, I would say the order of business is to figure out

76:28 you have labels the graphs and try find a good scheme for labor on

76:33 graph respect to in later stages being , able to use some form of

76:40 structured, sparse matrix. And then represent that blocks structured, sparse matrix

76:49 one of the schemes have talked And that's E. Guess my spiel

76:56 this forest matrices to think about for that works with them in any

77:03 And with that, I will stop lecture and take questions on anything related

77:09 today's lecture or the class. it was a high speed run by

77:20 making pointers to something that the hopefully in other parks in your area.

77:28 or studies. Okay, let's There is a couple of chats.

77:41 see. Okay, as, um those of you who may have checked

77:50 late the presentation that WAAS finalized Thio 15th in the morning from nine.

78:01 12 on the lecture has been Um, but I Well, stop

78:24

-
+