© Distribution of this video is restricted by its owner
Transcript ×
Auto highlight
Font-size
00:02 um Good afternoon everyone. We're very to have Janice today to share with

00:10 what has been very helpful mentoring to of our other PSD students and other

00:16 groups thinking about data and what it means. Right? So sometimes one

00:22 one is three and I think that's probably you know one of the indirect

00:26 we'll have today. So Janice please it away. Right. So first

00:30 all thank you all for the I'm very pleased to be here to

00:36 on my soapbox. Um So like said we will have the slides available

00:43 after the class and we are recording so um you can refer back if

00:50 you know this is I've tried to a lot of material in this.

00:56 should be able to get through all it but if you have questions or

00:59 you want to look back at Right well we'll make sure that you

01:04 that opportunity. So um let's see are we what are we going to

01:10 about? So this could be useful at least the PhD students here just

01:17 us a little bit just maybe just or three phrases related to their area

01:21 research. Oh yeah so so I uh I am um here at the

01:29 of History. I am in the science department but I also teach at

01:35 owner's college in the data and society and I also am an instructor at

01:40 D. S. I. The . P. D. S.

01:44 . At their micro credential programs. broadly um interested in the sort of

01:54 learning um deep learning kind of problems with an emphasis on the data

02:03 So it's it's not so much about methodologies but it's about the um sort

02:10 the steps that you have to take make sure that you get usable data

02:15 you can then work with your And that comes from my background.

02:22 my PhD was in math. You I'm a recovering mathematician statistician. So

02:30 so that's basically the flavor of what going to talk about today and why

02:37 perfectly. And so I'm going to go through, I'm just going to

02:40 each PhD student to just tell us area of research that might also be

02:46 context for you. So I'm just to call out the names, Just

02:49 you, tell us the area of is not a lot of details.

02:52 , so will I. Hi I'm uh my area of research is high

03:00 computing. I'm trying to optimize and up the machine learning algorithm using open

03:06 MP. I. And thank Mhm. Hello. My research area

03:13 computer vision uh in in more specific detection, facial emotion affection. And

03:21 I'm trying to improve one of the expressions algorithm in machine learning. Thank

03:27 very much. MTs. Hello My research was previously based on data

03:34 on google scholar data. Thank you much Robin Hi I'm doing my research

03:41 the area of machine learning for Super . Our goal is to introduce new

03:46 , promote and improve their interpret ability evaluating neural model of sub support.

03:51 you very much if you can hear formula. Hello Professor. My research

04:03 in three D modeling and visualization. you. Round up. Hello.

04:11 research area is your spatial analysis. epidemiological analysis and data led into epidemiological

04:19 . Thank you. Thank you. . My research area is in computer

04:27 more specifically is augmented reality virtual reality also medical robotics. Thank you.

04:37 my research area is in textual summarization lecture videos and my first step is

04:42 extract keywords. Thank you. I we ran out of people here and

04:45 now with all you. All That's quite a broad range of subjects

04:53 as befits a top computer science Um but so hopefully you know what

05:01 going to talk about, you will that it could be applicable to all

05:07 the things that you're you're doing and of the reasons why is um at

05:14 point whatever it is, whatever kind research is that we're trying to do

05:20 going to have to evaluate it and some point we're going to have to

05:24 about some kind of statistical way of about what the results are and whether

05:32 know they show and affect whether they're showing something happened or didn't happen better

05:40 worse than than the baseline. So is partly what I want to focus

05:47 . This is what this lecture is to be focused on and it's supposed

05:52 sort of go a little bit give some intuition and some discussion about

05:59 why is it that we use the , methods that we use and what

06:07 of misconceptions that are common misconceptions that easy to occur so that you can

06:15 . So the slide that I have that you know the reason I have

06:20 is because I like this quote, mean obviously it shows that computer science

06:26 you know, even from the early was keenly aware of the fact that

06:31 have to think about how you frame problem. You think you have to

06:35 about your data if you just run evaluation algorithm blindly, you put in

06:43 figures the right answers have no chance coming out right garbage in garbage

06:50 So there are basically four questions that always keep in mind when I'm working

06:57 something and these are listed right here these are sort of the things that

07:02 want to go over because a lot times we don't necessarily pay close attention

07:11 these, the answers to these questions this is not just students, this

07:17 I mean you could find published papers good scores and you would be surprised

07:25 by the fact that they didn't pay attention to these four questions. So

07:31 , this is what what we're talking . So you're going to look at

07:35 kind, there's going to be some of statistical summary. I mean it's

07:39 to have like the mean reported or going to have some kind of,

07:44 know, the media and some kind parameter estimation from your distribution. Does

07:49 say what you think it says? we on the same page with the

07:57 ? The next question is it could a whole bunch of tables that talk

08:02 the statistics of the data that you the evaluation or even maybe your research

08:08 , if your problem is amenable to , does that give you the full

08:14 of what you're trying to understand when collected this data or are things

08:21 Then the third question is, there's to be some statistical tests, there's

08:27 to be someone talking about p values accepting the alternate hypothesis or power.

08:36 like that. Um, there it's very common to misunderstand what is

08:45 described there. If you don't pay attention. And then finally, even

08:51 all those things are right, you always ask yourself if the question itself

08:57 you're asking makes sense and we have example in the end that I think

09:03 make this a little bit clear. these are the four types of things

09:09 I want to try and talk about and hopefully um it will make a

09:15 bit more sense once we go through details. Okay, so part

09:23 let's talk about statistical summaries. so here's the here's what I

09:28 When I say um think about Let's say that you want to figure

09:34 what the average weight is, You compute a number right? There

09:42 many ways to do that right? could look at the mean, you

09:45 look at the media and you can at the mod, you could look

09:49 does it mean when you look at number? What does it mean?

09:54 you say I want to see if way it is average right? We

09:58 all look at this multiple choice Do we think that one of these

10:07 Is the right one. Do we that all of these answers are the

10:12 1? Do they all produce the number if they don't produce the same

10:19 , what does that mean? These are clearly very different ways of

10:27 about what it means for something to average. But at the same

10:33 these are all equally valid ways because might be asking about something right?

10:43 it depends on what you want to with the answer. It depends on

10:48 you're trying to get out of the . Whenever you're producing a summary,

10:54 necessarily focusing on some aspects and your emphasizing other aspects. So you have

11:01 think about what it is that you're . So these things emphasize different aspects

11:07 your data. If you're looking at like the mode of your distribution.

11:12 you're saying like what's the most popular ? Right? That's different. And

11:20 you're saying I want to understand I want to understand how the data

11:27 . Right? When you talk about median of a set of data,

11:34 that says is it splits your values observations in half, half of the

11:40 are below the median and the other are above it gives you a threshold

11:45 ? In the middle. That is the same as what is the most

11:51 choice? Right? That is not same necessarily as what's the right?

11:58 mode? So this gives you a position so it answers a different

12:06 Okay. And it's obviously this was you know you have to be able

12:11 order your data to even answer that your data is categorical, you don't

12:16 any sense of the media. But two only agree. Right? If

12:25 have a symmetric uni modal distribution. ? So the median and the mode

12:32 always are not always the same. ? So that means that the answer

12:38 the question, what's the most popular is not the same as the answer

12:42 the question. What observation splits the in half and then you have the

12:51 quartile range. Okay? That's a of dispersion, right? Again?

12:55 says what band of value splits the in half, right? So You

13:01 two values, one is the the other is the 75%, quarter

13:07 half. The values are inside the half are outside. That's a different

13:13 of thinking about spread for your It's not the standard deviation,

13:19 It is the inter quartile range is again, a 5050 split 50% half

13:27 observations are inside the other half are . That gives you a different measure

13:34 what it means to think about how out your data is Which is not

13:40 same as the standard district standard deviation you're doing your statistics. Right?

13:46 depending on the kind of question that asking right, the the answers that

13:54 get would be different. So the right is yet another way of thinking

14:03 right? It says, let's look the total sum, right? The

14:07 value and then let's compare this to standardized measure that's divided by the number

14:14 observations. Right? So this is comparison measure. Alright, This says

14:22 does the total sum compare to the ? Right. And it's an easy

14:29 to compute So people use it all time, but that doesn't mean that

14:33 is always the most appropriate measure if distribution is not symmetric, right,

14:41 ? The mean is not the same the median, So the mean doesn't

14:46 you where half your observation split You could have a skew bread for

14:54 distribution. Think about something like Okay, There are some people that

15:03 very high worth very high incomes, billions. Most of the people do

15:12 . If you say what's the average instead of, you know, half

15:16 people have less than that and half people have more. That is going

15:22 be much, much less than if will say what's the average income in

15:28 ? Let's sum up the income of the people and see what the mean

15:33 is, right? That clear so , like what I'm talking about.

15:40 it make sense? Right? So another way, here's a different way

15:49 which the mean and the median are And why? It's not always enough

15:59 just look at one and you might to look at both of them.

16:04 , So the main is the flip of the media, the mean is

16:09 point of view of the house. you're thinking of gambling. Okay.

16:15 mean tells you how much profit the , right? The casino the lottery

16:23 going to make per gambler, It looks at the total profit that

16:29 make and it divides it by the of people that play. Mhm.

16:34 is not right, telling you how profit each gambler makes on average.

16:42 is not saying how many people are to profit or loss. Right.

16:51 you could have 99% of the people and one person lose and still you

17:00 have the casino make money because that lost so much more or vice

17:07 Right? So it's not about the many it is, about how much

17:13 that's the opposite of when you look the media, right? The median

17:17 the point of view of the gambler ? How many gamblers make a

17:24 If the median is greater than more than half of the gamblers make

17:28 proof if the median is less than , less than half of the gamblers

17:34 a profit. So again, this say how much we could be talking

17:40 scents or we could be talking about . If you care about the

17:46 you need to look at the media you care about them much, you

17:51 to look at the mean right? different things. So that's what I

17:59 by like when you're looking at the of your data, Right? It's

18:04 enough to just look at one or . A lot of times you have

18:08 look at all these different measures, have to look at all these different

18:13 to understand exactly how does the distribution your data? How is it

18:19 Right, What are these are different and you should have different ways of

18:26 them in order to get a full for your dad? So does that

18:34 sense so far? Am I going fast? Alright, it makes

18:44 So let's move on then, and look at Okay, so now that

18:51 can, you know, you can back and you can think about like

18:54 answers each one of these questions. how would you get a full

18:58 Right, Will you visit the original ? So let's move on to the

19:04 part. Right. So we're trying get a fuller picture by looking at

19:10 of these statistical measures. Right? the point is that sometimes you literally

19:19 to look at the picture and this what I mean by that, I

19:23 know if any of you have heard the data saurus. I know one

19:27 you has because one of you has me talk about this before but it's

19:34 really interesting um example um that you go online, you can click on

19:43 link and you can play with it the code is available, you can

19:46 it yourself and this is what I by that let's look at this image

19:53 the right, Okay, what does image do? Well, it has

20:00 the left side of it, the part of it, it has a

20:04 plot, right? It has basically whole bunch of different data sets,

20:13 ? You can see how the picture and it plots all these datasets and

20:19 the right, it computes the mean for the X variable, the mean

20:29 for the Y variable, the standard for the X. Variable and the

20:34 deviation for the Y variable. And also computes the correlation coefficient. These

20:41 all things that we know we've we think we understand. And if

20:45 told you that I have a data and it has, this means

20:53 this means for the Y 47.83, correlation coefficient negative 0.06. You might

21:03 that you have a pretty good picture what that dataset looks like. And

21:09 I'm here to tell you is that these datasets up to this precision,

21:16 significant digits, they have the exact statistics. Every single one of those

21:23 , every single one of those scatter produces the same statistics. When you

21:33 at that kind of accuracy. So does that say? Should we think

21:40 all these data sets the same Should we use the same kind of

21:46 to try and um you know, them to try and analyze them,

21:53 are very different. They have very patterns, they should be looked at

22:01 very different ways, right there? at all the same. But if

22:07 just look at the summaries, if just look at those numbers, we're

22:12 to miss that we're going to miss picture. And there's other examples,

22:18 , it doesn't have to be as as this. The point is that

22:24 know even if you look at the plot there's information missing. It might

22:30 tell you everything that you want when trying to understand how to analyze the

22:37 . So it's always it's never a of time to try and do some

22:44 data analysis to try and look at plots to try and look at your

22:49 in different ways, not just in of statistics but also instead in terms

22:55 graphs. So anyone, I have this before, anyone surprised by the

23:06 that this happens. Is this something seems like oh yeah we knew about

23:15 . Um I just I recently very saw paper where they tested like some

23:22 by giving students data like this. the just to check whether people were

23:29 actually going through the data correctly or . I saw it on twitter

23:34 It was the paper where um the were asked to verify whether a certain

23:43 relation existed and the ones that were to verify it did not notice that

23:49 was a gorilla in the scatter And the ones that were not told

23:56 the gorilla right? Which if you a gorilla shape in your data set

24:02 would immediately start being suspicious about whether real data or not about whether you're

24:09 know data collection was working properly or . Right? So that's what we're

24:17 about. But what I want every of you to keep in mind is

24:24 you can never spend too much time at your dad. I mean you

24:31 just do that. You eventually have do other things with it, but

24:36 think that it's waste of time to at your data and and think about

24:41 carefully. Okay, so this is the way, this is called

24:50 if you want a fancy name for . It goes back to the paper

24:56 FJ AM scum like almost 50 years and it's called he had started with

25:03 Quartet, which is these four datasets that he showed and those show again

25:10 we were talking about. They all the same statistics. But if you

25:15 looking at the data set that's top , maybe a linear regression makes

25:21 But if you're looking at the data on the top right, it's clearly

25:24 quadratic. You shouldn't be looking at linear regression when you're looking at the

25:30 left here. Again, strong linear trend, but strong outlier.

25:39 So how you deal with the data is going to change even though the

25:46 set has the exact same statistics, maybe a little bit of a

25:53 but maybe this is useful for all us. Sure. Um Is there

25:59 unity universally accepted definition of an Because otherwise I'm tempted to basically look

26:06 the data if some point is away what I'm expecting. I'm gonna delete

26:10 saying it's an outlier. What's wrong that? So we'll touch maybe a

26:17 bit on that. The next the question is about statistical testing. And

26:23 outliers are the way you define an . Right? It can mean two

26:33 things. So one thing is that you have, if you understand that

26:43 have control over how you collect your , right. Outliers are basically things

26:52 don't follow the protocol for data So if you had like humans coding

27:01 data said, right, Some humans mistakes, right? There's typos there's

27:09 sleep whatever those could be considered right? Because the method that produced

27:15 right, is different. So this the kind of thing that you catch

27:20 having some kind of sort of audit , some sort of quality assurance,

27:26 control, whether you're collecting the Right? So that's how you deal

27:32 that kind of thing. But there another definition if you will of outlier

27:39 comes like if you don't have access that, like you're presented with the

27:43 set, right? And you're looking it. Mhm. And there the

27:48 way to decide if something is an or not is by having a model

27:54 your head, in your like you how the data should look like,

28:00 ? You have to have a distribution place, you have to have an

28:04 for what you think, right? two different types of data that live

28:11 this data, right? You say for example, in this example that

28:16 have here in the bottom left you could say I trust my model

28:25 than I trust the data. I that there's a linear trend because

28:32 right? Not because I'm looking at data here, but because the process

28:36 I'm studying should have a linear right? So then if I see

28:41 that don't fit that linear trend in very obvious way, right? The

28:46 trend can give me a statistical right? The ones that failed a

28:52 test, they're outliers somehow they were by mistake. Somehow they come from

28:58 different process and I ignore them. ? But that is very much dependent

29:05 you having a suggestion about what the is like a theoretical understand prediction

29:15 Right? So you trust your model you reject that right? If you

29:22 have any strong reason. Two trust model, right? Then it's

29:31 much harder to reject data. Then what you do is you

29:38 well maybe I trust my data and need to think about different models.

29:46 I thought about the model was is and I need to think that.

29:52 most of the time the safest thing do if you see a situation like

29:57 is to seek to get more data . That's not always an option.

30:04 if you get more data, you sort of get a better stronger right

30:13 if you will about whether you should your model, right? Does it

30:19 to produce more outliers at the same or you know, do they go

30:26 ? Right. Like the new points in line with the data that you

30:33 ? Does that make sense? It . It makes me even curious about

30:38 question which is uh if I trust model and if I know the

30:44 why should I even bother collecting Because that's a great question because in

30:54 statistics, the reason you collect data not so that you can see which

31:00 is better. It's so cold that can estimate the coefficients of your

31:09 So if you say that's a linear , right? I know that distance

31:16 related to speed by a linear Right? You don't collect data to

31:24 that you collect data to establish what's marginal coefficient, Right? What's the

31:31 for that relationship? What's that And that's what traditional statistics does.

31:39 . It says, how can I these numbers these parameters? Right.

31:45 parametric statistics. If I know that is the model, right. If

31:50 hypothesize that this is the model. makes a lot of sense, then

31:56 do I know which relationship? How I select the relationship then between let's

32:03 distance and speed. Let's see if don't know, is it linear or

32:08 . Right? So if you're let's say machine learning. Right?

32:11 say okay, um Sidekick learn has total of 15 different models with 20

32:19 parameters each. So I need to up 300 runs and I will see

32:25 gives me the highest accuracy, That's kind of like how a lot

32:30 people do it. Um and that's how statistics works. So let me

32:39 on. I think that if I on to my third part, it

32:45 become a little bit clearer how to about answering that. Is that

32:52 Of course. Thank you. but great questions. Right. This

32:56 this is I'm glad that we're discussing . So that's the third question.

33:05 ? So it says does the statistical say what you think? It

33:11 All right. Like we said, of the time, a statistical says

33:15 tell you whether the model is It tells you what the coefficients for

33:19 model. R right? So this where we get to hypothesis testing.

33:29 , let me ask before we even started, people have seen hypothesis

33:36 I'm assuming people have heard of you know, and selecting the beta

33:44 and looking at power, is there we're not judging. But is there

33:50 that has not heard of like no and alternate hypothesis and significance and alpha

34:04 . Okay, good. So you have an understanding, right? At

34:09 in in theory of what we talked when we say statistical testing. So

34:17 me try and and walk you through a little bit because I think that

34:24 is one of the most misunderstood concepts statistics and it is not an easy

34:32 and it is not, I thought and I have a whole bunch of

34:37 if you're interested to show you why , this is leading to reproducibility

34:44 So the thing to remember is that whole thing started with a statistician called

34:49 about 100 years ago and Fisher's attitude that of a detective. Fisher was

34:58 saying statistical testing. The purpose of testing is so that we can doubt

35:07 . It's not. So that we choose the model the purpose of statistical

35:13 so that we can say do I that this model is a good choice

35:20 most of the time that would be answer would be no. And his

35:26 was geared towards rejecting, right? the null hypothesis, the default position

35:35 that someone is innocent. They've done wrong. If you're trying to estimate

35:41 means or some kind of slope or The difference. Right? The number

35:46 zero. All right. That's your hypothesis. And then Fisher says,

35:55 let's set up the rejection of that . The alternative is not innocent.

36:02 alternative is they did something wrong. alternative is the difference in population means

36:09 not zero. The alternative is their . Right? So our model is

36:17 baseline assumption and we're trying to find to doubt it. We're trying to

36:23 reasons to say maybe that's not a model. Thank you. So the

36:29 that this works is we produced a value. Anyone here doesn't know how

36:37 compute a P value or how to a program. Right? Your favorite

36:45 that will produce the P value for experiment. Again, here's what the

36:55 value measures, right? The p is the probability that you would observe

37:06 results of your experiment. If your your model was true, Right?

37:17 what percentage of innocent people behave in suspicious way? That's what the P

37:26 is. Thanks. Let's tried to thought experiment. Let's say you have

37:33 coin? You flip it 100 times value is the proportion of the

37:41 right? One experiment is 100 Another experiment is another 100 flips.

37:47 ? What proportion of those experiments would a specific degree of bias?

37:55 Would produce a certain number of tails more? Right. More extreme.

38:02 we're looking at the false alarms. ? So if we say, How

38:09 times would we get 60 tails or ? Right? The p value is

38:18 one out of 20 times that you the 100 flip experiments? If you

38:26 get 60 tails or more, that's the P value does, that's what

38:32 p value measures. It says. the coin is fair, right?

38:37 your model is correct, then You see this experiment 5% of the

38:46 Is that clear? Yes. So there is something called significance, correct

39:00 takes this bias, right? The of tails and translate it into standard

39:08 , right? The sigma's it says this experiment, 55 tails or

39:16 That's one standard deviation away, 60 more. That's two standard deviations,

39:25 or more. That's three standard deviations so on. Right? So significance

39:33 to the number of standard deviations, ? We can always translate In two

39:40 deviations that the spread of the right? So the smaller the P

39:48 , they hire the significance, the sigma's And so the more you can

39:54 the model, but the more you say that's not I'm not trusting

40:00 right? That's that's it's risky to that this is the case. It

40:07 only happen 5% of the time. what are the odds, right?

40:15 , you start with the model, assume it's true, Low p

40:20 high significance means you have reason for , right? And here's the

40:28 Here's why significance is important because different and different people. I mean,

40:37 no uniform way to decide what's good . Alright? If you're looking at

40:43 polls, you know, they talk the margin of error, that's one

40:49 . And if something is above the of error, they say, you

40:54 , this is a tight race. say these people are, you

40:57 this is a clear winner in the . Well, one sigma is not

41:06 enough. If you're talking about something physics in physics, You need 5

41:15 to establish that this model should be to establish that you found a different

41:22 , right? So huge gap in people think about which is why the

41:30 that we think about reporting significance level be reflecting that, right? You

41:40 basically compute the p value, then the significance level that value has.

41:47 then you can say this is the level, right? This is the

41:52 of trust. This is the level doubt, right? 1.96 or 2.5

41:59 whatever the level is, right, that and say that's why I'm doubting

42:04 . Right? That is the responsible to talk about it. If you

42:10 it's less than 5%, right? that's the two sigma standard.

42:16 okay, but that's not really Like who made this the uniformly,

42:26 know, the universal standard for So, that's what significance is

42:34 Right? That's the way that fisher this whole process of Well, how

42:39 I know if this model makes sense my data, right? He

42:44 let's think if if the model were , would we get this data set

42:52 ? That's what fisher says. He , Let's look at this example,

43:01 ? 4% for provide. Right? means that there were 96 people out

43:09 who are innocent and are not acting and there's going to be four people

43:15 are innocent but are acting suspiciously. ? That's like the tail event.

43:20 ? That's the 60 taels or more we flip a fair coin.

43:28 This is what the P value It's the likelihood that you would get

43:35 observed evidence in your experiment if you that experiment and the model was

43:47 Right? So can anyone see why can be problematic? What's wrong with

43:56 thinking here's the what's wrong Shouldn't we in the business of looking at guilty

44:19 ? Why confuse just the probabilities of people that are innocent? We should

44:24 finding the guilty people. We should thinking about the red row numbers,

44:30 the green row numbers saying okay, doubting this because not many innocent people

44:41 that way is not the same as data set suggests that this person is

44:51 ? So Janice Are you saying we to look at the population mean population

44:56 and distribution. So what I'm saying that we need to think of this

45:03 a binary classifier problem. This is hypothesis testing, right is not simply

45:13 give me a probably give me a level because really what it is.

45:20 a binary classifier and a binary classifier a confusion matrix. There are two

45:29 types of mistakes you could be What you really want to do is

45:33 want to know if your model is positive, right? That's what we're

45:42 to understand. Is this model truly model that underlines the data. And

45:49 we think about it this way, that means that our experiments are statistical

45:56 testing is really an algorithm for There is a binary outcome. This

46:05 the right model. This is not right model. And there is a

46:10 truth, right? And what we to do is we need to structure

46:16 hypothesis tests to understand what their performance . In terms of this confusion

46:24 Right? How many times do they us the true positive? Not just

46:29 many times do they avoid the false ? Because that's what fisher does.

46:38 ? If you only think about significance you're only thinking about the false

46:45 You're not thinking about the two So how do they go through thinking

46:51 positive? I'm glad you asked because my next slides. So we have

46:58 think like lawyers, Okay, what trying to do is there's two cases

47:05 . One suppose one supposes that the is innocent, the other lawyer says

47:11 person is guilty, right? That's binary classifier. The p value is

47:19 the false positive or the false alarm . Right? This is what we

47:25 . This is by definition right? want this percentage to be small.

47:32 we also need to think about the positive rate. And if you look

47:39 this, you will find out that is called so many different things.

47:44 called rico. It's called sensitivity. called the hit rate. It's called

47:49 power of the test. And the it's called so many things is because

47:54 is very, very important and people many, many fields, not just

48:01 decision theory. I mean they've come with it again and again.

48:06 And there is no standardized way of about. But really this is the

48:11 important ratio that we have to think . If we have a statistical

48:17 it should have specific performance in terms its true positive rate. Right?

48:26 it should be as close to 100% possible. Right? We don't want

48:33 get false negatives either. Right. that make sense? So, here's

48:45 we think about the performance of a test classify. This works for any

48:52 classification algorithm, not just statistical Right? You have one axis where

48:59 plug the p value, right? you have another axis where you plot

49:05 power And the perfect classifier would have for the P value. Right?

49:11 smallest p value that you can. ? That's the highest significance,

49:17 The smaller the p the more the And it would have 100% for

49:24 Right? Remember power is a true rate And we want that to be

49:30 . Right? So it should be to 100%. So it should be

49:33 number here. And the problem is actual classifiers doesn't matter whether it's a

49:43 test or if it's a face recognition or if it's a, you

49:49 as long as you have a binary , you're going to be away from

49:54 point, You will have two types errors. There's what's called the Type

49:59 error and the Type two error. , how many people have heard of

50:04 one and Type two errors? I this stuff. Okay, so if

50:16 haven't, this is what they are . It's just a fancy way of

50:21 about. But false positives and false . Type one error is How far

50:31 From 0% are you in your P ? How far off from one

50:37 100% arguing your power. But so is a fundamental limitation that we have

50:49 . Any kind of classifier that we is always going to have To balance

50:55 these two. Yeah. And so going back, right, let's recap

51:02 this is important to get franked. we think in terms of fisher and

51:09 , right? We set up a hypothesis. If we're thinking like Nayman

51:16 Pearson, who were the statisticians that fisher and sort of built up this

51:24 statistical thinking, then we have a a main hypothesis, right, Which

51:32 that someone is not guilty, but beyond reasonable doubt, Right? What

51:40 say is you cannot be 100% sure always going to be a minimum level

51:48 ? You have to show that someone guilty above reasonable doubt, right?

51:56 it's below you're going to keep with presumption of innocence. Right?

52:03 So it's not about whether in truth is an effect, it's about whether

52:12 test is capable of detecting it. not about whether this. In

52:19 this is the wrong model. It's whether your test is equipped to reject

52:27 model. Those are two different It's not about whether someone is

52:33 It's about whether we can prove it reasonable doubt. Right? Does the

52:39 show above a certain threshold? And a key difference, Right? And

52:47 goes to the alternative hypothesis as It says, what's the other lawyers

52:56 ? The other lawyer is saying it above the reasonable doubt, it is

53:03 the minimum level, it is a effect. Right? And so these

53:10 the two competing positions that we have we run a statistical test, one

53:19 says this is the right model up a margin and the other says this

53:27 not the right model. Up to March, Right? And so

53:36 the p value, which can computer value, but this time it's not

53:42 if the evidence is true. I'm if the model is true, that's

53:46 probability of the evidence, right? is the minimum threshold now comes into

53:54 , right? If the model is right? The knoll is true up

54:01 a certain error, right? Then would see this kind of evidence.

54:12 like we said, there's two different of error. There's type one error

54:17 we think like that and there's type error, right? This has nothing

54:22 do with the significance itself. This there's wrongful convictions. We're you have

54:31 who is convicted because the evidence was , but it was actually wrong,

54:40 ? And you have guilty people who acquitted. It's not that they didn't

54:47 it. It's just that the evidence was not enough in either case.

54:54 test that we set up right, a mistake and that's we're trying to

55:01 both. Right? This is what classifier should do. We're trying to

55:07 how a hypothesis test right performs in of these two numbers. Because both

55:14 are important. So the thing to is that the p value right,

55:24 determines whether you get Whether you choose one or the other. Right?

55:31 so if we set it to stricter , right? Like fisher did The

55:38 is zero right? It's not about beyond a reasonable doubt is anyone who's

55:46 gets convicted but 100% power. You're going to miss anyone. But I

55:53 you're also going to convict everyone. ? That's not good. The other

56:00 is if you have a threshold that's high, right? If you

56:05 you know what I need to be 100%. If there is any

56:12 I'm not going to say they're guilty then no one will be a

56:18 no one will be convicted. You never be able to reject the

56:22 right? One extreme. No matter model, you're always going to

56:28 oh yeah, the other extreme, matter the model, you're always just

56:33 to say, hmm, I don't it. It's wrong. Right?

56:37 there's got to be something in right? And the way you think

56:42 that is if you plot your number mistakes and the power right? If

56:47 plot the p value and the just like we did before, you're

56:54 to get a curve, there's going be This point here where the performance

57:02 your classifier is bad because it basically all the type one errors and none

57:08 the top type two. And that's your threshold is too low. And

57:13 another friend where yours classifier does all type two errors and none of the

57:20 work because your threshold is too And as you're very threshold in

57:26 you're moving along a curve and that's the R. O. C curve

57:31 for receiver operating characteristic. It goes to Radar and World War Two but

57:36 not important. So this is what is what the situation is. When

57:42 have a classifier, when you have statistical tests a hypothesis test it's going

57:50 its performance is going to be determined the way you set up the

57:57 And so if you want to be is what we were saying earlier you

58:05 more points. You need more You're always going to do better if

58:11 have more points you never you can have too many points. Okay.

58:17 . Of course. You're limited by budget. You're limited by all kinds

58:20 things in what you have. But have to understand that the amount of

58:25 that you have puts a fundamental limit how well your hypothesis tests can

58:34 Right roo seekers become closer and closer the ideal left top left corner.

58:44 more data points you have right, how power and sample size are

58:52 So another way to think about that that this is this third ratio that

58:59 didn't talk about right? The positive value or the precision of your

59:06 this is what change is what moves R. O. C curve closer

59:11 closer to the perfect classifier. So when you think of your hypothesis

59:19 as how does it perform in terms this confusion matrix, this is how

59:25 things relate. So how does in practical terms? That's not.

59:32 can read about this experiment but let's go skip to the um to how

59:41 should work when you're doing a hypothesis . So you start by fixing two

59:51 alpha and beta al phase. The of the threshold that you want for

59:59 one errors right? It's how many conventions in there locked run are you

60:06 to tolerate? How many times are willing to say that's the right model

60:13 it wasn't. And then you have , Which is the long term probability

60:18 the type two error, right, yet acquitted, which is how many

60:25 right are you willing to except the model? Did they say the

60:34 One of them is except the wrong . The other is reject the right

60:39 . But so Usually people use this 5% for alpha and 20% for

60:48 In physics the alpha becomes more like it becomes three million and so

60:57 . But the point is that you these two values right? And once

61:02 fix these two values there is a nothing else that you can do right

61:08 is you're going to have to live the possibility that you made one or

61:13 other mistake and these are the Right? And so this is your

61:21 . You start with an alpha and beta. Do you compute the p

61:27 the p value does not tell you about significance. Now, what the

61:32 value is. Does is it tells if your test is powerful enough to

61:39 that judgment or not, If your value is not, then you don't

61:47 power. You cannot make the So you accept one of the

61:53 But that's just by default. That's the presumption of innocence. Like you

61:57 shouldn't have. That's not a good . That's a sham trial. If

62:05 power is enough, right? If power is enough, then you look

62:12 the P value and you compare it Alpha, your threshold and if it's

62:19 right, then yes, you have reason to doubt the all the main

62:27 . So you accept the alternative. it's not, if it's above the

62:33 , then it means that the opposite . You had enough reason to prove

62:42 main hypothesis. And so that's a model. So, this part

62:51 the power calculation is often not properly , but people do this kind of

63:00 . And then they talk about accepting hypothesis or the other. And then

63:04 talk about the significance of their No. When you're doing hypothesis testing

63:11 you're trying to accept one or the , you're trying to prove if the

63:15 is right or wrong. What you to say is I have enough power

63:20 make that call or my experiment didn't enough power, I should have collected

63:27 data. That's the way to go . Does that make sense? Does

63:40 help with the question that you Yeah. Thank you. Sure.

63:51 we got like what five minutes I won't keep you long. This

63:56 last thing that I wanted to talk right? So if you want to

64:00 more, if you want to read about this, there's a big literature

64:05 the subject. But so Now we to part four. This is sometimes

64:16 most surprising thing for some people. let's say this comes from a paper

64:23 in 2003. Let's say that we to study S. A.

64:28 Scores over time. S. T. S. R. Like

64:30 is for those of you who are from the US. They're the undergraduate

64:35 take them. Uh And it's a test and the score um The higher

64:45 score the better and it helps you the highest score you have, the

64:53 it is to get into college. They looked at the S.A.T. scores

64:58 1992 and again, 10 years later 2002. And they had they built

65:06 table where they said, here's the who got a plus in high school

65:13 there G. P. A. those people scored an average. And

65:21 this is mean right? We talked the difference between media and mean and

65:26 the beginning but this is the mean on the S. A.

65:30 Over the people who had an Plus G. P. A.

65:34 it was 619. And then Different of people 10 years later those who

65:41 a plus now score 607 on the . A. T. Right?

65:49 that's the way to read this. that clear? So if we think

65:55 if we look at this there's a like what happens with the S.

65:59 . T. Scores over time. ? And if we look at any

66:07 grade right? If we look at one of these grades you're going to

66:13 there's a drop, it was from 19 to 607. From 5 75

66:19 5 65. 2% draw right across grades. If you pick a grade

66:27 average dropped by 1 - 2%. that clear? I'm not doing

66:36 You can verify that right. You the numbers. This is I'm not

66:41 here. So on the average students worse on the S. A.

66:47 . 10 years later can we say ? I mean strange is not that

66:54 but it's his list. Well let's at this top the bottom row.

67:03 . The bottom row is let's look all the students without separate them into

67:13 . Okay. Just all these The average score among all the students

67:21 501. 10 years later the average is 5 16. That's an increase

67:30 3%. So now it seems like students are scoring higher on average.

67:49 have a paradox. That is a . It's called Simpson's paradox. Well

67:59 just happened. What does it tell ? What can we say? What

68:08 says is that you have to be ? Are you asking the right

68:16 Okay. This is asking the question ? When I look at any given

68:27 right, does perform as decrease and answer is yes. This says when

68:37 look at any given students, does decrease? And the answer is

68:46 it increases. Right? So the grade performance drops, the average student

68:55 increases. Do you care about the or about the students? It's two

69:01 questions. Why does it matter? is there some great inflation happening

69:10 Is that the reason there it is ? If you look at the grade

69:18 it says that the performance drops the to read this is that the grades

69:26 assigned more leniently right. That's what means for a great. When the

69:33 the S. A. T. for a particular grade, that means

69:37 people will lower S. A. . Scores get the same G.

69:41 . A. Right? So what's is if you had a great

69:46 if you're great incurs change all the get slightly better grades and all the

69:53 do better on the S. T. But now the high scores

69:58 one letter grade will be classified among low scores in the next higher levels

70:05 great. Right. And so the . A. T. Average per

70:09 grade drops. Right. That's what inflation means. So the orange rolls

70:15 that there is great inflation and the Rose shows that students are scoring better

70:23 getting better at the S. T. So you may not be

70:30 the question you think you're answering if restrict yourself to the Orange Rose and

70:41 the point that I want to That's the last point that I wanted

70:45 make. So if the question was we're trying to understand if there's great

70:51 happening, then the orange one is area to focus on. Right?

70:56 . Although then you would want to about how you structure your model,

71:04 ? Because then the letter grade becomes outcome, not the exposure.

71:12 So, but yes, that's basically of them is studying the question of

71:20 inflation. The other of them is the question of student performance. Same

71:25 said, not at all the same . Right. And this is the

71:32 thing that I want to leave you . This is the Choluteca bridge.

71:37 . It was built very modern. It could withstand a category four

71:47 Right? Standards were very exactly. was progress and then the Hurricane four

71:54 hit. And it literally changed the of the river and washed away.

72:02 approaches to the bridge. So, that the right question to ask?

72:12 , did you want a bridge that withstand a hurricane for hurricane or did

72:18 want a system? Right. Did want a highway that could withstand the

72:27 ? Right? It's far better to the right question, but have an

72:32 answer. Okay. Maybe you could less money on building a bridge that

72:38 do so, you know, that be so exactingly bridge, but spend

72:42 of that money on the approaches. rather than having an exact answer to

72:49 wrong question. Okay. And that's why all of these questions that we

72:55 in place are meant to give you mindset that says, am I asking

73:01 right question? My following for Simpson's and my following, Am I not

73:07 the full picture? Am I understanding it means to talk about the statistical

73:15 ? And that's the end of my . As per usual slides were manufacturing

73:23 that processes words, a word So they may contain type of mistakes

73:27 omissions. So if you find let me know. But other than

73:33 , I hope that it was clear I hope that um, when you

73:38 it, it will be even clearer it will be helpful with your research

73:46 thank you again for the opportunity. you. And this, We really

73:52 you spending this time with us. going to ask the students in attendance

73:58 they have any last question. Maybe have time for one question if they

74:01 to. And if and if not is my email, I'll be happy

74:07 follow up because I know that we over. Yeah. All right.

74:13 gonna stop recording now,

-
+