false
Catalog
Statistics and Study Design
ASSH_CRC17_S08_01_405_425_Calfee
ASSH_CRC17_S08_01_405_425_Calfee
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
So without further ado, I want to introduce you to Ryan Calfee, who will give the next two talks. All right, thanks. Okay, guys, we're gonna go strong here and finish, and we'll be a little ahead of time, which will be great when everybody wants to get out of here at the end. So, as you can tell, I definitely drew the short straw here on topics. We got CRPS, now we got statistics and study design. I will assure you that my talk on this today, I took out a few of the slides that are in your handout. They're there, you can read them. I've tried to condense this to make it a little less of a running through multiple terminology for you. We're not gonna get into any advanced statistics. I just want to give you enough information that on a standardized test, the things they may ask, you're gonna be able to nail it, not miss those couple of points and move on. So, we have to go over study designs, because at least a couple of years ago on the self-assessment, this was extremely popular, kind of to my surprise. So, when we think about our little pyramid of studies, we always know we got these randomized control trials at the top, but I'm gonna encourage you in life to pick really the best study design for your question, not just the top of the pyramid here. So, what is a randomized trial? The key thing, when we talk about study types, you want to think about where does the study start? Where are we at in time? What's gonna happen because of the study? And then what are we looking at at the end? So, a randomized trial, we start at time zero. We take everybody, put them in treatment A or treatment B, we look at their outcome later. Because of that design, it's very clean. You're gonna get exactly the results you want. The only difference is I gave intervention A to one group, intervention B to another. The bad part, it takes money, it takes a lot of time. You have to follow these people forward in time, and I want you to understand that the whole benefit of randomization, when a randomized control trial is done, is to balance the unmeasured confounders. That's what you're doing. By enrolling 100 people on both sides and randomizing it, you're balancing the unmeasured confounders, and that's the big advantage of an RCT. Now, a prospective cohort is gonna look very similar. We start at time zero, something happens to people, treatment A, treatment B, and you look at them at the end. The only difference is that this is an observational trial. The trial is not assigning people to treatment A or treatment B. It's just happening, and we're just following them to see what happens. So maybe surgeon A likes procedure A, and surgeon B likes procedure B, and we're just gonna document the outcomes. There's no sorts of randomizations. All we're doing is watching and observing and reporting data. So you can imagine that a prospective cohort trial has the same disadvantages as a randomized trial in terms of it takes a long time, you're gonna have to follow people over time. Actually, the data may generalize, say to your practice, better, because you don't usually have 10,000 things listed as inclusion and exclusion criteria. It's just sort of two broad groups. The problem is is you may get some bias. Maybe somebody's population is a little younger. Maybe the other people that are offered the other surgery are a little stronger or a little bit more something in one direction. So a little bit more bias, but this is a nice opportunity to compare practices when people don't want to be randomized. Okay, now this is distinctly different than a case-control trial. So a case-control trial, we're starting at time zero over on the right. We got study people. We're defining two groups. Group one, say, has a disease, and group two is the control, or group one had a certain treatment and group two didn't, and then you look back in time to see sort of what kind of risk factors have happened. So say we have a case-control and we want to look for risk factors for something rare, like a glomus tumor, and we take 20 people with glomus tumors because we don't have to have 1,000 of them. We take a whole bunch of people with ganglion cyst because they're a good set of controls, and we look back in time to see something that may have led to one or the other, and maybe you find out that, hey, a lot of folks with glomus tumors have had neurofibromatosis in the past, or have been diagnosed with that and the other group hasn't. So this is something where you look back in time. Case-control, you're grouping people based on a disease or not, and looking back in time. So here's the advantage. It's quick, because all the data's retrospective. So a case-control trial is a great study design, as you can expect, for something that's a rare outcome. You don't want to wait for that rare person to develop this thing. You need to pick out those people that have it, and it's great for something that may happen a long time from the first event, because you're not having to follow people prospectively. What's the downside? Well, all the information you get is either based on patient recall, or it's based on what's in the record. Those things are imperfect, but they may be available. So case-control, it's always retrospective. You're always looking backwards. You're grouping people based on the disease status. Okay, cross-sectional study. This study doesn't go forwards in time or backwards in time. It's just one snapshot. So everybody that comes in the office, how many people have an ulnar nerve that snaps? That's cross-sectional. Everybody in this room, if we ultrasounded everybody, how many people have an asymptomatic rotator cuff tear? There's no follow-up, there's no looking back. It's just, who has it? So these epidemiology studies, talking about the prevalence of a disease, that's cross-sectional study. That's what you wanna do for that. But the way you pick it out on the test, there's no follow-up, there's no looking backwards, one snapshot in time. All right, away from study types. A little bit on power and a few things about statistics. So any time you read a study, any scientific study, you're testing a hypothesis. Technically, we are always testing the null hypothesis. And so you're testing that there's no difference between the groups. And if you find a, quote, significant difference, you say, no, no, no, these groups are different, for whatever reason. Otherwise, if there is no difference, you can't say that the null hypothesis is true, you can just say that you could not disprove it. Okay, so where do we decide that line in the sand to say something is significantly different? It's the alpha level. That's sort of what the person sets at the start of their experiment. And by routine, for us in hand surgery, orthopedics, plastics, it's almost always .05. So then we say if our p-value falls below that, hey, it's significant, if it doesn't, it's not significant, it was chance. But what we're saying is that if you say .05 or it's less than that, that's significant. We're saying that there is a 5% chance, if you hit .05 exactly, there's only a 5% chance that the difference you're seeing is just a random error or random chance alone. And so usually, once you get beyond that threshold, we say, oh, yep, there's truly a difference between these groups. So that's alpha. That's your line in the sand for where something's significant or not. The p-value is just that value you get when you do your statistical test and figure out, hey, how likely could this difference have been because of chance alone? And again, usually, if it's smaller than .05, we all say, yes, this is significant. If not, eh, we didn't find a difference. Okay, now you have to know about type one and type two error. Again, easy test questions. Type one error, I apologize for the memorization. I don't have a mnemonic here. It's just a false positive. So you rejected the null hypothesis. You said, yes, treatment A gives a superior outcome to treatment B, but in fact, the truth is probably they're the same. It's just that your study falsely found that. Type one error, false positive. Type two error is the, hey, there really is a difference in real life or real truth. Your study couldn't detect it. That's usually the problem with underpowered studies. So if you get a test question with, hey, they did a small study, couldn't find a difference, the answer is either A, you're underpowered, or the answer will be something like they are at risk for a type two error, which is not detecting it statistically, even though you have this big difference. You got too many, or sorry, too few people in your study. Type two errors, underpowered studies. Okay, confidence intervals. This is something good to know for life for reading articles in addition to the test. But what a confidence interval, when people give you a value, say they estimate the chance of complications at 2%, and then give you a parenthesis with 95% CI and give you two numbers, it's based on the precision of your estimate. So from your study, if you were to repeat that study 100 times, the mean value you estimate should fall within that range 95 times out of 100, okay? So it's kind of a nice supplement to just having a p-value tells you, ooh, significant or not. But when you get the confidence intervals on certain estimates, it really lets you know, hey, is this a good estimate I can take home and really remember and take to practice or not? There's a formula here. You'll never need to know that. I just put it on so you could see it. But here's the example that I wanted to mention to you. So again, say the first thing I list here with this risk of complications is a study that enrolled 20 people looking at some risk of a complication with a volar plate. And they said, oh, 2% of the people had a complication. But their confidence interval's zero to 50%. They've got so few people in the study, you have no idea what the real chance of that complication is. On the other hand, the next study you read the next month comes out and says, yep, 2% also. But they had 1,000 patients in their study and their confidence intervals are zero to five. You can take that number back to clinic and say, you know what? Based on the best available evidence, the chance of X complication is probably less than 5%. And you can feel good about it. So it really helps you interpret people's values when they publish things. Okay, two by two tables. These are like the favorite thing for epidemiologists. We're not gonna have to memorize calculations. I have not yet seen a calculation on a test in terms of having to come up with the number. But I want you guys to have the formulas in your handout and just understand how they work for a few things. So relative risk is always kind of a, you know, did you get the disease or not? And were you in group A or group B? Like for instance here, smoker, non-smoker, did you get cancer or not? And you calculate the risk of cancer in smokers versus those without. And you can kind of see the increased risk. Just so you know, odds ratios are calculated very similarly but just slightly different. But if you have a test question about a case control study, you're gonna produce an odds ratio. It just has to do with the fact that you're artificially picking how many controls and how many cases so you're sort of setting the prevalence of the disease. And then also realize, because this is an easy test question, that if you're looking at the odds of an event, as long as it's a rare event, then the odds ratio and the relative risk end up about the same, just so you know. Okay. So here on the slide I've given you the fact that whether it's low at less than 5% or low at less than 10%, some people say odds ratio very close to relative risk if the event is unlikely. Okay. Little bit on types of data. So there's lots of different types of data and it's gonna matter because if you're doing research, it's gonna matter what test you use to analyze your data. If you're taking a test, you're gonna have to pick the right kind of test to do on that data. So there's continuous data. Those are data that can go all along the spectrum. How old are you, what's your blood pressure? And then there's categorical data. How many times have you been pregnant? How many cars do you own? Something like that where there's just specific integers, there's nothing kind of in between. Now when you look at continuous data, usually it's normally distributed in a perfect world and you've got this normal curve and if you ever get asked about kind of how far or how many people are under the curve at one standard deviation or two standard deviations, the numbers to remember are, as you can see on the curve here, 68% of your population is plus or minus one standard deviation from the mean, 95% within two standard deviations, okay? Just some easy numbers that can every once in a while show up on a test. Okay, now your data can be skewed. Now it's hard to remember back to those days of doing math but if it's skewed to the left, like I show in A here, you got a lot of low numbers. That's gonna drag your mean value down if you took all those values and did a mean. The median doesn't move as much and remember the mean's the average, the median is the number that's kind of in the middle. You've got an equal number of observations above and below and then the mode is the most popular or most frequent observation. But just remember that skewed data pulls the mean further than it pulls the median, which is why when you have skewed data, we usually ask for what's called nonparametric statistics instead of parametric statistics that look at the mean values, okay? So when you have categorical data, like we talked about, just certain categories of things, you can have nominal, which nominal data doesn't have an order. You know, what ethnicity are you? What's your occupation? You're a doctor, you're a lawyer. There's no order to that. Or it can be dichotomous, male or female or it can be ordinal where there is an order but it's not quite like one, two, three, four. You can have, people like to publish satisfaction scales, you know, very satisfied, less satisfied or pain, mild, moderate, severe. We don't know that going from mild to moderate pain or from moderate to severe pain is the same. It may be different to other people. So you don't quite have this like very clean, you know, numeric change between the levels. So when you look at how to analyze these things, a couple things here. If you've got dichotomous data, so if you can basically make a two by two table on both sides, you can do a chi-square test. If you've got two groups, male and female, and you will look at a continuous measure like average test score, you do a t-test. Or if it's two continuous things, you've kind of got age on one side and blood pressure on the other, you can't really define two groups so you do a correlation coefficient. All right, and then again, just to emphasize this importance of classifying data. Two groups, just remember this one because this one's like an easy one for a test. Two groups, you know, men, female, surgeons, look at their average test score. Those are continuous means. Two groups looking at the mean score do a t-test. Two groups, it's a t, two groups t-test. If you have three or more groups, say we have now instead of men and women, we're gonna look at people in the East Coast, people in the Midwest, people on the West Coast, and we wanna look at their average hand certification score. We've got three groups to compare the mean value. As soon as you have more than two, you're back to an ANOVA test. So if you add another group, you do a ANOVA test instead of the t-test. So t-test, two groups, ANOVA, three or more. And then if you have, as I was just mentioning, ordinal data, those scales of pain or satisfaction, they're really not one, two, three, four. They're just sort of these ordered scales. You really wanna do chi-square testing and not take a mean. You can't take an average pain scale score when one was mild, two was medium, and three was severe. You wanna do chi-square testing for that. And then finally in papers, if you read about anything with regression modeling, basically they're telling you that they've done statistical testing where you can put multiple factors in together and figure out still which one will be associated with or predict the outcome. Okay, and then very briefly on test performance, sensitivity and specificity. So sensitivity is the proportion of correctly identified disease people. So how many disease people can you pick out with your test? Specificity is how many true negatives are you really picking out? And so I've got a little thing here again on that two-by-two table. The thing I want you to appreciate here is that both sensitivity and specificity are calculated vertically. So basically if you look at those people with the disease, you look at the people with the disease, disease positive, to get the sensitivity, it's just how many people had the positive test result out of how many people actually had the disease. Or for the specificity, how many people in that negative group had a negative test divided by how many people overall did not have that disease. So the vertical calculations don't get affected by prevalence of the disease. This is purely a performance metric of whatever test you're using, okay? So the sensitivity and specificity is tied directly to your test, not to how many people have the disease. You wanna use sensitive tests. So if they say, do you want a highly sensitive test, what's it good for? It's good for a screening test, okay? If you have a highly specific test, it's good as a confirmatory test. So if someone comes in and you're getting screened for a disease, you want a really sensitive test to start with. You might get some false positives, but then you do your confirmatory test to really make sure afterwards. So sensitive test, good for screening. Specific test, better for confirmation testing, okay? And then you'll always read papers. They give you sensitivity and specificity, and then they wanna report positive and negative predictive values. And that's because in clinic, we wanna know, hey, the test has a 90% sensitivity. I ordered it. What's the chances now that that ANA test means that you actually have lupus or something like that? So these are calculated horizontally. You don't need to know how to calculate them necessarily, but it's horizontal. Same sort of math that we've just been doing. But in this case, the prevalence of the disease matters a lot, okay? So just to give you the couple of examples, and I've skipped a few of these that are in your handout, but if we start with this kind of theoretical two-by-two table with a prevalence of a disease of 75%, 75% of our population has the disease. And our test, if you look, has a sensitivity of 93%, specificity of 88%. That's pretty good. So then our positive predictive value in this group, 95%. That's fantastic. What if our test has the exact same sensitivity and specificity, but only half the people have the disease to start with? Now our positive predictive value is down to 88%. What if only 5% of the people have the disease? Same test, positive predictive value down to 29%. So just realize positive and negative predictive values really hinge on the prevalence of the disease, okay? And then finally, two words to give you here for patient-reported outcomes because they're really everywhere. One is validity. Some of the tests have asked in the past, what is validity? It's the ability of an instrument to measure what it's supposed to be measuring, okay? Usually you establish validity by looking at, say, a certain test versus a reference standard of some sort. And then there's reliability. If you do the same test over and over, do you get the same answer? Doesn't mean it's the right answer, just means it's the same answer. So if you look at our little scatter plots, the little targets we've got here, the top one, if you look at all those little shots, if you take the average, it's not in the middle. And those shots are pretty scattered. So that top left is unreliable and not valid either. Now the one on the top right, you can see shots all over the place. So we're not really reliable. But if you took the average, you're right in the middle. You're not biased any one way in any other direction. So it's valid. Then you've got on the bottom ones, the bottom left is reliable. It's really tight. You get the same answer every time, but it's not valid. Say that scale is off by two pounds. Everybody's two pounds heavier than they really are, but it gives you the same answer every time. And then finally, reliable and valid are all the target spots close together in the middle. Okay. And I'm almost out of time, but my next talk is fast. So I'll show you the questions here. Whoops. I gave you the answer too fast. Sorry about that. All right, very briefly. So 2015, for whatever reason on the self-assessment, they had a whole bunch of these questions. And then they only had any this year. So who knows how it'll be on the test. For this one, they said a retrospective study comparing satisfaction for people undergoing either ulnar nerve decompression versus in situ decompression is what kind of evidence? And I didn't really cover levels of evidence in my talk, but level one basically is a randomized control trial. Level two are usually prospective comparative trials. Anything retrospective but comparative is usually at level three. And then level four are case series. And what distinguishes a case series from say a cohort trial or a case control is there's no control. It's just a I got 15 patients with one treatment, here's how they did. And level five's usually more expert opinion. Okay, here's one. This was also 2015. Prospective outcome comparing two treatment study groups found no difference, but they really found no statistical difference because they had a small sample size. Why did they not find a difference? What error type? Two, you guys wanna say it, okay. Type two. And then here was another one that came in the same year. Which of the following refers to the ability of an outcome measure to measure something the same way twice? As I just told you guys, reliability. Same answer every time. And then this question was in the 2015 test. I don't think it's a good question, but this is what they wrote. A researcher's looking at people with carpal tunnel syndrome. They wanna evaluate the treatment and the type of study. Well, they didn't tell you if they have two treatments or one treatment. If it's one treatment, it's a case series. They don't mention going retrospectively based on some disease, so it's not a case control. Cohort study is what they said the correct answer was, but technically, to be a cohort trial, you gotta be looking at two groups. You're observing them, but you really should have two groups. And it's not cross-sectional because they wanna look over time to evaluate treatment rendered. Okay, that's it for that talk.
Video Summary
The video transcript is a presentation by Ryan Calfee on different study designs and statistical concepts. He begins by discussing study designs such as randomized trials, prospective cohorts, case-control trials, and cross-sectional studies, explaining their advantages and disadvantages. Calfee then moves on to discuss statistical concepts, including power, p-values, type 1 and type 2 errors, confidence intervals, and types of data. He also explains the importance of sensitivity, specificity, positive and negative predictive values, and validity and reliability of tests. Some test questions and answers are provided at the end of the presentation.
Keywords
study designs
statistical concepts
randomized trials
prospective cohorts
case-control trials
cross-sectional studies
×
Please select your language
1
English