false
Catalog
Evidence Based Practice modules
2013 - What do these statistical tests mean? How t ...
2013 - What do these statistical tests mean? How to evaluate the clinical research studies you read. - Brent Graham MD MSc FRCSC
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
this particular session because there's so few of us, but I really appreciate you coming and we're in a suitably obscure far reaches of the conventional, not quite as far away as general hand surgery, but they're next door, but we're in a pretty far off area. I think that, my name is Brent Graham, I've been the research director at the organization for the past three years, and I don't have anything to disclose other than to say that I'm one of three deputy editors for methodology at the Journal of Bone and Joint Surgery, and that may have some ramifications for you because this is supposed to be for readers, and that's what we're going to be talking about, but if you are an author and you submit your paper to the Journal of Bone and Joint Surgery, what happens is that the content editors look at it, and they decide whether or not they think it has merit as a publication, and if they do think it has merit as a publication that's likely to be published, then all of those manuscripts come to me or one of my two colleagues, one of whom is a rheumatologist, Jeff Katz, and one is an epidemiologist, Alina Lozina, and one of the three of us will then go over it for the methodologic content and make suggestions to the authors and the deputy editor for that area so that the manuscript can be as good as it is. So a lot of what I'm talking about today would be of special interest to you if you are an author, but it should also be of interest to you as a reader. However, I will say that if what you're reading comes from the Journal of Bone and Joint Surgery, much of this information hopefully will have been addressed for you so that you can just take in the content of the article and not necessarily worry about the meaning of this statistic or that statistic. Journal of Hand Surgery I don't think is there just yet. They don't have a strong methodologic analysis part in the review process. The other thing I'll say right at the outset is that there will be no math. There's no formulas. I'm not going to try to teach you anything because even though I'm a fully trained epidemiologist, I would never even consider doing anything more complicated than calculating the mean or maybe doing a t-test if I could. Most of the time we're going to have a statistician involved. So we're not here to talk about how to do any tests. We're here just to talk about what the tests mean. So I think when you're a consumer of medical information in a journal, I think you should have a high level of skepticism no matter where you're reading it, even if it's JAMA, the New England Journal, even if it's very high profile general medical journals. I think you need to have a healthy dose of skepticism. You should be asking yourself some questions about the paper you're reading, particularly what are the limitations of the study design. We're going to talk a little bit about that today because all study designs have limitations. I don't care if it's National Cancer Institute, multi-center, randomized trial. There will be problems and we're going to go over what some of those potential problems can be. I think people get very baffled by the idea of power and I want to try to demystify a little bit about that today, so we'll be talking about that. And then you should also be asking yourself what are the potential sources of bias in the study and how do they impact the results that the authors are trying to get across. And part of that is knowing whether or not the appropriate tests have been used. And like I said, that's actually a pretty easy thing to analyze for yourself as a reader. Okay, obviously the reason that this is important is so that you can make a decision whether the results actually or the conclusions of the study actually are meaningful or maybe let's get more basic. Decide whether or not they're true. A basic dictum if you are an editor or even a reviewer when you're looking at a manuscript are to answer three questions. Is it new? Is the information new? Is it true, which frequently is not the case? And if it is new and it is true, does it matter? So these are the things that we're thinking about when we look at manuscripts. And deciding whether or not it's true is actually to me one of the most fundamental things because very frequently the authors of course believe that their conclusions are true, but I'm usually a lot less skeptical, or a lot more skeptical, a lot less convinced of that. So you want to know what the limitations of the research design are so that you can decide whether or not the conclusions are consistent with that design. Okay, so as I've said before, as I said at the outset, all research designs have limitations, all of them. I guarantee that the perfect study has not been devised yet or reported for sure. For the lower level areas of evidence, the case reports, the case theories, the cross-sectional studies, the problem is that the sample itself may be biased, and that's probably apparent to you. I mean, we know that case theories are relatively low levels of evidence. Why are they low levels of evidence? Many reasons, but it starts with the sample. The sample that is being evaluated by the authors may really look nothing like the population that it's supposed to represent. And when you have a small group of patients that you're reporting, of course, the probability that that is the case, that it's not representative, is very, very high. We go up the evidence scale a little bit to the case control study, which is usually considered a level three study. The problem there is in the control group. So in a case control study, it starts with the outcomes that you have. You may have a certain outcome, a certain complication or a certain bad result that you want to compare to a control group. So what you do is match for all of the factors that may be confounding. So if you have those equally distributed in your group of cases and your group of controls, perhaps you can make some conclusions. Of course, the cases are going to stay the same, but who you choose to compare with, the control side, there's a lot of room for bias there. And as a result, in a case control study, the bias exists in the selection of the control group. And then with randomized trials, the problem there is the generalizability because the patients that have been admitted to the study or that are being analyzed may look nothing like patients in general. Most of these studies are done in large tertiary medical centers and their patient population may look nothing like the patient population coming to your suburban practice or even a rural practice or a smaller city. So we need to understand whether or not the results are actually generalizable. I'm not really going to go into much of this because I really want to talk about some of the more arcane aspects of the statistical analysis, but I just want to alert you to the fact that no matter, just because it says it's a randomized control trial does not mean in any way, shape or form that the data is more reliable or more valid than a lower level of evidence study until it's been proven by reading through it that the methods were actually satisfactory. Another important point that I think you need to understand because this is a very, very common mistake made by authors, I see it all the time even at the meeting here today, and that is this whole idea of cause and effect. The most research design, virtually all research designs you're going to see at this meeting today or that you're going to read about in the Journal of Hand Surgery can only demonstrate an association. So if you have A over here and B over here and they seem to be associated, possible conclusions there are that A caused B or in fact that B caused A or even that C caused both A and B. So the fact that there's an association does not imply necessarily any type of causal relationship there. Cause and effect can only be inferred and often not with very much confidence. There are a number of things that go into the whole cause and effect relationship, including biologic plausibility. So you may remember a few years ago that there was a lot of excitement over the fact that heavy coffee drinkers or people that consume a lot of coffee seem to be at greater risk for pancreatic cancer. The biologic linkage there between taking coffee and causing a cancer in your pancreas seems tenuous at the best. There are lots of instances where seemingly random events are being associated with a certain medical outcome, but the biologic plausibility linking those two things is completely obscure. So in order to identify a cause and effect relationship, it's very important to at least have a concept of a biologic plausibility. Also if there's a dose effect relationship, that is very helpful too. So if you have an exposure of this amount, you should have a result of that amount. If you have an exposure of this amount, you should have a result of this amount. So for people that believe that cashiers in grocery stores are more likely to get carpal tunnel syndrome, I guess the people that work 40 hours a week should have a greater prevalence of carpal tunnel syndrome than the people that work four hours a week, if we can put it in those terms. People that work a lot of overtime should have a lot more carpal tunnel syndrome. So unless you have that type of dose effect phenomenon at play, it also plays down the whole cause and effect issue. And then of course a temporal relationship is important. So to take the example of the grocery store checker and carpal tunnel syndrome, I mean they should have their symptoms when they're working and then when they are on vacation for a few weeks, maybe they won't have their symptoms and they go back to work and they get their symptoms again. These are all aspects of the cause and effect relationship that have to be in play before we can give that any credence. So again, this is the kind of thing that often comes out in clinical papers that a certain observation has been made and then that's linked to a given outcome and it's implied that there's a cause and effect relationship there, which of course is very difficult to demonstrate. And for those of you that may still be around at the very end of the meeting this afternoon, I think it's 1.30 or something, there is a symposium on levels of evidence and I've been given the responsibility of describing or talking about the level four evidence and we'll go into this in a lot more detail. Okay, so really cause and effect can only be determined in a prospective, a truly prospective study where everybody that enters into the study is then either exposed to the factor of interest or not exposed to it and then through time we find out what happened. That's the only way to clearly imply a cause and effect relationship and sometimes that's not even possible. Okay, so let's get down now to some of the number stuff that, about which there is a lot of consternation and confusion sometimes and I might be overstating, maybe there's no confusion at all and certainly any moment, because we're a small group here, at any moment if somebody would like to ask a question, feel free to do so. First thing I want to talk about is this whole idea of type one error. I mean, we're all very familiar with the phrase P less than 0.05 or as we frequently see in the journal of hand surgery, P less than 0.00000001. It's ludicrous, but we'll leave that for now. What does it mean? It means that if we observe that there's a difference between group A and group B and we've done an appropriate statistical test and it shows that that difference is statistically significant at the P less than 0.05 level, what it means is that that difference has occurred by chance. There's a 5% risk that has occurred only by chance. So out of every 20 papers you would read like that that would give a conclusion that said the difference was statistically significant, one of those would be wrong. Because there's a 5% risk that that observed difference is actually just there spuriously and by chance. And actually there is no difference between the two. So we accept that being 95% sure that the difference is real is good enough. That's what we accept. And that's what's implicit in that statement. So even in the most seemingly methodologically rigorous studies there are, showing a difference between the two groups of interest at a P level of 0.05, there's still a 5% risk that that difference does not actually exist. But we accept that as a reasonable level of reality, an adequate level of validity. And I think most people here probably knew that before I said any of that. What's a little less clear and looks completely unclear to the vast majority of authors that submit to the Journal of Bone and Joint Surgery is this idea of type 2 error. This is distinct from the idea I just talked about. This is the idea, the probability that an observed difference that is not statistically significant has occurred because of inadequate or biased sampling. So again we have group A and group B and we find that there's a bit of a difference between the two of them but it is not statistically significant. Does that mean that they're the same? Not necessarily. In fact we should not conclude that they're the same unless the power is adequate. In other words we should only assume that the failure to show a difference between the two groups means that they're the same with this 20% risk threshold. So for type 1 error we accept 5% risk that our conclusion that the two things are different, we accept that 5% risk is adequate. For this type of analysis we accept up to 20%. So in other words when we see that the two groups are not different we can conclude that in fact they're not different, they are exactly the same or very very similar to the same or they don't appear to be different simply because of inadequate sampling or some other type of problem and we accept the risk of that erroneous conclusion of up to 20%. Does that seem reasonably understandable? So I'm summarizing that here. So if the probability of that type of error is greater than 20% then a statistical analysis is not allowed. And again I can't really speak for journal of hand surgery but you will not be allowed to make an underpowered statistical comparison in the journal of bone and joint surgery. And the reason is because a failure to show a statistically significant difference may simply be because of inadequate sampling. So you have to have an adequate sample, an adequately powered sample in order to make a statistical comparison and meet this threshold of 20% potential error. So when you're reading papers and you see that two things do not appear to have been statistically compared maybe because the editors have not allowed the authors to make that comparison simply because the size of the study is too small to allow that type of comparison to be made. When that's the situation all that you can do as a reader is just look at the two groups and make a common sense decision for yourself as to whether or not they seem comparable and in many instances they won't. One of my favorite issues around this is when you see the two groups being compared for demographic characteristics like the average age, the gender distribution and some of the other factors that may be at play there. And after that you'll see another column that says P is greater than 0.05, greater than 0.05, greater than 0.05 or P is equal to 0.98 or something. The idea is that because they haven't shown a difference between the two groups then they must be the same. And that is an erroneous conclusion that again you're not allowed to make in a journal that has a strong methodologic background. The key is that the absence of a difference does not mean that the groups are the same. The absence of a difference does not mean that they're the same. It may mean that they're the same but frequently it just means that we had inadequate sampling. So this is a point I just made. The demographic characteristics of the groups that are being compared should not be compared with statistics unless the power is adequate. So let me give you an example. If you had say you're going to do some type of operative procedure or one group has had an operative procedure and the other one has had an alternative operative procedure and you want to prove that the two groups are the same so you can compare the operative procedures. And the average age in the one group is 35 and in the other group is 47 let's say. And they show that P comparing the ages is greater than 0.05. So they assume that the average age of 47, that group, is equivalent to the group that had the average age of 35. When in reality if there are like 30 people in this group and 45 people in the other group that may not be enough to actually show that the difference of 47 and 35 is statistically significant. If you added another 10 people to each group the difference between the average age of 35 and the average age of 47 might actually be statistically significant. So because the power to compare the age or to compare the gender distribution or to compare some other comorbid condition in the two groups is inadequate we cannot make that comparison and it should not be allowed in any publication you read. So if it is like that just ignore that third column showing the P values and make your own decision as to whether or not you think the groups that are being compared actually are comparable for the various factors. Okay, so what do we take away from all this? Because I've just told you that all the studies are biased. What we have to do is try to understand what the nature of that bias is and how does it affect the conclusion. As I've mentioned biases principally is introduced mostly in the composition of the study sample because people tend to use a sample of convenience. They use the patients that are in front of them. Those are the ones that they study. But there's no reason to think that that's representative of the population at large. I mean where I work, I work in a tertiary care center in the middle of a gigantic 5 million person city and the people that come to see me don't look anything like the patients that go to a plastic surgeon who may do some hand surgery out in the suburbs. I mean they just do not look like the same people. And yet so a study done at my center and a study done at that center are not going to, patients are not going to look the same. Whatever the source of bias is, it's really the responsibility of the authors to identify what that is and interpret that for you. So when they, they should know that their sample is biased and they should say that as a result of this potential bias in this way or that way, that has the following impact on our conclusion. I can tell you that they never do it so the role of the editor then is to make them do that but that happens to a variable extent. So then it gets down to you as readers to try to have some insight into how the biases that are inherent in the study have an impact on the conclusions, regardless of the conclusions that the authors may have made. Like I said, I think it's good to have a healthy sense of skepticism about these things. Where the potential sources of bias and how should they be addressed? Okay, well as I've mentioned repeatedly, in the case series of the cohort study, you've got a whole bunch of things here. Referral bias, the people that are coming to you may not look anything like the patients being referred to your colleague across town. Selection bias, we may only do a certain type of treatment in a certain type of person. Spectrum bias, we may only see the easy cases, or we may only see the hard cases, and so as a result, our sample's biased in that regard. Something that works very, very well on the easy cases may not work so well on the hard cases, but if that's the sample that's reported, we may get an incorrect idea as to the effectiveness of a given intervention. This whole idea of the consecutive collection of cases, I don't know how many times I heard that said yesterday in the session, but basically, maybe a little over the top for me to say so, but I think it's a zany idea that somehow the cases collected consecutively are any less biased than the ones that are just thrown into a study, because all of these things like selection bias and spectrum bias go into these things, so unless every single person that walks through the door with a given condition is admitted to the study, it's not consecutive. To say that it's a consecutive study, because I did the same thing on all of these patients that I saw, is a little irrelevant if there are a lot of patients along the way that you decided not to treat because they met some of your exclusion criteria. They were too old, they were too young, they couldn't speak English, they had a more advanced type of condition, or they had some type of comorbidity, it's not consecutive, and the idea that a consecutive collection of cases actually makes the sample less biased, I find a little bit dubious to begin with. Now, having said all that, that diminishes the quality of our reports, there's no question about that, but I think we have to get used to it because in 2012, that was about two-thirds of the studies that appeared in the Journal of Hand Surgery, so of the 190 or so articles that were published in the Journal of Hand Surgery, about two-thirds of them were level four or level five, so in other words, the case series, the cohort study, or even the case report, that was two-thirds of it. One-third of it was level three, level two, or level one, and about half of those were level three. So, we don't have a lot of opportunity to read very, very high quality studies, this is usually what it is, and so, I'm cautioning you to be careful about that. Okay, I've already mentioned that in case control studies, the problem is that the controls may be biased by any of the factors that you're trying to control for, and so, selecting these controls so that we know that the distribution of these potential confounders is the same in both groups, that's why we select a control group, and the usual standard for a case control study is to actually have multiple controls for each case. So, if the paper says that it's a case control study, and they have like 15 cases, and they have 15 controls, it's low quality, it's probably not even level three, it's probably below that. If they have 15 cases and 60 controls, four controls for each case, then obviously, if you've got a lot of controls for each case, the likelihood that there's bias there goes way down, because you've got a much bigger sampling of the control group, of the control population. So, I think most methodologists would consider two controls per case in a case control would be adequate, but I'll just warn you that that's actually very rarely adhered to. Okay, and as I mentioned, in a randomized trial, the idea, the whole idea is that by randomizing, except for the thing that you're interested in, usually treatment, the idea is to distribute anything that may confound our result equally between the two groups, and that goes not only for the confounders we know about, but for the ones we may not know about. If we're truly doing a good process of randomization, the distribution of all these factors that may trip us up and screw up our conclusions should be equally distributed. Whether or not that actually takes place depends on the size of the sample, because if you have randomized 20 people to this group and 20 people to that group, the probability that those confounders have been equally distributed is not as great as if you randomized 200 people in this group and 200 people in the other group. So, small randomized trials, of which we see quite a few at the Journal of Hand Surgery, you know, 24 people here, 24 people there, they may still be heavily biased. Even though the idea of randomization was to minimize that bias or to blunt that effect, it may still be a very significant problem if the size of the sample is very small. You might well ask, well, how do we know if the sample's been big enough? And I'm going to get onto that in a second. I think I'm going to say it here, actually. So, what the authors should have done if they're doing a randomized trial is estimate the sample size that's required. And in the best randomized trials, that's done in the planning stages. So, this whole idea of power should not be calculated afterwards. In other words, we shouldn't do the study and then see if it was adequately powered to do the study. What we should be doing is planning beforehand to estimate the number of patients we need in both groups to answer our question, and then going forward. So, a lot of times the sample size calculation is done to identify the minimum number of people to show a difference between the two groups. But that's just the minimum, and usually it's much better if they accrue a much larger sample than that. Because with these very small samples, simply bad randomization may happen. I'm sure, well, I'm not sure, but many of you may have been to Las Vegas, and sometimes things just don't seem to be going well, because in the short term, you may just be rolling snake eyes in a row, several times in a row. There is a probability that that can occur. It's not a very high probability, but it can occur. And the same thing can happen here. You can have bad randomization because of 20 people in each group, you may simply not distribute all of the important factors on both sides. Okay, now we get down to the tests. So, what are the appropriate tests? And I don't, like I said at the outset, I don't, if I'm doing a study myself, I have a statistician help me, even though I'm very familiar with these tests, I'm not the one that really selects them, but, you know, I often think, with this whole emphasis on technology at the meeting, I've been thinking quite a bit about that. So my teenage children, you know, they don't have a clue, really, I don't think, how technology works, but they know what to do with it. So they don't really know how the search engine optimization works, and why they are able to, with a couple of keystrokes, pull down a whole litany of YouTubes for me to look at at dinner every night. I mean, they don't really know how that works, but they know what it does. And I would urge you to take the same view here. I don't think you need to know anything about how the tests work, but you need to know what they're used for, and why they're used for that. That's really what I'm trying to get across. So, it's helpful to know the assumptions for a given test. So, for example, the t-test really can only be done if the thing that's being compared is a continuous variable. So it can't be categorical, it has to be, it has to be normally distributed. In other words, let's take a good example of the visual analog scale. Okay, we're all very familiar with that for pain. And the presumption there is that, if you have a 10 centimeter scale, where zero over here is, you know, no pain, and 10 over here is the most excruciating pain ever, which, of course, most patients haven't experienced anything that's like excruciating pain, and to reference my children, for one of them, that would be falling out of a tree, and the other one would be stubbing her toe, you know, as the worst pain ever. But that's the scale that they work on. So the presumption there is that all the intervals along that scale are the same distance apart, so that a pain rating of 8 is exactly twice as bad as a pain rating of 4, right? Or a pain rating of 9 is three times as bad as a pain rating of 3. Well, that is crazy. That's just not true. I think all of us have seen lots of patients where, you know, they have an 8 for everything, and other very hardy people who never get above 5. You know, the difference of scaling there is not the same. So why do we... So whether or not we can actually use a t-test to then average a score in this group, average a score in this group, make a comparison and see whether or not they're different is very dubious to me. However, that's what's always done. And I guess if you have very, very large samples, then all of this noise around how people choose their pain or describe their pain numerically may be dampened out. So under those circumstances, it may be okay. But a truly continuous, normally distributed variable would be like weight. If we weighed everybody in this room, or everybody at this meeting, we would get a distribution of weight, and the difference between 180 pounds and 181 pounds is just the same as the difference between 125 pounds and 126 pounds. It's a continuous variable that is normally distributed. And hardly anything that we measure, whether it's grip strength or the length of something or the weight of something, hardly anything meets that criteria. And yet we use these scales all the time. So I think you should be highly suspicious of the t-test as a measure of comparing the average of this variable and the average of this variable, because it must meet these criteria of being normally distributed and continuous. What can we do if that's not the case? Well, we can use these more conservative, non-parametric tests. You've seen these tests measured, and they all have archaic Eastern European sounding names associated with all of them. But the idea is that they account for the non-continuous nature of the variables. So as a result, it's a lot harder to show a statistically significant difference between the two groups with a non-parametric test than it is with a t-test. And so what may appear to be statistically important between the two groups with a t-test may not pan out when we use these non-parametric tests. If you read any paper where the testing process has used these somewhat unfamiliar non-parametric tests, probably the quality of that statistical analysis is better than the use of a t-test, unless the variable that's being compared with the t-test meets these criteria. The other thing that's very interesting is that, I'm sure you all read papers where there are multiple, multiple comparisons, right? So they compare, they have their two groups, and they have their outcome, and they compare variable. And in every instance, the presumption is that there's a 5% risk, like I said before, of a type 1 error. So that when we show that something is statistically significantly different, it's not 5% risk of that. But the more comparisons you do, the bigger the p-value is actually getting. So this is the formula that's used for that. If x is one comparison, okay, x is the number of comparisons that have been done, so if x is 1, then 1 minus 0.095 is 0.05, right? That's what we accept. But if the number of comparisons goes up to 3, let's say, and this exponent here is 3, I'll just tell you that p comes out to 0.14, not 0.05. So the risk that one of those comparisons is incorrect is actually 15%, not 5%. And if this gets to be 20 over here, then the p-value for any one of these comparisons is more like about 0.3, not 0.95, or it's more like 0.7, not 0.05. What I'm saying is that you have a 5% cost of making an erroneous judgment with each comparison, and that is additive as we go forward. So each test is not independent. If they have 20 tests, 20 p-values in their gigantic table showing all the comparisons they've made, each one is not 0.05. It's related to this formula here. So what you need to know is that unless they've made an adjustment for that and lowered the threshold for which we're going to say something is statistically significant to a much, much lower level, none of the comparisons may mean what you think they mean. It may actually mean that the p-value is actually quite a bit higher than they've actually indicated. The fundamental point, and most statisticians just take that for granted, one thing you can do is to take the p-value of 0.05 and simply divide it by the number of comparisons that you're going to use. Let's say you're going to say you're going to use 50 comparisons in your paper. If you divide 0.05 by 50, I think what you end up with is 0.001, and that should be the threshold for statistical significance, not 0.05. Yes, sir? So his question was, if you see these tables comparing the demographic characteristics and they've gone down greater than 0.05 for 10 things, so the two groups must be the same. First of all, even if the power to make those comparisons for age, for gender, for occupation, for blah blah blah, for all the other stuff, even if the power to make that comparison was satisfactory, which would hardly ever be the case, the p-value is not 0.05. It's 1 minus 0.95 raised to the number of comparisons, so it's much, much higher. And this is a very, very fundamental point that next to no authors get. So because they don't get it, and many editors don't get it either, so the reason I'm telling you that is because that finds its way into our literature, so you'll be the one that has to make the judgment of that. And my advice would be just ignore all that stuff. Okay. Now the other thing we always hear about is modeling, right? That's very, very hip. I mean, people are constantly modeling stuff, trying to predict a given outcome. So a good example would be you have 30 cases, or let's make it a bit better, let's say you have 70 cases and there are 10 bad outcomes, and you want to predict who got those 10 bad outcomes, and so you do a regression analysis to look at all of the factors that might have screwed up your result, the age, the gender, the weight, whether they were compensation cases, whether they had previous treatment, all those kinds of things, and what you're trying to do is predict what distinguished those 10 bad results from those 60 good results. Well, sadly, it's naive. To do regression modeling where the regression model would look like the outcome over here is either good or bad, and over here the predictors would be all those things I just talked about that are predicting whether it's good or bad result, there has to be 8 to 10 occurrences of interest, in this case a bad result, for every predictor in the model. So for the example I just gave where you may have 70 patients, 60 good results, 10 poor results, and you're trying to predict those 10 poor results, your model can only contain one predictor, because the outcome of interest was a poor result and there were only 10 of those. If you have 5,000 people in your study and you have 500 bad results, it's still 10%, but you can now model up to 50 predictors, and the reason for that is that you've got a large, very rich database to look for the variables of interest in both the good results and the poor results and make a stable model. If you try to make a model on your 10 poor results and your 60 good results, you'll get a model. I mean, any statistical package will spit out results, but they will not validate in the next sample, and so again, if you have editors at the journal that you're reading who understand it, these regression models will never make it into a publication that you're reading, but where they do, you should be asking yourself, was the sample large enough to actually allow this model to be validly constructed, and the rule of thumb is that for every predictor that's on the left side of the equation trying to predict something on the right equation, there has to be 10 occurrences of interest, usually a bad result. If you have your regression model where the predictors are here on your left and they're trying to predict something on the right, for every predictor that's there, there has to be 10 occurrences of the outcome of interest to allow one predictor variable to go in there. That's a statistical truism. I mean, people way smarter than me have figured that out, and I can't tell you how that was done, but I know that that is a fundamental rule of modeling, especially logistic regression modeling. What a logistic model does is it gives a probability of a particular outcome. It predicts the probability of something happening, and that's what most of the models that we see in our literature are trying to do. They're trying to predict the bad outcomes over here by identifying, you know, if you have this factor and this factor and this factor and this factor, the probability of a bad outcome is whatever, but the problem is that these predictors may not have anything to do with the poor outcome if they are based on a very small sample size. So in general, you need to have 10 of those bad outcomes for every predictor you want in your model. So there are 5 predictors in the model, or 3, you need to have 30 bad outcomes. Not 30 outcomes. 30 bad outcomes, because that's the thing we're trying to predict. are factors for bad outcomes, but the study size is only 50 cases and 8 bad outcomes. I don't buy it. How do you decide which of those predictors becomes the predictor that you think is the important predictor? So, in case it didn't make it into the microphone, what he was asking was where you have a series of predictors, how do you choose the one, if you're only going to be allowed to have one predictor, how do you choose that one? Well, I think you have to choose the one that makes the most sense and that is the most important for your research question. It's not like just hauling a bunch of stuff on the wall and hoping that you see some type of relationship. And at the risk of making it even more complicated, the ones you mentioned, smoking, diabetes, compensation, well, those three things are often not independent because lots of diabetics smoke and lots of people who are broken down smoking diabetics are also on compensation. So, the things are not even necessarily separate. I don't say that lightly because I'm a diabetic who stopped smoking not so long ago, but I do have a job and I don't have any insurance. So, all I'm saying is that even before we get to the stage of deciding if the number of predictors is too great, they often pay zero attention to whether or not these predictors are even linked to one another. Because if they are correlated, I mean this is going a little bit too far, we can talk about this offline if you like, but if two predictors are actually correlated with one another, only one of them should be in the model. Because first of all, it doesn't make any sense to have them both in the model, right? They're both the same thing, they're heavily correlated. So, let's just say it's smoking and compensation status. If everybody who has a compensation claim smokes and nobody that has a compensation claim, everybody that doesn't have a compensation claim doesn't smoke, then you really just have to have smoking or compensation claim in your model because they're the same thing for purposes of the model. That is hardly ever understood. What happens if you put both smoking and compensation status in the model? They may cancel each other out and neither of them may look predictive, when in fact one of them may be. The reasons for that have to do with the distribution of variability in the variable that's being predicted. And again, you need to talk to somebody who actually knows about the numbers and that's not me, but that's a fundamental consideration. Yes, sir? Yeah. I think that's a very insightful question, so just to restate it and tell me if I get it right. I think what he was saying was that if let's say we have two linked variables, two heavily correlated variables, and let's just say for example that they are smoking and compensation status, and we only include one, and we show that it does predict the outcome very well, what are we supposed to assume about the one that wasn't included? Well I think what we're supposed to assume is that because there's a strong correlation between smoking and compensation status, if smoking predicts the outcome, compensation status probably will too, to the extent that smoking and compensation status are correlated. Or do we have to prove that correlation? Yes, but I mean, let's say for example, okay, a very common one is, well actually a very common one is things like smoking and diabetes, or let's just say heart disease, heart disease and smoking and diabetes, okay, predicting those things. Both of them cause heart disease, I don't think there's any doubt about that. But the correlation between the two of them is sufficiently high that if you're just measuring one, you may be measuring both of them, although together they may have a super-added effect. Until you know a lot about the correlation between the two things, I agree with you that it's hard to know how much is contributing to the one and how much is contributing to the other, but most of these studies are not sufficiently large that you could have the luxury of evaluating all of them. So the best you can do is try to evaluate the variables that matter the most for your research question, whatever it may be, and then go from there. And frankly, having the two variables that are correlated together actually works against the authors because it actually decreases the chances that either one of them are going to be shown to be significant. And so a lot of times we may see that these studies where the models show that none of the things that actually on common sense should predict the variable, they actually find that they don't, but it's because of a methodologic error, it isn't really because the two things are not linked. I realize that that's kind of an arcane point, and again I want to emphasize that the editors of the journals you read should be scoping this out for you beforehand, so that what turns up in print you can rely on as being true, or at least having been vetted by somebody who may know more about it than you do, so you can just concentrate on what the result is. But I'm just letting you know that that does not happen much, and so I'm trying to allure you to some of these things. So one last point here, especially for those in the audience who may be orthopedic surgeons. I cannot think of an instance, or not many instances anyway, in which it's reasonable to consider bilateral cases as independent. In other words, I get very irritated by these studies that say there were 83 people and 97 limbs. Why is that a problem? Well, let's take an example that would be familiar to orthopedic surgeons, let's say of total hip replacement, okay? A lot of patients that have a total hip replacement have osteoarthritis in both legs, so let's say we're trying to look at the result of total hip replacement, patients have their total hip and they might walk a distance or do something, run up some stairs or lift something with their leg or something like that, but the result is not the same if you have one normal hip and one reconstructed hip, if you have one osteoarthritic hip that has not been treated yet and one total hip, or you have bilateral total hips. Those are three very, very separate circumstances. So measuring the result of the right hip when the left hip is normal or has osteoarthritis or had osteoarthritis and now has been reconstructed with a total hip are very, very different circumstances. That's a very graphic example of why bilaterality, right and left, are not independent, they're in the same person. So that hardly ever can be allowed, unless we're, I mean even if we're sort of measuring the size of something in one limb and the size of something in the other limb, they are very likely to be correlated, so considering them as separate entities is not the same, so my right hand and my left hand are closer to the same than my right hand and his right hand. And so that's a very, very subtle but extremely important distinction to be made. So what are the authors supposed to do with that stuff? So if they want to analyze all the limbs that they've operated on as if they were independent cases, that's fine, but what they need to do is stratify in their results for the cases that were just unilateral and the cases that were bilateral and see whether or not there's any difference that way, because there may well be. Or they can just analyze one side only. So in many randomized trials that are well-planned, let's say one we're just about to start on, on randomizing carpal tunnel patients to either immediate surgery or non-operative treatment, if both hands are affected, we're just going to focus on the most symptomatic hand, or maybe we'll just focus on the left hand in bilateral cases, or just the right hand. You can make some kind of decision beforehand, before the study starts, for your analysis later on, but it will not be fair in most instances to say that the right hand and the left hand are independent of one another. I think that's all I had to say. So if there are any questions about any of that, I'm not really sure if that's what you were expecting, but hopefully that gives you some ideas as to how to better approach some of these papers, and if there are questions, I'm more than happy to answer them. Yes, sir? Two somewhat related questions. One, can you expand on the concept of a well-powered study versus an underpowered? I mean, just what numbers should we be looking at? And secondly, frequently we'll see something that says there were 88 patients enrolled in the study, but only 50 were seen and followed. Right. Okay. And so can we simply throw out the 88 and just work on the 50? No. Right. So the question of power, what you're trying to decide is whether or not the absence of a difference between the two groups is because not enough patients were analyzed, or there actually is no difference between the two. So it depends on what you're comparing. So let's say it's age, for instance, okay? If you have 10,000 people in the two groups, 10,000 in this group, 10,000 in that group, a difference in their age of, say, two years is going to look statistically significant. But it's not clinically relevant if the average age over here is 35 and the average age over here is 37. Right? Similarly, or sort of conversely, if you have in your group 20 people of which the average age is 37, and 20 people where the average age is 35, actually the average age in the group should not be 37 and 35 if you had a larger group. Right? Because often the one group is fundamentally different. So how many patients you need to satisfy the idea of power depends on the comparison that's being made. And I can tell you that the ways of calculating this are very, very basic, and anybody who is a clinical researcher who's even remotely serious about it would know how to do this. So they should be doing that beforehand to be sure that the comparisons that they then do will have adequate power. So that's that one. Now as far as the patients that are enrolled and disappear, that's a very significant concern. So what can they do about that? First thing they can do is look at the basic demographic characteristics of the people who did not return, let's say for follow-up, and see if they are fundamentally different in any way, shape, or form than those that were included. If there is a difference, you know, maybe all the old people didn't come back or all the people that lived in a certain area didn't come back or all the fat people didn't come back or, you know, something like that. If that's the case, then the authors are obligated to try to wonder about what it would have meant if those patients were included. They can actually do it statistically. They can assume that all those people that didn't follow up had a bad result and analyze them as if they had a bad result. Or they can analyze them as if 60% of them had a bad result or whatever. They can do what's known as a sensitivity analysis to decide whether those people that didn't follow up would have had an impact on the results or if half of them had followed up and had a bad result, would that have changed the conclusion? And again, that's number crunching. They never do it, but they always should do it, and I can guarantee you that a manuscript that comes to the Journal of Bone and Joint Surgery would not go forward with a large dropout unless the authors in some way address that issue so that the readers can understand what it may have meant. Sure, we showed our intervention was really good, but if 20% of those patients who didn't follow up had had a bad result, our difference at the end would not have been there, that kind of thing. So if you don't see that, you should be very suspicious or very skeptical or even dismissive of the results. Thank you. Yes, let me make it easy for you. Okay, so the question was about meta-analysis. The idea at the beginning of the meta-analysis movement was that we would take all these papers that report 20 people, 40 people, 35 people, we'd combine them all together and then we would have the power of numbers and we'd be able to make better conclusions. But, as you pointed out, the studies are so variable that that has proven to be impractical. So nowadays, I think the only way that a meta-analysis should appear in our literature is if it combines randomized trials. Only randomized trials. Why would we do such a thing? Well, there may be a series of randomized trials that give conflicting results. They may not be generalizable for one reason or another. They may have been underpowered. If any of those things have occurred, underpowered, variable, conflicting results, or poor generalizability because the study was done at the Massachusetts General Hospital by a bunch of professors that may not generalize to some other place. If any of those conditions exist, then a meta-analysis may provide a lot of usefulness. Because combining these studies that were done in fundamentally a similar way, using fundamentally the same outcomes, that has value. Nothing else has value. Anything that does not look like that is not a meta-analysis. It's more of a narrative literature review, subject to all the biases that the reviews that you read in hand clinics and everything else are subject to. And there's more than that because the doggedness with which the authors sought out all the papers and all of that kind of stuff plays into it, right? So if they only look at English papers, for instance, I mean, that's very poor. We should be getting papers from Europe and from China and all these other places. But usually it's just English only because English speakers are the ones that produce most of the literature that gets published in North America. There are lots of issues about meta-analysis, but the fundamental one is that unless the methods are basically the same in each of the studies and the outcomes are basically the same in the studies, they cannot be combined, and that condition hardly ever exists. So I think you can, this is being taped, so hopefully I don't get mugged on the way out to the airport, but I think that you can fundamentally dismiss the results of the vast majority of meta-analyses that you've read up until now, and a few come to mind. I won't mention them online, so I'll mention them afterwards, but there are some that have been published in our literature that are severely flawed, and yet they are constantly referred to, and they're flawed because of the very point you brought up, that combining the data is just inappropriate and leads to conclusions that are essentially random. But they may meet the preconceived idea as a result of the authors. That's a little strident, but that's my job. Are there any other questions? Yes, sir. Normal distribution. Can you tell that intuitively, or do you need to test for it? That's an excellent question. I mean, for many things, it's intuitive, and I think most people would understand. So the example I gave, say, of weights, that would be understood, because I think we know that people come in all shapes and sizes, and height and weight is pretty much a bell curve. If you're not sure, there are absolutely tests that can be done to prove whether or not the distribution is normal, absolutely. And you should consult your statistician for that, and we'll go into that today. But if the variable does not meet the criteria of being normal or continuous, you can't use it. But even if it looks like it might be normally distributed, and it looks like it may be continuous, it's still the distribution of values you have may still not fit into a concept of normality. So there are tests that can be done for that. And I'm not saying that those variables are not useful. It's just you have to use different tests. You have to use non-parametric tests, which actually are so conservative that it makes it much harder to show a statistically significant difference between the two groups. Which, you know, our literature tends to be about showing that this is better than that. There aren't too many studies that are done to show that two interventions are the same, because usually we're doing one intervention that we believe in, and we want to prove that it's better than everybody else's. So using these very conservative non-parametric tests for the type of variable that is not continuous or normally distributed makes that demonstration of that that much harder, and that's often why these things are not used. It's much easier to show a difference with a t-test. If the variable is not really amenable to a t-test, then we get spurious results. So there are tests that can be done to prove whether or not a t-test is the right thing. Yeah, there are lots of online resources that all cost money. I mean, they're all there for a commercial reason. I don't know of any apps yet, but I'm positive that there are some. Yeah, I think it's very glib for me to say, oh yeah, go talk to somebody who knows. But these people are really pretty specialized, and they do exist in most universities. So even if you're a private practitioner in a smaller town, at your university, I'm not talking about a medical statistician, right? The vast majority of statisticians are people like psychologists or people in education. So my whole master's was a degree, which actually I completed when I was already old, had a very strong statistical component to it. And the person on my thesis committee who advised about all that was a psychologist, a PhD psychologist who worked in medical education at my university. So it does not have to be a physician, does not have to be a medical statistician. It can be a psychologist or anybody who's involved in educational testing, but there are lots of online resources that can, I can't give you one off the top of my head, but if you just Google T-Test, you're going to get a bazillion hits, and you can start finding your way there. If you give me your email address at the end of the session, anybody who'd like to do this, I can certainly send you the titles of some books that are so dog-eared from being thumbed through on my desk, even though I consider myself a fully qualified methodologist. I mean, I look through these books on a daily basis, just about. I'd be happy to send you the titles of these books, and you can look at them. Or maybe to make it easier, you should contact me via email rather than having me write something down now. My email address, for anybody that's interested, is brent.gram at uhn.ca, uhn, universityhealthnetwork.ca, and that is, I'm sure that's available on your app. So it's 8.01. I don't have to go anywhere, so if anybody would like to hang around and discuss some of these issues further, that's great, but I really thank you for coming for such an arcane and obscure subject, and I hope it's been helpful for you. Thanks.
Video Summary
In this video, the speaker, Brent Graham, discusses the importance of understanding the limitations and biases in research studies. He explains his role as a research director and editor at the Journal of Bone and Joint Surgery, where he reviews manuscripts and evaluates their methodological content. Graham emphasizes the need for readers to approach medical information with skepticism, regardless of the source, and to critically analyze the limitations of study design, potential sources of bias, and the appropriate use of statistical tests. He also touches on the concept of power in research studies, which indicates the probability of finding a true difference between groups. Additionally, Graham highlights the issue of bilateral cases being considered independent, stating that they should be analyzed separately or stratified in the results. He concludes by discussing the limitations of meta-analyses and the importance of ensuring that studies being combined have similar methods and outcomes. Overall, Graham provides insights into how readers can better understand and interpret video content.
Keywords
limitations
biases
research studies
skepticism
study design
sources of bias
statistical tests
power
meta-analyses
×
Please select your language
1
English