## Aphasia treatment outcome measures – some are worse than others

One would hope that anyone who has been involved with aphasia treatment studies has at some point struggled with the question of what primary outcome measure to use. If they have not, I would argue that they have not been paying attention. It is no surprise, therefore, that our C-STAR group appears to have this conversation on a monthly basis, i.e. anytime someone is presenting a new analysis of predictors of aphasia recovery. It is easy to verbally phrase what we are looking to do, namely to find biographical, behavioral-cognitive and neurological predictors of response to aphasia treatment. But what should the outcome measure be, specifically? The simplest answer, I believe, is the raw difference (change) score between the pre-treatment measure and the post-treatment measure. The larger the change, the stronger the response to treatment. “Aha”, you say, “but there is an obvious potential problem with this score, which is especially acute if the outcome test has a maximum score!” Namely, not every patient will have the same potential of getting a high raw change score. Let’s say one patient scores 75 out of the maximum 100 points on the pre-treatment test. This person can improve by maximally 25 points. Another patient scores 40 on the pre-treatment test, and can therefore maximally improve by 60. If they both improve by 20 points, one of the two is practically cured (!), whereas the other still only scores 80 points. Is it fair to treat these two changes as equal?

So, an alternative is to use a weighted change score, by taking into account the patient’s potential for change in the outcome measure. The most straightforward way to do this is to first calculate the potential for change in each patient (maxscore – prescore) and then to divide the actual raw change by that number ((postscore – prescore) / (maxscore – prescore)). This is in fact what has been done in quite a few aphasia treatment studies, and that is all very well when assessing treatment success or comparing different interventions. In prediction studies, however, the objective is different. Remember? We are looking for predictors of response to treatment. One of the strongest predictors of aphasia recovery happens to be the initial or baseline severity of the patient. In chronic aphasia, typically, patients with mild-moderate severity appear to respond better to interventions than more severely impaired patients. This is a general effect that has been observed with different outcome measures, including simple raw scores, so it is likely something real. However, it is also important to understand that if the *weighted* change score is used as the outcome measure, the initial-severity effect is a mathematical certainty! Back to the imaginary experiment: Take a patient who scores 98 out of 100 points pre-treatment. Their maximum potential for positive change is 2 points. If this person improves by one point, their weighted change score is 0.5; they have improved by 50% according to this measure. Now we have a more severe patient, who scores 60 on the pre-treatment test, so that they have a potential maximum gain of 40 points. If they improve by 10 points, that seems pretty good, going from 60 to 70 points. However, the weighted change score for this patient will be 10/40 = 0.25; they have only improved by 25%. Surely, that is not fair to the success of the second patient’s recovery efforts, but a more serious problem that this thought experiment should make clear is that there is an practically unavoidable positive correlation between the pre-treatment score and the weighted change score: the higher the pre-treatment score (so, the less severe the patient), the higher the weighted change score, if everybody improves equally (or randomly) in absolute terms. If the second patient also improved by only a single point, their weighted change score would be 1/40=0.025. No wonder, then, that initial severity predicts treatment outcome with this weighted measure! If the initial severity is part of the calculation of the outcome measure, in whatever way, this will necessarily yield a relation between initial severity and that same outcome measure. Not really worth reporting, one would think, at least not without some serious side-notes or qualifications.

This problem has of course been noted in the stroke literature, typically discussed in the slightly different context of a phenomenon of ‘proportional recovery’. Proportional recovery is the observation that most post-stroke individuals will recover about 70% of their pre-morbid function, while a smaller subset of patients does not recover so much (see Lazar et al., 2010, and Bonkhoff et al., 2020 for the most recent critique). If this is the natural state of things, then intervention studies should essentially show improvement above and beyond this 70%, as the 70% is what most patients will show anyway, regardless of intervention. There is a lot of literature about this notion in the motor recovery domain, but it too has been argued to be largely due to a mathematical coupling, for example in an excellent and very accessible opinion piece by Hawe et al. (2019), in the journal *Stroke*. If the initial severity is mathematically included (in whatever way) in the outcome measure, it should be no surprise that seemingly non-random patterns arise in data, even if those data themselves are completely random.

Back to aphasia though. We had journal club today, to discuss a highly relevant paper to the work that is going on in the C-STAR group: Osa García et al. (2020), “Predicting Early Post-stroke Aphasia Outcome From Initial Aphasia Severity”, published in *Frontiers of Neurology*. I read the paper this morning, in my socially distanced hammock, and was looking forward to Zooming into the virtual journal club, so it was a pity that only two students showed up! Mostly, this was a pity because the paper provided an excellent illustration of what can go wrong in measuring and comparing aphasia recovery measures, in two ways. Here we go: In this study, the authors set out to identify behavioral and neurological predictors of the severity of language impairment in the subacute stage post-stroke, so that is around 2 weeks post-stroke, based on immediately acute tests and measurements (obtained within 72 hours post-stroke). Anything that can provide patients a more accurate prognosis even in early stages after the stroke can be helpful, so this is in itself a worthy enterprise. To cut to the chase, among a few other predictors, by far the strongest predictor of the composite language score (CS) in the subacute phase turns out to be the CS that was obtained in the acute phase. Once again, severity predicts aphasia recovery, one is tempted to take home. Not so fast, though. What was the outcome measure in this study? Not the raw change score between acute and subacute stages. Not the weighted change score, taking into account the potential for change. No, it was the *absolute CS* in the subacute stage! So, what was predicted was not recovery per se, between the two test points, but rather how the patient was doing in the subacute stage, in the absolute sense. In itself, again, perhaps not a problem, but now ask yourself this question: How surprising is it, really, that the test score at the first time point is strongly correlated with the test score at the second time point (two weeks later)? What if none of the scores changed, and all patients had exactly the same scores in the acute and subacute phases? In that case, the correlation would be a perfect and completely irrelevant 1.0! Indeed, a slightly more interesting alternative way in which the same result could be obtained is if patients with higher initial scores (so, less severity) should show proportionally greater gains in the subacute stage. That is also how the authors interpret their findings – less severe patients show higher outcomes. However, let’s take a closer look at something else that is reported in the paper – the only instance where a change score is in fact used: introducing their main results, they write “Achieved [change in] CS positively correlated with the potential [change in] CS (r =0.651, P =0.002)” (p. 5). Read this again, carefully, and think about what this means … (the three dots are to give you time to read the previous sentence again). The raw/absolute change score between the initial and second test point was greater when the patient had a greater potential for change. Patients have a greater potential for change if they have *lower* initial scores, so if they are *more *severe. The positive correlation, therefore, means that actual recovery in the subacute stage is indeed predicted by severity, but in the *opposite* direction of what we usually find in studies of chronic aphasia, and also in the opposite direction of how the authors appear to interpret their own data. Apparently, if anything, patients who are *more severely impaired* immediately post-stroke show a *stronger spontaneous recovery* in the subacute stage (two weeks post stroke) in this study. That is actually a really interesting finding (it may have to do with acute versus chronic aphasia, and/or with spontaneous versus treatment-induced recovery), but not one that was picked up by the authors, unfortunately. I am well aware it seems petty to seemingly single out one study for critique in this fashion, but I do think it provides a very clear illustration of what can go wrong if we do not think carefully enough about mathematically necessary relations between predictors and outcome measures. So, let’s all keep doing that!

I often feel it is easier to point out problems than it is to figure out solutions (guilty!). So, what is the best solution here? What outcome measure(s) should our group be using, in the quest for predictors of aphasia recovery and response to treatment? My two cents are that, if we do want to investigate to what extent baseline severity is a predictor for response to treatment, raw change scores still seem to be the most informative. Of course, raw change scores become particularly attractive if there is no natural ceiling to performance in the first place! The use of raw change avoids mathematical coupling between initial and change scores, but does leave us with the bias noted at the beginning of this blog. Therefore, in recognition of the fact that all of these measures are flawed to different extents, and for different reasons, I also believe it is best to provide multiple measures in our reports, including the weighted change score. If any single perspective is necessarily blurred, I think it is a good idea to at the very least provide multiple perspectives, in order to get the best idea of the true picture. My colleague Julius Fridriksson is quick and quite right to point out that this is an absolute no-no for clinical-trial grant applications, as it may tempt researchers to cherry-pick from multiple outcome measures. I can feel another blog bubbling up on that topic! Whatever we choose, however, what we *cannot* do is (1) make unqualified claims about initial absolute severity predicting absolute severity at later stages, or (2) make unqualified claims about initial severity predicting change scores if those scores themselves are directly weighted by the initial severity in the first place. In any case, better to have inflated methods and results sections, than to have artificially inflated outcome measures, wouldn’t you agree?

Bonkhoff AK, Hope T, Bzdok D, Guggisberg AG, Hawe RL, Dukelow, SD, Rehme, AK, Fink GR, Grefkes C, Bowman H (2020) Bringing proportional recovery into proportion: Bayesian modelling of post-stroke motor impairment. *Brain* 143:7. 2189–2206, doi: 10.1093/brain/awaa146

Hawe RL, Scott SH, Dukelow SP. (2019) Taking proportional out of stroke recovery. *Stroke* 50. 204–11. doi: 10.1161/STROKEAHA.118.023006

Lazar RM, Minzer B, Antoniello D, Festa JR, Krakauer JW, Marshall RS. (2010) Improvement in aphasia scores after stroke is well predicted by initial severity. *Stroke* 41. 1485–1488. doi: 10.1161/ STROKEAHA.109.577338

Osa García A, Brambati SM, Brisebois A, Désilets-Barnabé M, Houzé B, Bedetti C, Rochon E, Leonard C, Desautels A and Marcotte K (2020) Predicting early post-stroke aphasia outcome from initial aphasia severity. *Front. Neurol.* 11:120. doi: 10.3389/fneur.2020.00120