 Research
 Open Access
Who will benefit from antidepressants in the acute treatment of bipolar depression? A reanalysis of the STEPBD study by Sachs et al. 2007, using Qlearning
 Fan Wu^{1},
 Eric B Laber^{1}Email author,
 Ilya A Lipkovich^{2} and
 Emanuel Severus^{3}
https://doi.org/10.1186/s4034501400185
© Wu et al.; licensee Springer. 2015
 Received: 30 June 2014
 Accepted: 30 December 2014
 Published: 3 April 2015
Abstract
Background
There is substantial uncertainty regarding the efficacy of antidepressants in the treatment of bipolar disorders.
Methods
Traditional randomized controlled trials and statistical methods are not designed to discover if, when, and to whom an intervention should be applied; thus, other methodological approaches are needed that allow for the practice of personalized, evidencebased medicine with patients with bipolar depression.
Results
Dynamic treatment regimes operationalize clinical decisionmaking as a sequence of decision rules, one per stage of clinical intervention, that map patient information to a recommended treatment. Using data from the acute depression randomized care (RAD) pathway of the Systematic Treatment Enhancement Program for Bipolar Disorder (STEPBD) study, we estimate an optimal dynamic treatment regime via Qlearning.
Conclusions
The estimated optimal treatment regime presents some evidence that patients in the RAD pathway of STEPBD who experienced a (hypo)manic episode before the depressive episode may do better to forgo adding an antidepressant to a mandatory mood stabilizer.
Keywords
 Bipolar disorders
 Qlearning
 Antidepressant
 Dynamic treatment regimes
Background
Bipolar disorders are a group of chronic lifelong recurrent psychiatric disorders characterized by episodic shifts in mood, energy, social and vocational functioning, and activity levels (Phillips and Kupfer 2013). Worldwide, bipolar disorders are a leading cause of disability (Vos et al. 2013) and associated with a substantial economic burden on society (KleineBudde et al. 2014). Standard antidepressant medications have been proved to be effective for acute and longterm treatment of unipolar depression (Bauer et al. 2013); however, supporting evidence for the inclusion of standard antidepressants in the acute and longterm treatment of bipolar depression is more limited and controversial (Grunze et al. 2010, Pacchiarotti et al. 2013). Furthermore, there is concern that antidepressants can induce abnormal mood elevation (Licht et al. 2008). We use data from the Systematic Treatment Enhancement Program for Bipolar Disorder (STEPBD) (Sache et al. 2003, 2007) to estimate an optimal dynamic treatment regime (DTR) (Chakraborty and Murphy 2014; Murphy 2003, Robins 2004, Schulte et al. 2014), for bipolar depression. A DTR is a sequence of decision rules, one per stage of intervention, that map uptodate patient information to a recommended treatment; thus, an estimated optimal DTR can be used to generate hypotheses about how patient history and outcomes should dictate treatment selection. The estimated optimal DTR for bipolar depression constructed from the STEPBD study suggests the hypothesis that standard antidepressants should not be used to supplement mood stabilizers for patients with a prior hypomanic episode.
A DTR aims to select if, when, what, and to whom treatment should be assigned and thereby fits into the paradigm of personalized medicine. Because DTRs select treatment according to the uniquely evolving health status of each patient, they are suited to manage chronic illnesses with patient response heterogeneity; thus, DTRs have tremendous potential for personalizing and improving treatment strategies for bipolar disorder (Leboyer and Kupfer 2010; Nierenberg et al. 2013). Optimal DTRs have been estimated for wide range of chronic illnesses including major depressive disorder (Chakraborty et al. 2013; Chakraborty and Moodie 2013), attention deficit hyperactivity disorder (Laber et al. 2014; Lei et al. 2012, NahumShani et al. 2012a), schizophrenia (Laber et al. 2014; Shortreed et al. 2011), HIV/AIDS (Moodie et al. 2007; Sterne et al. 2009), and cigarette addiction (Strecher et al. 2006). Estimation of an optimal DTR is typically done as a secondary, exploratory analysis and viewed as a method of generating hypotheses for followup confirmatory experiments (Murphy ??). This is the perspective we take here; nevertheless, we show that an estimated optimal DTR appears to perform markedly better than any fixed treatment strategy.
In the “STEPBD study” section, we review the STEPBD study. In the “Dynamic treatment regimes and Qlearning” section, we formalize DTRs and introduce the Qlearning estimation algorithm. In the “Analysis of STEPBD” section, we present an analysis of STEPBD.
Methods
The study on which our analyses are based was approved by the institutional review board at each study site and was overseen by a data and safety monitoring board (for more details, see http://www.ncbi.nlm.nih.gov/pubmed/17392295).
STEPBD study
Acute depression randomized pathway
Percentages of different mood stabilizers used in RAD
Mood stabilizer  Percentage (%) 

Aripiprazole  2.19 
Carbamazepine  6.28 
Clozapine  0.27 
Lithium  48.91 
Olanzapine  9.84 
Quetiapine  9.02 
Risperidone  7.10 
Valproate  41.53 
Ziprasidone  3.01 
Dynamic treatment regimes and Qlearning
The effective management of a chronic illness requires ongoing personalized treatment (Wagner et al. 2001). DTRs formalize clinical decisionmaking as sequence of decision rules, one per treatment decision, that map patient information to a recommended treatment. An optimal DTR yields the optimal expected outcome when applied to assign treatment to a population of interest. One method for estimating an optimal DTR from observational or randomized study data is Qlearning (Murphy ??; Schulte et al. 2014). Qlearning is an approximate dynamic programming algorithm that can be viewed as an extension of regression to multistage decision problems (NahumShani et al. 2012b). As our focus is the application of Qlearning to the RAD study within the RCP pathway, we focus on data from a twostage randomized trial with a terminal continuous outcome; however, Qlearning applies in much more general settings (Goldberg and Kosorok 2012; Laber et al. 2014; Moodie et al. 2014; Schulte et al. 2014; Sutton and Barto 1998; Watkins and Dayan 1992).
Qlearning estimates an optimal regime using backward induction. For simplicity, we assume that the entire treatment period contains two stages with a distal outcome measured after completion of the second stage; treatment decisions are made in the beginning of each stage. Qlearning proceeds in two steps. In the first step, it estimates an optimal treatment rule for the second stage of treatment given patientlevel data accumulated up to and immediately preceding this second treatment assignment. This information includes each patient’s baseline information, stage 1 treatment assignment and intermediate, i.e., proximal, outcomes measured during the course of the first stage of treatment. These inputs to the secondstage rule are treated as “independent variables” with no attempt to infer what decision at stage 1 would be optimal for a given patient. This first step is achieved by regressing the distal outcome on patient information up to decision stage 2 and manipulating the obtained analytic expression to find for each patient which treatment at stage 2 optimizes the expected distal outcome.
At the second step, Qlearning looks for treatment assignment at stage 1 that would result in optimal distal outcome, assuming that subsequent stage 2 treatment will be determined by the rule constructed in step 1 of the procedure. Such backward reasoning allows Qlearning to factor in future decisions when making treatment decisions at earlier stages. This can be contrasted with a myopic strategy that only looks at intermediate (proximal) outcomes of a current treatment assignment. For example, treatments at stage 1 may lead to temporary alleviation of symptoms and therefore appear beneficial for a patient; however, the longterm benefits may become questionable after the later (e.g., second) stage decisions are factored in.
Formal mathematical description of Qlearning
We now present formal mathematical description of Qlearning. We assume that data available to estimate a DTR are in the form of n independent, identically distributed trajectories \(\lbrace (X_{1i}, A_{1i}, X_{2i}, A_{2i}, Y_{i}) \rbrace _{i=1}^{n}\), one for each subject where: \(X_{1} \in \mathbb {R}^{p_{1}}\) denotes baseline (prerandomization) subject information; \(A_{1} \in \mathcal {A}_{1}\) denotes the firststage treatment assignment; \(X_{2} \in \mathbb {R}^{p_{2}}\) denotes information collected during the course of the firststage treatment including information dictating firststage responder status; \(A_{2} \in \mathcal {A}_{2}\) denotes the secondstage treatment assignment; and \(Y\in \mathbb {R}\) denotes a continuous outcome measured at the end of the study coded so that lower values are better. To match the RAD study, we assume that responders are not rerandomized. In the RAD study, X _{1} contains a subject’s age, race, gender, marital status, annual household income, employment status, education level, nine different side effect measures, medical insurance type, as well as baseline measures of bipolar type, clinical status prior to depressive episode, scale scores for mood elevation (SUMM), and scale scores for depression (SUMD); A _{1} denotes lowdose Bupropion, lowdose Paroxetine, or placebo; X _{2} contains responder status at the end of stage 1, as well as SUMM and SUMD at the end of stage 1; A _{2} denotes either highdose Bupropion or highdose Paroxetine; Y is SUMD measured at the end of stage 2.
Let \(\Pi = \lbrace \pi = (\pi _{1}, \pi _{2}),:\, \pi _{j}(h_{j})\in \mathcal {F}_{j}(h_{j})\,, \forall h_{j}\in \text {dom}\,H_{j}\rbrace \) denote the class of feasible DTRs (for a more formal discussion of feasibility see (Schulte et al. 2014)). An optimal DTR, say π ^{opt}, satisfies \(\mathbb {E}^{\pi ^{\text {opt}}}Y \ge \mathbb {E}^{\pi }Y\) for all π∈Π, where \(\mathbb {E}^{\pi }\) denotes expectation under the restriction that A _{ j }=π _{ j }(H _{ j }). Define \(Q_{2}(h_{2}, a_{2}) = \mathbb {E}(YH_{2}=h_{2}, A_{2}=a_{2})\) and \(Q_{1}(h_{1}, a_{1}) = \mathbb {E}(\min _{a_{2}}Q_{2}(H_{2}, a_{2})H_{1}=h_{1}, A_{1}=a_{1})\). The function Q _{2}(h _{2},a _{2}) measures the “quality” of assigning treatment a _{2} to a patient presenting at stage 2 with H _{2}=h _{2}; the function Q _{1}(h _{1},a _{1}) measures the quality of assigning treatment a _{1} to a patient presenting at stage 1 with H _{1}=h _{1} assuming optimal subsequent treatment. It follows from dynamic programming (Bellman 1957) that \(\pi _{j}^{\text {opt}}(h_{j}) = \arg \min _{a_{j}\in \mathcal {F}_{j}(h_{j})}Q_{j}(h_{j}, a_{j})\). In practice, dynamic programming cannot be applied because the true Qfunctions are not known; instead, estimation of π ^{opt} must rely on the observed data. Qlearning is an approximate dynamic programming algorithm which mimics the dynamic programming solution by replacing the conditional expectations required by dynamic programming with regression models fit to the observed data. Let Q _{ j }(h _{ j },a _{ j };θ _{ j }) denote a postulated working model for Q _{ j }(h _{ j },a _{ j }) indexed by unknown parameter θ _{ j }.
In RAD, only patients who receive placebo as their firststage treatment and failed to respond are randomized at the second stage. Thus, we only use these subjects to estimate θ _{2}. Let R denotes a subject firststage responder status so that R=1 for responders and R=0 for nonresponders. Define 1_{ u } to be equal to one if the condition u is true and zero otherwise. A version of the Qlearning algorithm that applies to data from RAD is:
Algorithm 1: Qlearning for RAD
(Q1) Compute \(\widehat {\theta }_{2} = \arg \min _{\theta _{2}} \sum _{i=1}^{n}\{ Y_{i}  Q_{2}(H_{2i}, A_{2i};\theta _{2})\}^{2} 1_{A_{1i} = \text {placebo}}(1R_{i})\); and subsequently estimator \(Q_{2}\left (h_{2}, a_{2};\widehat {\theta }_{2}\right)\) of Q _{2}(h _{2},a _{2}).
(Q2) Define \(\widehat {Y}_{i} = 1_{A_{1i} = \text {placebo}}\left (1R_{i}\right) \min _{a_{2}\in \mathcal {F}_{2}(H_{2i})} Q_{2}\left (H_{2i}, a_{2};\widehat {\theta }_{2}\right) + (1_{A_{1i} \neq \text {placebo}} + R_{i}1_{A_{1i}=\text {placebo}}) Y_{i} \).
(Q3) Compute \(\widehat {\theta }_{1} = \arg \min _{\theta _{1}} \sum _{i=1}^{n}\{ \widehat {Y}_{i}  Q_{1}(H_{1i}, A_{1i};\theta _{1})\}^{2}\) and subsequently estimator \(Q_{1}\left (h_{1}, a_{1}; \widehat {\theta }_{1}\right)\) of Q _{1}(h _{1},a _{1}).
The Qlearning estimator of the optimal regime is \(\widehat {\pi }_{j}\left (h_{j}\right) = \arg \min _{a_{j}\in \mathcal {F}_{j}(h_{j})}Q_{j}\left (h_{j}, a_{j};\widehat {\theta }_{j}\right)\). To estimate π ^{opt} using data from the RAD study, we posit linear models for the Qfunctions. For Q _{1}(h _{1},a _{1}), we posit a model of the form \(Q_{1}(h_{1}, a_{1};\theta _{1}) = h_{10}^{\intercal }\beta _{10} + a_{11}h_{11}^{\intercal }\beta _{11} + a_{12}h_{12}^{\intercal }\beta _{12}\), where \(\theta _{1} = \left (\beta _{10}^{\intercal }, \beta _{11}^{\intercal }, \beta _{12}^{\intercal }\right)^{\intercal }\), h _{1k }, k=0,1,2 are known summary vectors of h _{1}, and a _{1k },k=1,2 are dummy variables coding two of the three possible treatments at the first stage. For Q _{2}(h _{2},a _{2}), we posit a model of the form \(Q_{2}\left (h_{2}, a_{2};\theta _{2}\right) = h_{20}^{\intercal }\beta _{20} + a_{2}h_{21}^{\intercal }\beta _{21}\), where \(\theta _{2} = \left (\beta _{20}^{\intercal }, \beta _{21}^{\intercal }\right)^{\intercal }\), h _{2k }, k=0,1 are known summary vectors of h _{2}, and a _{2} is a dummy variable coding one of the two possible treatments at the second stage.
Results
Analysis of STEPBD
The Qlearning algorithm stated in the preceding section assumes (i) complete data and (ii) working models for the Qfunctions. However, in RAD, as in most clinical trials, a nontrivial amount of covariate and outcome information are missing. Furthermore, there is no strong theory to suggest working models for the Qfunctions. So we must use the data to assist in the choice of these models. We combine multiple imputation with stepwise variable selection to estimate the Qfunctions and subsequently the optimal treatment regime.
Missing data
One approach in dealing with missing data is multiple imputation (MI) (Rubin 2004). MI creates multiple complete datasets and is thereby suited for conducting a series of exploratory and secondary analyses including estimation of an optimal treatment regime (Shortreed et al. 2011). We use Bayesian MI to “fill in” the missing values which draws from the posterior predictive distribution of the missing values given the observed data (for details and underlying assumptions, see (Little and Rubin 2002; Van Buuren 2012)). Implementation of Bayesian MI requires specification of a prior and likelihood for the observed data. We specify the joint likelihood through the conditional distribution of each variable on all other variables (for discussion of this approach, see (Raghunathan et al. 2001; Van Buuren et al. 2006; Van Buuren 2007)). Thus, the likelihood is determined implicitly through a series of regression models, one for each variable that contains missing information. For continuous variables, we use predictive mean matching, and for binary variables, we use logistic regression models. To reduce variance, we use forward stepwise variable selection applied to the complete data to select predictors for each conditional model. Flat improper priors were used for all parameters. Imputations were carried out using the freely available and opensource mice package with the default settings (http://cran.rproject.org/web/packages/mice/index.html). Complete R code implementing our imputation model is provided in Additional file 1.
Using the procedure described above, we impute m complete datasets. For a given choice of h _{1,k }, k=0,1,2 and h _{2,k }, k=0,1, we can apply the Qlearning algorithm to each imputed dataset to obtain estimated Qfunctions \(Q_{j}\left (h_{j}, a_{j};\widehat {\theta }_{j}^{(\ell)}\right),\,j=1,2,\,\ell =1,\ldots, m\). The final estimated optimal decision rule is obtained as the minimizer of the averaged imputed Qfunctions \(\widehat {\pi }_{j}\left (h_{j}\right) = \arg \min _{a_{j}\in \mathcal {F}_{j}\left (h_{j}\right)} m^{1}\sum _{\ell =1}^{m}Q_{j}\left (h_{j}, a_{j};\widehat {\theta }_{j}^{(\ell)}\right)\).
Estimated optimal treatment regime and empirical results
Point estimates and confidence intervals for the coefficients indexing the secondstage Q function
Variable  Abbreviation  Coefficient  90% confidence 

estimate  interval  
SUMM1  SUMM1  0.18  (−0.14, 0.90) 
SUMD1  SUMD1  0.50  (0.48, 0.83) 
Side 3  SIDE3  −0.41  (−2.65, 0.42) 
Intercept  Int  2.21  (0.05, 2.00) 
A _{2}×SUMM1  A2 _SUMM1  0.77  (−0.16, 1.18) 
A _{2}×Side 3  A2 _SIDE3  1.82  (−1.05, 3.18) 
A _{2}  A2  −1.18  (−1.98, 0.00) 
Point estimates and confidence intervals for the coefficients indexing the firststage Q function
Variable  Abbreviation  Coefficient  90% Confidence 

estimate  interval  
Age  AGE  0.02  (−0.01, 0.04) 
SUMM0  SUMM0  0.48  (0.35, 0.70) 
SUMD0  SUMD0  0.20  (0.15, 0.36) 
Prior episode 1  PRONSET1  −0.42  (−1.07, 0.42) 
Prior episode 2  RPOSNET2  −0.86  (−1.49, 0.05) 
Intercept  Int  1.57  (−0.49, 2.50) 
A _{11}×AGE  A11 _AGE  0.01  (−0.04, 0.07) 
A _{11}×PRONSET1  A11 _PRONSET1  0.66  (−0.80, 2.29) 
A _{11}×PRONSET2  A11 _PRONSET2  1.13  (−0.55, 2.82) 
A _{11}  A11  −1.55  (−4.07, 1.25) 
A _{12}×AGE  A12 _AGE  −0.03  (−0.08, 0.02) 
A _{12}×PRONSET1  A12 _PRONSET1  0.79  (−0.51, 1.92) 
A _{12}×PRONSET2  A12 _PRONSET2  1.62  (0.05, 2.90) 
A _{12}  A12  0.73  (−1.48, 3.08) 
Point estimates and confidence intervals for the expected depression score SUMD at week 12 under static regimes (firstline treatment, secondline treatment) and estimated DTR
Regime ( π _{ 1 } , π _{ 2 } )  Estimated  90% Confidence 

SUMD  interval  
Estimated DTR  2.13  (1.34, 2.86) 
(Bupropion, highdose Bupropion)  6.91  (6.27, 7.71) 
(Paroxetine, highdose Paroxetine)  8.25  (7.39, 9.07) 
(placebo, highdose Bupropion)  3.71  (3.38, 4.04) 
(placebo, highdose Paroxetine)  4.51  (4.10, 4.90) 
Discussion
We estimated an optimal DTR for patients presenting with bipolar depression using data from the RAD pathway in the STEPBD study. The estimated treatment regime suggests the hypothesis that bipolardepression patients with (hypo)mania immediately preceding a major depressive episode may do better to forgo adjunctive antidepressant treatment with either paroxetine or bupropion, whereas the opposite is true for who were in remission or experienced a mixed episode (manic episode with mixed features, according to DSMV) before the current major depressive episode. This is a novel finding, which has not been explored so far. At present, there is a consensus that antidepressants in the acute treatment of bipolar depression may be used when there is a history of previous positive response to antidepressants, while they should be avoided in patients with an acute bipolar depressive episode with two or more concomitant core manic symptoms in the presence of psychomotor agitation, in patients with a high number of previous episodes or with a history of rapid cycling and during depressive episodes with mixed features (Pacchiarotti et al. 2013). Furthermore, the use of antidepressants is discouraged if there is a history of past mania, hypomania, or mixed episodes emerging during antidepressant treatment (Pacchiarotti et al. 2013). However, this consensus is mainly based on clinical wisdom than strong external evidence. In our study, the scale scores for measuring symptoms of depression as well as mania were available for baseline and stage 1 to model the Qfunctions but did not turn out to be helpful in building an optimal DTR.
So far, there are no reliable data on the differential efficacy of paroxetine and bupropion in younger or older adult patients with bipolar depression. In unipolar depression, a recent metaanalysis suggests that the efficacy of antidepressants in general may be reduced in trials involving patients aged 65 years or older (Tedeschini et al. 2011). Similarly, there have not been any reliable data suggesting that patients with higher scores on mood elevation scales do better on paroxetine than bupropion—and vice versa (Pacchiarotti et al. 2013). What is well known on the other hand is that Paroxetine 20 mg/day does not seem to be associated with an increased risk of switch into (hypo)mania in patients with bipolar depression, even in monotherapy (McElroy et al. 2010). The data for our analyses stem from a doubleblind, randomized, placebocontrolled trial (Sachs et al. 2007). Consequently, we do not know whether in clinical practice not adding any medication or intervention to a mood stabilizer is of comparable benefit for those who do best on placebo in our analyses (Severus et al. 2012).
Conclusions
As mentioned in the introduction, estimation of an optimal DTR is typically done as a secondary, exploratory analysis and viewed as a method of generating hypotheses for followup confirmatory experiments. The latter is just about to start, using patients with bipolar depression being openly treated within the SCP pathway of STEPBD using the same rating forms, in particular, the clinical monitoring form for mood disorders.
Appendix
Variable selection

Using forward variable selection compute:$$ \widehat{\mathcal{M}}_{2} = \arg\min_{\mathcal{M}_{2}}\frac{1}{m} \sum_{\ell=1}^{m}\text{BIC}^{(\ell)}(\mathcal{M}_{2}); $$Table 5
Candidate predictors for regression models in Q learning
Variable
Description
Abbreviation
Type
Values (range or level)
Mean (SD) or frequency ( % )
Age
Age at entry (years)
AGE
Numerical
18–77
40.59 (11.74)
Race
Race
RACE
Binary
White or Causasian, nonWhite
90.4%, 9.6%
Gender
Gender
GENDER
Trinary
Male, female, transgender
43%, 56%, 1%
Marriage
Marital status
MARSTAT
Trinary
Never married, married,
35.6%, 33.8%, 30.6%
separated
Household Income
Annual household income
HINCOME
Binary
<40, ≥40
58.5%, 41.5%
(×$1000)
Employment
Employment status
EMPLOY
Binary
Employed, unemployed
46.9%, 53.1%
Education
Education level
EDUCATE
Binary
College or below, technical
53%, 47%
school or above
Insurance
Indicator of medical insurance
MEDINS
Binary
Yes, no
72.8%, 27.2%
Bipolar Type
Bipolar type at entry
BITYPE
Binary
Type I, type II
70.4%, 29.6%
Prior Episode
Clinical episode immediately
PRONSET
Trinary
Remission, (hypo)manic,
45.9%, 33.2%, 20.9%
preceding current depressive
mixed
episode
SUMD0
Scaled depression at entry
SUMD0
Numerical
0.75–18
7.47 (2.30)
SUMD1 ^{a}
Scaled depression at the end
SUMD1
Numerical
0–14
4.49 (3.07)
of stage 1
SUMM0
Scaled mood elevation at entry
SUMM0
Numerical
0–7
1.19 (1.09)
SUMM1 ^{a}
Scaled mood elevation at the
SUMM1
Numerical
0–6.75
0.95 (1.30)
end of stage 1
Treatment 1 ^{a}
Treatment received at stage 1
Trt1
Trinary
Bupropion, Paroxetine,
23.3%, 25.5%, 51.2%
placebo
Side 1
Tremor
SIDE1
Binary
Yes, no
26.9%, 73.1%
Side 2
Dry mouth
SIDE2
Binary
Yes, no
21.1%, 78.9%
Side 3
Sedation
SIDE3
Binary
Yes, no
17.1%, 82.9%
Side 4
Constipation
SIDE4
Binary
Yes, no
5.7%, 94.3%
Side 5
Diarrhea
SIDE5
Binary
Yes, no
12%, 88%
Side 6
Headache
SIDE6
Binary
Yes, no
13.7%, 86.3%
Side 7
Poor memory
SIDE7
Binary
Yes, no
14.3%, 85.7%
Side 8
Sexual dysfunction
SIDE8
Binary
Yes, no
9.7%, 90.3%
Side 9
Increased appetite
SIDE9
Binary
Yes, no
12.6%, 87.4%

Using forward variable selection compute:$$ \widehat{\mathcal{M}}_{1} = \arg\min_{\mathcal{M}_{1}}\frac{1}{m} \sum_{\ell=1}^{m}\text{BIC}^{(\ell)}\left(\mathcal{M}_{1}, \widehat{\mathcal{M}}_{2}\right); $$

Let \( {Q}_2\kern0.3em \left({h}_2,\kern0.3em {a}_2\kern0.3em ;{\hat{\theta}}_2\kern0.3em \left({\hat{\mathcal{M}}}_2\kern0.3em \right)\kern0.3em \right) \) and \(Q_{1}\!\left (\!h_{1}\!, a_{2};\widehat {\theta }_{1}\left (\!\widehat {\mathcal {M}}_{1}, \widehat {\mathcal {M}}_{2}\!\right)\right)\) denote the second and firststage estimated Qfunctions, respectively.
Declarations
Authors’ Affiliations
References
 Bauer, M, Pfennig A, Severus E, Whybrow PC, Angst J, Möller HJ. World federation of societies of biological psychiatry (wfsbp) guidelines for biological treatment of unipolar depressive disorders, part 1: update 2013 on the acute and continuation treatment of unipolar depressive disorders. World J Biol Psychiatry. 2013; 14(5):334–85.PubMedView ArticleGoogle Scholar
 Bellman, RE. Dynamic programming. Princeton, NY: Princeton University Press; 1957.Google Scholar
 Chakraborty, B, Murphy SA. Dynamic treatment regimes. Annu Rev Stat Appl. 2014; 1:447–64.PubMed CentralPubMedGoogle Scholar
 Chakraborty, B, Laber EB, Zhao Y. Inference for optimal dynamic treatment regimes using an adaptive moutofn bootstrap scheme. Biometrics. 2013; 69(3):714–23.PubMedView ArticleGoogle Scholar
 Chakraborty, B, Moodie EE. Statistical reinforcement learning. In: Statistical methods for dynamic treatment regimes. New York: Springer: 2013. p. 31–52.View ArticleGoogle Scholar
 Goldberg, Y, Kosorok MR. Qlearning with censored data. Ann Stat. 2012; 40(1):529.PubMed CentralPubMedView ArticleGoogle Scholar
 Grunze, H, Vieta E, Goodwin GM, Bowden C, Licht RW, Möller HJ, et al.The World Federation of Societies of Biological Psychiatry (WFSBP) guidelines for the biological treatment of bipolar disorders: update 2010 on the treatment of acute bipolar depression. World J Biol Psychiatry. 2010; 11(2):81–109.PubMedView ArticleGoogle Scholar
 KleineBudde, K, Touil E, Moock J, Bramesfeld A, Kawohl W, Rössler W. Cost of illness for bipolar disorder: a systematic review of the economic burden. Bipolar Disord. 2014; 16(4):337–353.PubMedView ArticleGoogle Scholar
 Laber, EB, Lizotte DJ, Qian M, Pelham WE, Murphy SA. Dynamic treatment regimes: Technical challenges and applications. Electron J Stat. 2014; 8(1):1225–72.PubMed CentralPubMedView ArticleGoogle Scholar
 Laber, EB, Lizotte DJ, Ferguson B. Setvalued dynamic treatment regimes for competing outcomes. Biometrics. 2014; 70(1):53–61.PubMed CentralPubMedView ArticleGoogle Scholar
 Laber, EB, Murphy SA. Adaptive confidence intervals for the test error in classification. J Am Stat Assoc. 2011; 106(495):940–5.View ArticleGoogle Scholar
 Lavori, PW, Dawson R. Dynamic treatment regimes: practical design considerations. Clin Trials. 2004; 1(1):9–20.PubMedView ArticleGoogle Scholar
 Leboyer, M, Kupfer DJ. Bipolar disorder: new perspectives in health care and prevention. J Clin Psychiatry. 2010; 71(12):1689–95.PubMed CentralPubMedView ArticleGoogle Scholar
 Lei, H, NahumShani I, Lynch K, Oslin D, Murphy S. A “smart” design for building individualized treatment sequences. Annu Rev Clin Psychol. 2012; 8:21–48.PubMedView ArticleGoogle Scholar
 Licht, R, Gijsman H, Nolen W, Angst J. Are antidepressants safe in the treatment of bipolar depression? A critical evaluation of their potential risk to induce switch into mania or cycle acceleration. Acta Psychiatr Scand. 2008; 118(5):337–46.PubMedView ArticleGoogle Scholar
 Little, RJA, Rubin DB. Statistical analysis with missing data (second Edition): Chichester: Wiley; 2002.Google Scholar
 McElroy, SL, Weisler RH, Chang W, Olausson B, Paulsson B, Brecher M, et al.A doubleblind, placebocontrolled study of quetiapine and paroxetine as monotherapy in adults with bipolar depression (embolden ii). J Clin Psychiatry. 2010; 71(2):163–74.PubMedView ArticleGoogle Scholar
 Moodie, EE, Dean N, Sun YR. Qlearning: flexible learning about useful utilities. Stat Biosci. 2014; 6(2):223–243.View ArticleGoogle Scholar
 Moodie, EE, Richardson TS, Stephens DA. Demystifying optimal dynamic treatment regimes. Biometrics. 2007; 63(2):447–55.PubMedView ArticleGoogle Scholar
 Murphy, SA. Optimal dynamic treatment regimes (with discussion). J R Stat Soc. 2003; 65(2):331–66.View ArticleGoogle Scholar
 Murphy, SA. An experimental design for the development of adaptive treatment strategies. Stat Med. 2005; 24(10):1455–81.View ArticleGoogle Scholar
 Murphy, SA. A generalization error for Qlearning. J Mach Learn Res: JMLR. 2005; 6:1073.Google Scholar
 NahumShani, I, Qian M, Almirall D, Pelham WE, Gnagy B, Fabiano GA, et al.Experimental design and primary data analysis methods for comparing adaptive interventions. Psychol Methods. 2012; 17(4):457.PubMedView ArticleGoogle Scholar
 NahumShani, I, Qian M, Almirall D, Pelham WE, Gnagy B, Fabiano GA, et al.Qlearning: a data analysis method for constructing adaptive interventions. Psychol Methods. 2012; 17(4):478–94.View ArticleGoogle Scholar
 Nierenberg, AA, Friedman ES, Bowden CL, Sylvia LG, Thase ME, Ketter T, et al.Lithium treatment moderatedose use study (LiTMUS) for bipolar disorder: a randomized comparative effectiveness trial of optimized personalized treatment with and without lithium. Am J Psychiatry. 2013; 170(1):102–10.PubMedView ArticleGoogle Scholar
 Pacchiarotti, I, Bond DJ, Baldessarini RJ, Nolen WA, Grunze H, Licht RW, et al.The International Society for Bipolar Disorders (ISBD) task force report on antidepressant use in bipolar disorders. Am J Psychiatry. 2013; 170(11):1249–62.PubMed CentralPubMedView ArticleGoogle Scholar
 Phillips, ML, Kupfer DJ. Bipolar disorder diagnosis: challenges and future directions. The Lancet. 2013; 381(9878):1663–71.View ArticleGoogle Scholar
 Raghunathan, TE, Lepkowski JM, Van Hoewyk J, Solenberger P. A multivariate technique for multiply imputing missing values using a sequence of regression models. Surv Methodol. 2001; 27(1):85–96.Google Scholar
 Robins, JM. Optimal structural nested models for optimal sequential decisions. In: Proceedings of the second seattle symposium in biostatistics. New York: Springer: 2004. p. 189–326.Google Scholar
 Rubin, DB. Multiple imputation for nonresponse in surveys (Vol. 81): John Wiley & Sons; 2004.Google Scholar
 Sachs, GS, Thase ME, Otto MW, Bauer M, Miklowitz D, Wisniewski SR, et al.Rationale, design, and methods of the systematic treatment enhancement program for bipolar disorder (stepbd). Biol Psychiatry. 2003; 53(11):1028–42.PubMedView ArticleGoogle Scholar
 Sachs, GS, Nierenberg AA, Calabrese JR, Marangell LB, Wisniewski SR, Gyulai L, et al.Effectiveness of adjunctive antidepressant treatment for bipolar depression. N Engl J Med. 2007; 356(17):1711–1722.PubMedView ArticleGoogle Scholar
 Sachs, GS, Guille C, McMurrich SL. A clinical monitoring form for mood disorders. Bipolar Disord. 2002; 4(5):323–7.PubMedView ArticleGoogle Scholar
 Schwarz, G. Estimating the dimension of a model. Ann Stat. 1978; 6:461–4.View ArticleGoogle Scholar
 Schulte, PJ, Tsiatis AA, Laber EB, Davidian M. Qand Alearning methods for estimating optimal dynamic treatment regimes. Stat Sci: Rev J Inst Math Stat. 2014; 29(4):640–661.View ArticleGoogle Scholar
 Severus, E, Seemüller F, Berger M, Dittmann S, Obermeier M, Pfennig A, et al.Mirroring everyday clinical practice in clinical trial design: a new concept to improve the external validity of randomized doubleblind placebocontrolled trials in the pharmacological treatment of major depression. BMC Medicine. 2012; 10(1):67.PubMed CentralPubMedView ArticleGoogle Scholar
 Shortreed, SM, Laber E, Lizotte DJ, Stroup TS, Pineau J, Murphy SA. Informing sequential clinical decisionmaking through reinforcement learning: an empirical study. Mach Learn. 2011; 84(12):109–36.PubMed CentralPubMedView ArticleGoogle Scholar
 Sterne, JA, May M, Costagliola D, De Wolf F, Phillips AN, Harris R, et al.Timing of initiation of antiretroviral therapy in AIDSfree HIV1infected patients: a collaborative analysis of 18 HIV cohort studies. The Lancet. 2009; 373(9672):1352–63.View ArticleGoogle Scholar
 Strecher, VJ, Shiffman S, West R. Moderators and mediators of a webbased computertailored smoking cessation program among nicotine patch users. Nicotine Tobacco Res. 2006; 8(S. 1):95.View ArticleGoogle Scholar
 Sutton, RS, Barto AG. Reinforcement learning: an introduction: MIT press; 1998.Google Scholar
 Tedeschini, E, Levkovitz Y, Iovieno N, Ameral VE, Nelson JC, Papakostas GI. Efficacy of antidepressants for latelife depression: a metaanalysis and metaregression of placebocontrolled randomized trials. J Clin Psychiatry. 2011; 72(12):1660–8.PubMedView ArticleGoogle Scholar
 Van Buuren, S, Brand JP, GroothuisOudshoorn C, Rubin DB. Fully conditional specification in multivariate imputation. J Stat Comput Simul. 2006; 76(12):1049–64.View ArticleGoogle Scholar
 Van Buuren, S. Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods Med Res. 2007; 16(3):219–42.PubMedView ArticleGoogle Scholar
 Van Buuren, S. Flexible imputation of missing data: CRC press; 2012.Google Scholar
 Vos, T, Flaxman AD, Naghavi M, Lozano R, Michaud C, Ezzati M, et al.Years lived with disability (YLDs) for 1160 sequelae of 289 diseases and injuries 1990–2010: a systematic analysis for the Global Burden of Disease Study 2010. The Lancet. 2013; 380(9859):2163–96.View ArticleGoogle Scholar
 Watkins, CJ, Dayan P. Qlearning. Mach Learn. 1992; 8(34):279–92.View ArticleGoogle Scholar
 Wagner, EH, Austin BT, Davis C, Hindmarsh M, Schaefer J, Bonomi A. Improving chronic illness care: translating evidence into action. Health Aff. 2001; 20(6):64–78.View ArticleGoogle Scholar
 Zhang, B, Tsiatis AA, Laber EB, Davidian M. Robust estimation of optimal dynamic treatment regimes for sequential treatment decisions. Biometrika. 2013; 100(3):681–94.View ArticleGoogle Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.