Optimal Sample Size for Multiple Testing

We consider the choice of an optimal sample size for multiple-comparison problems. The motivating application is the choice of the number of microarray experiments to be carried out when learning about differential gene expression. However, the approach is valid in any application that involves multiple comparisons in a large number of hypothesis tests. We discuss two decision problems in the context of this setup: the sample size selection and the decision about the multiple comparisons. We adopt a decision-theoretic approach, using loss functions that combine the competing goals of discovering as many differentially expressed genes as possible, while keeping the number of false discoveries manageable. For consistency, we use the same loss function for both decisions. The decision rule that emerges for the multiple-comparison problem takes the exact form of the rules proposed in the recent literature to control the posterior expected falsediscovery rate. For the sample size selection, we combine the expected utility argument with an additional sensitivity analysis, reporting the conditional expected utilities and conditioning on assumed levels of the true differential expression. We recognize the resulting diagnostic as a form of statistical power facilitating interpretation and communication. As a sampling model for observed gene expression densities across genes and arrays, we use a variation of a hierarchical gamma/gamma model. But the discussion of the decision problem is independent of the chosen probability model. The approach is valid for any model that includes positive prior probabilities for the null hypotheses in the multiple comparisons and that allows for efficient marginal and posterior simulation, possibly by dependent Markov chain Monte Carlo simulation.


INTRODUCTION
We consider the problem of sample size selection for experiments involving massive multiple comparisons. Approaching the sample size question as a decision problem, we argue that a solution needs to address the choice of sample size as part of a larger decision problem, involving both the sample size decision before carrying out the experiment and the later decision about the multiple comparisons once the data have been collected. We consider loss functions that combine the competing goals of controlling false-positive decisions and controlling false-negative decisions. For various reasonable loss functions, we show that the form of the terminal decision is the same: Reject all comparisons with marginal posterior probability beyond a certain threshold. We prove a formal result about the slow rate of change of the expected utility over a range of practically relevant sample sizes; this suggests that the preposterior expected utility alone may not suffice for a decisive sample size recommendation. Motivated by this, we conclude by recommending appropriate summaries of sensitivity of the expected utility with respect to the parameters of interest. The discussion includes specific algorithms to evaluate the proposed diagnostics. With a view toward the motivating application, we propose a nonparametric probability model that allows the use of pilot data to learn about the relevant sampling distribution for the sample size decision. Formally, this amounts to using the posterior predictive distribution from the pilot data as the prior model used in the sample size calculation.
Our discussion is motivated by the specific problem of choosing the number of replications in microarray experiments. Gene expression microarrays are technologies for simultaneously quantifying the level of transcription of a large portion of the Peter Müller is Professor, Department of Biostatistics, University of Texas M. D. Anderson Cancer Center, Houston, TX 77030 (E-mail: pm@odin.mdacc. tmc.edu). Giovanni Parmigiani is Associate Professor, Department of Oncology, Biostatistics and Pathology, Johns Hopkins University, Baltimore, MD 21205 (E-mail: gp@jhu.edu). Christian Robert is Professor, CERE-MADE, Université Paris Dauphine, and CREST, INSEE, France (E-mail: xian@ceremade.dauphine.fr). Judith Rousseau is Associate Professor, Université Rene Descartes, Paris, and CREST, INSEE, France (E-mail: rousseau @ensae.fr). The order of the authors is strictly alphabetical. genes in an organism (Schena, Shalon, Davis, and Brown 1995;Duggan, Bittner, Chen, Meltzer, and Trent 1999). (For a recent review of microarray technology and related statistical methods see, e.g., Kohane, Kho, and Butte 2002.) The range of applications is broad. Here we focus on controlled experiments that aim to search or screen for genes whose expressions are regulated by modifying the conditions of interest, either environmentally or genetically. A number of pressing biological questions can be addressed using this type of genomic screening. Because microarrays are costly, the design of the experiment and choice of sample size result in a difficult trade-off between the allocation of limited research resources and statistical learning. Our approach is applicable to the process of selecting the number of biological replicates (such as the number of mice to be assigned to each group), as well as the selection of the number of technological replicates (such as the number of separate aliquots of RNA to be extracted from two biological samples being compared). In our theoretical discussion, we use the term "replicate" to refer to either type. Each situation requires a different interpretation of the array-to-array variability, as well as different priors or different pilot samples. Our illustration specifically concerns biological replicates. General issues of experimental design in microarray experiments, beyond the sample size selection considered in this article, have been discussed by Kerr and Churchill (2001), , and Simon, Radmacher, and Dobbin (2002).
The general goal of the genomic screening is to discover as many of the genes that are differentially expressed across the experimental conditions as possible, while keeping the number of false discoveries at a manageable level. The consequences of a false discovery are often similar across genes. The decisionmaking process when selecting a sample size can benefit from explicitly acknowledging these experimental goals by following a formal decision-theoretic approach. Sample size selection using decision theoretics, including the multistage nature central to our discussion, was formalized within a Bayesian framework as early as 1961 through the work of Raiffa and Schlaifer (1961). (See also Lindley 1997or Adcock 1997 and references therein for discussions of sample size determination.) Following this paradigm, we present a general decisiontheoretic framework for the choice of sample size for genomic screening or for use in a similar selection problem. Central to our analysis is the concept of the false-discovery rate (FDR), introduced by Benjamini and Hochberg (1995). In controlled experiments, it is plausible to assume that genes can be divided into two groups, truly altered genes and truly unaltered genes. For a given approach to selecting a set of putatively altered genes, the FDR is the fraction of truly unaltered genes among the genes classified as differentially expressed. Common microarray software uses the FDR to guide gene selection (see, e.g., Tusher, Tibshirani, and Chu 2001). Applications of FDRs to microarray analysis have been discussed by Storey and Tibshirani (2003), and extensions have been given by Genovese and Wasserman (2003), who also introduced the definition of the posterior expected FDR as we use it here. We show that the decision-theoretic approach leads to a multiple-comparison decision of the form described by Genovese and Wasserman (2003). They focused on decision rules of the following kind. Assume that for each comparison, some univariate summary statistic v i is available. This could be, for example, a p value or any other univariate statistic related to the comparison of interest. All comparisons with v i beyond a certain cutoff t are considered discoveries. Central to their approach is the use of an upper bound on the FDR to calibrate that cutoff t.
In the context of microarray experiments, important initial progress toward the evaluation of sample sizes has been made by Pan, Lin, and Le (2002), who developed traditional power analyses for use in the context of microarray experiments. Their modeling is realistic in that it acknowledges heterogeneity in gene-specific noise and specifies a mixture model for regulated and unregulated genes. Further progress is needed, however. Pan et al. (2002) did not exploit heterogeneity in developing screening statistics, as is done by hierarchical models. This can potentially underestimate the power, especially in the critical range of experiments with very few replicates. Also, their power analysis considers a single effect size for all regulated genes. Finally, explicit consideration of properties of the entire selection, such as FDR, is preferable in the context of multiple testing. Zien, Fluck, Zimmer, and Lengauer (2002) proposed an alternative approach to an informed sample size choice. They considered receiver operating characteristic-type curves and showed achievable combinations of false-negative and falsepositive rates. Mukherjee et al. (2003) discussed sample size considerations for classification of microarray data. They assumed a parametric learning curve for empirical error as a function of the sample size. Their approach is based on estimating the parameters in that learning curve. Lee and Whitmore (2002) considered an ANOVA setup including, among other parameters, interaction effects for the gene and biological conditions. Hypothesis testing for these interactions formalizes the desired inference about differential expression. They assumed approximate normality for an estimator of these interaction effects and proceeded with a conventional power analysis. Bickel (2003) proposed a framework for inference on differential gene expression that includes a loss function consisting of a payoff for correct discoveries and a cost for false discoveries. (See Sec. 2.1 for a definition of these events.) The net desirability function defined by Bickel (2003) is equivalent to one of the loss functions introduced in Section 2.1.
As in many traditional sample size problems, the practical use of the proposed approach will be as a decision support tool. We do not expect investigators to blindly trust the proposed solution. Rather, we envision that an investigator may be operating under budget and resource constraints that allow for a narrow range of sample size choices. The proposed methods can guide the choice within that range by informing the investigator about the likely payoffs and decision summaries.
In Section 2 we outline the decision problem and our approach to the solution in a general form, without referring to a specific probability model. In Section 3 we develop an efficient simulation approach for evaluating the required sample size selection criteria. We define a Monte Carlo simulation method that allows us to evaluate the expected false-negative rate (FNR) and power across the sample sizes. We demonstrate that because of their preposterior nature, the required simulations are easier and less computationally intensive than posterior simulation in the underlying probability model. In Section 4 we introduce a specific probability model that we then use in Section 5 to show results in an example. In Section 6 we conclude with a final discussion.

THE DECISION PROBLEMS
To highlight the general nature of the proposed approach, we first proceed without reference to a specific probability model or comparison of interest. We let ω and y denote the model parameters and expression measurements, and let z i ∈ {0, 1} denote an indicator for the regulation of gene i. Regulation is broadly defined to include any of the typical questions of interest, such as differential expression across two conditions, time trends, sensitivity to at least one out of a panel of compounds, and so forth. We assume that the probability model includes indicators z i as parameters or easily imputed latent variables. We write y J when we want to highlight that the data y are a function of the sample size J . We assume that the underlying probability model allows for efficient posterior simulation.
Let v i = P (z i = 1|y) denote the marginal posterior probability for the ith comparison. Computation of v i could involve some analytical approximations, such as empirical Bayes estimates for hyperparameters. In Section 4 we introduce the probability model used in our implementation and discuss posterior inference in that model.
An important aspect of the problem is that the earlier decision about the sample size needs to take into account the later decision about gene selection. This will be either a selection (also referred to as discovery, or rejection, and denoted by d i = 1 for comparison i) or not (also referred to as a negative and denoted by d i = 0). Decision-theoretic approaches to sample size selection assume that the investigator is a rational decision maker choosing an action that minimizes the loss of the possible consequences-averaging with respect to all of the relevant unknowns (Raiffa and Schlaifer 1961;DeGroot 1970). At the time of the sample size decision, the relevant unknowns are the data y, the indicators z = (z 1 , . . . , z n ), and the model parameters ω. The relevant probability model with respect to which we average is the prior probability on (z, ω) and the conditional sampling distribution on y given (z, ω). At the time of the decision about multiple comparisons, the data y are known and the relevant probability model is the posterior distribution conditional on y.
The solution proceeds in a traditional backward induction fashion by first considering the terminal multiple comparison decision of the gene selection. Knowing the optimal policy for the eventual gene selection, we can then approach the initial sample size problem. It is thus natural to first discuss inference about the multiple comparison decisions d i , i = 1, . . ., n.

Terminal Decision
The choice of a decision rule for multiple comparisons is driven by the following considerations. First, the rule should have a coherent justification as the solution that minimizes the expected loss under a sensible loss function. Second, inference about the multiple comparison decision is nested within the sample size selection, making computational efficiency an important issue. In the type of experiments considered here, a relatively small number of genes are regulated, and the noise levels are relatively high. Finally, although our approach is based on joint probability models on data and parameters (i.e., in essence Bayesian), we are concerned about frequentist operating characteristics for the proposed rule. The use of frequentist properties to validate Bayesian inference is common practice in the context of medical decision making.
With these considerations in mind, we propose loss functions that have the following characteristics: they capture the typical goals of genomic screening; they are easy to evaluate; lead to simple decision rules; and can be interpreted as generalizations of frequentist error rates. We consider four alternative loss functions that all lead to terminal decision rules of the same form. We start with a notation for various summaries that formalize the two competing goals of controlling the false-negative and false-positive decisions. Writing D = d i for the number of discoveries, we let denote the realized FDR and FNR. FDR(·) and FNR(·) are the percentage of wrong decisions, relative to the number of discoveries and negatives, where the additional term avoids a zero denominator. (See, e.g., Genovese and Wasserman 2003 for a discussion of FNR and FDR.) Conditioning on y and marginalizing with respect to z, we obtain the posterior expected FDR and FNR, FDR(d, y) = FDR(d, z) dp(z|y) Let FD = d i (1 − v i ) and FN = (1 − d i )v i denote the posterior expected count of false discoveries and false negatives. We consider four ways of combining the goals of minimizing false discoveries and false negatives. The first two specifications combine FNRs and FDRs and numbers, leading to the posterior expected losses and L R (d, y) = cFDR + FNR. The loss function L N is a natural extension of (0, 1, c) loss functions for traditional hypothesis testing problems (Lindley 1971). From this perspective, the combination of error rates in L R seems less attractive. The loss for a false discovery and false negative depends on the total number of discoveries or negatives. Alternatively, we consider bivariate loss functions that explicitly acknowledge the two competing goals, leading to the posterior expected losses.
Using posterior expectations, we marginalize with respect to the unknown parameters, leaving only d and y as the arguments of the loss function. The sample size is indirectly included in the dimension of the data vector y. For the bivariate loss functions, we need an additional specification to define the minimization of the bivariate functions. A traditional approach to select an action in multicriteria decision problems is to minimize one dimension of the loss function while enforcing a constraint on the other dimensions (Keeney, Raiffa, and Meyer 1976). We thus define the optimal decisions under L 2N as the minimization of FN subject to FD ≤ α N . Similarly, under L 2R , we minimize FNR subject to FDR ≤ α R . Under all four loss functions, the optimal decision for the multiple comparisons takes the same form.
Theorem 1. Under all four loss functions, the optimal decision takes the form . . , v n }, and D * is the optimal number of discoveries. See the proof in the Appendix for a constructive definition of D * .
The proof proceeds by straightforward algebra (see App. A for details). Under L R , L 2N , and L 2R , the optimal threshold t * depends on the observed data. The nature of the terminal decision rule d i is the same as that in the work of Genovese and Wasserman (2003), which is a more general rule allowing the decision to be determined by cutoffs on any univariate summary statistic v i . Using v i = P (z i = 1|y) is a special case.
For simplicity, in what follows we focus on L = L 2N only, omitting the subscript 2N to simplify the notation. In Section 5.3 we revisit the other three loss functions. Also, by a slight abuse of the notation, we write d = t for the decision rule Finally, we note that not all loss functions lead to decisions d i = I (v i ≥ t). For example, assuming a loss of a false negative that depends on the true level of differential ex-pression would lead to different rules. One could argue that discovering a gene that shows a very small differential expression in a given experiment may not be as interesting as discovering one that shows a major change in its expression.

Marginal FN and FNR.
In contrast to the terminal decision of the gene selection, which is made conditional on the observed data, the sample size is decided before conducting the experiment. Thus we now consider the marginal prior mean of the proposed loss functions, also known as the preposterior expected loss (Raiffa and Schlaifer 1961), after substituting the optimal terminal decision for the multiple-comparison decision. The relevant loss function L m (J ) for the sample size selection is The conditioning bar in the nested optimization indicates that the minimization is subject to the bound on FD. The sequence of alternating between the expectation and the optimization is characteristic for sequential decision problems. (See, e.g., DeGroot 1970 and Berger 1985 for a discussion of sequential decision problems in general.) The expectation is determined with respect to the prior probability model on the data y J under a given sample size J . The only argument left after determining the expectation and the minimization is the sample size J . The nested minimization with respect to d is the solution of the multiple comparison problem. It reduces to , with the bound on FD being implicit in the definition of t * ( y J ). Thus we could alternatively write (2) as L m (J ) = FN m (J ). We use analogous definitions for FNR m , FDR m , and FD m . The latter is equal to α N by definition of t * ( y J ).
In Section 3 we introduce a simulation-based algorithm for a practical evaluation of the expectation and nested optimization in (2). Using this algorithm, one could evaluate and then plot the marginal expected utility (i.e., FN m ) against J to select a sample size. At this time one could add a (deterministic) sampling cost, if desired. But in practical application this would require the difficult choice of a relative weight for the sampling cost versus an inference loss. Alternatively, we take a goal-oriented perspective and use the plot of L m (J ) versus the sample size J to find a sample size for any set goal of L m (J ). However, doing this entails a practical complication. For relevant sample sizes of J ≤ 20, the decrease in L m (J ) is too flat to allow a conclusive choice of sample size; see Figure 1(a) for an example. The slow rate of decrease is a general feature of FNR and FN.
Theorem 2. Consider the three loss functions L = L 2N , L N , and L 2R . The false-negative rate and counts of FNR and FN decrease asymptotically as where t * generically indicates the optimal cutoff under each of the three loss functions, and For both results, we must assume that the genes are "randomly chosen," that is, a fraction p, 0 < p < 1, of the genes are truly differentially expressed. In other words, we assume that the level of differential expression is neither equal to 0 (or very small) nor always different from 0. A formal argument is given in Appendix B. The argument starts with a Laplace approximation for v i = P (z i = 1|y J ). Based on this approximation, we show that only genes with a low or zero differential expression, are included in the negative set, that is, the set of genes with d i = 0. We then approximate the average in FNR (or FN ) by an integral, exploiting the fact that these are genes with small differential expressions. Finally, we recognize that the integral expression is on the order of √ log J/J .

Conditional Preposterior Expected
Utility. The relatively flat nature of the expected utility L m (J ) does not allow for a conclusive sample size recommendation. A natural next step is to investigate the expected utility as a function of an assumed true value for some key parameters of the probability model. Specifically, we assume that the probability model includes a parameter ρ i that represents the level of differential expression for gene i, with ρ i = 0 if z i = 0 and ρ i = 0 when z i = 1. For example, in the probability model discussed in Section 4, we would use ρ = log θ 1 /θ 0 . We thus proceed by considering the expected utility, conditional on an assumed true level of ρ i . Recall the definition of FN as the false-negative count. Conditioning on ρ i only changes the term that is related to gene i. For a large n, conditioning ρ i for one gene leads to only negligible changes in inference for other genes (a posteriori, as well as a priori). Finally, note that for ρ i = 0 gene i contributes only to FN, not to FD. Thus we can characterize the conditional expected utility as a function of ρ i by considering writing v i ( y J ) to highlight the nature of the marginal posterior probability v i as a function of the data. The expectation is determined with respect to the joint probability model on data y.
In particular, the expected utility appropriately adjusts for dependencies, uncertainties on other model parameters, and the entire process of finding and applying t * ( y J ). Assuming that the genes are a priori exchangeable, the marginal expectation is the same for all i, allowing us to drop the i subindex.
The diagnostic β(J, ρ) has interesting interpretations. We define it as the term in the conditional expected utility that varies as a function of ρ i . Our main reason to propose it is due to its link with the traditional notion of power. The definition of β is essentially the power to test one hypothesis in the multiple-comparison decision, although with an added twist of marginalizing it with respect to all other parameters. To simplify the terminology, we refer to β(J, ρ) as "power," with the understanding that the definition includes the mentioned marginalizations. Figure 2 shows a typical example.
Thus the following modification to the approach outlined in Section 2.2.1 emerges. The investigator fixes a minimum level  (3) as the average posterior probability of discovery, conditional on the true level of differential expression ρ i = log(θ 0i /θ 1i ).
of differential expression that is of interest in the given experiment and the desired probability of discovering a gene that is differentially expressed at that level. Inspection of a power plot like Figure 2, together with FN m and FD m in the marginal loss function, allows the investigator to obtain an informed sample size choice. The FD m and FN m plots, along with plots of related summaries FDR m and FNR m , add the experimentwise dimension to the marginal, comparison-wise summary that is provided by the power plot. It tells the investigator how many false-negatives might be missed, averaging over the range of likely differential expression levels and summing over all genes. Computation of β(J, ρ) is achieved within the same preposterior simulation that is used to evaluate FN m and FD m .

SIMULATION
The approach to sample size selection that we describe involves several calculations that are typically analytically intractable. Details depend on the specific probability model. Often the posterior mean probabilities v i , the threshold t * ( y J ), and the expected FNR are not available in closed form; however, all can be computed by Monte Carlo simulation. In this section we describe how such Monte Carlo simulation is implemented. Before giving a step-by-step algorithm, we introduce the notations and review the important steps in the algorithm in words. The discussion is still possible without reference to a particular probability model.
For a given sample size J , we simulate data y J ∼ p( y J ). Simulating from the marginal p( y J ) = p( y J |ω, z) dp(ω, z) is conveniently implemented by first generating "true" parameters (ω, z) from the prior, and then generating from the assumed sampling model p( y J |ω, z) given the simulated parameter. To distinguish this prior simulation from a posterior Markov chain Monte Carlo (MCMC) simulation that is required later in the algorithm, we mark the realizations of this prior simulation by a superindex as in ω o , and so on.
For each simulated dataset y J , we compute the marginal posterior probabilities v i = p(z i = 1|y J ) and evaluate FD(t, y J ) and FDR(t, y J ) on a grid over t to find the optimal cutoff t * ( y J ). Plugging in the optimal cutoff t * in d i = I (v i > t), we evaluate the posterior means FN(t * , y J ) and FNR(t * , y J ). Averaging over y J by (independent) Monte Carlo simulation, with repeated simulation of y J ∼ p( y J ), we compute The nonlinear thresholding d i = I (v i > t * ) implicit in the definition of FN hinders the interpretation of (4) as one joint integral with respect to the joint distribution p(ω, y J ) on parameters and data. Instead, we need to proceed with two nested steps, as described earlier. Finally, evaluating (4) across J , we find the sample size J * , which allows us to achieve a desired marginal expected loss. The information in the marginal loss L m (J ) is supplemented by power curves β(J, ρ). Power β(J, ρ) as defined in (3) is a summary of the preposterior expected utility. It is evaluated as part of the same simulation described earlier to find L m . For each simulated experiment, we record (J, ρ o i , v i , d i ), i = 1, . . ., n. Here ρ o i is the true simulated level of the differential expression. The recorded simulations are then arranged by J to compute FN m as described earlier. Arranging the same simulations by J and ρ (possibly on a grid), we estimate β(J, ρ), which can be summarized in plots like those given in Figure 2.
Implementation is facilitated by several simplifications that increase the computational efficiency. First, we use common random numbers across J , in the following sense. We consider sample sizes on the interval J 0 ≤ J ≤ J 1 . We start by generating one large sample y J 1 , and use appropriate subsamples y J ⊂ y J 1 to compute FN m (J ), FD m (J ), FNR m (J ), and FDR m (J ) for J over a grid J 0 ≤ J ≤ J 1 . Using the common underlying data reduces sampling variation across J .
Another simplification arises in the setup of the posterior simulations required to evaluate the posterior expected FN(t, y J ) and FD(t, y J ). Both require posterior simulation ω ∼ p(ω|y J ) by MCMC. In the context of the preposterior simulation, we can start the MCMC at the true parameter values ω o used to simulate the data y J . Details are explained in the stepby-step algorithm given later in this section.
Finally, when computing L m (J ), we borrow strength across different sample sizes. Instead of averaging separately for each J the computed values L(t * , y J ) for that J , we proceed as follows. Consider a scatterplot of all pairs (J, L(t * , y J )). We fit a smooth curve L m (J ) through all points of the scatterplot. This formalizes the borrowing strength across different sample sizes J , exploiting the fact that L m (J ) is smooth across J . In fact, we recommend enforcing the smooth fit L m to be monotone and decreasing and to follow the (log J/J ) asymptotics. We used a least squares fit of a linear regression of the observed FN(t * , y J ) values on √ log J/J . For comparison, we fit a smoothing spline without any such constraints. The spline fit is practically indistinguishable from the simple regression, validating use of the asymptotic law for the curve fitting (see Sec. 5).

THE PROBABILITY MODEL
Our approach to sample size selection assumes an encompassing probability model that specifies a joint distribution across comparisons and across repeated experiments. In general, the model should be sufficiently structured and detailed to reflect the prior expected levels of noise, a reasonable subjective judgment about the likely numbers of differentially expressed genes, and some assumption about dependencies, if relevant. It should also easily incorporate prior data when available.
The design argument can be developed with a simplified model, ignoring all details of the data-cleaning process, including the spatial dependence of measurement errors across the microarray, correction for misalignments, and so on. Although such detail is critical for the analysis of the observed microarray data, it is an unnecessary burden for the design stage. The variability resulting from preprocessing and normalization can be subsumed as an aggregate in the prior description of the expected noise. In what follows we thus assume that the data are appropriately standardized and normalized and that the noise distribution implicitly includes the consideration of those processes (see, e.g., Tseng, Oh, Rohlin, Liao, and Wong 2001;Baggerly et al. 2001;, for a discussion of the process of normalization).
For the implementation in the example we choose a variation of the model introduced by Newton, Kendziorski, Richmond, Blattner, and Tsui (2001) and Newton and Kendziorski (2003). We focus on the comparison of two conditions and assume that data will be available as arrays of appropriately normalized intensity measurements X ij and Y ij for gene i, i = 1, . . . , n, and experiment j , j = 1, . . . , J , with X and Y denoting the intensities in the two conditions. Newton et al. (2001) proposed a hierarchical gamma/gamma model. The model starts by assuming that the observed intensities are sampled from gamma distributions, with a conjugate gamma prior on the scale parameters. The model includes a positive prior probability mass for matching the means of the gamma distribution for the same gene under the two conditions of interest. For the purpose of the sample size design, we extend the model to multiple experiments, j = 1, . . . , J . We assume a gamma sampling distribution for the observed intensities X ij , Y ij for gene i in sample j , The scale parameters are gene-specific random effects (θ 0i , θ 1i ), with positive prior probability for a tie, Conditional on latent indicators z i for differential gene expression, z i = I (θ 0i = θ 1i ), we assume conjugate gamma randomeffects distributions, The model is completed with a prior p(η) for the parameters η = (a, a 0 , ν, p). In the implementation for the example in Section 5, we fix ν. We assume a priori independence and use marginal gamma priors for a 0 and a and a conjugate beta prior for p. As in the approach of Newton et al. (2001), the foregoing model leads to closed-form marginal likelihoods p(X i , Y i | z i = 0, η), p(X i , Y i |z i = 1, η), and p(X i , Y i |η) after integrating out θ 1i , θ 0i but still conditional on η = (p, a, a 0 ). This greatly simplifies the posterior simulation. We add two more generalizations to the model. First, we want to modify the model to allow the use of a pilot dataset to learn about the sampling distribution of the observed gene expressions across genes and repeated samples. We envision a system wherein the investigator collects some pilot data (on control tissue) before going through the sample size argument. These pilot data can then be used to learn about the important features of the sampling distribution. If the observed pilot data can be adequately fit by the marginal model p(X i |z i = 0) under the gamma/gamma hierarchical model, then the sample size design should proceed as before. But if the pilot data show evidence against the gamma/gamma model, then the system should estimate a model extension and proceed with the extended model. A convenient way to achieve the desired extension is a scale mixture extension of the basic model (5). In particular, we assume that X ij ∼ Ga(a, θ 0i r ij ) dp(r ij |w, m) and (7) Y ij ∼ Ga(a, θ 1i s ij ) dp(s ij |w, m), where p(r|w, m) is a discrete mixing measure with P (r = m k ) = w k (k = 1, . . ., K). Locations m = (m 1 , . . . , m K ) and weights w = (w 1 , . . . , w K ) parameterize the mixture. To center the mixture model at the basic model, we fix m 1 = 1.0 and assume a high prior probability for a large weight w 1 . We use the same mixture for s ij , P (s ij = m k ) = w k . The model is completed with m k ∼ Ga(b, b), k > 1 and a Dirichlet prior w ∼ Dir K (M · W, W, . . . , W ). Selecting a large factor M in the Dirichlet priors assigns a high prior probability for a large w 1 . By assuming a dominating term with m 1 = 1.0 and E(m 2 ) = · · · = E(m K ) = 1, we allocate a large prior probability for the basic model and maintain the interpretation of ρ i = θ 0i /θ 1i as the level of differential expression.
A concern related to microarray data experiments prompts us to introduce a further generalization to allow for the occa-sional presence of slides that are outliers compared with the other arrays in the experiment. This occurs for reasons unrelated to the biologic effect of interest, but needs to be accounted for in the modeling nevertheless. We achieve this by adding a second mixture to (7), with an additional slide-specific scale factor g j . Paralleling the definition of p(r ij |w, m), we use a finite discrete mixture P (g j = m gk ) = w gk , k = 1, . . . , L with a Dirichlet prior (w g1 , . . . , m gL ) ∼ Dir L (M g · W g , W g , . . . , W g ), m g1 = 1, and m gk ∼ Ga(b g , b g ) for k = 2, . . . , L.
An important feature of the proposed mixture model is its computational simplicity. We proceed in two stages. First, we use the pilot data to fit the mixture model. Let X o ij , j = 1, . . ., J o , denote the pilot data. We use posterior MCMC simulation to estimate the posterior mean model. We do this once, before starting the optimal design. Posterior simulation in mixture models like (8) is a standard problem. We include reversible-jump moves to allow for random size mixtures (Green 1995).
We then fix the mixture model at the posterior modes K and L, and the posterior means (w,m,w g ,m g ) = E(w, m, w g , m g |X o , K, L ). We proceed with the optimal sample size approach, using model (8) with the fixed mixtures. We conduct the procedure, including all posterior and marginal simulation, exactly as before, with only one modification. We add a step to impute r ij , s ij and g j . Conditional on (r ij , s ij , g j ), we replace X ij by X ij /(r ij g j ) and Y ij by Y ij /(s ij g j ). Everything else remains unchanged. Updating the mixture variables r ij , s ij , and g j is straightforward. The following algorithm summarizes the proposed approach with the pilot data.  (8). 3. Optimal sample size: Proceed as in Algorithm 1, replacing X ij with X ij /(r ij g j ) and Y ij by Y ij /(s ij g j ), and adding an additional step in the posterior MCMC to update the mixture indicators r ij and s ij (step 1.2 in Algorithm 1).
The indicators are initialized with the (true) values from the data simulation. The mixture model parameters remain fixed at (w,m,w g ,m g , K, L ).
Rescaling with the iteratively updated latent scale factors r ij g j and s ij g j formalizes the use of the pilot data to inform the sample size selection by changing the prior simulation (as in Algorithm 1) to the preposterior simulation, conditional on the pilot data.

EXAMPLE
We analyze the data reported by Richmond, Glasner, Mau, Jin, and Blattner (1999). These data were also used as an illustration by Newton et al. (2001). We use the control data to plan for a hypothetical future experiment.

Implementation
We proceed as proposed in Algorithm 2. First, we estimate the mixture model (8), using the available control data as a pilot dataset X o . Estimation of (8) is implemented as a MCMC posterior simulation with reversible-jump moves. We use splitmerge moves (Richardson and Green 1997) for both mixtures defined in (8). Recall that the mixtures are defined with respect to the discrete mixing measures p(r ij |w, m) and p(g j |w g , m g ). The third mixture, with respect to s ij , does not appear in the model because the pilot data include only the control data. We find the posterior mode for the size of the mixture models at K = 3 and L = 2.
To define the probability model for the design calculations, we fix K = 3 and L = 2 and set the mixture model parameters (m, w, m g , w g ) at their posterior means (conditional on the fixed size of the mixture). Maintaining the randomness of the mixture parameters in the design model would not significantly complicate the procedure, but it also would not contribute much to the final inference. Implementing Algorithm 1, we compute the expected losses and power β(J, ρ) across a grid of sample sizes J .
Algorithm 1 is implemented as three nested loops. The outer loop is simply a repeated simulation from model (8), with fixed priors for the scale factors r ij , s ij , and g j . We start by generating the hyperparameters η = (a, a 0 , p) for the prior model, given in (5)-(8) (step 1.1 of Algorithm 1). Let Ga(α, β) denote a gamma distribution with a shape parameter α and a mean of α/β. We use a Ga(2, 2) prior for a, a Ga(12, 1) prior for a 0 , and a Be(1, 10) beta prior for p. We include the additional prior constraints a < 1, a 0 < 1, and .01 < p < .15. Next, we generate indicators z i and random effects (θ 0i , θ 1i ), i = 1, . . . , n. We conclude simulation for the outer loop by simulating hypothetical data (X ij , Y ij ), i = 1, . . ., n, j = 1, . . . , J 1 (step 1.2).
We then proceed with the second loop, nested within the first, by iterating over J = J 1 , . . . , J 0 (step 1.3). For each J , we generate a posterior Monte Carlo sample from p(ω|y J ). This is achieved by the third nested loop (step 1.3.1), which implements a posterior MCMC simulation. Posterior simulation is initialized with the known true parameter values (saved from step 1.1). In each iteration of the MCMC simulation, we update r ij , s ij , g j , and the hyperparameters (a, a 0 , p). The first three steps are draws from the multinomial complete conditional posterior for the respective indicators. The last three steps are implemented as random-walk Metropolis-Hastings steps to update the hyperparameters. The random-walk proposals are generated from a truncated univariate normal centered at the current values of the respective parameter, with normal standard deviations of .05, .1, and .05 for a, a 0 , and p. Implementation of the MCMC is greatly simplified by noting that p(ω|y J ) can be analytically marginalized with respect to the random effects θ 0i , θ 1i , and z i . (See Newton et al. 2001 for a statement of the marginal likelihood.) At the end of each sweep of the posterior MCMC, we compute the posterior probabilities of the differential expression p(z i = 1|η, r, g, s, y J ).
On completion of the innermost loop, we use ergodic averages of the conditional probabilities p(z i = 1|η, r, g, s, y J ) to approximate v i = P (z i = 1|y J ) (step 1.3.2). Using the marginal posterior probabilities v i , we then evaluate the posterior false-discovery and false-negative counts and rates and corresponding decisions d i and record them for later use. We also record the triples (J, ρ o i , d i ) (step 1.3.3). On completion of the outer loop, we summarize the observed FDR, FD, FNR, FN, and the triples (J, ρ o i , d i ). For example, the sample average over the simulated values of FDR under a given sample size J provides a Monte Carlo estimate of the preposterior expected FDR m (J ). The fraction of d i = 1 under a given sample size and the true effect ρ o i = ρ provides a Monte Carlo estimate for β (J, ρ). To ensure sufficient Monte Carlo sample size, the latter is done for a grid on ρ.

Results
Recall that FN m (J ) and FNR m (J ) denote the preposterior expectations of FN and FNR. We use analogous notations D m (J ) and t * m (J ) to denote the preposterior expectations of the number of discoveries D and the threshold t * , computed under the loss L and sample size J , defined analogously to FNR m and FN m . We computed all inference in one run of Algorithm 1, collecting the necessary Monte Carlo averages for all summaries. Figure 1 shows the expected loss L m (J ) = FN m (J ) and other summaries under the loss function L = L 2N . We set the threshold for FD as α N = 7.1. We chose this to match a bound FDR ≤ α R for α R = 40%. We compute the value as α N = .1npα R /(1 − α R ), under the assumption that 10% of the true differential expressions are discovered. Under L 2N , the false-discovery count FD, and thus also the preposterior expectation FD m , is fixed by definition at FD = α N . To maintain the fixed FD, the procedure must eventually start lowering the threshold t * to reject comparisons with increasingly lower posterior probabilities of differential expression. The estimated curves FNR m (J ) and FN m (J ) are derived by fitting a linear model with predictor log(J + 1)/(J + 1) to the observed pairs (J, FNR(t * N , y J )). This is motivated by the asymptotic results of Theorem 1 (with the offset +1 to avoid a singularity at J = 0). For comparison, we estimate the same curve with a cubic smoothing spline, using the smoothing.spline function in R with default parameters. The corresponding curves for FNR m and FN m are shown as dashed lines. For FDR m , FD m , D m , and t * m , we use cubic smoothing splines to estimate the mean value as a function of J .
The relatively flat nature of FN m and FD m does not enable a conclusive sample size choice. We propose considering additional power curves, as defined in (3). Figure 2 shows β(J, ρ) as a function of ρ and J . Figure 2(a) plots the power against the assumed true level of differential expression ρ, with a separate curve for each sample size J . Figure 2(b) plots the same summary against the sample size J , arranged by the level of differential expression ρ. In practice, a sample size argument would then proceed as follows. First, the investigator determines a minimum level of differential expression that would be considered biologically meaningful, say two-fold expression at ρ = log 2. Using a pilot dataset, we proceed with Algorithm 2 to compute the expected FNR, FN, and power across the sample sizes. Inspection of the power plot for the level ρ of interest, together with the FNR and FN plots, informs the investigator about the minimum sample size needed to achieve the desired power and/or error rates.

Alternative Loss Functions
Although the general nature of the loss function as trading off false-positives and false-negatives is natural, the specific form of combining them is less clear. A strength of the proposed approach is that it allows us to evaluate the alternative loss functions that combine the basic summaries FN, FD, FDR, and FNR in different ways with minimal computational effort. We discuss the results for three alternative loss functions, L 2R , L N , and L R (introduced in Sec. 2.1).
We have already established (in Thm. 1) that there is a common optimal terminal decision rule under all four loss functions. This allows us to easily adapt Algorithm 1 for all four loss functions. The only required change is in step 1.3.2. For L 2R , L N , and L R , different definitions of t * are required. The rest of the algorithm proceeds unchanged. It is possible to use a single implementation of Algorithm 1, recording J, t * , FN, FD, FNR, and FDR for all four loss functions. Appropriate summaries of the saved Monte Carlo samples allow us to produce summaries such as those shown in Figure 1 for all four loss functions, based on a single run of Algorithm 1.
An important implication of the different strategies for choosing the cutoff is the nature of FDR as a function of the sample size J . Under L 2R it remains, by design, constant over J . This has awkward implications. Imagine the asymptotic case with a large sample size when the true z i are practically known. To achieve the desired FDR, we have to knowingly flag some genes as differentially expressed even when v i ≈ 0. By the same argument, the loss L 2N leads to similar asymptotics. However, fixing the count FD instead of the rate FDR slows the awkward decrease in the threshold required to maintain the fixed FDR under L 2R . The number of discoveries is still increasing, but starts at a higher level and avoids the steep increase seen under L 2R . In contrast, under L N the cutoff t is fixed across the sample size, leading to a vanishing FDR in the limit as J → ∞, due to posterior consistency. But these problems might be of only asymptotic relevance and of no concern for moderate sample sizes. Apart from these concerns, all three loss functions L 2R , L N , and L 2N , are very similar in terms of their properties, nature of the inference, and implementation details.
Inference under L R leads to different behaviors among the various summaries. In contrast with the other three loss functions, the optimal decision under L R does not constrain an error, the error rate, or the cutoff. Considering plots similar to those in Figure 1, we find that at an intermediate sample size, the threshold t * moves swiftly from an initial value of t * ≈ 0 to the other extreme of t * ≈ 1. This unintuitive behavior confirms our initial reservations against L R for including penalties for a false discovery and false negative that depend on the total number of discoveries or negatives.
In summary, inference under the four loss functions differs in how the competing goals of reducing false positives and false negatives are balanced. The loss functions L 2R , L 2N , and L N define the trade-off implicitly by fixing FDR, FD, and t * . Under L R , the trade-off is explicitly included as a coefficient in the loss function. The constraint on FDR under L 2R has the awkward implication that with an increasing sample size, we must knowingly include an increasing number of false positives in the rejection region to maintain the set false-positive rate. The loss function L R induces counterintuitive jumps in FDR and t * , with leads us to favor L 2N and L N . Both lead to very similar inference, with L 2N having the advantage of a constraint on the practically more important FD, rather than t * , as in L N .

CONCLUSION
The design of microarray experiments for measuring gene expression is a critical aspect of genomic analyses in biology and medicine. Microarrays are costly, and difficult trade-offs must be evaluated when allocating resources to alternative investigations. Even in the simplest two-sample comparison setting, microrarray analyses pose difficult challenges to traditional sample size approaches. First, in terms of hypothesis testing, they present a multitude of heterogeneous alternatives; second, they are generally performed with goals that are best captured by properties of the ensemble of the choices made; and third, they mandate the incorporation of existing knowledge, because signal-to-noise ratios vary significantly with the specific technology, the source of RNA, and the overall experience of the laboratory's personnel.
Our goal of this article has been to develop a formal decisiontheoretic framework to address these challenges. This gives investigators the opportunity to quantify both the a priori uncertainty about the likely expression levels and the implications of sample size choices on the performance of inference about differential expression. The consequences of decisions are captured by loss functions related to genome-wise error rates. We argue for using posterior expected error rates for the terminal decision about the multiple comparisons, and marginal expected error rates for the design decision about the sample size, consistently with a Bayesian sequential approach. Similar issues recur in other high-dimensional multiple-comparison problems and in the detection of faint signal levels in noisy data; the methods that we propose are applicable more generally to those problems as well.
In situations requiring complex decision making, decision models such as ours are best thought of as decision support tools. As is common in simpler settings, we envision investigators exploring various scenarios rather than simply eliciting input and blindly trusting the emerging sample size recommendation. A reasonable situation is also one in which an investigator has in mind a certain sample size that is feasible within given resource constraints. The proposed method informs the investigator about the effect sizes that she or he can realistically expect to discover with the proposed sample size, and also about the ensuing error rates.
An interesting application of the proposed method is in a sequential framework. An investigator could proceed in steps, starting with an initial batch of experiments and stopping when a satisfactory balance of classification error rates is achieved. This could be implemented without preposterior calculations. Because genome-wise error rates refer to the ensemble of genes, an investigator could not sample to a foregone conclusion about any individual genes by using this stopping rule.
In our model we assume that genes are from a discrete mixture in which some genes are altered across the two samples and others are completely unaltered. This assumption is realistic in tightly controlled experiments, but less so in the comparison of RNA samples across organs or across organisms. These broader comparisons are often made to produce exploratory analyses, such as clusters. The choice of sample sizes in these circumstances is different from that used in controlled experiments. Some insight into this issue has been offered by Simon et al. (2002) and Bryan and van der Laan (2001).
An important practical indication for microarray design arises from the illustration described in Section 5. In particular, for a realistic set of parameters and pilot data, we have shown that the improvement in the genome-wide error rate appears to be nonconcave, with a small initial plateau at very small sample size. In some cases the payoff of increasing the sample size from, say, two to three appears to be negligible. This has implications for the common practice of planning experiments with only two or three replicates. We suggest that an analysis of the kind presented in Figure 1 can provide valuable information to investigators entertaining experiments with a very small number of replicates.

APPENDIX A: OPTIMAL TERMINAL DECISION
We prove Theorem 1. We start by considering L N , subject to a fixed total number of discoveries D. We find that L N (d, y|D) = cD − (c + 1) d i v i + v i . The last term does not involve the decision. For fixed D, the rest is minimized by setting d i = 1 for the D largest v i . In other words, for any D, the optimal rule is of the type where t is simply the (n − D)th order statistic of {v 1 , . . . , v n }. Thus we conclude that the global minimum must be of the same form, and it remains only to find the globally optimal t . Straightforward algebra shows that the global minimum is achieved for t * = c/(c + 1).
A similar argument holds under L R (d, y). We find that with C 1 (D) = cD/(D + ), C 2 (D) = c/(D + ) + 1/(n − D + ), C 3 (D) = 1/(n − D + ), and v (i) the ith order statistic of v i . The global optimum is found by minimizing L R (d, y|D) with respect to D to find the optimal D = D * . Thus the optimal decision is d i = (v i > t) and t * R ( y) ≡ v (n−D * ) . Under L 2N and L 2R , we need an additional argument. To minimize FNR subject to FDR ≤ α, we write the Lagrangian function Using Lagrangian relaxation (Fisher 1985), we find a weight λ * ≥ 0 such that the minimization of f λ * (d) provides an approximate solution to the original constrained optimization problem. (The solution is only approximate because of the discrete nature of the decision space.) But f λ * = L R with c = λ * . Thus the solution must have the same form as described earlier. The only difference is that the implied coefficient c itself is a complicated function of the data. Knowing the structure of the solution, we can solve the decision problem by finding the cutoff t 2R ( y) = min{s : FDR(s, y) ≤ α}. A similar argument holds for L 2N , with t 2N ( y) = min{s : FD(s, y) ≤ α}. Note that the optimal cutoff t * in all three new loss functions is now a function of the data. We write t * L ( y J ) for the optimal cutoff under loss L given data y J .

APPENDIX B: ASYMPTOTIC FALSE-NEGATIVE RATE
We now prove Theorem 2, assuming a model with the same structure as in Section 4. The specific distributional assumptions, including the gamma sampling distribution for (X ij , Y ij ) and the gamma prior for (θ 0i , θ 1i ), are not critical. We start the argument by establishing an asymptotic approximation for P (z j = 1|y J ). We then use this result to argue that for a large J , the rejection region has to necessarily include some genes with zero or a small level of true differential expression. This is true under all three loss functions, L 2R , L 2N , and L N . Thus the nonrejection region includes only small levels of true differential expression. We exploit this fact to approximate FNR by an integral that can be recognized as an expression of the order of √ log J/J . The integral approximation is valid only if a large number of genes are in the nonrejection region, allowing us to approximate the sum in the definition of FNR by an integral. We conclude the argument by showing that this is the case under all three loss functions, for a sufficiently large J .
We start with an asymptotic result for the posterior probability of differential expression. Let η = (a, a 0 , p) denote the hyperparameters, and let y i = {X ij , Y ij , j = 1, . . . , J } denote the data for gene i. Because the number n of genes is very large, we have, for each gene, P (z i = 1|y) = P (z i = 1|y i , η) dp(η|y i ) = P (z i = 1|y i ,η) 1 + O P (n −1/2 ) = P (z i = 1|y i , η) 1 + O P (n −1/2 ) , (B.1) whereη is the maximum likelihood estimator and η are the true hyperparameters. Here X n = O P (n k ), for a sequence of random variables X n , is defined as lim M→∞ lim sup n P [X n /n k > M] = 0.
Classical Laplace expansions imply that P (z i = 1|y i , η) = 1 1 + c i exp{−J (θ 0i −θ 1i ) 2 τ i /2} √ J , (B.2) c i , τ i = O P (1) as J goes to infinity. The constant c i includes the ratio (1 − p)/p. Under suitable regularity conditions, this result is uniform in (θ 0i , θ 1i , η) over compact sets. In the noncompact case, some conditions on the tails of the priors need to be added (see, e.g., Guihenneuc and Rousseau 2002). Therefore, when |θ 0i − θ 1i | is large, p(z i = 1|y i , η) goes to 1 at an exponential rate and thus P (z i = 1|y i ) is very close to 1 (the error being essentially of the order n −1 ). We now use (B.2) to study the asymptotic behavior of the terminal decision. In particular, we consider FDR, FD, FNR, and FN. Let v (1) ≤ · · · ≤ v (N) be the ordered posterior probabilities v i = P (z i = 1|y) and recall that FDR(t, y) is the number of discoveries. We use N = I (v i < t), FP = I (v i > t)I (z i = 0), n 1 = I (z i = 1), and n 0 = n − n 1 to denote the number of negatives, false-positives, and differentially expressed and nondifferentially expressed genes. We use A N , A FP , A 1 , and A 0 to denote the corresponding sets of genes. The foregoing expansions show that the ordering of v i is asymptotically linked to the ordering of |θ 0i −θ 1i |, with v i monotone, increasing in |θ 0i −θ 1i | with, asymptotically, v i ≈ 1 − c i √ J exp −J θ 0i −θ 1i 2 τ i /2 .
The false-discovery rate FDR(t, y) as a function of t is a step function taking values in {1 − v (n) , . . . , 1 − (v (k) + · · · + v (n) )/(n − k + 1), . . . , 1 − (v (1) + · · · + v (n) )/n}. Similarly, FD(t, y) is a step function with values {1 − v (n) , . . . , 1 − v (1) + · · · + 1 − v (n) )}. Both are monotone, decreasing in t . For a large J , the previous discussion shows that any gene with has a posterior probability of v i ≥ 1 − 1/ √ J when C is sufficiently large, uniformly in θ 0i and θ 1i belonging to some compact set, and with a large probability. We let S = i : θ 0i −θ 1i < C log J / √ J denote the set of genes with small |θ 0i −θ 1i | that violate (B.3). We now show that under all three losses, only genes with small |θ 0i −θ 1i | are classified as nondifferentially expressed, that is, A N ⊆ S. Under L N , the argument is straightforward. For all genes satisfying (B.3), the posterior probability v i ≈ 1 − 1/ √ J is beyond t N = c/(1 − c) for sufficiently large J . Thus all genes in A N satisfy |θ 0i −θ 1i | < C √ log J/J . To prove the claim under L 2R , we show that the opposite would violate the constraint on FDR. Assume that (B.3) holds for all i ∈ A D . Then FDR = 1 − v (n−D+1) + · · · + v (n) /D ≤ 1/ √ J , which is not enough to reach the set level α bound required for L 2R . Thus the rejection region A D necessarily must include some genes that violate (B.3). Together with the monotonicity of |θ 0i −θ 1i | as a function of v i , this proves the claim. Finally, to show the same for L 2N , consider (B.3) with an even larger C. If C 2 > 1/τ i [log n − log(α/2)], then 1− v (i) ≤ α/(2n) for all genes that satisfy (B.3) with such C. If only such genes are considered in the rejection region, then 1 − v (k) + · · · + 1 − v (N) ≤ α/2, which is not enough to reach the desired bound FN = α under L 2N .