Maximizing a Family of Optimal Statistics over a Nuisance Parameter with Applications to Genetic Data Analysis

This article was downloaded by: [New York University] On: 20 October 2014, At: 23:43 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Journal of Applied Statistics Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/cjas20 Maximizing a Family of Optimal Statistics over a Nuisance Parameter with Applications to Genetic Data Analysis Gang Zheng a a Office of Biostatistics Research , National Heart, Lung and Blood Institute , Bethesda, USA Published online: 02 Aug 2010. To cite this article: Gang Zheng (2004) Maximizing a Family of Optimal Statistics over a Nuisance Parameter with Applications to Genetic Data Analysis, Journal of Applied Statistics, 31:6, 661-671, DOI: 10.1080/1478881042000214640 To link to this article: http://dx.doi.org/10.1080/1478881042000214640 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http:// www.tandfonline.com/page/terms-and-conditions http://www.tandfonline.com/loi/cjas20 http://www.tandfonline.com/action/showCitFormats?doi=10.1080/1478881042000214640 http://dx.doi.org/10.1080/1478881042000214640 http://www.tandfonline.com/page/terms-and-conditions http://www.tandfonline.com/page/terms-and-conditions Journal of Applied Statistics, Vol. 31, No. 6, 661–671, July 2004 Maximizing a Family of Optimal Statistics over a Nuisance Parameter with Applications to Genetic Data Analysis GANG ZHENG Office of Biostatistics Research, National Heart, Lung and Blood Institute, Bethesda, USA A In this article, a simple algorithm is used to maximize a family of optimal statistics for hypothesis testing with a nuisance parameter not defined under the null hypothesis. This arises from genetic linkage and association studies and other hypothesis testing problems. The maximum of optimal statistics over the nuisance parameter space can be used as a robust test in this situation. Here, we use the maximum and minimum statistics to examine the sensitivity of testing results with respect to the unknown nuisance parameter. Examples from genetic linkage analysis using affected sub pairs and a candidate-gene association study in case-parents trio design are studied. K W: Genetic analysis, maximal statistics, nuisance parameter, robust test Introduction Consider hypothesis testing based on a parametric model f(x; �, �), where � is the parameter of interest and � é [L, U] is a nuisance parameter, where L and U are known. We are interested in testing H 0 : �ó0 against H a :�[0 (or H a :�Ö0) using random samples from the distribution f(x; �, �) such that f (x; 0, h)óf (x) (1) and � LLj logL(j, h)� ��0 óg(h, x)óa(x)hòb(x) (2) where L(�, �) is the likelihood function. The condition (1) refers to the hypothesis testing when the nuisance parameter (�) is defined only under the alternative hypothesis (e.g. Davies, 1977), and condition (2) specifies that the score function, evaluated under the null, is a linear function of the nuisance parameter. From Correspondence Address: Gang Zheng, Office of Biostatistics Research, National Heart, Lung and Blood Institute, 6701 Rockledge Drive, MSC 7938, Bethesda, MD 20892, USA. Email: [email protected] 0266-4763 Print/ 1360-0532Online/04/060661-11 © 2004 Taylor & Francis Ltd DOI: 10.1080/1478881042000214640 D ow nl oa de d by [ N ew Y or k U ni ve rs ity ] at 2 3: 43 2 0 O ct ob er 2 01 4 662 G. Zheng conditions (1) and (2), the standardized optimal score statistic for testing H 0 : �ó0 can be generally written as Z 0 ó L logL(j, h)/LjD H0 {ñE[L2 logL(j, h)/Lj2] H0 }1�2 ó a(x)hòb(x) (ch2òdhòe)1�2 (3) where c, d and e are some constants, which can be obtained as follows. Note that the denominator of equation (3) can be written as the squared root of Var H0 (L logL(j, h)/Lj)óVar H0 (a(X)�òb(X))óVar H0 (a(X))�2ò2Cov(a(X), b(X))� òVar H0 (b(X)), from which cóVar H0 (a(X)), eóVar H0 (b(X)) and dó2Cov H0 (a(X), b(X)). There are other testing problems (see the next section) for which conditions (1) and (2) are not directly satisfied, but the optimal statistic can still be written as equation (3). For a given � é [L, U], Z� asymptotically follows a standard normal distributionunder the null hypothesis. Hence, when � is known, H 0 : �ó0 is rejected in favour of H a :�[0 when Z�[z�, the upper � percentile of the standard normaldistribution. Often in practice, � is unknown, so Z� cannot be used. Davies(1977) proposed using the maximum of Z�, MAXómax�� L U� Z�, for testingH 0 , and gave an approximation for obtaining the critical value for using {MAX[c}, as the exact asymptotic null distribution of MAX is not available. In general, the exact value of MAX is usually not available. The numerical value of MAX is usually obtained. However, when the function g(�, x) is specified by equation (2), the exact value of MAX is available. In this article, we give a simple algorithm to find the maximum of Z� given by equation (3) when g(x, �)is specified by condition (2). We present an application of the results to a simple preliminary data analysis. In the following section, two examples are given to obtain optimal statistics that can be expressed as equation (3). The algorithm is given in the third section for both two-sided and one-sided alternatives. Applica- tions are described in the section after with numerical examples present in the final section. Two Examples Genetic Linkage Analysis Using Affected Sub Pairs In genetic linkage analysis using affected sub pairs (e.g. Blackwelder & Elston, 1985; Faraway, 1993; and Holmans, 1993), one tests linkage (genetic distance) between a genetic marker (a known location on a chromosome) and a disease locus (an unknown location on the same chromosome, which determines the disease under the study). Under the null hypothesis of no linkage, the observed frequency of genes at the genetic marker and the disease locus transmitted together from parents to offspring should be close to the expected probability. Since the disease locus is latent, if such a linkage exists, one may observe that affected sib pairs (both sibs have disease) share more identical alleles (an alternative form of gene) of the genetic marker than the expected number. This is because, under the alternative hypothesis, certain genes of the genetic marker and the disease locus are more likely transmitted together from parents to offspring than expected. D ow nl oa de d by [ N ew Y or k U ni ve rs ity ] at 2 3: 43 2 0 O ct ob er 2 01 4 Maximizing a Family of Optimal Statistics 663 Consider a genetic marker with only two alleles M and N. There are three possible genotypes (formed by a pair of alleles) at this marker: MM, MN and NN. Two sib pairs with genotypes MN and MN, respectively, share 0 allele identical-by-descent (IBD) if two Ms and Ns are respectively transmitted to them from different parents or from the same parent but different chromosomes. They share one (two) alleles (alleles) IBD if either M or N but not both (both alleles) is (are) transmitted to them from the same chromosome of the same parent. So two sibs with genotypes MM and NN always share zero allele IBD. They share either one or zero allele IBD with genotypes MM and MN. For formal definition of IBD, see Hartl & Clark (1997). Suppose n independent sib pairs who have disease are sampled, and the number of alleles (0,1,2) shared IBD of each sib pair is determined. The following family of parametric models for the probabili- ties IBD sharing was studied by Whittemore & Tu (1998), f�ó�p : pój(0, h, 1ñh)ò(1ñj)�14, 12, 14�; 0OjO1, 0OhO1/2� where p is a mixture of two trinomial probabilities and � is a nuisance parameter determined by the underlying genetic model, e.g. the rare recessive and additive diseases corresponding to �ó0 and �ó1/2, respectively. Under the null hypothesis of no linkage H 0 :�ó0, pó(1/4, 1/2, 1/4) is the trinomial probability that sib pairs share (0,1,2) alleles IBD, which is independent of the nuisance parameter �. Under the alternative hypothesis H a :�[0, the likelihood function based on the trinomial probability p é f� is proportional to (1ñj)n0�jhò12 (1ñj)� n1 �j(1ñh)ò14 (1ñj)� n2 (4) where n i , ió0,1,2, is the observed number of sib pairs sharing i alleles IBD and nón 0 òn 1 òn 2 . Thus, the score function, evaluated under H 0 , can be written as (2n 1 ñ4n 2 )�ò(3n 2 ñn 0 ñ n 1 ), a special case of condition (2). In fact, for any trinomial probability (p 1 , p 2 , p 3 ) with p 1 and p 2 constrained to a triangle in the (p 1 , p 2 ) plane, if the null hypothesis (p 1 ,p 2 )ó(p 10 , p 20 ) is on the vertex of the triangle and the alternative corresponds to any point in the triangle except for the null, then the testing problem can be reparameterized so that the score function has the form (2) (such as f�). Case-parents Trio Design for Genetic Association The second example comes from case-parents trio design for testing association between a genetic marker and a disease (Spielman et al., 1993). In this design, an affected offspring and parents (trio) are sampled and their marker genotypes are obtained. For a marker with two alleles M and N with one high risk allele (M) and one normal allele (N), there are only six parental mating types: (i) MMîMM; (ii) MMîMN; (iii) MMîNN; (iv) MNîMN; (v) NNîMN; and (vi) NNîNN, whereîis used to indicate mating. Any offspring gets only one copy from each parent. For example, offspring of mating type (i) only has D ow nl oa de d by [ N ew Y or k U ni ve rs ity ] at 2 3: 43 2 0 O ct ob er 2 01 4 664 G. Zheng genotypeMM. Thus, conditional on mating type (i), the probability that offspring has genotype MM is one. Similarly, for mating types (iii) and (vi), the offspring genotype must be MN and NN, respectively, given the corresponding mating types. Thus, only mating types (ii) (iv) and (v) are informative, i.e. contributing to the likelihood function, conditional on their mating types. Schaid & Sommer (1993) obtained the conditional likelihood function, given three informative parental mating types (types (ii) (iv) and (v)) and their diseased offspring (i.e. case). The genotype relative risks r 1 and r 2 refer to the ratio of the probability of having disease with, respectively, one risk allele and two risk alleles relative to that with no risk alleles, that is, r 1 óPr(disease DMN)/Pr(disease DNN) and r 2 óPr(diseaseDMM)/Pr(disease DNN). Under the null hypothesis of no association, all genotype relative risks are equal to 1 (H 0 :r 1 ór 2 ó1). Under the alternative, r 2 Pr 1 P1 and at least one inequality holds. Let n j , jó2,4,5 be corresponding sample sizes of three informative mating types with their cases and nón 2 òn 4 òn 5 . From Schaid & Sommer (1993), conditional on n 2 , n 4and n 5 , n 21 (ón 2 ñn 22 ) and n 50 (ón 5 ñn 51 ) follow binomial distributions bin(n 2 , r 1 /(r 1 òr 2 )) and bin(n 5 , 1/(1òr 1 )), respectively, and (n 40 , n 41 , n 42 ) follow the trinomial distribution with probabilities (1/(1ò2r 1 òr 2 ), 2r 1 /(1ò2r 1 òr 2 ), r 2 /(1ò2r 1 òr 2 )), where n 4 ón 40 òn 41 òn 42 . Here, n 2i , n 4i and n 5i are the counts of case genotypes in mating types (ii) (iv) and (v), respectively, with i M alleles, where ió1,2, ió0,1,2 and ió0,1 for three informative mating types. Given n j , jó2,4,5, the joint conditional likelihood function is obtained by the multiplica- tion of the two binomial distributions and the trinomial distribution, which can be written as, subject to a constant, L(r 1 , r 2 )ó rn21�n41�n51 1 rn22�n42 2 (r 1 òr 2 )n2 (1ò2r 1 òr 2 )n4 (1òr 1 )n5 Note that we have two parameters of interest (r 1 , r 2 ) without nuisance parameters. To reduce the problem to equation (3), we consider the reparameterization, r 1 ó1ò� sin � and r 2 ó1ò� cos�, which establishes a one-to-one relation between (r 1 , r 2 )Ö(1,1) and (�,�). Under the null, H 0 :r 1 ór 2 ó1 is equivalent to H 0 :�ó0 and � is not defined when �ó0. Clearly, r 2 Pr 1 P1 implies � é [0,�/4]. Using the reparameterization, it can be shown that the score statistic has the form Z{ó a cos{òb sin{ [c cos2{òd sin2{òe sin (2{)]1�2 (5) where aó(n 22 ñn 2 /2)ò(n 42 ñn 4 /4), bó(n 21 ñn 2 /2)ò(n 41 ñn 4 /2)ò(n 51 ñn 5 /2), cón 2 /4ò3n 4 /16, dó(n 2 òn 4 òn 5 )/4 and eóñ(n 2 /4òn 4 /8), and equation (5) can be written as equation (3) if we let �ótan � é [0,1]. Finding the Maximum of Optimal Statistics Examples in the previous section show that optimal statistics from various genetic testing problems can be expressed as equation (3). Although MAX can be obtained numerically, using calculus, a simple closed solution can be obtained D ow nl oa de d by [ N ew Y or k U ni ve rs ity ] at 2 3: 43 2 0 O ct ob er 2 01 4 Maximizing a Family of Optimal Statistics 665 when Z� is given by equation (3). We present the closed solutions in this section.The derivations are given in the Appendix. Write a(x) and b(x) in equation (3) as a and b. Assume that cÖ0. Otherwise if có0, a must be a constant with probability one. Thus, dó0 and the optimal statistic equation (3), becomes (a�òb)/e1�2, whose maximum over � é [L,U] is trivial. Define a*ó2abña2d/c and b*ób2ña2e/c. If a*ó0 and b*ó0, then Z�given in equation (3) is a constant. First, consider a two-sided alternative. Define the following three cases: (C1) when a*ó0 and b*Ö0, let � 1 óñd/(2c); (C2) when a*Ö0 and �ób*2c2òa*2ceña*b*cdP0, let h i ó b*cô*1�2 ña*c (6) ió1,2; (C3) when �\0. For (C1) and (C2), max� � L U� DZ� D is given by max �� L U� DZ�ómax(DZLD, DZ�1D, DZUD) and max�� L U� DZ�Dómax(DZLD, DZ�1D, DZ�2D, DZUD) respectively. For (C3), max� � L U� Z� Dómax( DZL D, DZU D). Second, for the one-sided alternative, we assume H a :�[0 is rejected for the large values of Z�. Thus,we are interested in the positive values of Z�. Determine the interval [L*,U*]�[L,U] such that Z�P0 if and only if � é [L*,U*]. Given the data x, the interval[L*,U*] is easy to obtain from the numerator of equation (3), e.g. if a[0, then [L*,U*]ó[�*,U] when �*óñb/a é [L,U], [L*,U*]ó[L,U] when �*\L, and [L*,U*] is empty otherwise, i.e. Z�\0 for any � é [L,U]. Suppose [L*,U*] is notempty. Then max�� L U�Z�ómax�� L� U��Z�. The previous results for the two-sidedalternative can be used here to obtain the maximization of Z� for �é[L,U].In applications, we also use the minimum of Z�, MINómin�� L U�Z�, for � é [L,U]. The MIN can be usually obtained by MINóñmax�� L U�(ñZ�). Applications In the following, we consider testing H 0 :�ó0 against the one-sided alternative H a :�[0. One possible application of the algorithm is to use MAX as a test statistic and reject the null hypothesis when MAX[c. No exact distribution for {MAX[c} is available. The critical value c can be obtained from the following approximation given by Davies (1977), Pr H0 (MAX\c)O'(ñc)ò 1 2n exp(ñc2/2)� U L {ño 11 (h)}1�2 dhóa (7) where o 11 (h)ó(L2/Lh2 1 )o(h 1 , h)D�1�� and o(h1, h2)óCovH0(Z�1,Z�2) for �1,� 2 é [L,U]. Note that in some applications the bound provided by equation (7) is not sharp (e.g. Shoukri & Lathrop, 1993; Azais & Cierco-Ayrolles, 2002). Note that, in our applications, the null distribution is known and data can be generated from condition (1) under the null hypothesis. Thus, the null distribution of MAX can be simulated, and the critical value c can also be obtained as the upper � percentile of the null distribution of MAX. D ow nl oa de d by [ N ew Y or k U ni ve rs ity ] at 2 3: 43 2 0 O ct ob er 2 01 4 666 G. Zheng We generalize the applications of Kimeldorf et al. (1992) by applying MAX and MIN to hypothesis testing with a nuisance parameter not defined under the null. Examples of the genetic linkage analysis using affected sib pairs and candidate-gene association in case-parents trio design are used for illustration. Kimeldorf et al. (1992) defined the non-straddling and straddling situations to examine the effect of choosing a score on testing results in the Cochran–Armitage trend test. In terms of test statistic (3), the non-straddling and straddling situations refer to max� Z�\z� or min�Z�[z� and max�Z�[z�[min�Z�, respectively, where z� is the upper � percentile of the standard normal distribution. In the non-straddling situation, the test results do not depend on �. When max�Z�\z�, one cannot reject the null whatever � is. This indicates the weakest association. When min�Z�[z�, the null is rejected whatever � is. This, on the other hand, indicates the strongest association. In the straddling situation, the conclusion (p-value) depends on �. For the genetic linkage analysis using affected sib pairs and the association test using case-parents trios, we apply the algorithm of the previous section to find MAX and MIN to examine the effect of an unknown genetic model on testing results. When MAX[1.645, we fail to reject the null at �ó0.05 for any nuisance parameter (genetic models) while MIN[1.645, we reject the null for any nuisance parameter (genetic models). In these two non-straddling situations, we can conclude that the testing results do not depend on the underlying models. Notice that the distribution of MAX or MIN is not standard normal under the null. Thus, MAX\1.645 or MIN[1.645 can only be used for examination of the effect of a nuisance parameter on the score test Z�. They are not used as test statistics for the 0.05 significance level. For the straddling situation, MAX[1.645[MIN, whether or not the test is significant at the 0.05 level depends on the genetic model. Generally, in this situaiton, one can consider some robust tests, e.g. the maximum efficiency robust test (MERT) and maximal test, developed by Gastwirth (1966, 1985) and Freidlin et al. (1999). For general discussion of the maximal test and MERT and their relationship, see Freidlin et al. (1999). In affected sib pair linkage analysis, several test statistics are available. For example, in the literature, Z 0 and Z 1�2 are referred to the means test and proportions test, respectively, which are optimal under corresponding genetic models (�). When the underlying genetic model is unknown, a simple robust test is also available, Z 1�4 for testing H 0 . This test was studied by Whittemore and Tu (1998) and was shown that it is a linear combination of two extreme tests Z 0 and Z 1�2 , and it is equal to MERT (Gastwirth & Freidlin, 2000). Another robust test, the maximum of two extreme tests Z 0 and Z 1�2 (a maximal test), was also studied by Gastwirth & Freidlin (2000). Here, to apply the MAX/MIN algorithm, we simulate the probabilities of non-straddling situations under the alternative hypotheses. These probabilities would tell us how useful the procedure based on MAX\1.645 or MIN[1.645 would be in data analysis. For example, Pr(MIN[1.645)ó0.80 indicates that 80% of the time the MIN can be used to identify a non-straddling situation and make a conclusive decision. Note that, from equation (4), the score statistic for linkage analysis is given by D ow nl oa de d by [ N ew Y or k U ni ve rs ity ] at 2 3: 43 2 0 O ct ob er 2 01 4 Maximizing a Family of Optimal Statistics 667 Z�ó{(2n1ñ4n2)�ò(3n2ñn0ñn1)}/(3nñ8n�ò6n�2)1�2, where � é [0,1/2]. Resultsare reported in the next section. In case-parents trio design, the transmission/disequilibrium test (TDT) of Spielman et al. (1993) is common used. It is equal to Z{ 0 given by equation (5), where � 0 ótan�1(1/2). The TDT is optimal under the additive genetic model. Some robust tests were discussed by Zheng et al. (2002) when the genetic model is unknown. Our purpose is to apply the MAX/MIN algorithm to testing association using trios. To this end, we simulate data sets to apply the algorithm to examine whether or not we have non-straddling situations. Results are reported next. Simulation Results To generate (n 0 , n 1 , n 2 ) sib pairs sharing (0, 1, 2) alleles IBD, in each of 10,000 replications, we generated nó25, 50, 100, 200 trinomial random variables from the distribution p é f� for given �ó0(0.05)1 and �ó0, 0.25, 0.40, 0.50, wherenón 0 òn 1 òn 2 , and calculated Z�. Then MAX and MIN were calculated usingthe algorithm for each combination of (�, �). Only four � values were considered here as the true models since they were also studied by Gastwirth & Freidlin (2000) with nó200. The probabilities of two disjoint non-straddling situations, {MAX\1.645} and {MIN[1.645}, were obtained from the 10,000 replications. These probabilities wereplotted in Figure 1. When � is close to 0, Figure 1 (the left column) shows that the probability of the non-straddling situation is greater than 80%. The larger is �, the higher the probability of {MAX\1.645}. From the right column of Figure 1, when �ó0.2 and nó100, the probability of {MIN[1.645} ranges from about 25% to 85% depending on the true value of �. It is not surprising that, for given � and �, Pr(MAX\1.645) decreases and Pr (MIN[1.645) increases as the sample size increases. For case-parents design, we generated a data set under the alternative. We have to specify genotype relative risks (r 1 , r 2 ) and the frequency (q) for the allele M and assume that the Hardy–Weinberg Equilibrium holds (so that Pr(MM)óq2, Pr(MN)ó2q(1ñq) and Pr(NN)ó(1ñq)2) for calculating the expected sample size of informative families for each informative mating type. For six parental mating types, let p j , jó1, . . . , 6, be probabilities of six mating types, given in Schaid & Sommer (1993, Table 1, the 2nd column). The probabilities of three informative mating types are also given in the second column of Table 1. In practice, one has to screen informative mating types. To get n informative mating types, the expected screen size is given by Nón/ (p 2 òp 4 òp 5 ) and depends on the allele frequency and relative risks. For example, assuming the allele frequency qó0.2 and r 1 ó1, r 2 ó2, to get nó200 informative families, the expected screening size is Nó421. If the allele frequency reduces to qó0.1, then Nó750 where r 1 and r 2 are retained the same. For a given sample size of three informative mating types n, the expected sample sizes for mating types (ii) (iv) and (v) are given by n j ónp j /(p 2 òp 4 òp 5 ), jó2, 4, 5. Conditional on n 2 or n 5 , the corresponding binomial random variables were generated from the distributions given in the second section. Similarly, conditional on n 4 , D ow nl oa de d by [ N ew Y or k U ni ve rs ity ] at 2 3: 43 2 0 O ct ob er 2 01 4 668 G. Zheng Figure 1. Plots of probabilities of non-straddling situations for various genetic models and sample sizes. The plots in four rows from the top correspond to sample size nó25, 50, 100, 200, respectively trinomial random variables were generated from the trinomial distribution given in the second section. The data set given in Table 1 is based on the frequency of the mutation allele (M) qó0.20 and genotype relative risks r 1 ó1, r 2 ó2. For nó200, the expected sample sizes (rounded up to the nearest integers) for each informative family are n 2 ó14, n 4 ó45 and n 5 ó143, respectively. Now we find MAX and MIN for this data set using the algorithm for the one-sided alternative. Note that �ótan�ó(r 1 ñ1)/(r 2 ñ1)ó0 é [L,U]ó[0,1] is the true value. From aóñ5 and bó15.75, Z�ó(a�òb)/(c�2òd�òe)1�2[0 forany � é [L*,U*]ó[0,1]. Two roots were found � 1 ó0.1308 and � 2 ó3.15.Thus, we calculated Z 1 ó1.6712, Z 0 ó5.2139 and Z 0�1308 ó5.4752. It follows from MIN[1.645 that we reject the null hypothesis at �ó0.05, even if we do not know the true value of �, i.e. the true genetic model. In this example, the D ow nl oa de d by [ N ew Y or k U ni ve rs ity ] at 2 3: 43 2 0 O ct ob er 2 01 4 Maximizing a Family of Optimal Statistics 669 Table 1. Data set of case-parents trio design for candidate-gene association Case Informative mating type Probability Genotype Counts (ii) MMîMN p 2 ó2(1ñq)3(r 1 òr 2 )/R MN n 21 ó3 MM n 22 ó11 (iv) MNîMN p 4 ó(1ñq)2q2(r 2 ò2r 1 ò1)/R NN n 40 ó4 MN n 41 ó18 MM n 42 ó23 (v) NNîMN p 5 ó2(1ñq)q3(r 1 òr 2 )/R NN n 50 ó68 MN n 51 ó75 Rór 2 (1ñq)2ò2r 1 (1ñq)qòq2 and q is the frequency of M. conclusion is independent of the underlying genetic model (the nuisance para- meter). Note that the transmission/disequilibrium test (TDT) of Spielman et al. (1993) is given by Z 1�2 . Applying TDT to the data given in Table 1, Z 1�2 ó3.031, which is also significant at the 0.05 level. Discussion For a family of optimal statistics obtained from hypothesis testing with a nuisance parameter not defined under the null hypothesis, we provided a simple algorithm to find exact MAX (MIN) and applied it to genetic data analysis. The algorithm can be used to identify the non-straddling situations defined in Kimeldorf et al. (1992). It provides some useful data analysis tools along with other optimal tests and robust tests (MERT and maximal tests). However, for a family of smooth test statistics, the maximum statistic can be obtained by numerical grid search. The simple exact solution is desirable if it exists. Moreover, it helps to reduce the computing time in simulation. For example, in linkage analysis using affected sib pairs, for � é [0,0.5], we can choose � from 0 to 0.5 with a step size of 0.0001, calculate all 5001 test statistics, and find the maximum test statistic. This will be replicated 10,000 times. Most of the computing time in this approach comes from sorting 5001 statistics to find its maximum for each of 10,000 replicates. Using the exact solution for MAX programmed as a macro, there is no need to sort the dataset. From the simulation results (Figure 1), MAX and MIN become more useful as sample size increases. In practice, the sample size is usually not chosen (because of the unknown nuisance parameter) to reach a prespecified power at a given significance level. Hence, MAX and MIN are helpful for preliminary data analysis to explore the effect of the unknown nuisance parameter. Under the non-straddling situations, the conclusion is clear, while under the straddling situations, a simple score statistic may not be reliable. Thus, the more advanced robust tests of Davies (1977), Freidlin et al. (1999) and Gastwirth & Freidlin (2000) can be applied. Although the results can be applied generally, for the genetic linkage testing problem, the results may depend on the assumption that D ow nl oa de d by [ N ew Y or k U ni ve rs ity ] at 2 3: 43 2 0 O ct ob er 2 01 4 670 G. Zheng IBD can be uniquely determined by the available data. In practice, however, this may not be the case. Acknowledgements The author thanks a referee for helpful suggestions on improving the presentation. References Azais, J.-M. & Cierco-Ayrolles, C. (2002) An asymptotic test for quantitative gene detection, Ann. I. H. Poincará: Probability and Statistics, 38, pp. 1087–1092. Blackwelder, W. C. & Elston, R. C. (1985) A comparison of sub-pair linkage tests for disease susceptibility loci, Genetic Epidemiology, 2, pp. 85–97. Davies, R. B. (1977) Hypothesis testing when a nuisance parameter is present only under the alternative, Biometrika, 64, pp. 247–254. Faraway, J. J. (1993) Improved sib-pair test for disease susceptibility loci, Genetic Epidemiology, 10, pp. 225–233. Freidlin, B., Podgor, M. J. & Gastwirth, J. L. (1999) Efficiency robust tests for survival or ordered categorical data, Biometrics, 55, pp. 883–886. Gastwirth, J. L. (1966) On robust procedures, Journal of the American Statistical Association, 61, pp. 929–948. Gastwirth, J. L. (1985) The use of maximin efficient robust tests in combining contingency tables and survival analysis, Journal of the American Statistical Association, 80, pp. 380–384. Gastwirth, J. L. & Freidlin, B. (2000) On power and efficiency robust linkage tests for affected sibs, Annals of Human Genetics, 64, pp. 443–453. Hartl, D. L. & Clark, A. G. (1997) Principles of Population Genetics, 3rd edn (Sunderland, MA: Sinauer Associates). Holmans, P. (1993) Asymptotic properties of affected-sib-pair linkage analysis, American Journal of Human Genetics, 52, pp. 362–374. Kimeldorf, G., Sampson, A. R. & Whitaker, L. R. (1992) Min and max scorings for two-sample ordinal data, Journal of the American Statistical Association, 87, pp. 241–247. Schaid, D. J. & Sommer, S. S. (1993) Genotype relative risks: methods for design and analysis of candidate- gene association studies, American Journal of Human Genetics, 53, pp. 1114–1126. Shoukri, M. M. & Lathrop, G. M. (1993) Statistical testing of genetic linkage under heterogeneity, Biometrics, 49, pp. 151161. Spielman, R. S., MeGinnis, R. E. & Ewens, W. J. (1993) Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes melitus (IDDM), American Journal of Human Genetics, 52, pp. 506–516. Whittemore, A. S. & Tu, I.-P. (1998) Simple, robust linkage tests for affected sibs, American Journal of Human Genetics, 62, pp. 1228–1242. Zheng, G., Freidlin, B. & Gastwirth, J. L. (2002) Robust TDT-type candidate-gene association tests, Annals of Human Genetics, 66, pp. 145–155. D ow nl oa de d by [ N ew Y or k U ni ve rs ity ] at 2 3: 43 2 0 O ct ob er 2 01 4 Maximizing a Family of Optimal Statistics 671 Appendix Consider the two-sided alternative and assume that Z� asymptotically follows astandard normal distribution under the null hypothesis. For the two-sided alternative, we consider Z2�ó a2h2ò2abhòb2 ch2òdhòe ó a2 c ò �2abñ a2d c �hò�b2ñ a2ñe c � ch2òdhòe ó a2 c ò a*hòb* ch2òdhòe which follows a chi-square distribution with 1 degree of freedom for a given �. Let S�ó(a*�òb*)/(c�2òd�òe). Note that argmax�� L U� DZ�Dóargmax�� L U�Z2�óargmax�� L U�S� To maximize S� with respect to � é [L,U], basic calculus is applied to examinewhen S� increases and when it decreases, and, therefore, find the local maximalpoints. Note that LS� Lh ó (ña*c)h2ñ2b*chò(a*eñb*d) (ch2òdhòe)2 Suppose that the numerator of LS�/L� has kó0, 1, 2, real roots in [L,U]. Thepossible roots are given by h i ó b*cô(b*2c2òa*2ceña*b*cd)1�2 ña*c when a*Ö0, and only one root � 1 óñd/(2c) when a*ó0, b*Ö0, ió0, . . . , k. These k real roots in [L,U] are the only possible local maximal points of S�.Therefore, max �� L U� S�ómax(SL,S�1, . . . ,S�k,SU) When kó0, i.e. there are no real roots in [L,U], max�� L U�S�ómax(SL,SU).Consequently, max�� L U� DZ� Dómax(DZLD, DZ�1D, . . . , DZ�kD, DZUD) when k[0 andmax�� L U� DZ�Dómax(DZL D, DZU D) when kó0. D ow nl oa de d by [ N ew Y or k U ni ve rs ity ] at 2 3: 43 2 0 O ct ob er 2 01 4

Maximizing a Family of Optimal Statistics over a Nuisance Parameter with Applications to Genetic Data Analysis

Description

Comments