Takeshi Amemiya - Introduction to Statistics and Econometrics

June 18, 2018 | Author: Lilian Hancu | Category: Probability Theory, Probability Distribution, Random Variable, Normal Distribution, Statistics
Report this link


Description

INTRODUCTION TO STATISTICS AND ECONOMETRICS . 3 Counting Techniques 7 2.4 Conditional Probability and Independence 2.2 What Is Statistics? 2 2 PROBABILITY 5 2.CONTENTS Preface xi 1 INTRODUCTION 1.2 3.2 Axioms of Probability 5 2.1 What Is Probability? 1 1.7 Definitions of a Random Variable 19 Discrete Random Variables 20 Univariate Continuous Random Variables 27 Bivariate Continuous Random Variables 29 Distribution Function 43 Change of Variables 4 7 Joint Distribution of Discrete and Continuous Random Variables 57 59 EXERCISES .1 3.5 3.4 3.5 Probability Calculations 13 EXERCISES 17 10 3 R A N D O M VARIABLES A N D PROBABILITY DISTRIBUTIONS 3.6 3.3 3.1 Introduction 5 2. 5 Definition of Basic Terms 257 Matrix Operations 259 Determinants and Inverses 260 Simultaneous Linear Equations 264 Properties of the Symmetric Matrix 270 278 EXERCISES .4 Examples 107 EXERCISES 109 100 19 BlVARlATE REGRESSION M O D E L 10.1 What Is an Estimator? 112 7.6 9.4 9.2 Least Squares Estimators 230 10.viii Contents Contents ix 160 4 MOMENTS 4.2 11.2 Properties of Estimators 116 7.5 9.1 Introduction 228 10.4 Maximum Likelihood Estimator: Properties 138 EXERCISES 151 11.2 Laws of Large Numbers and Central Limit Theorems 103 6.- 7.4 Expected Value 61 Higher Moments 67 Covariance and Correlation 70 Conditional Mean and Variance 77 83 61 8 INTERVAL E S T I M A T I O N 8.1 5.1 Introduction 160 8.1 4.1 11.3 Normal Approximation of Binomial 104 6.4 Binomial Random Variables 87 Normal Random Variables 89 Bivariate Normal Random Variables 92 Multivariate Normal Random Variables 97 98 87 EXERCISES 9.1 9.3 5.3 Tests of Hypotheses 248 EXERCISES 253 11 POINT ESTIMATION --- ELEMENTS O F M A T R I X ANALYSIS - 257 .n Z .3 11.2 5.2 4.2 Confidence Intervals 161 8.3 Maximum Likelihood Estimator: Definition and Computation 133 '7.3 Bayesian Method 168 EXERCISES 178 EXERCISES 9 TESTS OF HYPOTHESES BINOMIAL A N D NORMAL R A N D O M VARIABLES 5.4 11..7 Introduction 182 Type I and Type I1 Errors 183 Neyman-Pearson Lemma 188 Simple against Composite 194 Composite against Composite 201 Examples of Hypothesis Tests 205 Testing about a Vector Parameter 210 219 EXERCISES LARGE SAMPLE THEORY 6.3 9.3 4.1 Modes of Convergence 100 6.2 9. statistics. Chapters 1 through 9 cover probability and statistics and can be taught in a semester course for advanced undergraduates or first-year graduate students. My own course on this material has been taken by both undergraduate and graduate students in economics.4 13. large sample theory.5 13. but either they do not include statistics proper.1 12.7 Duration Model 343 Appendix: Distribution Theory References Name Index Subject Index 353 357 361 363 Although there are many textbooks on statistics. In these chapters I emphasize certain topics which are important in econometrics but which are often overlooked by statistics textbooks at this level.1 13. and the properties of the maximum likelihood estimator. In addition. the joint distribution of a continuous and a discrete random variable. conditional density of the form f (x I x < y). I discuss these topics without undue use of mathematics and with many illustrative examples and diagrams.2 13. and other social science disciplines.3 13.x Contents 12 MULTIPLE REGRESSION MODEL 281 12. there are many textbooks on econometrics.2 12. many exercises are given at the end of each chapter (except Chapters 1 and 13). they usually contain only a cursory discussion of regression analysis and seldom cover various generalizations of the classical regression model important in econometrics and other social science applications. Moreover.3 12.5 Introduction 281 Least Squares Estimators 283 Constrained Least Squares Estimators Tests of Hypotheses 299 Selection of Regressors 308 310 PREFACE 296 EXERCISES 13 ECONOMETRIC MODELS 314 13.4 12. The present book is aimed at filling that gap. The prerequisites are one year of calculus and an ability to think mathematically. At the same time.6 Generalized Least Sauares 314 Time Series Regression 325 Simultaneous Equations Model 327 Nonlinear Regression Model 330 Qualitative Response Model 332 Censored or Truncated Regression Model (Tobit Model) 339 13. Examples are best prediction and best linear prediction. I devote a lot of space to these and other fundamental concepts because I believe that it is far better for a student to have a solid . in most of these textbooks the selection of topics is far from ideal from an econometricianyspoint of view. or they give it a superficial treatment. for which I use my Advanced Econometrics (Harvard University Press. censored and truncated regression models.Under this plan. I do so not because I believe it is superior (in fact. general editor at Harvard University Press. this book is written mainly from the classical point of view) but because it provides a pedagogically useful framework for consideration of many fundamental issues in statistical inference. I am also indebted to my students Fumihiro Goto and Dongseok Kim for carefully checking the entire manuscript for typographical and more substantial errors.1 through 13. Chapter 11 is a brief introduction to matrix analysis.. the material in Sections 13. 1 discuss various generalizations of the classical regression model (Sections 13. I frequently have recourse to Bayesian statistics. and James Powell. At Stanford about half of the students who finish a year-long course in statistics and econometrics go on to take a year's course in advanced econometrics. I am grateful to Gene Savin. Dongseok Kim also prepared all the figures in the book. and Elizabeth Gretz and Vivian Wheeler for carefully checking the manuscript and suggesting numerous stylistic changes that considerably enhanced its readability. B y studying it in earnest.4 needs to be supplemented by additional readings. take responsibility for the remaining errors. for students with less background. including a comparison of various criteria for ranking estimators. Chapter 12 gives the multiple classical regression model in matrix notation. who read all or part of the manuscript and gave me valuable comments. who for over twenty years has made a steadfast effort to bridge the gap between two cultures.1 through 13. In Chapters 10 and 12 the concepts and the methods studied in Chapters 1 through 9 in the framework of the i. and Chapter 13 studied independently. especially in the realistic case of testing a composite null against a composite alternative.xii Preface Preface xiii knowledge of the basic facts about random variables than to have a superficial knowledge of the latest techniques.d. Yoshiko.5 through 13. which discusses qualitative response models. I alone. though perhaps not long-lasting effect. I give a thorough analysis of the problem of choosing estimators. 1985). It is expected that those who complete the present textbook will be able to understand my advanced textbook.4) and certain other statistical models extensively used in econometrics and other social science applications (13. Peter Robinson. Finally. the reader should be able to understand Chapters 12 and 13 as well as the brief sections in Chapters 5 and 9 that use matrix notation. The second part.7). (independent and identically distributed) sample are extended to the regression model. in Chapter 13. I also thank Michael Aronson. - . The first part of the chapter is a quick overview of the topics. Chapters 10 through 13 can be taught in the semester after the semester that covers Chapters 1 through 9. for constant encouragement and guidance. Chapters 1 through 12 may be taught in a year. Alternatively. I dedicate this book to my wife.i. and duration models. Chapter 10 presents the bivariate classical regression model in the conventional summation notation. however. I also believe that students should be trained to question the validity and reasonableness of conventional statistical techniques. Therefore. In discussing these issues as well as other problematic areas of classical statistics. Her work. I also present a critical evaluation of the classical method of hypothesis testing. is a more extensive introduction to these important sub~ects. . . we would want to know the probability of winning. If we asked a related but different question-what is the probability that a man who died of lung cancer was a cigarette smoker?the experiment would be simpler. In determining the discount rate. A reasonable way to determine a probability should take into account the past record of an event in question or. the Federal Reserve Board needs to assess the probabilistic impact of a change in the rate on the unemployment rate and on inflation. Consider estimating the probability of heads by tossing a particular coin many times. For example. these two interpretations of probability lead to two different methods of statistical inference. some of the balls have lung cancer marked o n them and the rest do not. Tossing a coin with the probability of heads equal to p is identical to choosing a ball at random from a box that contains two types of balls.) To apply the box-ball analogy to this example. We are ready for our first working definition of statistics: Statistics is the science of assigning a probability to a n event on the basis of experiments. Although in this book I present mainly the classical method. in what sense. This is not an essential difference. corresponding to cigarette smokers. otherwise we will lose in the long run. before deciding to gamble. Most people will think it reasonable to use the ratio of heads over tosses as an estimate. Note that ( I ) is sometimes true and sometimes false as it is repeatedly observed. A statistician looks at the world as full of balls that have been drawn by God and examines the balls in order to estimate the characteristics ("proportion") of boxes from which the balls have been drawn. Thus we state a second working definition of statistics: Statistics is the science of observing data and making inferences about the characteristics of a random mechanism that has generated the data. (Such an experiment would be a costly one. if so. we should imagine a box that contains balls. consider the question of whether or not cigarette smoking is associated with lung cancer. the results of a deliberate experiment.2 I What Is Statistics? 3 (3) The probability of obtaining heads when we toss a particular coin is '/. and different situations call for different methods. a classical statistician can do so only for the event (1). In statistics we study whether it is indeed reasonable and.a ball at random from a box that contains various types of balls in certain proportions. Its only significance is that we can toss a coin as many times as we wish. we need to paraphrase the question to make it more readily accessible to statistical analysis. One way is to ask.2 1 I Introduction 1. whenever possible. In the physical sciences we are often able to conduct our own experiments.2 WHAT I S STATISTICS? In our everyday life we must continuously make decisions in the face of uncertainty. What is the probability that a person who smokes more than ten cigarettes a day will contract lung cancer? (This may not be the optimal way. A Bayesian can talk about the probability of any one of these events or statements. rather than the particular one regarding Atlantis.) This example differs from the example of coin tossing in that in coin tossing we create our own sample. It may be argued that a frequency interpretation of (2) is possible to the extent that some of Plato's assertions have been proved true by a later study and some false. We want to know the probability of rain when we decide whether or not to take an umbrella in the morning. but in economics or other behavioral sciences we must often work with a limited sample. 1. As we shall see in later chapters. although in the short run we may be lucky and avoid the consequences of a haphazard decision. and in making decisions it is useful for us to know the probability of certain events. with p being the proportion of heads balls. But in that case we are considering any assertion of Plato's. . This mode of thinking is indispensable for a statistician. It is advisable to determine these probabilities in a reasonable way. First. whereas statement (2) or (3) is either true or false as it deals with a particular thing--one of a kind. I will present Bayesian method whenever I believe it offers more attractive solutions. whereas in this example it is as though God (or a god) has tossed a coin and we simply observe the outcome. Drawing a ball at random corresponds to choosing a cigarette smoker at random and observing him until he dies to see whether or not he contracts lung cancer.. For example. The two methods are complementary. The statistician regards every event whose outcome is unknown to be like drawing . which may require specific tools of analysis.because only (1) is concerned with a repeatable event. whereas in the present example the statistician must work with whatever sample God has provided. but we choose it for the sake of illustration. one of which corresponds to heads and the other to tails. 1 to heads and 0 to tails. A discrete random variable is characterized by the probabilities of the outcomes.2.1? The answer is 1 . a continuous random variable (assuming hypothetically that height can be measured to an infinite degree of accuracy). 2.525. A random mechanism whose outcomes are real numbers is called a random variable. (The answer is derived under the assumption that the ten schools make independent decisions. We use the term prob ability distribution as a broader concept which refers to either a set of discrete probabilities or a density function. the formula gives 0. The random mechanism whose outcome is the height (measured in feet) of a Stanford student is another random variable.51. and the second. which is defined in such a way that the probability of any interval is equal to the area under the density function over that interval. Now we can compose a third and final definition: Statistics is the science of estimating the probability distribution of a random variable on the basis of repeated observations drawn.from the 2. These terms inevitably remain vague until they are illustrated. .) Or what is the probability a person will win a game in tennis if the probability of his winning a point is p? The answer is For example. As a more complicated example. The first is called a discrete random variable.1 and 2.65.1 INTRODUCTION In this chapter we shall define probability mathematically and learn how to calculate the probability of a complex event when the probabilities of simple events are given. what is the probability that a head comes up twice in a row when we toss an unbiased coin? We shall learn that the answer is I/q.- 4 1 I Introduction Coin tossing is an example of a random mechanism whose outcomes are objects called heads and tails. the statistician assigns numbers to objects: for example.2.2 AXIOMS OF PROBABILITY Definitions of a few commonly used terms follow. see Examples 2. In order to facilitate mathematical analysis. For example.9~' 0. In these calculations we have not engaged in any statistical inference.2. The characteristics of a continuous random variable are captured by a density function. if p = 0. what is the p r o b ability that a student will be accepted by at least one graduate school if she applies to ten schools for each of which the probability of acceptance is 0.0. it forms the foundation of statistics. Probability is a subject which can be studied independently of statistics. the relative frequency of A U B (either 1 or 2 ) is clearly the sum of the relative frequencies of A and B. An event which is not a simple event. The event that a head occurs at least once: HH U HT U TH. We want probability to follow the same rule. 6 ] . In such a case we restrict our attention to a smaller class of events to which we can assign probabilities in a manner consistent with the axioms. j = 1 . 1 Experiment: Tossing a coin twice. then P(A1 U A2 U . Simple event. 2. The calculation is especially easy when the sample space consists of a finite number of simple events with equal probabilities. . Axiom ( 3 ) suggests that often the easiest way to calculate the probability of a composite event is to sum the probabilities of all the simple events that constitute it. are mutually exclusive (that is. The axioms of probability are the rules we agree to follow when we assign probabilities. 2. use of the word probability. ( 1 ) P ( A ) 2 0 for any event A. A = { 2 . Compute the p r o b ability that the sum of the two numbers is equal to each of the integers 2 through 12. P ( A ) = 0. f l Aj = RI for all i # j ) . The reader who wishes to study this problem is advised to consult a book on the theory of probability. . .100).) = P ( A I ) + P(A2) + . ( 3 ) If (A.5. 2 . 2. a situation which often occurs in practice. A subset of the sample space. 2 . E X A M P L E 2 . Let the ordered pair (i. Composite event. as in Example 2. The third rule is consistent with the frequency interpretation of probability.3 ( Counting Techniques 7 Sample space. A. For example. 4 .6 2 I Probability 2. every possible subset of the sample space) in a way which is consistent with the probability axioms.2. as in Example 2. . Let n ( A ) be the number of the simple events contained in subset A of the sample space S. at the roll of a die. EXAMPLE 2 . When the sample space is discrete.1 Simple Events with Equal Probabilities A probability is a nonnegative number we assign to every event. An event which cannot be a union of other events. i = 1. Then S = {(i. however.j) represent the event that i shows on the first die and j on the second. A pair of fair dice are rolled once.2 . Sample space: { H H . If. 100) and their unions satisfies the condition. . where S is the sample space. it is not possible to do so. 6 )and hence n ( A ) = 3. so that n ( S ) = 36. the class of all the intervals contained in ( 0 .3.2. for relative frequency follows the same rule. . We have E X A M P L E 2. . TH. .2. Then we have Axioms of Probability Two examples of this rule follow. . The first two rules are reasonable and consistent with the everyday . When the sample space is continuous.1. Sample space: Real interval (0. The set of all the possible outcomes of an experiment. . I What is the probability that an even number will show in a roll of a fair die? We have n ( S ) = 6. . Event.3. 3 . j)I i. .]. Events of interest are intervals contained in the above interval. A is the event that the die shows 1 and B the event that it shows 2. it is possible to assign probability to every event (that is. 2 .3 COUNTING TECHNIQUES EXAMPLE 2 . . such as Chung (1974).HT. In the subsequent discussion we shall implicitly be dealing with such a class. . ( 2 ) P ( S ) = 1. Therefore. 2 Experiment: Reading the temperature (F) at Stanford at noon on October 1. TT}. 2 The number of combznatzons of taking r elements from n elements is the number of distinct sets consisting of r distinct elements which can be formed out of a set of n distinct elements and is denoted E X A M P L E 2. We must also count the number of the hands that contain three aces but not the ace of spades. we can choose two dice out of five which show i: there are C: ways to do so. C p . 2 ) . what is the probability that it will contain three aces? We shall take as the sample space the set of distinct poker hands without regard to a particular order in which the five cards are drawn. The number of permutations of taking r elements from n elements is the number of distinct ordered sets consisting of r distinct elements which can be formed out of a set of n distinct elements and is denoted by P:.2.3.3.r 1 elements.'. 3 ) . (2.1 and so on. R EXAMPLE 2.therefore. The desired probability P is thus given c?.3 Compute the probability of getting two of a kind and three of a kind (a "full house") when five dice are rolled. Therefore. we have P: = 6.so that n ( S ) = 65. and so on. Given a particular (i.1 For example. r! different permutations are possible. l ) . by In Example 2. Let the ordered pair (i.3 1 Counting Techniques 9 n ( i 4. 3 ) . Of these. Thus In the example of taking two numbers from three. ng. and 3 are (1. and finally in the rth position we can choose one of n . Let n. be the number on the ith die.( 3 .l ) ]= 1. the permutations of taking two numbers from the three numbers 1. THEOREM 2. ng. the number of the hands that contain three aces but not the ace of clubs is equal to the number of ways of choosing the 48 two remaining cards out of the 48 nonaces: namely. ( 2 . 1 ) make the same combination.8 2 I Probability = 2. . 1 ) . ( 3 .2 ) .2 Permutations and Combinations Prooj It follows directly from the observation that for each combination. 2) and ( 2 . 2 ) ] = 3. (We define O! = 1.1 P:=n!/(n--r)!. THEOREM 2. See Exercise 2.. D E F I N 1 T I 0 N 2.2 n ( i + j = 3 ) = n [ ( l . therefore. 2 .3 ) . n 5 ) .3. j) mean that i is the number that appears twice and j is the number that appears three times. ( 1 .1. ( 3 . n ( S ) = c. n(z + j = 4 ) = n [ ( l . Therefore we conclude that the desired probability P is given by The formulae for the numbers of permutations and combinations are useful for counting the number of simple events in many practical problems.j = 2) n [ ( l . We shall take as the sample space the set of all the distinct ordered 5-tuples ( n l . j).3. Therefore.1 we shall solve the same problem in an alternative way. 2. the number of permutations is the product of all these numbers. Note that the order of the elements matters in permutation but not in combination.4 If a poker hand of five cards is drawn from a deck. .3. l ) .3. The number of the distinct ordered pairs.2) .wheren!readsnfactorialanddenotes n ( n . Therefore.l ) ( n . which is also and similarly for hearts and diamonds.5. ( 2 . 1)l = 2. by C:. is P:.) - - - - Proot In the first position we can choose any one of n elements and in the second position n . ( 1 . we must multiply ($ by four. nq. R + - D EF l N I T I 0 N 2. 2 ) .3. ( 2 . ) Proof. . Using the four axioms of conditional probability. THEOREM 2 . Thus we have proved that (2). Therefore. Therefore we can eliminate the last term of (2. Since E fl Al. . and establish axioms that govern the calculation of conditional probabilities. . i = 1. 2. .) ( 1 ) P ( A I B) r 0 for any event A. . I B) = P ( A I B) + p(A21 B ) + . Theorem 2.4. be mutually exclusive such that P(A1 U Ap U . . .4.3 may be used to calculate a conditional probability in this case.4.2 Bayes' Theorem Axioms of Conditional Probability (In the following axioms it is assumed that P ( B ) > 0. (2. Most textbooks adopt the latter approach.4 1 Conditional Probability and Independence 11 2. C P(E I A. 2 . But from axioms ( 2 ) and ( 3 ) we can easily deduce that P(C B ) = 0 i f C fl B = 0.1) P ( A I B) = P ( A f l B I B) + P ( A n fl( B). from axiom ( 3 ) of probability. provided P ( B ) > 0 . Theorem 2. are mutually exclusive and their union is equal to E. 1 Prooj From axiom ( 3 ) we have (2. E fl A. . ( 2 ) . more simply. ( 3 ) If (4fl BJ. . 4 . Then P(AzI E ) = . It is easy to show the converse. the counting techniques of Section 2. ( 4 ) If B 3 H and B 3 G and P(G) P(H IB) . A.1. . we can prove + 0 . then Bayes' theorem follows easily from the rules of probability but is listed separately here because of its special usefulness. . 2 (Bayes) Let eventsAl.1 as the only axiom of conditional probability. From Theorem 2. . E fl AZ. and ( 4 ) or. They mean that we can treat conditional probability just like probability by regarding B as the sample space.4. j=l P(E I Az)P(A. this conditional probability can be regarded as the limit of the ratio of the number of times one occurs to the number of times an odd number occurs.4. Axiom ( 4 ) is justified by observing that the relative frequency of H versus G remains the same before and after B is known to have happened. Therefore we may postulate either axioms ( 2 ) . In general we shall consider the "conditional probability of A given B. then P(A1 U A2 U .10 2 I Probability 2.4. . . ." denoted by P ( A I B ) .4. The theorem follows by putting H = A fl B and G = B in axiom ( 4 ) and noting P ( B I B) = 1 because of axiom ( 2 ) .) i = l . 4 . . O The reason axiom (I) was not used in the above proof is that axiom ( 1 ) follows from the other three axioms.P(H) Axioms ( I ) . Let E be an arbitrary event such that P ( E ) > 0. 4 . .1) to obtain I The concept of conditional probability is intuitively easy to understand.1 shows P ( A I B ) = n ( A fl B ) / n ( B ) . THEOREM 2 . .and ( 3 ) are analogous to the corresponding axioms of probability. .4 2. . are mutually exclusive. for any pair of events A and B in a sample space. U A.2) P(A B) I = P ( A fl Bl B ) . ( 2 ) P ( A B) = 1 for any event A 3 B.4. we have. and ( 4 ) imply Theorem 2. (3). For example. In the frequency interpretation. .1 CONDITIONAL PROBABILITY AND INDEPENDENCE Axioms of Conditional Probability where B denotes the complement of B. If the conditioning event B consists of simple events with equal p r o b ability. it makes sense to talk about the conditional probability that number one will show in a roll of a die given that an odd number has occurred.4.1. . . 2. ( 3 ) .)P(A. n. .) = 1 and P ( 4 ) > 0 for each i. we have P ( A I B ) =P(AflB)/P(B)foranypairofeventsAand B such that P ( B ) > 0. ) by Theorem 2.4. I -- Note that independence between A f l C and B or between B f l C and A follows from the above. aces and nonaces. P(A. we have E X A M P L E 2. without paying any attention to the other characteristics-suits or numbers. Then A. and C . Let A. To summarize. are said to be mutually independent if any proper subset of the events are mutually independent and P(A1 f l A2 The term "independence" has a clear intuitive meaning. Definition 2. . the above formula enables us to calculate the probability of obtaining heads twice in a row when tossing a fair coin to be 1/4.4. because P ( A f l B n C ) = P ( A f l B) = 1/4. . O 2. and let C be the event that either both tosses yield heads or both tosses yield tails. n A.) = P(AI)P(A2) . We shall first compute the probability that three aces turn up in a particular sequence: for example.4.5 1 Probability Calculations 13 Thus the theorem follows from (2.12 2 I Probability 2. Henceforth it will be referred to simply as "independence.4. and C are pairwise independent but not mutually independent. First we shall ask what we mean by the mutual independence of three events. by the repeated application of Theorem 2. It means that the probability of occurrence of A is not affected by whether or not B has occurred.). that is. the above equality is equivalent to P ( A ) P ( B ) = P ( A n B ) or to P ( B ) = P ( B I A ) . The following are examples of calculating probabilities using these rules. let B be the event that a head appears in the second toss. A*.3 Events Al. Let A be the event that a head appears in the first toss of a coin. Clearly we mean pairwise independence. independence in the sense of Definition 2.4. that is. whereas - P ( A ) P ( B ) P ( C )=I/.4.4. Thus we should have n . . suppose the first three cards are aces and the last two nonaces. Because of Theorem 2. D E F I N IT I o N 2 . A.3) and (2. we shall solve the same problem that appears in Example 2.1. which may be stated as the independence between A fl B and C.5. Then. denote the event that the ith card is a nonace. Since the outcome of the second toss of a coin can be reasonably assumed to be independent of the outcome of the first.4..1. We do not want A and B put together to influence C.3. . B.3 Statistical Independence We shall first define the concept of statistical (stochastic) independence for a pair of events.1 between any pair. Using the axioms of conditional probability. 2 Events A. B. .4) and by noting that P ( E n A. P ( A f l B) = P ( A n B I C). In the present approach we recognize only two types of cards. .4." We can now recursively define mutual independence for any number of events: D E F I N I T I O N 2. denote the event that the ith card is an ace and let N.5 PROBABILITY CALCULATIONS We have now studied all the rules of probability for discrete events: the axioms of probability and conditional probability and the definition of independence. The following example shows that pairwise independence does not imply mutual independence. and C are said to be mutually indqbendent if the following equalities hold: . 4 .)P(A. A . . 2.1. B. But that is not enough. .) = P ( E I A.4.1 needs to be generalized in order to define the mutual independence of three or more events. By repeated application of Theorem 2.4) 1 1 1 P(A f l B I C) = P(A I B)P(B I C ) = .5 / Probability Calculations 15 Similarly.) Calculate P ( A I C) and P ( A I Since A = ( A n B) U ( A fl B ) . Calculating P ( A 1 C) is left as a simple exercise. and C. P ( l and 2 1 1 ) = - P [ ( l and 2) n 11 P(l) P(l and 2) P(1) (2.2).5. by our assumptions.If we try Therefore.1. (2.5.4). we have = P(A I B fl C)P(B I C). What is the probability that 1 and 2 are drawn given that 1 or 2 is drawn? What is the probability that 1 and 2 are drawn given that 1 is drawn? By Theorem 2. .1 illustrates the relationship among the relevant events in this example. Figure 2. Suppose that there are four balls in a box with the numbers 1 through 4 written on them. 2 2 4 .5. P ( A I B ) = '/2.5. Furthermore. P ( A I B) = 1/3.P(l and 2) P ( l or 2) 2. Finally. affect each other in the following way: P ( B I C ) = '/2. E XA M PLE 2. A. (2.3 is somewhat counterintuitive: once we have learned that 1 or 2 has been drawn.1) by c:.5.5. assume that P ( A ( B 1 7 C) = P ( A B ) and that P ( A I B n C) = P ( A B ) . from (2. and PC. The result of Example 2. EXAMPLE 2. and (2. we have by axiom (3) of conditional probability E X A M PLE I c). pB. B..5) we obtain P ( A I C ) = 5/1?.4.5.4.1 we have (2..2 Suppose that three events.5. with respective probabilities PA.1). and each way has the same probability as (2.5. if B or B is known.= . A. Therefore the answer to the problem is obtained by multiplying (2. P ( B I C) = l/g. and we pick two balls at random.4 There is an experiment for which there are three outcomes. Similarly.3 Probability calculation may sometimes be counterintuitive. B.5. Ei * P(l and 2 1 1 or 2) = P [ ( l and 2) f l (1 or 2)] P ( l or 2) .2) P ( A I C) = P ( A f l B 1 C) + P ( A n B I C). (In other words. C or C does not affect the probability of A. and C.5. . learning further that 1 has been drawn does not seem to contain any more relevant information about the event that 1 and 2 are drawn.5. But it does.6) There are C! ways in which three aces can appear in five cards.14 2 I Probability 2. . what is the probability that A occurs before B does? Assume pc # 0.4.1) Show that Theorem 2.005 of the popu- 4.3.- . (Section 2.5 This is an application of Bayes' theorem.16 2 I Probability I Exercises 17 1and 4 lor2 Sample space lation actually has cancer. (3).1 Characterization of events this experiment repeatedly. which in this case has turned out to be correct. (Section 2.1 implies (9). similarly. compute the probability that a particular individual has cancer. be the event that A happens in the ith trial.4. 2.- which gives the same answer as the first method. (c) (A .C) n (B . substantiated by the result of the rigorous first approach.C. what is the probability that a Stanford faculty member who voted for Bush is a Republican? 5. The second method is an intuitive approach.4. . Let P be the desired probability.C) = (A n B) . Let C indicate the event that this individual actually has cancer and let T be the event that the test shows he has cancer. E X A M P L E 2. (b) A U ( B n C ) = ( A U B ) n ( A U C ) . --- .1) Complete Example 2. (Section 2.2) Prove (a) A n ( B U G ) = ( A n B) U ( A n C ) . Then we have by Theorem 2.4.4.3) Fill each of the seven disjoint regions described in the figure below ". Then we have EXERCISES 1. given that the test says he has cancer.2.3. (1) Let A. Assuming that 0. - (2) We claim that the desired probability is essentially the same thing as the conditional probability of A given that A C WB has o c c m d . (Section 2.2) Suppose that the Stanford faculty consists of 40 percent Democrats and 60 percent Republicans. If 10 percent of the Democrats and 70 percent of the Republicans vote for Bush. We shall solve this problem by two methods. and (4) of the axioms of conditional probability.5.2 (Bayes) I 3 and 4 F lC UR E 2. and define Bi and C. Thus 3. Suppose that a cancer diagnostic test is 95% accurate both on those who do have cancer and on those who do not have cancer. (Section 2. 5 that the ace will turn up at least once? ing to a certain probability distribution. (Section 2. and (2.1 2 A random variable is a real-valued function defined . (2.4.6). what is the probability that you will be admitted into at least one school? Find the limit of this probability as n goes to infinity. When we speak of a "variable. (2.4. and (2. We have mentioned discrete and continuous random variables: the discrete random variable takes a countable number of real numbers with preassigned probabilities.8) is not.5).8) is satisfied but (2.1.7) are not. (Section 2. How many rolls are necessary before the probability is at least 0. In general.4. We have already loosely defined the term random variable in Section 1.4.4.2 as a random mechanism whose outcomes are real numbers. R A N D O M VARIABLES A N D PROBABILITY DISTRIBUTIONS 6. what is the probability of winning a tennis game under the "no-ad" scoring? (The first player who wins four points wins the game.2. At our level of study.5) If the probability of winning a point is p. and the continuous random variable takes a continuum of values in the real line according to the rule determined by a density function.5) A die is rolled successively until the ace turns up.4.4.7) are satisfied but (2. (c) (2."we think of all the possible values it can take. over a sample space.4.5. 7. when we speak of a "random variable. Definition 3.4.) 3. 1 A random variable is a variable that takes values accord- 9.5).1 DEFINITIONS OF A RANDOM VARIABLE 8." we think in addition of the probability distribution according to which it takes all possible values.1 is just as good.5) Calculate P ( A ) C) in Example 2. 10. (Section 2.71.4.5) Compute the probability of obtaining four of a kind when five dice are rolled. (b) (2.8) are all satisfied. (2.5) If the probability of being admitted into a graduate school is l / n and you apply to n schools. I .5).4. we can simply state D E F I N ITI o N 3. (2.18 2 I Probability by an integer representing the number of simple events with equal probabilities in such a way that (a) (2. Later in this chapter we shall also mention a random variable that is a mixture of these two types. Defining a random variable as a function has a certain advantage which becomes apparent at a more advanced stage of probability theory.6). The customary textbook definition of a random variable is as follows: D E F I N IT I 0 N 3.4. (Section 2. Note that the idea of a . and (2. (Section 2.6). . .1 and Definition 3.1 DISCRETE RANDOM VARIABLES Univariate Random Variables The following are examples of several random variables defined over a given sample space. Note that Xi can hardly be distinguished from the sample space itself. It indicates the little difference there is between Definition 3.20 3 1 Random Variables and Probability Distributions EXAMPLE 3 . i = 1.1 to the case of a discrete random variable as follows: D E F I N ITI o N 3. In the next section.p. 2 3. We must. 1 Experiment: A throw of a fair die. we shallillustrate how a probability function defined over the events in a sample space determines the probability distribution of a random variable.2. . of course. 2 . . . The arrows indicate mappings from the sample space to the random variables.. 2 .1. we can forget about the original sample space and pay attention only to what values a random variable takes with what probabilities. It is customary to lues it takes by by a capi denote a random lowercase letters. It means the random variable X takes value x. with probability p.) = p.1. have Z:='=.= 1.1 A discrete random variable is a variable that takes a countable number of real numbers with certain probabilities. EXAMPLE 3 . Experiment: Tossing a fair coin twice.2 as well.. P 7 Probability Sample space H H I 7 1 P 1 HT I TH I TT I 3. P B P 6 1 X (number of heads in 1st toss) 1 6 1 6 Probability Sample space 6 Y (number of heads in 2nd toss) In almost all our problems involving random variables.1. Note that the probability distribution of X2 can be derived from the sample space: P(X2 = 1) = '/ 2 and P(X2 = 0) = %. 2. n. for a sample space always has a probability function associated with it and this determines the probability distribution of a particular random variable.2 1 Discrete Random Variables 21 probability distribution is firmly embedded in Definition 3.1.2 3.2. We specialize Definition 3. n may be w in some cases. The probability distribution of a discrete random variable is completely characterized by the equation P(X = x.2. I ifP(Y=yj)>o.2..2.pij.2.2.1. Since a quantity such as ( 1 . We have -P2j Plj (Y = yj) = P(X = x.) = P ( X = x. See Table 3. . we do not have a random variable here as defined in Definition 3.1. AMixed to the end of Table 3. We call pq the joint probability.Y = y. we can define Conditionalprobability P(X = xi :. Proo$ ("only if" part).1 3. ) . . T H Eo R E M 3.I :g . .. n. i = 1 .2. But it is convenient to have a name for a pair of random variables put together. m.2 1 Discrete Random Variables 23 3. . i = 1. It is instructive to represent the probability distribution of a bivariate random variable in an n X m table. Y = y .) for all i.3 which does not depend on j.4. equivalently. ) = x P ( X = x . P(X = x. J= 1 Discrete random variables X and Y with the probability distribution given in Table 3. P22 Pzm PZo The probability distribution of a bivariate random variable is determined by the equations P ( X = xi. 2. the probability distribution of one of the univariate random variables is given a special name: mcrginal probability distribution.. Using Theorem 2. If X and Y are independent. Consider. for example.1 we defined independence between a pair of events.2. When we have a bivariate random variable in mind. j. the first two rows are proportional to each other. j = 1 . Y = yj) = pq.2.2 A bivariate discrete random variable is a variable that takes a countable number of points on the plane with certain probabilities.) are independent for all i.1) P ( ~I I ~ j ) p ( ~-j )P(x1 I ~ P(x2 yj)Pbj) j ) I P(x2 I~ j ) for every j. 2. every column is proportional to any other column. = E~=lpij comes from the positions of the marginal probabilities in the table. .) and the event (Y = y. again. 1 ) is not a real number. That is to say.(The word marginal calculated by the rules pi. Therefore..) B y looking at the table we can quickly determine whether X and Y are independent or not according to the following theorem.1 are independent if and only if every row is proportional to any other row. j. In Definition 2.1 Marginal probability m P ( X = x . n.1 are a column and a row representing marginal probabilities and poi = Cy=.1.) P(Y = yj) (3. . . . Y = y. D E F I NIT Io N 3 .2 Bivariate Random Variables Probability distribution of a bivariate random variable The last row in Example 3. 3 Discrete random variables are said to be independent if the event ( X = x.)P(Y = y. we have by Definition 3. The same argument holds for any pair of rows and any pair of columns. 2. n and/or m may be in some cases. Because of probability axiom ( 3 ) of Section 2. . Thus we have D E F I N IT I o N 3. the marginal probability is related to the joint probabilities by the following relationship.22 3 1 Random Variables and Probability Distributions TABLE 3.. or. 2 .2 shows the values taken jointly by two random variables X and Y.. the first two rows.2. .4. Here we shall define independence between a pair of two discrete random variables. 1) we have (3. n. . Therefore X and Y are independent. Y. .=I h= 1 m 9 i = 1 . 2.3) and (3.24 3 1 Random Variables and Probability Distributions 3. . ..5 generalizes Definition 3.)=C. z = zk) > 0. 7%. .2 as follows.2 1 Discrete Random Variables 25 ("if' part).) and sum over j to get (3.I I X = I ) ='/.3 Multivariate Random Variables P(xiI yj) = cj . E X A M P L E 3..2.4 A T-variate discrete random variable is a variable that takes a countable number of points on the T-dimensional Euclidean space with certain probabilities.2. but x2and Y' Random variables X and Y defined below are not indeare independent. . m.4 =OF -- - Conditional p-obability P(X = x8.. E X AM PLE 3.6) ' P(q ' J for ) every i.3) P ( Y = . .) P(Y = y. From (3. 2 .) i f P ( Y = y.. Then from (3. P(xi) for every i and j.2.2.. .1.2. and ir. Y =y. z = 2.i = 1. .) = C C P ( X = x. which shows that X and Y are not independent.P(X~I foreveryi. ' Note that this example does not contradict Theorem 3. and therefore X cannot be regarded as a function of x2. P ( Y = l I X = O ) ='/4 P ( Y = O I X = O ) ='/. . .4) we have (3. The word function implies that each value of the domain is mapped to a uniquevalue of the range.) = ctk . Definition 3. Summing both sides over i. to be unity for every j.4) 0 ) = %. we can define Marginal probability P(X = x.k. z = 2. 2.2. .. As in Section 3. 2. j = 1.5.. . We can generalize Definition 3. q.3..) Multiply both sides of (3.2.. Z P(Z = zk) if P ( Z = zk) > 0 = zk) pendent. P(Y = -1 IX = I P(X.2. Y = y . Y = yl. P(xb) . I Y = y.andj. . . .3) by P(y. Z = zk). % % Then we have P(Y = 1 I X = 1) = (%)/(%I = % and P(Y = 1 I X = (%)/ (%) = 2/5. Suppose all the rows are proportional to each other.2.2. .~.2.2. .Y = yj I Z = zk) = P(X = x. we determine c. P(x.2.2. Z = zk) = pUk.2. Therefore (3. .3 Let the joint probability distribution of X and Y be given by The probability distribution of a trivariate random variable is determined by the equations P ( X = xi.2. j. k = 1. P(xk) for every i and k. 0 We shall give two examples of nonindependent random variables. D E F I N IT I o N 3. P(X = x.I~.2. . .5) ( J P(xJ . Z = zk) = P(X = x. Y = y. 3. as illustrated by Example 3.3. 2 . For such a 1)iPCg ?-. Z = 1 for i even. .1) simply as the area under f (x) over the interval [xl. Y = ~ ) a way analogous to Definition 3.. define EXAMPLE 3 . In most practical applications.= = Zk. For our discussion it is sufficient for the reader to regard the right-hand side of (3. There are many possible answers. then X is a continuous random variable and f (x) is called its density function. It is important to note that pairwise independence does not imply mutual independence. for example.2.3 UNlVARlATE CONTINUOUS RANDOM VARIABLES Density Function 3. 1 An example of mutually independent random variables follows.3 1 Univariate Continuous Random Variables 27 DEFINITION 3. and Z are mutually independent because Then X. Following Definition 3. . rule determined by a density function. We shall allow xl = -a. = 0 otherwise. = 0 otherwise Y=1 for35z56. Y = 1.3. 6 (3.I)* . Find three random variables (real-valued functions) defined over S which are mutually independent. and so on for all eight possible outcomes.. .2. Y = ~ ) P ( X = ~ . 5 Suppose X and Y are independent random variables which each take values 1 or -1 with probability 0. see. 3 .1: A continuous random variable is a variable that takes a continuum of values in the real line according to the = P(X = 1.26 3 1 Random Variables and Probability Distributions 3. and Z are not mutually Then Z is independent of either X or Y.2. Y = yj. z . Y = 1) = '/4.5 Afinite set of discrete randomvariablesX. j. For a precise definition. by axiom (2) of probability.3. Apostol (1974).2. = 0 otherwise.) = P ( X = xj)P(Y = y.1) P(xi 5 X 5 x2) = Ixy(x)dx XI X = 1 for i 5 4.5 and define Z = XY. We assume that the reader is familiar with the Riemann integral. Z = ~ ) = P ( ~ = I ~ X = ~ . x2 satisfying xl 5 XZ. The following defines a continuous random variable and a density at the same time. but X. x2].. P(X = l ) P ( Y = l ) P ( Z = 1) = '/s. ..1 that the probability that a continuous random variable takes any single value is zero. and/or x:! = w.3. are mutually independent if P(X = x. Z = 1) = P ( i = 4) = '/. and therefore it does not matter whether < or 5 is used within the probability bracket. but we can. 2 .1. Then. .. Or we can define it in P ( X = ~ . for example..1 whereas In Chapter 2 we briefly discussed a continuous sample space. f (x) will be continuous except possibly for a finite number of discontinuities. .)P(Z zk) . Y P ( X = 1. 2. forall i. k . we must have J"f(x)dx = 1.5.. we define a continuous random variable as a real-valued function defined over a continuous sample space. If there is a nonnegative function f (x) defined over the whole line such that D E F I N ITI o N 3 . . It follows from Definition 3. = P(X = l)P(Y = for any XI. EXAMPLE 3 . Y. We need to make this definition more precise. Y = ~ . however. Y independent because 3. . Let the sample space S be the set of eight integers 1 through 8 with the equal probability of '/s assigned to each of the eight integers. . it will satisfy (3. X Z .1) exists. it must be nonnegative and the total volume under it must be equal to 1 because of the probability axioms. For such a function we may change the order of integration so that we have (3.4. and therefore f ( x ) can be a density function.5) f( x I S ) = = P(X E S ) 0 f(x' for x S.1) and hence qualify as a joint density function. yl 5 B.4.2. y) defined over the whole plane such that given a (3. 3. as desired.for a 5 x 5 b. which was given for a univariate continuous random variable. is defined by I f f (x.4 1 Bivariate Continuous Random Variables 29 function the Riemann integral on the right-hand side of (3. 2 We may loosely say that the bivariate continuous random variable is a variable that takes a continuum of values on the plane according to the rule determined by a joint density function defined over the plane.1). y) is called the joint density function. We shall give a few examples concerning the joint density and the evaluation of the joint probability of the form given in (3.3.4. From the above result it is not difficult to understand the following generalization of Definition 3.1 Now we want to ask the question: Is there a function such that its integral over [ x l . We can easily verify that f ( x 1 a (3.3) 5 X 2 Let X have density f ( x ). denoted by f ( x 1 S ) . D E F I N I T I O N 3.3.1 If there is a nonnegative function f ( x . = for any X I .3. otherwise.x p ] is equal to the conditional probability given in (3.3. The second condition may be mathematically expressed as 0 otherwise. . for any closed interval [xl. Y) is a bivariatt. The conditional density of X b. is defined by I (4 f ( x I a I X 5 b) = -f .3. y) is continuous except possibly over a finite number of smooth curves on the plane. The rule is that the probability that a bivariate random variable falls into any region on the plane is equal to the volume of the space under the density function over that region. continuous random variable and f (x.3. denoted by f ( x a 5 X 5 b).3. y) to be a joint density function.2.3. and the details are provided by Definition 3. Then.1.we have from Theorem 2. D EFI N IT I o N 3 . y1. D E F l N IT I 0 N 3. b ] .4) 5 X I b) defined above satisfies P ( x l ~ X I x 2 1 a 5 X < b ) = f(xIa5XSb)dx I: whenever [a. yp satisfylng xl 5 xp.1 Bivariate Density Function Suppose that a random variable X has density f ( x ) and that [a.4 BlVARlATE CONTINUOUS RANDOM VARIABLES 3. b] 3 [ x l . We shall give a more precise definition similar to Definition 3. in addition to satisfylng the above two conditions.28 3 1 Random Variables and Probability Distributions 3. then (X.xp]. provided that $I: f (x)dx # 0.2)? The answer is yes.3. Then the conditional density of X given X E S. xp] contained in [a. 3 . In order for a function f (x.3 Let X have the density f ( x ) and let S be a subset of the real line such that P ( X E S ) > 0.2 Conditional Density Function 3. b] is a certain closed interval such that P ( a 5 X S b) > 0.4.4.3. Let S be as in Figure 3.9). V = y. y)dxdy. where S is a subset of the plane.e . Y ) falls into the shaded quarter of the unit circle in Figure 3. we need the following formula for integration by parts: FIGURE 3. 1 Iff(x. Y ) falls into the shaded triangle in Figure 3. We write this statement mathematically as (3. and performing the mathematical operation of integration.which will have a much more general usage than the geometric approach. as in the following example. (3.3 Domain of a double integral: I Domain of a double integral: 11 Domain of a double integral: 111 where U and V are functions of x. that the square in each figure indicates the total range of ( X . Since P ( X > Y ) is the volume under the density over the triangle.1)we have Y Y To evaluate each of the single integrals that appear in (3.4. Y). The event ( X > Y ) means that ( X . a = 1.4.7) = [-)us]: + = e l + [ .4. Y ) E Sl = I1 f (x. P ( X > Y ) = 1/2. y ) = 1 for 0 < x < 1 .y) = ~ y e ~ ( " + ~ ) .4).~ ~= : 1 . the probability that ( X .2. calculate P ( X > Y ) and P(X' + Y2 < 1 ) .3. it must equal the area of the triangle times the height of the density.4.4). Y ) . Putting U b = in (3. Then we have . as shown in Figure 3. y) over S . Assuming f ( x .4. y > 0 . y) is a complicated function.4. x > O .we have = -eLX.9)may not work iff (x. a n d O o t h e r w i s e . A region surrounded by two horizontal line segments and two functions of y may be similarly treated. y ) is a simple function. Therefore.9) P t (X. Therefore from (3. Iff (x. We shall show the algebraic approach to evaluating the double integral (3. y) plane. the inside of a circle) can be also evaluated as a double integral off (x. V = x. and (3. The event (x2+ Y* < 1 ) means that ( X .4 1 Bivariate Continuous Random Variables 31 E X A M P L E 3 . and b = 1 in (3.4.7)we obtain If a function f (x. y) is a joint density function of a bivariate random variable (X.4. We shall consider only the case where S is a region surrounded by two vertical line segments and two functions of x on the ( x .3.4.5) we have (3. p ( x 2 + Y2 < 1 ) = ~ / 4Note . EXAMPLE 3.4.6).1 FIGURE 3. Y ) falls into a more general region (for example. which is 1.2e-1.1. 4 .2 = Putting U = -eCY. we can treat any general region of practical interest since it can be expressed as a union of such regions.4.2 FIGURE 3. Y < I ) ? B y (3. w h a t is P ( X > 1 . y). The double integral on the right-hand side of (3. this intuitive interpretation will enable us to calculate the double integral without actually This geometric approach of evaluating the double integral (3. a = 0.30 3 1 Random Variables and Probability Distributions 3.5).9) may be intuitively interpreted as the volume of the space under f (x.4. 0 < y < 1 and 0 otherwise. Once we know how to evaluate the double integral over such a region. Therefore.4. 4. h(y) = 0 in (3. 2 0 4 Volume of the ith slice z f (xi.2 are special cases of region S depicted in Figure 3. To evaluate P(X > Y ) .4.4. y ) = 1.= 1 2 /s. 4 . .4 1 Bivariate Continuous Random Variables 33 Figures 3.~ ) .4. and h(x) = 0 in (3. Y) falls into the shaded region of Figure 3.) h(xJ f ( q .11) over i. we get (3. Note that the shaded regions in When we are considering a bivariate random variable (X.is called the marginalp-obability. y) = 1. The following examples use (3. EXAMPLE 3. The first slice of the loaf is also described in the figure.15) f (x)dx = XI / [ ~ ( t+'(tjdt.4.10) indeed gives the volume under f (x.y)dy .16) d o = 8 cos 0 1 'iT +o r =.-.2 M a r g i n a l Density But the limit of the right-hand side of (3.10) so that we have EXAMPLE 3 . g(y) = 1 . In order to apply (3.4. +'(t) denotes the derivative of with respect to t.I I 32 3 1 Random Variables and Probability Distributions 3.10). )] fl (3.12) as n goes to is clearly equal to the right-hand side of (3.2 using formula (3.1 and 3. the p r o b ability pertaining to one of the variables.10) to this problem.12) Suppose f (x. We have approximately (3. b = 1.10).4 the volume to be evaluated is drawn as a loaf-like figure. g(x) = x. we have + + (3. 3. Calculate P(Y < %).4.10) 11 S f (x. b 1.4.4. We shall show graphically why the right-hand side of (3.3 We shall calculate the two probabilities asked for in Example 3.x and = 0 otherwise. g ( x ) = p ( x 2 + Y* < 1) = 1: (17~) dx = To evaluate the last integral above.4 Double integral (3.4.4.4.4.x.The following relationship between a marginal probability and a joint probability is obviously true.4.x .yldxdy = 1: [r h(x) f(x. y) over S. such as P(xl 5 X 5 q ) or P(yl 5 Y 5 y2).10) so that we have m.3. Summing both sides of (3. Y ) .4. Here. a = 0 . (xi . 0 < y < 1 .we put f (x. y)dy] dx. Then. and h ( x ) = 0 so that we have (3.10) to evaluate the joint probability. In Figure 3. y ) = 24 xy for 0 < x < 1.4. we shall put x = cos 0. yjdy .y.411 if is a monotonic function such that +(tl) = xl and +(t2) = xp.14) To evaluate P(X' + y2 < l ) . (x.we put f ( x . since dx/d0 = -sin 0 and sin20 + cos20 = 1.5. . . Event (Y < %) means that ( X . Next. B * 0.4. 4 Volume z .). we must reverse the role of x and y and put a = 0. we need the following formula for integration by change of variables: FI CU R E 3. b = %. the density function of one of the variables is called the margznal density.4. 1 We are also interested in defining the conditional density for one of the variables above.4 More generally.4. A generalization of Definition 3. It may be instructive to write down the formula (3.Y) have thejoint density f(x.19) equation (3.4. Y ) .4. Similarly. b = m. Y) E S. THEOREM 3 .8. Under the first situation we shall define both the joint conditional density and the conditional density involving only one of the variables. 3 Let (X. it can be obtained by integrating f (x.y) a n d l e t s be a subset of the plane such that P[(X. Domain of a double integral for Example 3.34 3 1 Random Variables and Probability Distributions 3. D E F I N I T I O N 3 .4. Y) E S. Y) E S] > 0. y) E S.3.21) F l G UR E 3. denoted by f (x. see Example 3. We have For an application of this definition.3.3. Formally.22) explicitly for a simple case where a = -a. Proo$ We only need to show that the right-hand side of (3.3 Conditional Density We shall extend the notion of conditional density in Definitions 3.4. and g(x) = yn in Figure 3. is defined by I (3.4.We assume that P[(X.we have . Y ) E Sl 0 for (x. one may replace xl 5 X 5 x 2 in both sides of (3.y) and l e t s be a subset of the plane which has a shape as in Figure 3.4. given a conditioning event involving both X and Y. 4 .Y) have thejoint density f(x.1). We shall explicitly define it for the case that S has the form of Figure 3. Then = 0 otherwise. Y) given (X.3. say.18) by x E S where S is an arbitrary subset of the real line. Then the conditional density of (X.3 is straightforward: D E F I N I T I O N 3 . X. h(x) = yl. 4 .3. We shall consider first the situation where the conditioning event has a positive probability and second the situation where the conditioning event has zero probability. when we are considering a bivariate random variable (X.1 shows how a marginal density is related to a joint density. Theorem 3. is defined by Letf(x.4 1 Bivariate Continuous Random Variables 35 3.3. y S). y I S) with respect to y. 2 Let (X.2 and 3.3.5 f (x. Then the conditional density of X given (X. otherwise. y 1 S) = = f (ay) P[(X. 4 .y) be theQointdensityofXandYandletf(x) be the marginal density of X. Y) E S] > 0. Since in this case the subset S can be characterized as yl 5 Y 5 yn.3 to the case of bivariate random variables. denoted by f (x I S). xp].24). I yl 5 Y 5 y2).4. where yl and c are arbitrary constants.3. B y putting a = -m.4. 4 . we have from (3. and (3.4. y)dydx The conditional density f ( x Y = yl I + c x ) exists and is given by The reasonableness of Definition 3. DEFINITION 3.4 provided the denominator is positive. 4 .1. where yl < yz. (3. Proof. Next we shall seek to define the conditional probability when the conditioning event has zero probability. 5 The conditional density of X given Y = yl + cX.4. 0 Y 5 lim P(xl Y 2+Y X 5 xpr yl y:.y1 + cx)dx P(xl = 5 X 1 5 xz I Y = yl 5 + cX) + cX 5 Therefore the theorem follows from (3.4. xp satisfying xl Now we can prove 5 q. + cx in Figure 3.4. if it exists.9. since P(Y = yl + cX) = 0.4.4.23) f(x I yl 5 Y 5 yn) - Il' I f (x.x2].4. write as a separate theorem: .36 3 1 Random Variables and Probability Distributions 3. An alternative way to derive the conditional density (3.that when (3.4. We shall confine our attention to the case where the conditioning event S represents a line on the (x. h(x) = yl + cx.23) is integrated over an arbitrary interval [xl. We begin with the definition of the conditional probability Pixl 5 X 5 x g 1 Y = y1 + c X ) and then seek to obtain the function of x that yields this probability when it is integrated over the interval [xl. y) plane: that is to say. denoted by f ( x Y = yl cX). b = m.26) can be obtained by defining for all xl.4. S = { ( x . Note that this conditional probability cannot be subjected to Theorem 2. is defined to be a function that satisfies I + Then the formula (3.y) 1 y = yl + cx].24) + The conditional probability that X falls into [xl. (x. 2 1 Bivariate Continuous Random Variables 37 (3. We have J \&.2.26) is as follows.q ] cX is defined by rm f by the mean value theorem of integration.1 - - given Y = yl (3. YIuyuX by Theorem 2. y)dy f ( x .25). + cX).3 can be verified by noting .4 THEOREM 3 .22) - DEFINITION 3 .4. see Example 3. and g ( x ) = y:. Next we have - For an application of Theorem 3.27).4.4.4. it yields the conditional probability P(xl 5 X 5 x:. 3. its connection to Definition 3. In Example 3. . ye-? over S. The next definition generalizes Definition 3. . consider Examples 3.6 Joint density and marginal density i T H E o R E M 3.38 3 1 Random Variables and Probability Distributions 3. . THEOREM 3 . 6 Continuous random variables X and Y are said to be independent iff (x.) 3.4.y) >(lover S and f (x. X and Y are independent because S is a rectangle and xye-(x+~)= xe-X . This may be a timeconsuming process.4 1 Bivariate Continuous Random Variables 39 for all X . is more apparent. . y reader should be able to generalize Definition 3. X and Y are not independent because S is a triangle. yl.4. D E F 1 N I T I 0 N 3 . are said to be mutually independent if Finally.2. .3/q) is outside of the triangle whereas both f (x = %) and f (y = %) are clearly nonzero. y) = g(x) h(y) over S.4. 3/4) = 0 since the point (1/2. which defined independence for a pair of discrete random variables. Note that if g(x) = cf (x) for some c.4. 7 provided that f (yl) > 0.3 i The conditional density of X given Y = yl. Thus stated.2. conditional densitJr. where g(x) and h(y) are some functions of x and y . F I G u R E 3. we shall define the notion of independence between two continuous random variables.1 and 3.4. . stated without proof. y) = f (x)f (y) for all x and y.4. D EF 1 N I T 1 0 N 3 .6 in the same way that Definition 3.4. 4 .30). 4 . I x2. provides a quicker method for determining independence. y1 5 y2.4. Figure 3. Y.6 describes the joint density and the marginal density appwing in the right-hand side of (3.2. My) = c-tf(y). In Example 3. yg such that xp 5 xp. as shown in Figure 3. z. . 4 LetSbeasubsetoftheplanesuch thatf(x. and independence.5 generalizes 3. even though the joint density 24xy factors out as the product of a function of x alone and that of y alone over S.).4. The area of the shaded regim represents f (yl).4.4.1 to a multivariate case. A finite set of continuous random variables X . 4 . y1 5 Y 5 yg) = P(x. but the (We have never defined a multivariate joint density f (x. Definition 3.1.4. denoted by As examples of using Theorem 3.4.5 Examples This definition can be shown to be equivalent to stating P(x1 5 X 5 xg. Z.6 implies that in order to check the independence between a pair of continuous random variables. . we should obtain the marginal densities and check whether their product equals the joint density.4. 5 X 5 x2)P(y1 5 Y 5 y2) We shall give examples involving marginal density.5. respectively. The following theorem. One can ascertain this fact by noting that f (%.3. Then X and Y are independent if and only if S is a rectangle (allowing -a or to be an end point) with sides parallel to the axes and f (x. y) = 0 outside S. as can be seen either from (3.4.34).O < y < 1 and = 0 otherwise.4. we must f i s t obtain the conditional density f ( x y). we have (3. y) is integrated with respect to x from -m to CQ but % is integrated from y .4 1 Bivariate Continuous Random Variables 41 Suppose f ( x . for example. -1) and = 0 otherwise. y) cannot be expressed as the product of a function of x alone and a function of y alone.4.5 1 Y = 0. Therefore =0 otherwise.5 E X A M P L E 3 .5. Calculate P(0 < Y < S/4 1 X = %) . Therefore Therefore Therefore P ( 0 < X < 0.5) = '/5. O). and (0.1 we have x6. l ) .4.4. f (x.1 to 1 .42) f (y) = -m f ( x .40) the range of the first integral is from 0 to 5/4.x and = 0 otherwise. Note that in (3. We should calculate f ( y ) separately for y 2 0 and y < 0 because the range of integration with respect to x differs in two cases. Suppose f (x. we must first obtain the is equal to marginal density f ( y ) .4.- - - Therefore That X and Y are not independent is immediately known because f (x. We have By a simple double integration it is easy to determine that the numerator To obtain the denominator.4. We can ascertain this fact by showing that and (3. We have EXAMPLE 3. ( 0 . (.4. Calculate P ( 0 < X < 0.7 Putting y = 0.35) f (x I Y = 3 12 2 0. y) = % over the rectangle determined by the four corner points (1. This is because f (y I x = 1/2) = 8y only for 0 < y < '/2 and f (y I x = '/9) = 0 for % < y. 7 7 - - - -- . 6 Let f (x. To calculate P(0 < X < 0. B y Theorem 3.4.40 3 1 Random Variables and Probability Distributions 3. That is.4.5 are very useful in this regard. Calculate marginal density f ( y ). y) = 24xy for 0 < x < 1 and 0 < y < 1 . since f (x.y.41).5 in (3.4. We have EXAMPLE 3.5) = .3. Such an observation is very important in solving this kind of problem. Note that in (3.+ -x .y)dx = i"' -n-~ 2 -dx=l+y if-1 I y < 0 . y) be the same as in Example 3.4.5 1 0 < Y < 0. B y Theorem 3.5 10 < Y < 0. O).39) or from Figure 3. 4 . whereas the range of the second integral is from 0 to 1/2. 9) = '/b .5) and determine if X and Y are independent.5 1 Y = 0.5) and P ( 0 < X < 0.5). f ( x . and diagrams such as Figure 3. y ) = (9/2)(x2 + y2) f o r 0 < x < 1.1. This example is an application of Definition 3.42 3 ( Random Variables and Probability Distributions 3.1. i = I .4. Then from (3. and F(m) = I. %).26) and noting that the range of X given Y = % -I-X is the interval (0. Obtain the conditional density f (x I Y = 0. The distribution function of a continuous random variable X with is given by density function f ( a ) . as shown in Figure 3.3.7 describes (3. Then the distribution function can be shown to be continuous from the right.5 X ) . E X A M P L E 3. is continuous from the left.5 1 Distribution Function 43 = 2.1) if x is within the interval (y . the value of the distribution function is at the solid point instead of the empty point. is defined by (3.9 + From the definition and the axioms of probability it follows directly that F is a monotonically nondecreasing function. Thus E X A M P L E 3.2. and in certain situations it is better to treat all the random variables in a unified way.8. indicating the fact that the function is continuous from the left.4.4. This can be done by using a cumulative distribution function (or. This dichotomy can sometimes be a source of inconvenience.22) we have = 0 otherwise. 0 < y < 1 and = 0 otherwise.5. n.) . and g(x) = 1 in Figure 3. At each point of jump. Figure 3.5 DISTRIBUTION FUNCTION 1 FI C u R E 3.4.41) as the area of the shaded region. This example is an application of Theorem 3.4. Put a = 0. The answer is immediately obtained by putting yl = %. more simply. 5 .3.4. h(x) = x. which can be defined for any random variable. =0 otherwise.8 Suppose f(x.4. b = 1. Then its distribution is a step function with a jump of length pi at x. F(-m) = 0. 1 . Conversely. but this cannot be done for a continuous random variable. . Obtain f (x X I F(x) = P(X < x) for every real x. c = 1 in (3.7 Marginal density As we have seen so far. Assume f(x. for 0 <x <- 1 2 =0 otherwise. a discrete random variable is characterized by specifying the probability with which the random variable takes each single value. . . a continuous random variable has a density function but a discrete random variable does not. Some textbooks define the distribution function as F(x) = P(X 5 x).y) and = 0 if x is outside the interval. y) = 1 for 0 i x 5 1 and 0 5 y 5 1 and < Y). a distribution function).y) = 1 f o r 0 < x < 1 . 1 The (cumulative) distribution function of a random variable X. D E F I N I TI 0 N 3 . 2. denoted by F ( . 3. Let X be a finite discrete random variable such that P ( X = xi) = pi. . 44 3 1 Random Variables and Probability Distributions 3.5 1 Distribution Function 45 --?J Then F(x) = 0 if x 5 0. For x > 0 we have ,-'/2dt = [,-t/?; = 1- ,- x/z. EXAMPLE 3.5.2 Suppose foro<x<l, otherwise. = (3.5.6) f(x)=2(1-x) = 0 Clearly, F(x) = 0 for x 5 0 and F(x) 1 for x 2 1. For 0 < x < 1 we have FIG u R E 3 . 8 Distribution function of a discrete random variable =O+2 I:, (1-t)dt=2 2 From (3.5.2) we can deduce that the density function is the derivative of the distribution function and that the distribution function of a continuous random variable is continuous everywhere. The probability that a random variable falls into a closed interval can be easily evaluated if the distribution function is known, because of the following relationship: Example 3.5.3 gives the distribution function of a mixture of a discrete and a continuous random variable. EXAMPLE 3.5.3 Consider If X is a continuous random variable, P(X = x2) = 0; hence it may be omitted from the terms in (3.5.3). The following two examples show how to obtain the distribution function using the relationship (3.5.2), when the density function is given. EXAMPLE 3.5.1 This function is graphed in Figure 3.9. The random variable in question takes the value 0 with probability % and takes a continuum of values between % and 1 according to the uniform density over the interval with height 1. Suppose (3.5.4) j(x) = 1 2 for x > 0, A mixture random variable is quite common in economic applications. For example, the amount of money a randomly chosen person spends on the purchase of a new car in a given year is such a random variable because we can reasonably assume that it is equal to 0 with a positive probability and yet takes a continuum of values over an interval. We have defined pairwise independence (Definitions 3.2.3 and 3.4.6) A *-" A* 3 1 Random Variables and Probability Distributions DEFINITION 3.5.4 3.6 1 Change of Variables 47 Bivariate random variables (Xl,YI), (X2,Y2),. . . , (X,, Y,) are mutually independent if for any points xl, x2, . . . , X, and yl, 3'2,. . - ,.Yn, I I I F I G u R E 3.9 Distribution function of a mixture random variable and mutual independence (Definitions 3.2.5 and 3.4.7) first for discrete random variables and then for continuous random variables. Here we shall give the definition of mutual independence that is valid for any sort of random variable: discrete or continuous or otherwise. We shall not give the definition of pairwise independence, because it is merely a special case of mutual independence. As a preliminary we need to define the multivariate distribution function F(xl, xn, . . . ,x,) for n random variables X1, X2, . . . , Xn by F(xl, q, . . . , x,) = P(X1 < XI, X2 < xp, . . . ,X, < x,). Random variables XI, XZ, . . . , Xn are said to be mutually independent if for any points XI, xe, . . . , x,, D EF l N I T I O N 3 . 5 . 2 Note that in Definition 3.5.4 nothing is said about the independence or nonindependence of X, and Y,. Definition 3.5.4 can be straightforwardly generalized to trivariate random variables and so on or even to the case where groups (terms inside parentheses) contain varying numbers of random variables. We shall not state such generalizations here. Note also that Definition 3.5.4 can be straightforwardly generalized to the case of an infinite sequence of bivariate random variables. Finally, we state without proof: THEOREM 3.5.1 f Let 4 and be arbitrary functions. If a finite set of random variables X, Y, Z, . . . are independent of another finite set of random variables U, V, W, . . . , then +(X, Y, Z, . . .) is independent of +(U, . . .). + v,w, $ ' Just as we have defined conditional probability and conditional density, we can define the conditional distribution function. D E F I N I T I O N 3.5.5 LetXandYberandomvariablesandletSbeasubset of the (x, y) plane. Then the conditional distribution function of X given S, denoted by F(x 1 S), is defined by (3.5.9) F(xl,x2, . . . , x , ) = F ( x I ) F ( x 2 ) . . . F ( x , ) . Equation (3.5.9) is equivalent to saying (3.5.10) P(XlES1,X2ES2 , . . . ,XnESn) = P(Xl E S,) P (X, E S,) (3.5.12) E S,) F(x(S) = P ( X < x I ( X , Y ) ES). . . . P(X, for any subsets of the real line S1, S2, . . . , Sn for which the probabilities in (3.5.10) make sense. Written thus, its connection to Definition 2.4.3 concerning the mutual independence of events is more apparent. Definitions 3.2.5 and 3.4.7 can be derived as theorems from Definition 3.5.2. We still need a few more definitions of independence, all of which pertain to general random variables. D E F I N IT I o N 3 . 5 . 3 An infinite set of random variables are mutually &dependent if any finite subset of them are mutually independent. Note that the conditional density f (x I S) defined in Definition 3.4.3 may be derived by differentiating (3.5.12) with respect to x. 3.6 CHANCE OF VARIABLES In this section we shall primarily study how to derive the probability distribution of a random variable Y from that of another random variable X when Y is given as a function, say +(X), of X. The problem is simple if X and Y are discrete, as we saw in Section 3.2.1; here we shall assume that they are continuous. ' 48 3 1 Random Variables and Probability Distributions 3.6 1 Change of Variables 49 We shall initially deal with monotonic functions (that is, either strictly increasing or decreasing) and later consider other cases. We shall first prove a theorem formally and then illustrate it by a diagram. T H E O R E M 3.6.1 The term in absolute value on the right-hand side of (3.6.1) is called the Jacobian of transfmation. Since d+-vdy = (d+/dx)-', we can write (3.6.1) as (3.6.9) - L e t f ( x ) bethedensityofXandletY=+(X),where+ is a monotonic differentiable function. Then the density g(y) of Y is given by g(y) = ('I - (or, mnemonically, g ( y )idyl = f ( x )ldx\), Idy/&l which is a more convenient formula than (3.6.1) in most cases. However, since the right-hand side of (3.6.9) is still given as a function of x, one must replace x with + - l ( y ) to obtain the final answer. +-I - where is the inverse function of 4. (Do nOt mistake it for 1 over 4.) Proot We have E X A M P L E 3.6.1 Suppose f ( x ) = 1 for 0 < x < 1 and = 0 otherwise. Assuming Y = x', obtain the density g(y) of Y. Since dy/dx = 2x, we have by (3.6.9) Suppose (3.6.3) + is increasing. Then we have from (3.6.4) P(Y < y) = P[X < But, since x = 6, we have from (3.6.10) Denote the distribution functions of Y and X by G ( . ) and F ( . ) , respectively. Then (3.6.3) can be written as (3.6.4) G(y) = ~ [ $ - ' ( y ) l . Differentiating both sides of (3.6.4) with respect to y, we obtain It is a good idea to check for the accuracy of the result by examining that the obtained function is indeed a density. The test will be passed in this case, because (3.6.11) is clearly nonnegative and we have Next, suppose (3.6.6) 4 is decreasing. Then we have from (3.6.2) - - P(Y < y) = PIX > The same result can be obtained by using the distribution Fi%%etirm=K without using Theorem 3.6.1, as follows. We have (3.6.13) -- which can be rewritten as (3.6.7) G(y) = P(Y < y) = P(X' < y) = P(X < 5) G(y) = 1 - F [ + - ' ( ~ ) ] . Differentiating both sides of (3.6.'7), we obtain Therefore, differentiating (3.6.13) with respect toy, we obtain The theorem follows from (3.6.5) and (3.6.8). R Since Y lies between y and y Ay if and only if X lies between x and x Ax.6. (3.15) g(y)Ay f (x)hx. It is clear that we would need the absolute value of d$/dx if we were to consider a decreasing function instead. shaded regions (1) and (2) must have the same area.16) we have (3. In the case of a nonmonotonic function.17) in fact holds exactly. Therefore + F lC U R E 3.1 1 is helpful even in a formal approach. E X A M P L E 3. -1 < x < 1.21) with respect to y. however. find g(y). If AX is small then Ay is also small.F ( .6. we get From (3.6. we also have approximately (3. It has the advantage. either through the formal a p proach. Therefore we have approximately Figure 3.11 we must have area (3) = area (1) area (2). Differentiating (3. and = x2 if X < 0.6 ) ---- 2 = - 2 + -- O<y<l.6. We haw + + (3.2 Given f (x) = %. as it does not utilize the power of Theorem 3.21) G(y) = F(y) .6.6.1 0 Change of variables: one-to-one case Therefore This latter method is more lengthy.2 can be generalized and stated formally& follows. and the area of (1) is approximately f (x)Ax and the area of (2) is approximately g(y)Ay.22) g(y) = f (y) + f( .6 ( Change of Variables 51 Since we can make Ax arbitrarily small.10 illustrates the result of Theorem 3. In this example we have considered an increasing function. Figure 3.6. using the distribution function. 4 6 ' The result of Example 3.1.6. In Figure 3.50 3 ( Random Variables and Probability Distributions 3. .6.15) and (3. but we can get the correct result if we understand the process by which the formula is derived.6.6. Therefore But if Ax is small. or through the graphic approach. We shall first employ a graphic approach. of being more fundamental.6 ) .1.1 will not work.6. the formula of Theorem 3. 27) f(z) = for o < z < 1. The present solution is more complicated but serves as an exercise. but here we shall present an alternative solution using the distribution function.6. but the present method is more fundamental and will work even when the Jacobian method fails. the event ( Z 5 z) is equivalent to the event (X 5 z. Then the density g(.X . called the h s u m e again f(x.1. Y 5 z ) .) is the density of X and +' is the derivative of +.-1 for z r 1.) of Y is given by Differentiating (3. hence. Obtain the conditional density f (x I + .6. O<z<l. Since the density g ( z ) is the derivative of the distribution function. Obtain the density of Z defined by Z = Y / X .4. in Example 3.3 Assume f(x. Define Z = Y . 0 < z < 1.6.4 F I G uR E 3 . 2z2 EXAMPLE 3. y) = 1 for 0 < 1 and 0 < y < 1. EXAMPLE 3.5 X ) . y) = 1 for 0 < x < 1. Let F(. In the next three examples we shall show how to obtain the density of a random variable which is a function of two other random variables. Y ) . we get (3. .52 3 1 Random Variables and Probability Distributions 3.0. the probability of the two events is the same. 1 1 Change of variables: non-one-to-one case Let X and Y have the joint density f (x. See Figure 3. Calculate the density function g ( z ) of Z = max(X.5 So far we have studied the transformation of one random variable into another. we conclude that g ( z ) = 22.) be the distribution function of Z. which is a generalization of Theorem 3. 0 < y < 1 and Y = 0. Then we have = 0 otherwise.9. We shall always use the method in which the distribution function is obtained first and then the density is obtained by differentiation. For any z. This problem was solved earlier.6.26) with respect to z. EXAMPLE 3. we have (3. A =5 2 1 =1-areaB=l-for22 22 Note that nyindicates the possibility that the number of values of x varies with y. 2 Suppose the inverse of y = +(x) is multivalwd and can be written as = area for 0 < z < 1. 1 . Then <x - T I-IE 0 ff E M 3 . where f (.5.25) P(Z 5 z) = P(X 5 z)P(Y 5 z) =z2. 6 . 0 < y < 1 and = 0 otherwise.12.6.6.6 ( Change of Variables 53 Jacobian method. y) = 1 for 0 < x < 1.6. Later in this section we shall discuss an alternative method. Since X and Y are independent. 5.6.a12a21# 0 so that (3.4 FIGURE 3 . must be appropriately determined. .30) f(x. y2) of (Y1.6 1 Change of Variables 55 I 0 F I G U R E 3.6.6. f(b1l~l+ b 1 2 ~ 2b .5-x<z<0. 21~1 + b22~2) Ialla22 .54 3 1 Random Variables and Probability Distributions 3. where the support of g.1 2 1 X Illustration for Example 3.6.6.1 to a linear transformation of a bivariate random variable into another.32) (3.x.6.5. the range of (yl. 6 .Y2) is given by Therefore.x < z < 0.. -0. O<x<1.30) we get X2 = b21Y1 + bZ2Y2.a12a211 Then the joint density g(yl.3 generalizes Theorem 3.29) and the marginal density of X. Suppose alla22 . T H E O R E M 3 . -0. 1 3 Domainofajointdensity =z + x + 0. 3 Let f (xl. from (3. from (3. .35) g(yi. (3.5 . y2) = f (x I Y = 0.13.6.6. we finally get (3. that is. otherwise.6.5-x.6. x2) be the joint density of a bivariate random variable (XI.6.Y2) be defined by a linear transformation Therefore. n)over which g is positiw.5 . Therefore Theorem 3. From (3.X2) and let (Y1.5 + X) = f (x I Z = 0) = 2 =0 for 0 < x < 0.33) can be solved for XI and X2 as The domain of the joint density f (x.30) and (3.z)=l.31). z) is indicated by the shaded region in Figure 3. That this is needed can be best understood by the following geometric consideration.X2.14. . In some applications we need to understand the characteristics of the joint distribution of a discrete and a continuous random variable. . and P(yi 1 x) related to one another? ( 2 ) Is there a bivariate I I .we immediately obtain In Section 3. x 2 ) = 4x1x2f0r0 5 x 1 I . a21X1 + a22X2 + a21AX1).6.36) for X1 and X2.6. In this section we ask two questions: (I) How are the four quantities f ( x ) . . x 2 + AX.al2a21 > 0.- Inserting the appropriate numbers into (3.35) is called the Jacobian of transformation. with probabilities P(yi).6 Suppose f ( x l .( X I + Ax.Y2)? Solving (3. The area of the rectangle is AX1AX2.6.Y 1 + .36) Y l = Xl + 2x2 .). the best way to characterize the relationship seems to be to specify either the conditional density f ( x y. 2.6 Next we derive the support of g. i = 1. where the coordinates of its four corners-counterclockwise starting from the southwest corner-are given by ( X I . Thus the support of g is given as the inside of the parallelogram in Figure EXAMPLE 3.X:! + AX2).31) (3.and if we suppose for simplicity that all the a's are positive and that alla22 . Theorem 3.- Y2 = X1 .a2. whose coordinates are given by (allX1 + a12X2.-- -- - - --- -.al2a21 is the determinant of the 2 X 2 matrix FIG u R E 3. a22X21.4 we studied the joint distribution of continuous random variables.1 and0 5 x 2 5 3.6. If (3. f ( x I yi). . q l X l + q 2 X 2 f aZ2AX2). and ( X I .33) maps this rectangle to a parallelogram on aZ1X1 + the Yl-Y2 plane.a12asllappearing on the right-hand side of (3.56 3 1 Random Variables and Probability Distributions 3.Y2 5 1 3 3 B y using matrix notation. we obtain -- -. If we assume that X and Y are related to each other.6. (allX1 + 42x2 + allAX1 + a12AX2. Let X be a continuous random variable with density f ( x ) and let Y be a discrete random variable taking values yi. Since 0 5 xl r 1 and 0 S % 5 1.) or the conditional probability P(yi x).6. n.6. and (allX1 + a12X2+ a12AX2.Xl + aZ2X2+ aglAXl + aZ2AX2).7 J O I N T DISTRIBUTION OF DISCRETE AND CONTINUOUS RANDOM VARIABLES 1.P(yi).2 we studied the joint distribution of discrete random variables and in Section 3. ( X I + Ax. X Z ) .3 can be generalized to a linear transformation of a general n-variate random variable into another... 3. I 4 Illustration for Exampk 5. Chapter 11 shows that alla22 . The linear mapping (3.35). what is the joint density of (Yl.7 1 joint Distribution 57 The absolute value lalla22. then the area of the parallelogram must be (~11~2 -2 a12a21)8X1AX2. (~11x1 + ~12x2 + a11AX1. Consider a small rectangle on the X l l X 2 plane.6.6. we have from (3.39) I 2 0 5 .6.X2). 3 and 3. . by (3.4.1) and the conditional density of Y given X is given by .5. y. . Y = y.4. Specify ajoint distribution in such a way that the three variables are pairwise independent but not mutually independent.2. which provides the answer to the fim question.) by Theorem 2. Y < 0.4.)P(Y = yt) by Theorem E 5 x + €1 Given the density f (x.) defined in the second question. so does P(y. ) I f( 4 by the mean d u e theorem.)P(y. Suppose the density of X is given by f(x) = 1.3) - = lim P(Y = y.n) and S = {y.+o P(x 5 X 5 x P(x 5 X = lim E+O by Theorem 2.4. ( 2 E I}. Since the conditional probability P(y. (Section 3. (Section 3. Thus f (x I y. 0 < y < x. 0 (a) P(X < 0. The score is scaled to range between 0 and 1. and grade A is given to a score between 0. (Section 3. like any other conditional density defined in Sections 3.: for any a 5 b. CzEI$(x.4.58 3 1 Random Variables and Probability Distributions I Exercises 59 function +(x. x > 0. 1 x 5 X 5 x + E) as E goes to zero.4.) plays the role of the bivariate function +(x. 1 or 0.1 by probability axiom (3). = = x 2EI 5. . (Section 3. I . f (x I y. (Section 3.)dx. (b) P(X < 0. Y E S ) Given f (x) = exp(-x). Y) be given by (a) Calculate marginal density f (x). f(r(~)=2xy+2(1-x)(l-y). . . find the density of the variable (a) Y = 2 X + 1 . x 5 X .5).f (x Y ~ ) ~ ( Y .) must satisfy where S = {y. (b) Y = x2. x)f ( 4 .2).) such that P(a 5 X 5 b.. y.5).7. we need to define it as the limit of P(y. 2. (b) Calculate P(0 < Y < 3/4 1 X = 0.5). where I is a subset of integers (1. 1 x) involves the conditioning event that happens with zero probability.5). P(a 5 X 5 6 . < x < 1.8 and 1.3) Let X be the midterm score of a student and Y be his final exam score. O< x< 1 P(x 5 X 5 x + E) 2.1 by (3. O<y<l.61 + E) x * I Y = y. and Z be binary random variables each taking two values. calcluate 3.I i E I}? Note that. Next consider What is the probability that he will get an A on the final if he got an A on the midterm? 4. (c) P(Y < 0. y) = 2(x + y) . 2.6) P(a 5 X 5 b. Thus we have EXERCISES 1. Y . Hence.7.3) Let the joint density of (X.3) Let X. Y E S) = Sb. y. O<x<l. )= GO and C-x.) 6.P(x.1. then we say EX = -m. D E F I N I TI o N 4. if you gain xi dollars with probability P(xi)).. Define Y by Y=X+l = ifO<X<l if -1 -2X < X < 0.P(x. . > 0 and in the second summation we sum for i such that x.is finite. then we say EX = GO. where in the first summation we sum for i such that x. denoted by EX. 2.1 and. 0 < y < 1. 0 =X-Y.6) Assuming f (x. If X is the payoff of a gamble (that is. Obtain the density of Y.). first.2. < x < 1.6) Let X have density f (x) = 0. (Section 3. . Then the average gain from repeating this gamble n times is ~ . is defined to be EX = Z~="=.' z $ ~ xand ~ .) + C-x. . 7. the expected value signifies the amount you can expect to gain on the average. Find the conditional density of X given Y if X = U and Y = U + V. We can write EX = C+x. t > 0. i = 1. Let Xi be the payoff of a particular gamble made for the ith time. Then the expected value (expectation or mean) of X. second. (The symbol log refers to natural logarithm throughout. Let X be a discrete random variable taking the value x. obtain the density of Z 4. with probability P(x. y) = 1.P(x.1. for a discrete random variable i n Definition 4.= -m and C+ is finite.P(x. I The expected value has an important practical meaning. If C+ = and C. For example. If C+x.6) Suppose that U and V are independent with density f (t) = exp(-t). for a continw>m random variable in Definition 4.1 . (Section 3. We shall define the expected value of a random variable. This is a consequence of .P(x.). .5 for -1 4 MOMENTS < x < 1. we say EX does not exist. We can formalize this statement as follows. it converges to EX in probability. If C. It means that if we repeat this gamble many times we will gain 50 cents on the average.) if the series converges absolutely.60 3 1 Random Variables and Probability Distributions (c) Y = 1/X. if a fair coin is tossed and we gain one dollar when a head comes up and nothing when a tail comes up. < 0.) = -m.1 EXPECTED VALUE 8. the expected value of this gamble is 50 cents. (Section 3. (d) Y = log X. 1. The mode is a value of x for which f (x) is the maximum. where minx means the minimum possible value X can take. we may say that the fair price of the gamble is 50 cents. Hence. and the median m is defined by the equation P(X 5 m) = %.1 1 Expected Value 63 Theorem 6. where XI takes value 1 with probability % and 0 with probability % and X2 takes value c with probability 1. A risk taker may be willing to pay as much as 90 cents to gamble. simple exposition of this and related topics can be found in Arrow (1965). Obviously. Such a strategy is called the minimax strategy.1. the expected value of X.=~X. 1 . Choosing the value of d that maximizes EX(d) may not necessarily be a reasonable decision strategy. however. A coin is tossed repeatedly until a head comes up. the expected value is a very important characteristic of a p r o b ability distribution. Z = :. then mode (X) < m < EX. an extremely risk-averse person may choose the d that maximizes minX(d). the utility of gaining four dollars for certainty. For example. The payoff of this gamble is represented by the random variable X that takes the value 2' with probability 2-'. (More exactly." because the Swiss mathematician Daniel Bernoulli wrote about it while visiting St. Petersburg gamble is merely E log X = log 2 . We can think of many other strategies which may be regarded as reasonable by certain people in certain situations. nobody would pay m dollars for this gambIe. How much this gamble is worth depends upon the subjective evaluation of the risk involved. Petersburg Paradox. it is the sample mean based on a sample of size n. This example is called the "St. This does not mean. More generally.2. The other important measures of central location are mode and median.1 m EX A positively skewed density EU(X) for some U. Petersburg gamble for any price higher than two dollars. however. denoted by EX. we write EX = m. the utility function is logarithmic. Coming back to the aforementioned gamble that pays one dollar when a head comes up. Such a person will not undertake the St. The three measures of central location are computed in the following examples. EX = a. decision making under uncertainty always means choosing one out of a set of random variables X(d) that vary as d varies within the decision set D. If. J ! x f (x)dx = -m and 5. being a measure of its central location. because it means minimizing the maximum loss (loss may be defined as negative gain). If the density is positively skewed as in Figure 4.1. z ('/. By changing the utility function. If . Here X(d) is the random gain (in dollars) that results from choosing a decision d. If J : xf (x)dx = w and . To illustrate this point. For the exact definition of convergence in probability.2. where U denotes the utility function. see Definition 6. If the density function f (x) is bell-shaped and symmetric around x = p. rather.) EX is sometimes called the populatzon mean so that it may be distinguished from the sample mean. DEFlN I T I O N 4 . whereas a risk averter may pay only 10 cents. The quantity ~-'C.1. The decision to gamble for c cents or not can be thought of as choosing between two random variables XI and X P . consider the following example. the real worth of the St. we write EX = -m. however. EU(X). one can represent various degrees of risk-averseness. that everybody should pay exactly 50 cents to play. One way to resolve this paradox is to note that what one should maximize is not EX itself but. A good. We shall learn in Chapter '7 that the sample mean is a good estimator of the population mean. and 2' dollars are paid if a head comes up for the first time in the ith toss. is called the sample mean. is defined to be EX = x f (x)dx = c~ and $2 x f (x)dx if the integral is absolutely convergent. xf (x)dx is finite. - Besides having an important practical meaning as the fair price of a gamble. for example.62 4 I Moments 4. . Not all the decision strategies can be regarded as the maximization of I I I mode F I G U R E 4.)I = log 4. then y = EX = m = mode (X). J ! x f (x)dx = -m. Then.! " J x f (x)dx is finite. 2 Let X be a conthuous random variable with density f (x). If . we say the expected value does not exist.1. by Definition 4. Petersburg Academy. 1.) is continuous.1 .The median m must satisfy '/2 = 2xW3dx = -m-* 1.~ f o r x1> . EXY = EXEY.1. Note that the density of Example 4.5) above or by first obtaining the marginal density f (x) by Theorem 3.. because the proof involves a level of analysis more advanced than that of this book. Then we have (4. Then T H E o R E M 4.6 and 4.1.1. Y) be a bivariate continuous random variable with joint density function f (x. and P are EY = r If X and Y are independent random variables.1.. Let Y = +(X) and denote the density of Y by g(y). 1 .1. Therefore m = 6The mode is clearly 1. Then Theorems 4. .15). differentiable. 3 f(x) = ~ . . If X and Y are random variables and L X -- --- - We shall not prove this theorem.2.Since 1/2 = 1 : X-'dx = . j = 1.1 1 Expected Value 65 f(x) = 2 ~ . - Note that given f (x..1.1.1.1. . Therefore (4. we have m = 2.2.1. THE 0REM 4.4. Y)= Pro@ Define Y = + ( X ) Then Y takes value +(xi) with probability P(xi).1.2 and Theorems 4.1.1. T H E 0 R E M 4 . yg(y)dy = lim Zyg(y)Ay = lim Z+(x)f(x)Ax The proof of Theorem 4. then the proof is an easy consequence of approximation (3.m -1 + 1.1. . affecting the mean much more than the median.1.1.). E a = a . and monotonic.6 constants. 1 .1. .1 and then using Theorem 4. i. Y) is either discrete or continuous follow easily from Definitions 4.) be an arbitrary function. Although we have defined the expected value so far only for a discrete or a continuous random variable.1.1 The following is a similar generalization of Theorem 4. y. E X A M P L E 4.1.~ f o r x1.1 can be easily generalized to the case of a random variable obtained as a function of two other random variables.2 r -m -m +(x. yldxdy. If +(. . the following theorems are true for any random variable.3) T H E O R E M 4.1.) with probability P(x.1 + Theorem 4.5 is trivial. Y) be a bivariate discrete random variable taking value (x. . y).. 2.1) follows from Definition 4.).ThenEX=sTx-'dx=m.6. E+(X) can be computed either directly from (4. Let X be a discrete random variable taking value x. 4 Let (X.64 4 I Moments 4. -m . be an arbitrary function. The mode is again 1.4. with probability P(x.7 + PY) = clEX + PEY.1. Then T H E O R E M 4. The following three theorems characterize the properties of operator E. which has pushed both the mean and the median to the right. i = 1. The proofs of Theorems 4.2 has a fatter tail (that is.ThenEX > =ST 2x-'dx= 2.1.3 and 4. . 2. y).1.1 and 4. O Let X be a continuous random variable with densityf (x) and let +(-) be a function for which the integral below can be defined. The same value is obtained by either procedure.1. and let +(. E(aX T H E o R E M 4 .1 and 4. provided that the expected values exist.2 show a simple way to calculate the expected value of a function of a random variable. which we state without proof.5 Ifaisaconstant. y.and let +(-) be an arbitrary function. and let +(.2 Let (X. y)f (x.7 when (X. E X A M P L E 4. T H E O R E M 4. Then a) +(x.1. . the density converges more slowly to 0 in the tail) than that of Example 4. - -- 66 4 I Moments 4.2 (4.1.6) 1 Higher Moments 67 -- Theorem 4.1.7 is very useful. For example, let X and Y denote the face numbers showing in two dice rolled independently. Then, since EX = E Y = 7/2, we have EXY = 49/4 by Theorem 4.1.7. Calculating EXY directly from Definition 4.1.3 without using this theorem wodd be quite timeconsuming. Theorems 4.1.6 and 4.1.7 may be used to evaluate the expected value of a mixture random variable which is partly discrete and partly continuous. Let X be a discrete random variable taking value x, with probability P ( x , ) , i = 1, 2, . . . . Let Y be a continuous random variable with density f ( y ) . Let W be a binary random variable taking two values, 1 and 0, with probability p and 1 - p, respectively, and, furthermore, assume that W is inde~endent of either X or Y. Define a new random variable Z = W X + 1 (1 - W)Y. Another way to define Z is to say that Z is equal to X with probability p and equal to Y with probability 1 - p. A random variable such as Z is called a mixture random variable. Using Theorems 4.1.6 and 4.1.7, we have EZ = EWEX E ( l - W ) E Y . But since EW = p from Definition 4.1.1, we get EZ = pEX + ( 1 - P)EY. We shall write a generalization of this result as a theorem. Y = 0 if 5 5 5 X < 10, 5 -- if 10 5 X 15. Clearly, Y is a mixture random variable that takes 0 with probability % and a continuum of values in the interval [2,3] according to the density f ( y ) = 1/2. Therefore, by Theorem 4.1.8,we have (4.1.7) EY=O.-+- :/: [/lo 5 5 ydy=-. 4 Alternatively, EY may be obtained directly from Theorem 4.1.2 by taking I $ to be a function defined in (4.1.6).Thus (4.1.8) - E Y = + ~( X ) ~ ( X ) ~ X = $ ~ ; + ( X ) ~ X - rn - + -10 10 +(x)dx + r5 10 - $(x)~x] 5 T H E O R E M 4.1.8 Let X be a mixture random variable taking discrete value xi, i = 1, 2, . . . , n, with probability pi and a continuum of values in interval [a, b] according to densityf ( x ) : that is, if [ a , b] 3 [xl, x2],P ( x I I X 5 x2) = 5:: f(x)dx. Then EX = Cy=lxipi + $b, xf ( x ) d x . (Note that we must have Z,"_,p, 5: f ( x ) d x = 1.) 4.2 HIGHER MOMENTS + The following example from economics indicates another way in which a mixture random variable may be generated and its mean calculated. E X A M P L E 4 . 1 . 3 Suppose that in a given year an individual buys a car if and only if his annual income is greater than 10 (ten thousand dollars) and that if he does buy a car, the one he buys costs one-fifth of his income. Assuming that his income is a continuous random variabIe with uniform density defined over the interval [5, 151, compute the expected amount of money he spends on a car. Let X be his income and let Y be his expenditure on a car. Then Y is related to X by As noted in Section 4.1, the expected value, or the mean, is a measure of the central location of the probability distribution of a random variable. Although it is probably the single most important measure of the characteristics of a probability distribution, it alone cannot capture all of the characteristics. For example, in the coin-tossing gamble of the previous section, suppose one must choose between two random variables, X 1and X2, when XI is 1 o r 0 with probability 0.5 for each value and X2is 0.5 with probability 1. Though the two random variables have the same mean, they are obviously very different. The characteristics of the probability distribution of random variable X can be represented by a sequence of moments defined either as (4.2.1) kth moment around zero = E x k or (4.2.2) kth moment around mean = E ( X - 23x1~. 68 4 I Moments EXAMPLE 4.2.2 4.2 1 Higher Moments 69 Knowing all the moments (either around zero o r around the mean) for k = 1, 2, . . . , is equivalent to knowing the probability distribution completely. The expected value (or the mean) is the first moment around zero. Since either xk or (X - EX)^ is clearly a continuous function of x, moments can be evaluated using the formulae i n Theorems 4.1.1 and 4.1.2. As we defined sample mean in the previous section, we can similarly define the sample klh moment around zero. Let X I , X2, . . . , Xn be mutually independent and identically distributed as X. Then, ~ - ' Z , " , ~ X :is the sample kth moment around zero based on a sample of size n. Like the sample mean, the sample kth moment converges to the population kth moment in probability, as will be shown in Chapter 6. Next to the mean, by far the most important moment is the second moment around the mean, whi variance of X by VX, we have DEFINITION 4.2.1 X has density f(x) = 1/(2a), -a < x < a. that the d n c e Note that we previously obtained EX = 2, which S ~ O W S is more strongly affected by the fat tail. - Examples 4.2.4 and 4.2.5 illustrate the use of the: s e c o d farmula of Definition 4.2.1 for computing the variance. A die is loaded so that the probability of a given face turning up is proportional to the number on that face. Calculate the mean and variance for X, the face number showing. We have, by Definition 4.1.1, E X A M PL E 4.2.4 - VX = E(X - EX)^ EX)^. = EX^ - -- - - - - - The second equality in this de the squared term in the above and using Theorem 4.1.6. It gives a more convenient formula than the first. It says that the variance is the mean of the square minus the square of the mean. The square root of the variance is called the standard deviation and is denoted by a. (Therefore variance is sometimes denoted by a%nstead of V.) From the definition it is clear that X = 0 if and only if X = EX VX r 0 for any random variable and that V with probability 1. The variance measures the degree of dispersion of a probability distribution. In the example of the coin-tossing gamble we have V X I = '/4 and VX2 = 0. (As can be deduced from the definition, the variance of any constant is 0.) The following three examples indicate that the variance is an effective measure of dispersion. EXAMPLE 4 . 2 . 1 Next, using Theorem 4.1.1, - - - - - Therefore, by Definition 4.2.1, (4.2.5) V X = 21 - 169 = 20 ---9 9 . X has density f ( x ) = 2(1 - x) for 0 < x otherwise. Compute VX. By Definition 4.1.2 we have E X A M P L E 4.2.5 < 1 and = 0 X=a = -a with probability % with probability % B y Theorem 4.1.2 we have I V X = EX' = a2. - 70 4 1 Moments 4.3 1 Covariance and Correlation 71 (4.2.7) EX' = 2 1 /' (x2 - x3)dx = - . o 6 Therefore, by Definition 4.2.1, 6, we can show that the sample covariance converges to the population covariance in probability. It is apparent from the definition that Cov > 0 if X - EX and Y - EY tend to have the same sign and that Cov < 0 if they tend to have the opposite signs, which is illustrated by EXAMPLE 4 . 3 . 1 The following useful theorem is an easy consequence of the ckfmiticm of the variance. THEOREM 4 . 2 . 1 (X, Y) = (1, 1 = (- 1, -1) = with probability 4 2 , with probability 4 2 , with probability (1 - a)/2, with probability (1 - a ) / 2 . IfaandPareconstants,wehave (1, - 1 ( 11 Note that Theorem 4.2.1 shows that adding a constant to a random variable does not change its variance. This makes intuitive sense because adding a constant changes only the central location of the probability distribution and not its dispersion, of which the variance is a measure. We shall seldom need to know any other moment, but we mention the third moment around the mean. It is 0 if the probability distribution is symmetric around the mean, positive if it is positively skewed as in Figure 4.1, and negative if it is negatively skewed as the mirror image of Figure 4.1 would be. 4.3 = Since EX = EY = 0, = a - Cov(X, Y) = EXY (1 - a ) = 2a - 1. Note that in this example Cov = 0 if a = %, which is the case of independence between X and Y . More generally, we have T H E0 RE M 4 . 3 . 1 If X and Y are independent, Cov(X, Y) = 0 provided that VX and W exist. COVARIANCE A N D CORRELATION Covariance, denoted by Cov(X, Y) o r uxy, is a measure of the relationship between two random variables X and Y and is defined by DEFINITION 4.3.1 The proof follows immediately from the second formula of Definition 4.3.1 and Theorem 4.1.7. The next example shows that the converse of Theorem 4.3.1 is not necessarily true. EXAMPLE 4 . 3 . 2 COV(X,Y) = E[(X - EX)(Y - EY)] =EXY - EXEY. Let the joint probability dbtdb~tion of @ , Y ) be @v&n by The second equality follows from expanding (X - EX) (Y - EY) as the sum of four terms and then applying Theorem 4.1.6. Note that because of Theorem 4.1.6 the covariance can be also written as E[(X - EX)Y] or E[(Y - EY)X]. Let (XI,Y 1 ) , (X2,Y2), . . . , (Xn,Yn) be mutually independent in the sense of Definition 3.5.4 and identically distributed as (X, Y). Then we define the sample covariance by n-'~in,,(x~ - 2)(Yi - p), where 2 and P are the sample means of X and Y , respectively. Using the results of Chapter - 3.3. N = 5/8. f o r O < x < l = It is clear from Theorem 4.) = 0 for every pair such that i f j. Assume that the returns from the five stocks are pairwise independent.3. 49 1 Cov(X.X.6.3..3.by symmetry 12 7 : . .1 and 4. p.1.3. f(x.1.2. Therefore.S There are five stocks.3. (b) E(2 c&~x.1. and v(~c:=~xJ= 20a2 by Theorem 4. Cov(Income.1. Y).3. 2 V(X 'C Y) = V X + VY 'C 2 COV(X.3.1. what will be the mean and variance of the annual return on your portfolio? (b) What if you buy two shares of each stock? Let X. and the same variance of return.3.) = 100u2 by Theorem 4. and EXY = 1/4.1 for computing the covariance. 0 otherwise. by Definition 4.3. . E X AM P L E 4. i = 1. n. EY = . Then where the marginal probabilities are also s h m . and V(lOX... . Then.3 1 Covariance and Correlation 73 Clearly.72 4 I Moments 4. Y ) = . We have - : .3.3.)= 10p by Theorem 4.6. u2. each of which sells for $100 per share and has the same expected annual return per share. (a) If you buy ten shares of one stock. Consumption) is larger if both variables are measured in cents than in .--.) = 10p by Theorem 4.---.. Y) = '/4 . be the return per share from the ith stock.2. 3 . which we state as T H Eo R E M 4 .3 and 4. X and Y are not independent by Theorem 3. (a) E(lOX. Examples 4. THEOREM 4 . 2. we can easily show that the variance of the sum of independent random variables is equal to the sum of the variances.4 illustrate the use of the second formula of Definition 4. As an application of Theorem 4. Let the joint probability distribution of (X. For example.2.2 that the conclusion of Theorem 4. consider E X A M P L E 4 . 3 . be pairwise independent.2.1 and Theorem 4.3 Theorem 4. . but we have Cov(X..4 Letthejointdensitybe and O < y < 1.= . Y) .S/16 = ' /.3. Combining Theorems 4.2 gives a useful formula for computing the variance of the sum or the difference of two random variables. EXAMPLE 4. Y) be given by The proof follows immediately from the definitions of variance and covariance.y) = x + y . Calculate Cov(X. 3 .by Definition 4.3 holds if we merely assume Cov(X. 3 144 144 1 A weakness of covariance as a measure of relationship is that it depends on the units with which X and Y are measured. 3 - Let X.3.1. Cov(X. W have EX = 1/2. Y) = EXY = 0.3. 3 1 Covariance and Correlation 75 dollars.6) and (4. among all the possible linear functions of another random variable. c 6 Correlation (X.3). Y .3. u x ' UY Correlation is often denoted by pxy or simply p. PY) = Correlation (X. we can write the minimand.1) as a ZPEX' - 2EXY + 2aEX = 0.7) -= Proof. O If p = 0. X.3. Since the expected value of a nonnegative ran&m wriab$e is nonnegative. The latter will be called either the prediction error or the residual.3. Y) . 6 The best linear predictor (or more exactly.3.2) V X + X'VY . E[(X .3. It is easy to prove T H E OR E M 4 . we have (4.3. + Next we shall ask what proportion of VY is explained by a* + P*X and what proportion is left unexplained. This weakness is remedied by considering cumlation (coeflcient).3. We next consider the problem of finding the best predictor of one random variable. The problem can be mathematically formulated as (4. denoted by S.2k Cov 2 0 for any k.8) and (4. we get Expanding the squared term.2). Y). We also have THEOREM 4 . the minimum mean-squared-error linear predictor) of Y based on X is given by a* P*X. Solving (4.X(Y - 2 0 for any A. 3 . If p > 0 (p < 0). and (4. as If a and P are nonzero constants.3. Define = a* + P*X and U = Y ?. defined by DEFINITION 4. We shall solve this problem by calculus. we say X and Y are positively (negatively) correlated.4) Minimize E(Y .7) simultaneously for a and P and denoting the optimal values by a* and P*. Y) = Cov(X.3. 3 . as we shall see below. we obtain (4. we say X and Y are uncorrelated. 4 by the best linear predictor based on X turns out to be equal to the square of the correlation coefficient between Y and X. where a* and p* are defined by (4.2 . 3 . putting X = Cov/VY into the left-hand side of (4. Expanding the squared term.3. This problem has a bearing on the correlation coefficient because the proportion of the variance of Y that is explained T H E0 RE M 4 .2a - 2EY + PPEX = 0 s 5 1. and In particular.EX) .3.3. We shall interpret the word best in the sense of minimizing the mean squared error of prediction. 5 aa . we obtain the Cauchy-Schwartz inequality Thus we have proved The theorem follows immediately from (4. Equating the derivatives to zero.9).74 4 I Moments 4.a - PX)' with respect to a and P. we have (4.6) -- Correlation(aX. We have . . and also suppose that X has a symmetric density around EX = 0.)P(y. This result suggests that the correlation coefficient is a measure of the degree of a linear relationship between a pair of random variables. be an arbitrary function.3.1) EYlX+(X> Y) = ]=I C + ( X .11). . Let (X. (4. When a = 1/2. or it may be regarded as a random variable. Since V X =V Y = 1 in that example.Y) is a function only of X. we can define the conditional mean in a way similar to that of Definitions 4. using the conditional probability defined in Section 3. and (4.2. Y) I XI or by Edx+(X. j = 1. being a function of the random variable X. p = 2a .2. . denoted by E[+(X..3. 2 a) The conditional mean Eylx+(X. U).2 and the various types of conditional densities defined in Section 3.8) and (4. Y) =0 I @ by Definition 4. 4.4.1. Let P(y.4 1 Conditional Mean and Variance 77 (4. Y) given X is defined by DEFl N IT10 N 4 .3. Then the conditional mean of +(X. i.76 4 I Moments 4.3. we can take a further expectation of it using the probability distribution of X. Y) C O N D I T I O N A L M E A N A N D VARIANCE by Theorem 4. there is an exact linear relationship with a negative slope.3. Y ) given X.1 t i = p2VY by Definition 4.1 m = COV(?.3.2P* Cov(X. f') and the part which is uncorrelated with X (namely.1 = P*Cov(X. Suppose that there is an exact nonlinear relationship between X and Y defined by Y = x2.1. If we treat it as a random variable. As a further illustration of the point that p is a measure of linear dependence. is defined by D EF l N l T I 0 N 4. given X.3.4 = W + (P*)~vx. Let +(.3. In the next section we shall obtain the best predictor and compare it with the best linear predictor.3. Therefore p = 0.2. . I XI.1.3. Let +(.3.2. p p Let (X.1 and 4. Then Cov(X. Y ) be a bivariate continuous random variable with conditional densityf (y 1 x). Here we shall give two definitions: one for the discrete bivariate random variables and the other concerning the conditional density given in Theorem 3.1 again..10).3. Y).2.10). 2.4. It may be evaluated at a particular value that X assumes.3.4. there is an exact linear relationship between X and Y with a positive slope. 4 . (4. consider Example 4. . Y) = EXY = Ex3 = 0. Combining (4.p2 proportion to the second part.3. When a = I.4. This may be thought of as another example where no correlation does not imply independence. We have A nonlinear dependence may imply a very small value of p.) be an arbitrary function.2 and 3. The following theorem shows what happens.1 by (4. we can say that any random variable Y can be written as the sum of the two parts--the part which is expressed as a linear function of another random variable X (namely. the degree of linear dependence is at the minimum.3. 1 X) be the conditional probability of Y = y. a p2 proportion of the variance of Y is attributable to the first part and a 1 .12). We also have In Chapters 2 and 3 we noted that conditional probability and conditional density satisfy all the requirements for probability and density. y.2 = (1 - p2)w by Definition 4. Then the conditional mean of +(X.10) V? = (P*)~vx by Theorem 4. We call W the mean squared prediction errorof the best linear predictor. Therefore. y.).I @ by Definition 4. . . Y ) be a bivariate discrete random variable taking values (x. Y) . When a = 0. 4) V 4 ( X .4.4. use Definition 4. (The symbol Ex indicates that the expectation is taken treating X as a random variable. O < x 1. 2 EXAMPLE 4 . \ \ The following examples show the advantage of using the right-hand side of (4.-- . It says that the variance is equal to the mean of the conditional mriance plus the variance of the conditional mean. x. We have .7.3) and (4. use Theorem 4.) Proof: We shall prove it only for the case of continuous random variables.4. 2 ThemarginaldensityofXisgiven by ffx) = 1 . f ( y I x) Supposef(x) = l f o r O < x < 1 and =Ootherwiseand < y < x and = 0 otherwise.2: The following theorem is sometimes useful in computing variance.4.4. Calculate EY.4. Y ) = ExVylx4(X.4 1 Conditional Mean and Variance 79 THEOREM 4 .1 4. Proof Since P(Y=OIX=x)=l-x.E X ( E Y ~ X + ) ~ . The conditional probability of Y given X is given by P(Y = 1 IX = x) = < (4. The result obtained in Example 4. Y ) = ExEdx4(X. This problem may be solved in two ways.4. 4 . EY = 1 ExEylxY = ExX = .1.2 can be alternatively obtained using the result of Section 3.4) in computing the unconditional mean and variance. Y ) f VxEq&(X.4.1: = x-' for 0 Second. by adding both sides of (4. Y ) . THEOREM 4 . 4 . 4 .6) and (4. Find EY and W. the proof is easier for the case of discrete random variables. 2 But we have Therefore.7).- - 78 4 I Moments E X A M P L E 4. we have (4. as follows. 1 ( l a w o f i t e r a t e d m e a n s ) E 4 ( X ..4. First.6) ExVylx+ = ~4~. we must first compute the moments that appear on the right-hand sides of (4.Thus we have Despite the apparent complexity of the problem. the minimum meansquared-error predictor) of Y based on X is given by E(Y I X ) .4.9): EY = 7 / 2 .3 1(Y 1 X). Y = 0 ) = Poo = 0. Y = 1 ) = Pll = 0. Let Y be the face number showing. P(X = 1 .3.4 1 Conditional Mean and Variance 81 In the next example we shall compare the best predictor with the best linear predictor.+(x)])' where the values taken by Y and X are indicated by empty circles in Figure 4. EX' = EXY = 28/3. E X A M P L E 4 . EY' = 91/6. Y = 0 ) = Plo = 0.--- where the cross-product has dropped out because Therefore (4.8) and (4.3. VX = 16/3. there is a simple solution. We have 1 ) E[Y - (b(x)12 = E{[Y . EX = 2. Cov = 7 / 3 . Y = 1) = Pol = 0. P(X = 0 . Put ?= ( 2 1 / 8 ) + ( 7 / 1 6 ) X . Obtain the The best predictor (or more exactly.2. Define X by the rule X = Y if Y is even.2. Here we shall consider the problem of optimally predicting Y by a general function of X. We shall compute the mean squared error of prediction for each predictor: - .E(Y I X ) ] + [E(Y I X ) . 4 . W = 35/12.11) is clearly minimized by choosing + ( X ) Thus we have proved T H EoR EM 4. P(X = 0 .2. 4 . Therefore - Minimize E[Y - (b(x)12 with respect to +. I X 0 2 4 6 Therefore the mean and variance of Y can be shown to be the same as obtained above.4. = 0 if Y is odd. The following table gives E(Y X). E X A M P L E 4 . .80 4 1 Moments 4. using the definition of the mean and the variance of a discrete random variable which takes on either 1 o r 0. The problem can be mathematically formulated as (4. 3 A fair die is rolled. 4 Let the joint probability distribution of X and Y be given as follows: P ( X = 1.3.3.10) To compute the best linear predictor. In the previous section we solved the problem of optimally predicting Y by a linear function of X.4. E n d the best predictor and the best linear predictor of Y based on X. and Cov(X. but as an illustration we shall obtain two predictors separately.1) A station is served by two independent bus lines going to the same destination.14) MSPE =W - Cov(X9 vx v2= ap4. Inserting these values into equations (4.5. 1 % 3/s E(Y I X) = [ P I I / ( ~ I+ I PIO)IX+ [ P o I / ( ~ o+ I Poo)] (1 .4. Y) be given by PII/(PII + Pio) Poi/ (Poi + Poo).2X. Compute EX and VX assuming the probability of heads is equal to p. From (4.12) . EXERCISES F I G U R E 4.X).yEyix[~' E(Y x ) ~ = EY2 . and in the second line ten minutes. (Section 4.2) Let the probability distribution of (X. From (4.4.4. VX = W = 0. which is a linear function of X. (Section 4. We have E(Y 1 X = 1) and E(Y I X = 0) = = 1.82 4 I Moments = E.2) Let the density of X be given by Best predictor.3.2 Comparison of best predictor and best linear predictor best predictor and the best linear predictor of Y as functions of X and calculate the mean squared prediction error of each predictor.3.3. Y) = 0. Its mean squared prediction error (MSPE) can be calculated as follows: + . 4.11) we obtain (4.2) Let X be the number of tosses required until a head comes up.25. (Section 4. Find V(X I Y).Ex[E(Y x ) ~ ] + I 2YE(Y ( X ) ] I Best linear predictor. (Section 4. This result shows that the best predictor is identical with the best linear predictor. The moments of X and Y can be calculated as follows: EX = EY = 0. 3.8) and (4.4 0.2 and a* = 0.4. You get on the first bus that comes.05.12) we readily obtain E(Y I X) = 0.9) yields P* = 0. What is the expected waiting time? 2. Both equations can be combined into one as (4. In the first line buses come at a regular interval of five minutes. 6. y) = 1 for 0 Calculate VX. Y). 7.4-Prediction) Let X be uniformly distributed over [0. Y) = 0. 0 11. - - 8. P(Y = 1 1 X = 1) = 0.4. Calculate Cov(Z. P(Y = 1 1 X = 0) = 0. (Section 4. 12. (Section 4. P ( Z = 1 1 Y = 1) = 0.5.84 4 I Moments = I Exercises for 0 85 f( 4 x < x < 1. and Z be random variables. Compute Cov(X. but Y is of no value in predicting X. (Section 4. + y for 0 < x < 1 and 0 < < x < 1 and 0 < y < Obtain the best predictor and the best linear predictor of Y as fumtions of X and calculate the mean squared prediction error for eack predictor.3) Suppose X and Y are independent with EX = 1." 17. (Section 4. Define Z = X + Y and W = XY. + + - - . 10.4) With the same density as in Exercise 6. P(Z = 1 1 Y = 0) = 0. (Section 4.3) Let EX = EY = 0. y) = 2 for 0 x. 5. Let the conditional density of Y given X = 1 be uniform over 0 < y < 1 and given X = 0 be uniform over 0 < y < 2.3) Let (X.5.4-Prediction) Give an example in which X can be used to predict Y perfectly. Y) = 1. 15.7.3) Let the probability distribution o f (X. Y values: 1 and 0. find the best linear predictor of Y based on 2. Determine a and p so that V(aX + PY) = 1 and Cov(aX PY. obtain E(X Y = X I + 0. 14. (Section 4. Compute VX and Cov(X. Assume that the probability distribu- 16. otherwise. (Section 4. (Section 4.3. VX = V define Z = X Y.. (Section 4. EY = 2. (Section 4.5).(20/19)Y are uncorrelated. If we Suppose EX = EY = 0. y) = x y < 1. VX = W = 2. and Cov(X. (Section 4.- 9. Given P(X = 1) = 0.4) Let X = 1 with probability p and 0 with probability 1 . Y) have joint density f (x. VX = 1. 13. 1) define . Obtain Cov(X. 11 and for some c in (0.4) .4-Prediction) Y = 1. Y ) -begiven by < x < 1 and 0 < y < 1.3) Let (X. and Cov(X. Obtain E(X I X < Y). each of which takes only two Let X. X) = 0. (Section 4. (Section 4. find EZ and E(Z I X = 1). (b) Are X + Y and X .4-Prediction) Let the joint probability distribution of X and Y be given by I (a) Show that X + Y and X .(20/19)Y independent? 6. Y) have joint density f (x. and W = 1. Y). Supply your own definition of the phrase "no value in predicting X. =2-x = forl<x<2. Y).p. W).4) Let f (x. ). . . THEOREM 5. 1 Let (Y.)Find the best predictor and the best linear predictor of X + Y given X .1 B I N O M I A L R A N D O M VARIABLES t Let X be the number of successes in n independent trials of some experiment whose outcome is "success" or "failure" when the probability of success in each trial is p. p ) .1 (5. 19. where U = x .iableXwehave = P(X = k ) ~ t p ~ q ~ . More formally we state D E F I N I T I O N 5 .~ .1 for the definition. I .Y . with the probability distribution given by I Y. Compute their respective mean squared prediction errors and directly compare them. 2 and compare the BINOMIAL AND NORMAL R A N D O M VARIABLES 18. is called a binomial random variable. p). (See Section 5.1. each distributed as B ( l . .1) is distributed as B(l.1.) Define X = U + V and Y = UV. = 1 with probability p = 0 with probability 1 - p = q. Then the random variable X defined by n (5. the number of heads in n tosses) and is called a binomial random variable. 5. (T is exponentially distributed with parameter h if its density is given by h exp(-kt) for t > 0. Symbolically we write X . 2. denoted variances V(X) and V(U).1. Find the best predictor and the best linear predictor of Y given X. 1 .2. n .2) X = .3) Forthebinomialrando~1MI. i = 1.=I Y.1. which is called a binary or Bernoulli random variable. (Section 4.86 4 I Moments Find the best predictor of X given Y.4-Prediction) Suppose U and V are independent with exponential distribution with parameter A. p). .4-Prediction) Suppose that X and Y are independent. (Section 4. Note that Y . Such a random variable often appears in practice (for example. defined in (5.B ( n . and a.1.1.5. In this example we have n = 5 and p = 0.3). a 2 ) . b e ThenEX = pandvx = a'.11) for each k .2).1.25. 2 . and all positive u by a rather complicated procedure using polar coordinates.u'). Note that the above derivation of the mean and variance is much simpler than the direct derivation using (5.3.5) EX = np. = p E Y ~= p for every i for every i The normal distribution is by far the most important continuous distribution used in statistics.k trials are failures is equal to pkqn-k.1.N(y.1. however.1.5 and V X = 1.2 N O R M A L R A N D O M VARIABLES EY. combinations with an equal probability. p. Since k successes can occur in any of C.88 5 1 Binomial and Normal Random Variables 5. = p .1. We should mention that the binomial random variable X defined in Definition 5. See. Prooj The probability that the first k trials are successes and the remaining n .3). which we shall discuss in Chapter 6. f (x)dx is difficult because the normal density does not have an indefinite integral. The direct evaluation of a general integral J-b. D E F I N I T I O N 5.8) 5. O 1 When X has the above density. to give formula (5.3.1.1. Therefore by (5.4) and (5.p2 = pq for every i EX = (5.9) x n 1= The norrnaldensityisgivenby - EY.2.= npq by Theorem 4. = np by Theorem 4.1.we have EX = 2. 1 Let X be the number of heads in five tosses of a fair coin.1 - VY. This is a special case of the so-called central limit theorem.For example. we write symbolically X .1. Using (5. we must multiply the above probability by C. 78). VX = npq. Obtain the probability distribution of X and calculate EX and VX.4) and (5. we have .3.Using (5.1. Many reasons for its importance will become apparent as we study its properties below.5). 1 . I*.6 1 n 5 V X = 1= W .the mean and variance of X can be obtained by the following steps: (5.1. Proof.7) (5.2 1 Normal Random Variables 89 (5. We have THEOREM 5 .1. Examples of the normal approximation of binomial are given in Section 6.1 is approximately normally distributed when n is large. The normal density is completely characterized by two parameters. for example. Hoel (1984. We have Evaluating (5. in the direct derivation we must compute EXAMPLE 5 .6) (5. Such an integral may be approximately evaluated from a normal p r o b ability table or by a computer program based on a numerical method.3) we have We can verify f(x)dx = 1 for all IJ.1. 1 J-mm ~ e t ~ N(P. p2u2).2. what is the largest value that u can have? which shows that the larger u is.3).Thenwehave Proot Using Theorem 3. from (5.N(10. Next we have A useful corollary of Theorem 5.1587 . 1). the density g(y) of Y is given by But we have Therefore. Defining Z in the above way. 1 f (p) = --G u . using integration by parts = u 2.2. and (5. ~ e be ~ t ( p ~ u2) . E X A M P L E 5 .2.1) it is clear that f (x) is symmetric and bell-shaped around p. r Sometimes the problem specifies a probability and asks one to determine the variance.0.6.2.6) = 0.2.8) can be evaluated from the probability table of the standard normal distribution.N ( a + Pp.N(O. observe (5. we have EX = p.2. 2 Assume that the life in hours of a light bulb is normally distributed with mean 100. To study the effect of o on the shape off (x). by Definition 5.2.2. u2). calculateP(4 < X < 8 ) .2.4). 2 . Therefore. p2u2). as in the following example.2)) (5.1. If it is required that the life should exceed 80 with at least 0. u2). = P ( . EX = p follows directly from this fact. the flatter f (x) is. We will often need to evaluate the probability P(xl < X < x2) when X is N(p.N ( a + pp. andletY a f (3X.3 < Z < -1) where Z .0013 from the standard normal table From (5.2. . then Z = (X .2.2 T H E O R E M 5.9 probability. Theorem 5. which is called the standard normal random variable.p) / u is N(0.1 AssumingX.P(Z < -3) 0. Y and . O because the integrand in (5.90 5 ( Binomial and Normal Random Variables 5.1.2 shows an important property of a normal random vanable: a linear function of a normal random variable is again normal. 1).2 1 Normal Random Variables = 91 (uz + p) exp(-z2/ 2)dz by putting z = u Y .2 is that if X is N(p.1574. we have The right-hand side of (5.4) is the density of N(0. 2 z exp(-z 2/2)dz by putting z = - x- CT E X A M P L E 5.4).1) c l = = P(Z < -1) .2. I 5 1 Binomial and Normal Random Variables 5. Y) = puyux. 1 The bivariate normal density is defined by Therefore we immediately see X N(py. and we have EX = px.3. 5.3) without much difficulty. O In the above discussion we have given the bivariate normal density (5. Y) = p. hence Correlation(X. then a X + PY is normal. If X and Y are bivariate normal and c-r and (3 are con- . Y) = p. (5. 3 . a.6. EY = py.All that is left to show is that Correlation(X.3. u2).1) is indeed the only function o f x and y that possesses these properties. The next theorem shows avery important property of the bivariate normal distribution.1) as a definition and then derived its various properties in Theorem 5.1).3.3 1 Bivariate Normal Random Variables 93 92 Let X be the life of a light bulb. 2 - Proot The joint density f (x.1.3. 3 . We have From the standard normal table we see that + From (5.3. V Y = u:.3.1 + THEOREM 5 .2.12) we conclude that we must have u 5 15.).2. Correlation(X. We can also prove that (5.100)/u. We must Defining Z = (X .(1 . and finally (5.2.p2)]. u.10) is equivalent to where f is the density of N(px. Then X determine u2 so as to satis5 . u i ) and f is the demity of N[p+ p ~ ~ u -~px).6) EXY = ExE(XY I X) = Ex[XE(Y ) X)] = Ex[Xpy - + P ~ Y U X ~ X(X px)I - - = P X ~ Y+ puyux. a. Then the marginal densities f (x) and f (y) and the conditional densities f (y I x) and f (x I y) are univariate normal densities.3 BlVARlATE N O R M A L R A N D O M VARIABLES = lr f2dy -m because f does not depend on y because f2is a normal density. Therefore Cov(X. Next we have . = f l D E F I N IT I o N 5 . We have by Theorem 4. 3 .All the assertions of the theorem follow from (5.P2)].N(px. Y) have the density (5.4. 1 Let (X. VX = u i .(l x . Y) = p. y) can be rewritten as - L stants.px).11) and (5.). By symmetry we have Y - Therefore we can conclude that the conditional distribution of Y given X = x is N[py p u f l i l ( x .N(100. ~ (u. T H E0 RE M 5 . P*')] and f~ is that of N ( p x . If.3.1 and equation (5. Because of Theorem 5. and Cov(X. Z is also normal because of Theorem 5. Note that the expression for E(Y X ) obtained in (5.2) is precisely the best linear predictor of Y based on X. a * ) . we have (5.3. u .4.2) follows readily from Theorems 5. If we put p = 0 in (5.2) before we proved Theorems 5. ) . 1 Proof.2.3. Therefore Z and X are independent because of Theorem 5. 2.4.p2. We conclude. using Theorem 5.3 that E(Y X ) is the best predictor of Y based on X. T HEoR EM 5.3.1).3 1 Bivariate Normal Random Variables 95 Proof.6. Z) = 0. Y ) = 0.3. 11 1 ) for an example of a pair of univariate normal random variables which are jointly not normal.~ ~ (":)2(l .ar / x)f (x)& 1 I f X and Y are bivariate normal and Cov(X. then X and Y are independent. therefore. . we have THEOREM 5.3.4 and equations (4. By applying Theorem 5. Define W = wX + Y. .Recall that these three equations imply that for any pair of random variables X and Y there exists a random variable Z such that I I But clearly h is the density of N [ a p .2.4. be pairwise independent and Then identically distributed as N ( p . we can easily prove that a linear combination of n-variate normal random variables is normal./a. X and Y are bivariate normal.7) with respect to t md density of W by g ( . (5. It may be worthwhile to point out that (5. as before.3.3. +Z~opxc and ~ ~p* = (pey + aox)/u.1).6. -m j(x.11).3.2 does not necessarily follow if we merely assume that each of X and Y is univariately normal. O If we define ('J? )* = a2u$ + we can rewrite (5. we need to prove the t h e m miy for the case that p = 1.3.3. In the preceding discussion we proved (5. and (4.p. in addition. py p*(~*.3. n.3. ) . a 2 / n ) .3.8) Let { X . (4. Since we showed in Theorem 4. Then we have (5.7) P(W c t) = = /E -m r-. See Ferguson (1967.3.i = 1.2. y) = f (x) f ( y ) .that g(t) is a normal density. y)dydx 2 [r-m " " f (Y 1 x)dY]f (x)dx.3.is N ( p .3.. (l/n)Cr=IX.12).2 and 5.3. which was obtained in Theorem 4.10) Y = py UY +p( x . The following is another important property of the bivariate normal distribution. ) . VZ = 1 . we immediately see that f (x. It is important to note that the conclusion of Theorem 5.94 5 1 Binomial and Normal Random Variables 5.4.)(x .2 repeatedly.3 Differentiating both sides of (5.3.px) + U Y Z .8) as C.10).3.3. In particular.3.the best predictor and the best linear predictor coincide in the case of the normal distribution-another interesting feature of normality. 'JX EZ = 0. . D + + 1 . Therefore X and Y are independent by Definition 3. .3.3.2 and 5.4 g(t) = -m f ( t . which implies that .3. 4. a matrix is denoted by a boldface capital letter and a vector by a boldface lowercase letter.8 that your estimate will not be in error by more than 0. ~. n.10). Y given X = 3 is N ( 3 .4.13. 2.. .y)/3. the answer is 60.4 shows. 3 . l ) .1).2 1 X = 3 ) .2 < Y < 3. and. as Example 4. i.N ( 0 . however. 2 If you wish to estimate the mean of a normal population whose variance is 9.2). and 5.we have E X A M P L E 5 .3. Suppose X and Y are distributed jointly normal with EX 1.Z ..1 and 4. in particular. We sometimes write u: for aI.4 1 Multivariate Normal &ndom Variables 97 E(Z I X ) = EZ = 0 and V ( Z I X) = V Z = 1 .3. Then. j = 1. . The student unfamiliar with matrix analysis should read Chapter 11 before this section.=l X. how large a sample should you take so that the probability is at least 0..3. the linearity of E(Y I X ) does not imply the joint normality of X and Y.3.96 5 1 Binomial and Normal Random Variables 5. Note that a i = 1. . 2. Let x be an n-dimensional column vector with E x = p and Yx = 2. . respectively.) We write their elements explicitly as follows: Therefore.1. Therefore. VX = VY = %.5? Put X. .1 We say x is multiuariate normal with mean p and if its density is given by variance-covariance matrix 8.. EY = 2.1 and 5.4. taking the conwhich implies n > 59.2. : ) .Defining Z . x.3. n. a. the above density is reduced to the bivariate density (5.3. ditional mean and variance of both sides of (5.=. and the correlation coefficient p = '/2.N ( . = Cov(x.4.3..3. Examples 4. Therefore. the inequality above is . . Using (5. .3. Conversely.2) we have E X A M P LE 5.7 and Chapters 12 and 13. Calculate P(2.2 also indicate the same point.3. - 1 " x). = Vx. by Theorem 5.1 = 5. ..3.3. The following two examples are applications of Theorems 5. .p2. 5. The results of this section will not be used directly until Section 9. 9 ) . . Defining the standard normal Z = equivalent to G(a. ]A).3. want to choose n so that The reader should verify that in the case of n = 2.).4. D E F I N I T I O N 5. denoted N ( p . we arrive at (5. (Throughout this section. Now we state without proof generalizations of Theorems 5.N ( p . .4 MULTIVARIATE N O R M A L R A N D O M VARIABLES In this section we present results on multivariate normal variables in matrix notation. is multivariate normal. p). Z12= E[(y . I). (Section 5. Define = ~ : 2 ~ ~ .4. (Section 5. where ZI1 = V y = E[(y . 4.3) Suppose (X. - THEOREM 5 . (independent and identically distributed) drawings from bivariate normal random variables with EX = l .Ez) '1. THEOREM 5 . Y N(2. Define X and Y by Y=X-1-U.2) Suppose X. That is to say.1 L e t x W N ( p . Calculate P(U < 5) where u = wx (1 .98 5 1 Binomial and Normal Random Variables I Exercises 99 THEOREM 5. respectively. (Section 5.N(Ap. W B(l.z').Ez)(z Ez) 'I. Compute EX.3) and v(y I 2) = 2 1 1 .5).whereyis h-dimensional and z is &dimensional such that h + k = n.1. a'). and Cov(X.75. Then Ax .3) Suppose U and V are independent and each is distributed as N(0.Ey)(y .d. X=2Y-3-v. VX = 4. Let X be the number of aces that turn up. EXERCISES 1. and P(X 2 4). and the conditional distribution of y given z (similarly for z given y) is multivariate normal with 5. 4 . Y. Obtain E(Y I X) and V(Y I X). Partition 2 conformably as tor and the best linear predictor of Y given X and calculate their respective mean squared prediction errors. (Section 5. 4). 0.W)Y.Ey) (z . and /36 = ~:21~. Z and ) let A be an m X n matrix of constants such that m 5 n and the rows of A are linearly independent.4.Z12(z22)-'ZP~.1) Five fair dice are rolled once. 1) and N ( 1 . Y) = 2. Find the best predictor and the best linear predictor of y2 given X and find their respective mean squared prediction errors. l ) . (Section 5. f (x) = f (y)f (z). 8 )andpartitionxf = (yf./36. 2 Let (XI. VX.N ( p .2) Let X = S and Y = T + T S ~where . W = 9. 1.i. such as y or z. Z ) a n d l e t y a n d z bedefinedasin Theorem 5. Calculate P ( p > 3 .Ey)']. 4 . Find the best predic- . S and T are independent and distributed as N(0. - + - - a * 3.Y. and ZZ1= (&) '. respectively.) be i. y and z are independent. 2. Y) BN(0. 9). (Section 5. Let x . and W are mutually independent and distributed as X N ( l .N ( ~ . ZZ = Vz = E[(z . 6. Then any subvector of x. 1. where f (y) and f (z) are the multivariate densities of y and z.29). 0. EY = 2. meaning that X and Y are bivariate normal with zero ineans and unit variances and correlation p. 3 Supposex. If Z12= 0. The last equality reads "the probabzlity limit of X.). 1 . D E F I N I T I O N 6 . A sequence of random variables { X . ..X. or Amemiya (1985).1. Y.. Most of the theorems will be stated without proofs..1 MODES OF CONVERGENCE Unlike the case of the convergence of a sequence of constants.6. We write X. . n = 1. For the proofs the reader should consult.6. that a sample mean converges to the population mean. where r is the number of heads in n tosses of a coin. Chung (1974). . If a. In Chapter 1 we mentioned that the empirical frequency r / n .. . . for a sequence of random variables. This suggests that we should modify the definition in such a way that the conclusion states that (6. converges to the probability of heads.1. because it would be sometimes true and sometimes false. 5 X as n + cc or plim. If {X. were a random variable. ] .1.} have the same limit distribution. and we call F the limit distribution of {X. Serfling (1980). and in Chapter 5. -$ X. Let us first review the definition of the convergence of a sequence of real numbers. 1 . in Chapter 4. } . . A definition can be found in any of the aforementioned books. we could not have (6. is X. We write X. for example. convergence in mean square and convergence in distributzon. . but we will not use it here. 1 . converges to the distribution function F of X at every continuity point of F. and the third. The first two are examples of a law of large numbers. that the binomial variable is approximately distributed as a normal variable. Rao (1973). We could only talk about the probability of (6. the if clause may be paraphrased as follows: if lim P(\X.) is said to converge to X in mean square if limn. There is still another mode of convergence. 1 ~ s e ~ u e n c e o f r e a l n ~ b e r s ( a . in turn. is said to converge to a random variable X i n probabzlzty if for any E > 0 and 6 > 0 there exists an integer N such that for all n > N we have P(IX. which. D E F I N I T I O N 6 . an example of a central limit theorem. .1 1 Modes of Convergence 101 I LARGE SAMPLE T H E O R Y Now we want to generalize Definition 6.}and {Y.1) holds with a probability approaching 1 as n goes to infinity. . 3 x . almost sure convergence.1) being true. of X. 2 ( c o n v e r g e n c e i n p r o b a b i l i t y ) . 6.. for which only one mode of convergence is sufficient. 4 ( c o n v e r g e n c e i n d i s t r i b u t i o n ) A sequence {X. we write X . we need two other modes of convergence.XI< E) > 1 .x)' = 0.1. 1 . Thus we have D E F I N I T I O N 6 .) is said to converge to X i n distribution if the distribution function F.2 . Wewrite x." (Alternatively. = X. issaid to converge to a real number a if for any E > 0 there exists an integer N such that for all n > N we have D E F I N I T I O N 6 .1 to a sequence of random variables. n = 1. E(X. 3 ( c o n v e r g e n c e i n m e a n square) A sequence {X..X I < E) = 1 for any E > 0.1) exactly. I The following two theorems state that convergence in mean square implies convergence in probability. implies convergence in distribution. is obvious from the context). In this chapter we shall make the notions of these convergences more precise.) - We have already alluded to results in large sample theory without stating them in exact terms. 2. by Theorem 6. 5 p. XKn. = p and VX. 4 ax.OLJ). . = p . X/a. 2. 1 ( C h e b y s h e v ) X./Y.(x)dx -m 2 e2 Is Given a sequence of random variables {Xi].. converges in distribution to a standard normal random variable.1. XK. then 4 p.. . Suppose that g is a continuous function except for finite discontinuities.2) to be X.2) is true for any sequence of random variables.. .EX. .1 (Chebyshev). We state without proof the following generalization of the Slutsky theorem. 2. THEOREM 6 . . and therefore the distinction is unnecessary here.1. .$ 6.. + Y.1. .1. THEOREM 6.1.1 (Chebyshev). and {X. .. is the distribution function of Z. Here we have assumed the existence of the density for simplicity. (iii) (X. .(x) is the density function of X. of (6. . X . 1 ( K h i n c h i n e ) where f. however.. 3 0 and then to apply Theorem 6. .1.Ez.. .=. Theorem 6. . define = ~-'C.2 LAWS OF LARGE NUMBERS A N D CENTRAL L I M I T THEOREMS X.i = 1. . = (VX. . 2 .EX.1. Let X . . For if the limit distribution of Z. To prove Theorem 6.).i.$ X and Y.Y2. It is an uninteresting limit distribution. it should be nondegenerate. 2. .1. XZn. .)in distribution.3 Let (Xi) be independent and identically distributed (i. Let (X. 5 p. however.2 that X. 5 0. YJn) is the same as the distribution of g(X1.then - THEOREM 6 .1. 4 ( S l u t s k y ) (i) x.2 M 1 Laws of Large Numbers and Cent- L h i t Tharems 103 THEOREM 6 .).d. take g(x) to be x2 and take X.) = jm g(x)/. i = 1. . .& a J g ( X . .] to {X. We do not use the strong law of convergence.X~. . be a vector of random variables with a fixed finite number of elements.)are uncorrelated with EX. converges to 0 in probability. Chebyshev's inequality follows from the simple result: (6. x.Y. 1 . = 1 for all n.) is any nonnegative continuous function. provided a # 0. therefore. . but inequality (6. . -+N(0. K. and S = (x I g(x) r E ' ) .1.J. Yln. = a2.)if we use Theorem 6. = cr2. which concerns the almost sure convergence. converge jointly to {X. .. A central limit theorem (CLT) specifies the conditions under which Z.X2.& X . -$ 0 is to show rf. 2 ( L i n d e b e r g . a 2 . because VZ. ) so that . . For example. We shall state two central limit theorems-Lindeberg-L6vy THEOREM 6 . provided that Eg(X. Then 2. (ii) X. plim Y.E x .Ex. . . x. THEOREM 6 . . Then X.1. ) -5 g ( a ) . Now we ask the question. It follows from Theorem 6.1.3) Eg(X. ThenX. In certain situations it will be easier to apply z. exists. It is more meaningful to inquire into the limit distribution of Z.102 6 1 Large Sample Theory 6. = a. This law is sometimes which referred to as a weak law of large numbers to distinguish it from a strong law of large numbers. i = 1. it means the following: if F. 2 x.X.-% a. We shall write 2. f ix)dx. Then the limit distribution of g(X1. 5 x + a. 1).1 is deduced from the following inequality due to Clhebyshev. In many applications the simplest way to show X. The following two theorems are very useful in proving the convergence of a sequence of functions of random variables. .$ x J x. . because it is degenerate.) Note that the conclusion of Theorem 6.)is an important necessary condition.) with EXi = p. A law of large numbers (LLN) specifies the conditions under .i.+ X 3 x. 1). which is useful on its own: where g(. with EXi and VX.] be i.. 0. 2 .L e v y ) and Liapounov. ~f x.)-"'(z.1 can be obtained from a different set of assumptions on (X. 1 .2. More precisely. . x.a].d. when n is large? Suppose a law of large numbers holds for a sequence { X . if {X.) exists.EX. 1 .Here the joint convergence of {X. . + N(0. Let g be a function continuous at a constant vector pointa. what is an approximate distribution of 2. . we shall make statements such as "the asymp totic distribution of Z. 1). 1 ) .5). then Z.13) = Let {X. - p. vX. Then EX = 1 and V X = 0. because it makes the sum of the approximate probabilities equal to unity." Normal approximation of B(5.1.5 < X* < 0.where Y.). 1 - (6..2).5)."=IY.Note that it would be meaningless to say that "the limit distribution of X .5)'/2.5. 0 FIGURE 6. We now introduce the term asymptotic distnbution. 3 ( L i a p o u n o v ) VX. Or we may state alternatively that X / n A N ( p .p. and the asymptotic variance vX. EX..1. we may replace 5 above by A. is N(EX. ) ." These statements should be regarded merely as more intuitive paraphrases of the result Z.8.5 and V X = 1.1 we defined a binomial variable X as a sum of i. We shall not use it in this book.0.}be independent with EX. we draw the probability step function of binomial X and the density function of normal X* in Figure 6.104 6 1 Large Sample Theory 6.3.3 = p.2.d.1. = pq.3. Since EX = 2.3 N O R M A L APPROXIMATION OF B I N O M I A L Here we shall consider in detail the normal approximation of a binomial variable as an application of the Lindeberg-Levy CLT (Theorem 6. The true probabilities and their approximations are given in Table 6. N ( 0 . As for P ( X = 0).25 in this case.2. 6. EXAMPLE 6 . . = p and V Y . P ( X = 2) by P(1. The results are summarized in Table 6. In the terminology of Definition 6. X = C. Both are special cases of the most general CLT. 3 If 5% of the labor force is unemployed. however. pq/n) or that X A (np. ms. The densityfunction f ( x ) of N(2. If 1/3 li- n j m z=l :f. Change the above example t o p = 0.)-"~(X.1.i. Since {Y. 1). we shall approximate binomial X by normal X* N(2. 1. 1)" (written as Z.1. which is due to Lindeberg and Feller.2 and Figure 6. -+ N ( 0 .1/2 = 0. = (vX. Let X be as defined in Example 5. with EY. 3 .) . is N(EX.2) f(x) = 1 exp [ . is asymptotically normal with the asymptotic m a n E X . In Definition 5.. it may be approximated either by P ( X * < 0. 1 Normal Approximation of Binomial 105 THEOREM 6 .( x .25) is. 2 EXAMPLE 6 . The same is true of P ( X = 5 ) .2. EXAMPLE 6 ..5]..5).1) -- Using (5.5 < X* < 1.12) and (6. because its condition is more difficult to verify. 4 N ( 0 .1. which means the "approximate distribution when n is large..2. 3 .1 1 2 3 4 5 These two CLTs are complementary: the assumptions of one are more restrictive in some respects and less restrictive in other respects than those of the other. we can conclude (6. after some rounding off. npq).}:that is.5) or P(-0.} satisfy the conditions of the Lindeberg-Levy CLT.5. The former seems preferable. 3 . The figure suggests that P ( X = 1 ) should be approximated by P(0.3. = 1 with probability p and = 0 with probability q = 1 . 1 ) ) or "the asymptotic distribution of X . = a : ." This last statement may also be stated as "x.4. what is the probability that one finds three or more unemployed workers among m p 4 N(0.25). I ) .2).5) As we stated in the last paragraph of Section 6. and E(IX. l ) . and so on." Given the mathematical result Z. Bernoulli variables {Y.) is N ( 0 .1. 5 N(O. 1. 2 . central limit theorems provide conditions under which the limit distribution of 2. V z . however.:) [i -. We shall consider three examples of a normal approximation of a binomial.2.5 < X* < 2. is N ( 0 . 4 1 Examples 107 Normal approximation of B(5. 0. Then X B ( 1 2 .106 T A B L E 6.1 6 1 Large Sample Theory 6. p).05.5) 1 Approximation X Probability twelve randomly chosen workers? What if 50% of the labor force is unemployed? Let X be the number of unemployed workers among twelve workers. We first calculate the exact probability: - . where we first assume p = 0. i..2 (Lindeberg-Levy) &a. 4 . required condition.2.Prove that the sample variance.1) and (6.1. Obtain the probability limit and the asymptotic that Y log E. .d. f = ~ and ~.)and {Y.d. For this purpose note the identity 3. + Then we can readily see that the numerator will converge to a normal variable with an appropriate normalization and the denominator will 4. so that we can use (iii) of Theorem 6.3 (Liapounov) in this case becomes EXAMPLE 6 .1.-2 n .) are independent of each other. = a.pyx.. 2 converge to (pX)2 in probability.1. Using (iii) of Theorem 6.1.i. defined as S . with the common mean p > 0 and common variance a2 and define X = n ..= a. Y. E ~ % n dplim. E n-'C. 2 px and P 4 px. In the second case. In Example 6. 4 Let ( X . Because S.4. note E ( 2 ..lY.Xf converges to a2 in probability.4 (Slutsky). = py and W .xXf and X.p)/converge to N ( 0 .1) Prove that if a sequence of random variables converges to a constant in distribution.4. . i = 1.2.pI3 = m3 Under what conditions on a: does (2 . is clearly a continuous function of n-'C.2.108 6 1 Large Sample Theory ) Exercises 109 - We can answer this question by using either Theorem 6. 3 x2.i. we should assume that {X.1.3.2) + - fn i l l 2 We therefore conclude (6. = pXY. . (Section 6. the desired result follows from Theorem 6.] are independent. obtain from (6.. . indicate distribution of x clearly which theorems of Chapter 6 are used.Y = -1 n n C.]. Therefore Theorem 6.1) we have plim.) are identically distributed in addition to being independent.1 (Chebyshev) Vx or Theorem 6. I ) ? The condition of Theorem 6.. with EY.1.2. Assume that {X. ~ . EXAMPLE 6 . is that this last quantity should converge to 0 as n goes to infinity. n.d..3) It is known that 5% of a daily output of machines are defective. Assume also . p/X 4 py/pX. 2. Assume that {X.=~ B y Khinchine's LLN (Theorem 6.. (Section 6. = px # 0 and VX.4..)be i. (Section 6. and let {Y. Define W.=. we where a&= pis.]and (Y. with EX. ] be i. Obtain the asymptotic distribution of F/x.4) r A sufficient condition to ensure this is X EXERCISES Let (X..3.p)" . therefore.] satisfies the conditions of .' ~ . 4 .py/pX) to make it converge to a proper random variable. B y Theorem 6. .1) Give an example of a sequence of random variables which converges to a constant in probability but not in mean square and an example of a sequence of random variables which converges in distribution but not in probability. Then (W. The next step is to find an appropriate normalization of (FIX . be i. > 0 for all i. by Theorem 6. (Section 6. . assume further that EIX.n c.]be i.2.d.1 (Khinchine). What is the probability that a sample of 10 contains 2 or more defective . EXAMPLE 6 .="=. At each step of the derivation. X = p. Therefore.4 (Slutsky). 4 .4. it converges to the same constant in probability.' c ~ . In the first case.2) Let {X.=~cT The ~ .1 (Khinchine). 2. with a finite mean p and a finite variance a*. x~ 1.i. .. where 2. What more assumptions on (X. How many times should we throw this coin if the proportion of heads is to lie within 0.} be independent with EX = p and VX = u2.. Y. ~and py. (b) l / x . (Section 6.}.i.8)'/n to converge to u2in probability? What more assumptions are needed for its asymptotic normality? 9.d. (Section 6. 6.5. Prove the consistency of fi (see Definition 7.. - based on a sample of size n.d. Derive the asymptotic distribution of X -E -. you may simply refer to it.9? 10. I]. p. and define estimate p by fi = logX/logF.} be i. (Section 6. . 11. Note that EX = VX = A and v(X2) = 4h3 6h2 A. 132) and derive its asymptotic distribution. If a theorem has a wellknown name..We are to (X.4. .4) Let (X.4) Suppose {X.i.3) There is a coin which produces heads with an unknown probability p.110 6 1 Large Sample Theory I Exercises 111 machines? Solve this exercise both by using the binomial distribution and by using the normal approximation. .4) Let {X. be i. 11 and Y N[exp(a).4) n P ( X = k) = (A%-4 /kt Derive the probability limit and the asymptotic -bution of the Suppose X . n.}are i.}are needed in order for 4 = Z(X. and the covariance UXY.=I (b) Obtain the limit distribution of .05 of p with probability at least 0.XI+'). observations on = ~-'z:=.N[exp(ap).4.x. the variances u. (Section 6. 5. Y. with the means p .} be as in Example 6. and cr. describe it. with EX (a) Obtain " I = 0 and VX = u2 < co. Obtain the asymptotic distribution of (a) X2.d. = n-'~:=lx.. i = 1.i. X+E Explain carefully each step of the derivation and at each step indicate what convergence theorems you have used. Otherwise. Y). independent of each other. Let {X. plim n-' n-tm (X. 2. and P = ~-'z:=~Y. . . +. . (Section 6.4) Let (X.2. + + 8. (Section 6. the population.2. X2.1 WHAT I S AN ESTIMATOR? Sample Moments In Chapter 4 we defined population moments of various kinds. each of which has the same distribution as X.7. X. . . . Note that (XI. . . Y. will be discussed in Chapter 8. z = 1. - C x)k. 6. These observed values will be denoted by lowercase letters (xl. .8. 7. . with a particular degree of a confidence. the goal of interval estimation is to determine the degree of confidence we can attach to the statement that the true value of a parameter lies within a given interval. The goal of point estimation is to obtain a single-valued estimate of a parameter in question. In this chapter we discuss estimation from the standpoint of classical statistics. . sample.5. X. If X is the height of a male Stanford student. Here we shall define the corresponding sample moments. . . x.X.0. (We say that { X . 7.1. and we call (XI. Guessing p to be 0. They are also referred to by the same name. X P . Once we observe them.9. . . .. . 1.).d.0. . is the height of the ith student randomly chosen.6. Xn) a sample of size n. n.1 - - Chapters 7 and 8 are both concerned with estimation: Chapter 7 with point estimation and Chapter 8 with interval estimation. say. ) are i. they become a sequence of numbers. . We define Sample mean Sample variance In Chapter 1 we stated that statistics is the science of estimating the probability distribution of a random variable on the basis of repeated observations drawn from the same random variable. This is an act of interval estimation. The Bayesian method. . For example.Y. (0. We call the basic random variable X.) 1 a bivariate sample of size n on a bivariate population (X. represents the outcome of the ith toss of the same coin. .5. If we denote the random variable in question by X. we call {(X. 2.) For example. . Y).5. .. We define I . 0. xe.5 is an act of point estimation. . . whose probability distribution we wish to estimate.). . 1.3. Sample moments are "natural" estimators of the corresponding population moments.1 1 What Is an Estimator? 113 for a given coin.i. are mutually independent in the sense of Definition 3..) or (5.). We can define X = 1 if a head appears and = 0 if a tail appears. suppose we want to estimate the probability ( p ) of heads Sample kth moment around the mean - 1 " n 1 = 1 (X.) are random variables before we observe them. If (X. . At most we can say that p lies within an interval. We can never be perfectly sure that the true value of p is 0. such as (1.. suppose we want to estimate the probability ( p ) of heads on a given coin toss on the basis of five heads in ten tosses. X2. the n repeated observations in mathematical terms mean a sequence of n mutually independent random variables XI. Then X. .7).4 and have the same distribution as (X. Y). however.0. in which point estimation and interval estimation are more closely connected. 1 (Khinchine's law of large numbers). follows from (1) and (2) above because of Theorem 6. If VX is finite. But let us concentrate on the sample mean and see what we can ascertain about its properties. If we replace these five moments with their corresponding sample moments.2. ( I ) Using Theorem 4. we know that V X = u2/n. In general. = P(X = j). . We can solve these six equations for the six unknowns (pi] and express each pj as a function of the five moments.2) reduces to the identity which states that the sum of the probabilities is unity. Y . . . (3) Using Theorem 6. Thus an estimator is a statistic used to estimate a parameter. .. = 1 and Y . Its observed value is called an estimate.3. it .114 7 1 Point Estimation 7.= k = 0. 1. (that is. Mathematically we express it as We call any function of a sample by the name statistic. . Then p1 = (l/n)Cr=lYIY.6. the method of moments as the method of moments. . x2. 2. using Theorem 4. . . . When k = 0.) = l/n.1 1 What Is an Estimator? 115 Sample covariance On the basis of these results. 2 . is determined).. Although. 1 where p. Note that an estimator is a random variable.2 Estimators in General Sample correlation Sample Covariance SXSY The observed values of the sample moments are also called by the same names.x. X.1. . 2. . Then. . we know that EX = EX.. X2. We have mentioned that sample moments are "natural" estimators of population moments. 6. be the outcome of the ith roll of a die and define Y . 2. Note that X* is always discrete.5. (7. . using the term "good" in its loose everyday meaning. which shows that the degree of dispersion of the distribution of the sample mean around the population mean is inversely proportional to the sample size n. . Then the moments of X* are the observed values of the sample moments of X. and the remaining five identities fork = 1.1. . They are defined by replacing the capital letters in the definitions above by the corresponding lowercase letters. The observed values of the sample mean and the sample variance are denoted.1. . We shall show that it is in fact a function of moments. . 5 are the definitions of the first five moments around zero. is uniquely determined when X.3 we shall learn that j1 is a maximum likelihood estimator. In Section 7. The fil just defined can be expressed as a function of the sample. Since Y. Let (xl. A "natural" estimator in this case is the ratio of the number of times the ace appears in n rolls to n-denote it by j l .3. i = 1. Are they good estimators? This question cannot be answered precisely until we define the term "good" in Section 7. we estimate a parameter 8 by some function of the sample.1. regardless of the type of X. is not a moment. We stated above that the parameter p. respectively. (2) Suppose that VX = a2is finite. 7. .This method of obtaining estimators is known as in this case.2.1. n. .2) EX^= x j k p l . We may sometimes want to estimate a parameter of a distribution other than a moment.) be the observed values of a sample and define a discrete random variable X* such that P(X* = x. = 1 if X. we obtain estimators of {pi). . is a function of X. Consider the following six identities: 6 (7. jl is a function of XI. . which means that the population mean is close to a "center" of the distribution of the sample mean. = 0 otherwise. the same result also know that plim. by 5 and 2 sx . j = 1. . The following way of representing the observed values of the sample moments is instructive. estimator sometimes coincides with the maximum likelihood estimator. Let X. An example is the probability (pl) that the ace will turn up in a roll of a die. . we can say that the sample mean is a "good" estimator of the population mean.. we = EX. We shall call X* the empirical image of X and its probability distribution the empi~ical distribution of X. .1 (Chebyshev). 116 7 1 Point Estimation 7. In the distribution-specific method. For example. - 7. many intervals will contain only a small number of observations unless n is very large. = 0 with probability 1 . the - distribution is assumed to belong to a class of functions that are characterized by a fixed and finite number of parameters-for example.1 Population: X = 1 with probability p.1 Ranking Estimators F I G U R E 7.X 2 ) . In nonparametric estimation we attempt to estimate the probability distribution itself. the distribution is not specified and the first few moments are estimated. normal-and these parameters are estimated. In this book we shall discuss only parametric estimation. - - - In parametric estimation we can use two methods. This example shows two kinds of ambiguities which arise when we try to rank the three estimators. . In Figure 7.2 PROPERTIES OF ESTIMAWRS 7. (1) Distribution-spec@ method. In the distribution-free method. (2) Distribution-free method. but if d is small.2. The reader who wishes to study nonparametric density estimation should consult Silverman (1986).1 we show the probability step functions of the three estimators for four different values of the parameter p.1 Probability step functions of estimators EXAMPLE 7. suppose we want to estimate the density of the height of a Stanford male student.fi Sample: (XI. as illustrated by the following example. .2.3 Nonparametric Estimation - . The estimation of a probability distribution is simple for a discrete random variable taking a few number of values but poses problems for a continuous random variable.1. Nonparametric estimation for a continuous random variable is therefore useful only when the sample size is very large.2 1 Properties of Estimators 117 is in general not as good as the maximum likelihood estimator. 7. The difficulty of this approach is characterized by a dilemma: if d is large. Inherent problems exist in ranking estitnat~rs. 71. because it does not use the information contained in the higher moments. We must divide this interval into 3 / d small intervals with length d and then estimate the ordinate of the density function over each of the small intervals by the number of students whose height falls into that interval divid6d by the sample size n. . the approximation of a density by a probability step function cannot be good. assuming that it is zero outside the interval [4. 2. 2 .01).2 Illustration for Theorem 7. 1985.0 1< 1 r E). (Obvious. In this section we shall consider six measures of closeness and establish relationships among them. The h- 1 Pro$ Consider Figure 7. (We allow for the possibility of a tie. it is not clear which of the three estimators is preferred.0 that Eh.2.0) for every E. fof example.. and it is not easy to choose a particular one. (2) T dominates W for p = 0. (2) Eg(X .Y 119 (1) For a particular value of the parameter.0 1 > IY .(z) = 1 if lzl 2 1.) (3) (3) + (5) and (5) p (3).) C3 (4). The theorem follows from the fact that a continuous function can be approximated to any desired degree of accuracy by a linear combination of step functions.) O THEOREM 7 .2 1 Roperties of Estimators ------------. however.01) 5 E ~ ( ~-Y0)) for every continuous and nondecreasing function g. But because we usually must choose one estimator over the others. (6) is not.2. r > . The reader should verify this. Huang and Litzenberger (1988). Criteria (I) through ( 5 ) are transitive. (4) is equivalent to stating Then Eh. 2 . meaning that one does not imply the other. (6) P(IX . see.(Y . 5 = 0 otherwise. (3) Eg(lX . Here X (solid line) and Y (dashed line) are two random variables defined over the sample space [0. but W dominates T for p = %. 2 . Note that 0 is always a fixed number in the present analysis. 11. 4 IY . we shall have to find some way to get around the ambiguities.0) for every continuous function g(. E. (Obvious.4 idea of stochastic dominance is also used in the finance literame. Criteria (3) and (4) are sometimes referred to as universal dominance and stochastic dominance.Therefore.0 1 > E) for every E. see Hwang (1985). we say X is strictly preferred to Y.0 ) * 5 E ( Y . say.) Or.0 1 > E) 5 P(IY .) (1) P(IX .0)'. If X is preferred to Y and Y is not preferred to X. P(IX .0 1 5 IY . (4) P([X .118 7 1 Point Estimation 9. X is "better" than Y. we might say. 3 (2) 3 (3) a n d ( 3 ) p (2).0) Eg(Y . f l i ) t s . p = 3/4.) which is nonincreasing for x < 0 and nondecreasing for x > 0. The p r o b .01) = F I G U R E 7.(X .2 Various Measures of Closeness The ambiguity of the first kind is resolved once we decide on a measure of closeness between the estimator and the parameter.2. Sketch o f ProoJ: Define h.0) = P(IX .0) 5 Eh.. for a rigorous proof. 2 THEOREM 7 . These ambiguities are due to the inherent nature of the problem and should not be lightly dealt with. X 7. THEOREM 7. respectively. There are many reasonable measures of closeness.01) (4) ? (6). Adopting a particular measure of closeness is thus equivalent to defining the term better: (The term stn"ctZy better is defined analogously. In the following discussion we shall denote two competing estimators by X and Y and the parameter by 0.1 THEOREM 7 . (See Hwang.(X . (5) E ( X . Each of the six statements below gives the condition under which estimator X is prefmed to estimator Y . ferred to X by criterion ( 6 ) . by our construction.4. p ( 1 ) . Then T is not preferred to S by criterion ( 2 ) . Xis preferred to Y by criterion ( 3 ) . Consider X and Y . 2 . 6 ( 1 ) =3 ( 6 ) and ( 6 ) p (1). as before.2. X (solid line) and Y (dashed line) are defined over the same sample space as in Figure 7. But E(X . 2 . T ~ U S 1.2. an arrow indicates the direction of an implication.012 = 4 5/4 and E(Y . But ( 3 ) and ( 4 ) are equivalent by Theorem 7. X is preferred to Y by criterion ( 3 ) because of Theorem 7.0 1 r IY . 7 Proof.01) = * 0 FIG u R E 7. consider X and Y . But. and a dashed line between a pair of criteria means that one does not imply the other.01).0 ~ ( I Y .0)' = 4.0 > 0 in this example. O + The results obtained above are summarized in Figure 7.p). Then T is preferred to S by criterion ( 1 ) .2.4. as already noted. ( 3 ) + ( 1 ) .eferred to Y by criterion ( 4 ) . Consider X and Y .3. Then X is strictly preferred to Y by criterion ( 6 ) .5. But clearly X is not preferred to Y by criterion ( 1 ) . The right-hand side of ( 6 ) is zero if ( I ) holds.1 when p = 3/4. we assume that 0 = 0.= P(IX .0 1 5 I Y . Then. Next. 2 . Define a function go in such a way that go(-%) = go(-%) = 1 and go(%) = %. This shows that ( 1 ) does not imply ( 2 ) .but P(IY . Proof. as we noted in the proof of Theorem 7.01) = P ( Y < X ) < 1.because Ego( S .2 1 Properties of Estimators 121 ability distribution defined over the sample space is assumed to be such that the probability of any interval is equal to its length. Eg(lX .0 > 0 and Y . (6) .01) r Eg(lY . Therefore. 0 Pro$ Consider estimators S and T in Example 7.2.2.3.Y is pre15I X . We have shown that X is preferred to Y by criterion (4).8 c l THEOREM 7 . - Proof.01) 5 g ( [~ Therefore. Since g is nondecreasing. Then.01) for every continuous and nondecreasing function g. and ( 4 ) ? (6) by T h e e rem 7.Therefore X is preferred to Y by criterion ( 2 ) .This shows that ( 2 ) does not imply ( 1 ) .2.2.120 7 1 Point Estimation 7. defined in Figure 7. 5 ( 1 ) =3 ( 3 ) and ( 3 ) p ( 1 ) . Clearly. criteria ( 2 ) and ( 3 ) are equivalent. In the figure.0 . defined in Figure 7. We also assume that 0 = 0.01) r P [ ~ ( I X . But P(IX .2.whereas Y is strictly preferred to X by criterion ( 6 ) .3. X is strictly p. 15I Y -8 1 g(lX .2. . and. ( 2 ) and (3) are equivalent. therefore Y is strictly preferred to X by criterion (5). O > L THEOREM 7 .3 - : 1 Illustration for Theorem 7. O > 0 and Y (1) ? (2). In Figure 7. IX .01) 5 ( 1 ) 3 (3). Pro05 ( 6 ) .2. Since X . defined in Figure 7.0 > 0.p) < Ego(T . Then ( 6 ) (1) must hold.9 THEOREM 7 . Consider any pair of random variables X and Y such that X . 2. More formally. 11.p). B y adopting the mean squared error criterion. E(S . The ambiguity of the second kind remains. We can easily calculate the mean squared errors of the three estimators in Example 7. This should be distinguished from the statement that T is better than S at a specific value of p.0)' 5 E(Y . Therefore. If 6 is an estimator of 0. for each value of the parameter.5 Mean squared errors of estimators in Example 7.2 are reasonable (except possibly criterion (6). W is the best estimator.0 ) 2 < E(Y .2. In Example 7.2.2.) It is evident from the figure that T clearly dominates S but that T and W cannot be unequivocally ranked. statisticians have most frequently used criterion ( 5 ) .WesayXisbetter (or more eficient) than Y if E(X . we say that T is better than S. FIGURE 7. because T is better for some values of p and W is better for other values of p.2. The mean squared errors of the three estimators as functions of p are obtained as (7. 4 Relations among various criteria 7.p).1).1 F IG UR E 7 .2.) D E F I N I T I O N 7. as we shall illustrate by referring again to Example 7. (Here 8 denotes the parameter space.3 Mean Squared Error Although all the criteria defined in Section 7. and there is no a priori reason to prefer one over the others in every situation.012 for all 0 E 8 and E(X . for this value of the parameter p. we call ~ ( 6 8)' the mean squared errm of the estimator.1) E(T .2) E(S .2 4 P~O~SS&S of f&mat~rs 123 (7.2. and E(W . however. we state L e t X a n d Y b e twoestimatorsof0. known as the mean squaredierro7: We shall follow this practice and define the term better in terms of this criterion throughout this book. unless otherwise noted.% ) 2 = 1/16.2. for the moment.2. Now we can rank estimators according to this criterion though there may still be ties. the set of all the possible values the parameter can take.1.1: E(T . When T dominates S as in this example.1 .5. it is the closed interval [O. because it is not transitive).2.p)2 = p(l .3/4)' = 3/32. we have eliminated (though somewhat arbitrarily) the ambiguity of the first kind (see the end of Section 7.122 7 1 Point Estimation 7.8)' for at least one value of 0 in 8.3/41' = 3/16.1-p(l p) 2 - 2 .1. (Ignore the dashed curve. They are drawn as three solid curves in Figure 7. 2 Let 6 be an estimator of 0. We call E8 . in Example 7. According to the minimax strategy. In Chapter 8 we shall learn that Z is a Bayes estimator. S is inadmissible and T and W admissible.1. One strategy is to compare the graphs of the mean squared errors for T and W in Figure 7. - .2.4 Strategies for Choosing an Estimator The mean squared error of Z is computed to be How can we resolve the ambiguity of the second kind and choose between two admissible estimators.2. T and S are unbiased and W and Z are biased. This suggests that we can perhaps combine the two estimators and produce an estimator which is better than either in some sense. rather than comparing the mean squared errors of estimators.4 max ~ 0 (6 012 5 max ~ ( 6 O12. We would then choose the estimator which has the minimum area under the mean squared error function. we have D E F I N I T I O N 7. 2 .124 7 1 Point Estimation 7. We can ignore all the inadmissible estimators and pay a t t a d o n only to the class of admissible estimators.2. In many .1. DEFINITION 7. in Example 7.2.5. we see that Z is chosen both by the subjective strategy with the uniform prior density for p and by the minimax strategy.1. therefore.1? Subective strategy. One possible way to combine the two estimators is to define (7.2. whereas T does well for the values of p near 0 or 1. This strategy is highly subjective. For example. a Bayesian would proceed in an entirely different manner. suppose we believe a priori that any value of p is equally likely and express this situation by a uniform density over the interval [0.5 and to choose one after considering the a priori likely values of p. practical situations the statistician prefers a biased estimator with a small mean squared error to an unbiased estimator with a large mean squared error.1. we say that the estimator is inadmissible. We see in Figure 7. We say that 6 is inadmissible if there is another estimator which is better in the sense of Definition 7.2. Minimax strategy. if we eliminate W and Z from our consideration.5 that W does well for the values of p around %.5 Best Linear Unbiased Estimator Neither of the two strategies discussed in Section 7.0 bias. This strategy may be regarded as the most pessimistic and risk-averse approach. Their primary strategy is that of defining a certain class of estimators within which we can find the best estimator in the sense of Definition 7. 11. W.2. it should not be regarded as an absolutely necessary condition.2. as in the case of S by T in the above example. It is more in the spirit of Bayesian statistics.2.2. 7. we choose the estimator for which the largest possible value of the mean squared error is the smallest. although.4 is the primary strategy of classical statisticians. When we compare the three estimators T. I t b a dninimaxestimatwif. In our example. In our example. for any other estimator 0. in Example 7.2 ) Properties of Estimators 125 When an estimator is dominated by another estimator. Although unbiasedness is a desirable property of an estimator.1. 7. A certain degree of arbitrariness is unavoidable in this strategy. T is preferred to W by this strategy. We formally define ~ e6 tbe an estimator of 0. One of the classes most commonly considered is that of linear unbiased estimators. as we shall explain in Chapter 8.2. T and W. T and W are equally good by this criterion. and Z. D E F I N I T I 0 N 7 . although the second is less objectionable to them.4) Z = XI + X2 + 1 4 Thus. For example. it is usually not discussed in a textbook written in the framework of classical statistics. We first define 6 issaidtobean unbimedestimatorof0if~6= Ofor all 0 E 8. T is the best estimator within the class consisting of only T and S.2. An estimator is admissible if it is not inadmissible.3 and is graphed as the dashed curve in Figure 7. 0 f Among the three estimators in Example 7. 1. we obtain ProoJ It follows from the identity From (7. The class of linear estimators consists of estimators which can be ex- . Therefore.2.2.2 1 Properties of Estimators 127 Theorem 7. We have The mean squared error is the sum of the variance and the bias squared. MSE(Z) < hrlSE(T) In the following example we shall generalize Example 7.P.~ 6( )~ 80)]= ( ~6 0 ) ~ (6 ~ 6 ) 0. Estimators: T = As we stated in Section 7.10 gives a formula which relates the bias to the mean squared error. the sample mean is generally an unbiased estimator of the population mean.2 Population: X = 1 with probability p. . as we show in (7. 0 Since ( n 1 ) / ( 2 n for everv n if + + 1 ) is a decreasing function of n. . = ( n .2.10.126 7 1 Point Estimation 7. E X A M P L E 7.1 instead of n to produce an unbiased estimator of the population variance. The same cannot necessarily be said of all the other moments defined in that section. Sample: (XI. T H E o R E M 7. This formula is convenient when we calculate the mean squared error of an estimator.1 to the case of a general sample of size n and compare the mean squared errors of the generalized versions of the estimators T and Z using Theorem 7. That is. the sample variance defined there is biased.8) and (7.2. = 0 with probability 1 . X2.2.2.l ) a 2 / n . for any estimator 6 of 0.2.11) we conclude that MSE(Z) only if < MSE(T) if and Note that the second equality above holds because E [ ( 6 . using Theorem 7.1.10. For example.1 0 where MSE stands for mean squared error. .2. .13).2.Xn).For this reason some authors define the sample variance by dividing the sum of squares by n . we have Therefore ES. We have Since E T = p. moreover. of the population mean.1 2 - L e t { X .2. The solution to this spect to {a.14) - EX alX. = p.1 are linear estimators. or BLUE. (In words.2.Xl = a2 Now consider the identity I af. 0 . 0 (Note that we could define the class of linear estimators as a0 + C:=la.2.17).nbeindependentandhavethe common mean p and variance a2. Theorem '7.2.2. This class is considered primarily for its mathematical convenience rather than for its practical usefulness.14) would ensure that a0 = 0.2.2.1 is merely a special case of this theorem. THEOREM 7.2.11 provides the ~ a respect : to (a. which has a wide applicability. we obtain where we used the condition Zr=la.2. Consider the identity (7.20) is the sum of squares and hence nonnegative.} subject to condition solution to minimizing ~ r ~ with CZ".."=la.15) VX 5 V /= a i X j for all {ail satisfy~ng(7.a. The unbiasedness condition (7.2. and (7.X P .Xl with a constant term. Consider the class of linear estimators of p which can be written in the form Z. (7.) Prooj We have Consider the problem of minimizing ~ : = ~ with a f re= 1.) subject to the condition C. From a purely mathematical standpoint.~=laa. . Therefore the theorem follows from (7.128 7 1 Point Estimation pressed as a linear function of the sample ( X I .Xl and impose the unbiasedness condition (7.2. . This would not change the theorem. 2 . .2.2. = l / n . . The theorem follows by noting that the left-hand side of the first equality of (7.16) u2 V z = -.19).2. = 1. t= n 1 Then (7. = l / n for all i. Despite the caveats we have expressed concerning unbiased estimators and linear estimators.2:14) implies Zr=lai = 1.2. . i = 1 . because the unbiasedness condition (7.18) is the sum of squared terms and hence nonnegative. = 1 to obtain the second equality.15) holds if and only if a.b. the sample mean is the best linear unbiased estimatol.X n ) All four estimators considered in Example 7.19) clearly holds if and only if a.2.17) V a.2. We shall prove a slightly more general minimization problem.) We now know that the dominance of T over S in Example '7.16). Therefore. the equality in (7.14) G=1 and. . . the following theorem is one of the most important in mathematical statistics. . THEOREM 7.2. problem is given by Proof. } .b. n and (7. noting that the left-hand side of (7.11 The equality in (7. and let ci be the number of shares of the ith stock to purchase. An intuitive motivation for the second estimator is as follows: Since 0 is the upper bound of X.130 7 1 Point Estimation 7.2.2. . = 1. P(X.2 1 haperties of Estimators 131 (Theorem 7. = ciui and bi = p i / ( M u i ) . Therefore the solution is ci = U ~ ~ / C ~ = ~ U ~ ~ .4.5 that b2 is the bias-corrected maximum likelihood estimator. US.11 follows from Theorem 7.12. 2 . = -. Thus it is a special case of Example 7. 4 L e t & . where M is a known constant. .3. Hence E X A M P L E 7 . .2.2. 2. the solution is E X A M P L E 7. i = 1. . . (7.--- Theorem 7. Then we have for any 0 < z < 0. Example 7. n. Therefore. More rigorously. n. - 0" Differentiating (7. we can calculate (7. Assume that 6i are uncorrelated. Since the unbiasedness condition is equivalent to the condition C:'.2. If we put a. where pi = 1 and M = 1.23) G(z) = P(Z < z) = P ( X I < z)P(X2 < z) .2. i = 1. Assume that Xi are uncorrelated. 2. 2 . . i = 1.x2dx = 02/3. = 1 for all i. such as W and Z of Example 7.2. n.) We shall give two examples of the application of Theorem 7.2.3 0 Parameter to estimate: p . we obtain - -. 2 Estimators: k1 = . Determine s so as to minimize V(ZL1ciXi) Put EXi = pi and VXi = subject to M = Zr=ficipi. We have already seen that a biased estimator.12. Therefore it makes sense to multiply Z by a factor which is greater than 1 but decreases monotonically to 1 to estimate 0.12 by putting b. we know that Z 5 0 and Z approaches 0 as n increases. f 2. the problem is that of minimizing ~ : = ~ csubject ~ u ~ to C k l c i = 1. . We have EX^ = 0-'$.23) with respect to z. nonlinear estimat Using (7. can have a smaller mean squared error than the sample mean for some values of the parameter. we shall show in Example 7. .2. respecLet G(z) and g(z) be the distribution and density function o tively.5 provides a case in which the sample mean is dominated by an unbiased.c.2. Let Xi be'the return per share of the ith stock.25) and EZ =- 0" "I" zndz = - n 0 n+10 Therefore .24). beunbiasedestirnatorsof0with is unbiased and variances a:. this problem is reduced to the minimization problem of Theorem 7. . . < Z) = zn . Choose (ci]SO that has a minimum variance.2.2.11 shows that the sample mean has a minimum variance (and hence minimum mean squared error) among all the linear unbiased estimators. Therefore VX = 02/12.1.2. . . 132 . 7 1 Point Estimation 7.3 1 Maximum Likelihood Estimator: Win#ion 133 Since (7.2.25) shows that (7.2.27), G2 is an unbiased estimator, we have, ushg \ 7.3 M A X I M U M LIKELIHOOD ESTIMATOR: DEFINITION AND COMPUTATION Discrete Sample 7.3.1 Comparing (7.2.22) and (7.2.27), we conclude that MSE(b2) 5 MSE(bI), with equality holding if and only if n = 1. 7.2.6 Asymptotic Properties Thus far we have discussed only the finite sample properties of estimators. It is frequently difficult, however, to obtain the exact moments, let alone the exact distribution, of estimators. In such cases we must obtain an approximation of the distribution or the moments. Asymptotic approximation is obtained by considering the limit of the sample size going to infinity. In Chapter 6 we studied the techniques necessary for this most useful approximation. One of the most important asymptotic properties of an estimator is consistency. D E F I N I T Io N 7 . 2 . 5 Suppose we want to estimate the probability (p) that a head will appear for a particular coin; we toss it ten times and a head appears nine times. Call this event A. Then we suspect that the coin is loaded in favor of heads: in other words, we conclude that p = % is not likely. If p were '/2, event A would be expected to occur only once in a hundred times, since we have P(A I p = 1/2) = ~ i ~ ( 1 / G 2) 0.01. ~ ~ In the same situation p = y4is more likely, because P(A I p = 3/4) = ~ ; O ( g / q ) ' ( 1 / 4 ) 0.19, and p = g/ln is even more likely, because P(A I p = 9/10) = ~;O(9/10)~('/10) 0.39. Thus it makes sense to call P(A I p) = cAOpg(l - p) the likelihood function of p given event A. Note that it is the probability of event A given p, but we give it a different name when we regard it as a function of p. The maximum likelihood estimator of p is the value of p that maximizes P(A I p), which in our example is equal to 9/1+ More generally, we state D E F I N I T I O N 7 . 3 . 1 Let (XI,&,. . . ,X,) be a random sample on a discrete population characterized by a vector of parameters 0 = (01,02, . . . , OK) and let x, be the observed value of X,. Then we call We say 0. (See Definition 6.1.2.) 6 is a consistent estimator of 0 if plim,, 6= In Examples 6.4.1 and 6.4.3, we gave conditions under which the sample mean and the sample variance are consistent estimators of their respective population counterparts. We can also show that under reasonable assump tions, all the sample moments are consistent estimators of their population values. Another desirable property of an estimator is asymptotic normality. (See Section 6.2.) In Example 6.4.2 we gave conditions under which the sample mean is asymptotically normal. Under reasonable assumptions all the moments can be shown to be asymptotically normal. We may even say that all the consistent estimators we are likely to encounter in practice are asymptotically normal. Consistent and asymptotically normal estimators can be ranked by Definition 7.2.1, using the asymptotic variance in lieu of the exact mean squared error. This defines the term asymptotically better or asymptotically eficient. the likelihood function of 0 given (xl, x2, . . . , x,), and we call the value of 0 that maximizes L the maximum likelihood estimatol: Recall that the purpose of estimation is to pick a probability distribution among many (usually infinite) probability distributions that could have generated given observations. Maximum likelihood estimation means choosing that probability distribution under which the observed values could have occurred with the highest probability. It therefore makes good intuitive sense. In addition, we shall show in Section 7.4 that the maximum likelihood estimator has good asymptotic properties. The following two examples show how to derive the maximum likelihood estimator in the case of a discrete sample. 134 7 ( Point Estimation I 7.3 1 Maximum Likelihood Estimator: Definition 135 E X A M P L E 7 . 3 . 1 Supposex-B(n,p) andtheobservedvalueofXis k. The likelihood function of P is given by 1 We shall maximize log L rather than L because it is simpler ("log" refers to natural logarithm throughout this book). Since log is a monotonically increasing function, the value of the maximum likelihood estimator is unchanged by this transformation. We have (7.3.2) log L = log C; mum likelihood estimator is the same as before, meaning that the extra information is irrelevant in this case. In other words, as far as the estimation of p is concerned, what matters is the total number of heads and not the particular order in which heads and tails appear. A function of a sample, such as Cr='=,xi in the present case, that contains all the necessary information about a parameter is called a suficient statistic. This is a generalization of Example 7.3.1. Let Xi, i = 1, 2, . . . , n, be a discrete random variable which takes K integer values 1, 2, . . . , K with probabilities pl, &, . . . ,pK. This is called the multinornial distribution. (The subsequent argument is valid if Xi takes a finite number of distinct values, not necessarily integers.) Let nj, j = 1, 2, . . . , K, be the number of times we observe X = j. (Thus X?=,nj = n.) The likelihood function is given by E X A M P L E 7.3.2 + k log p + (n - k) log(1 - p). Setting the derivative with respect to p equal to 0 yields Solving (7.3.3) and denoting the maximum & 1 & d obtain estimator b y $, we where c To be complete, we should check to see that ('7.3.4) gives a maximum rather than any other stationary point by showing that a210g L / ~ P 'evaluated at p = k/n is negative. This example arises if we want to estimate the probability of heads on the basis of the information that heads came up k times in n tosses. Suppose that we are given more complete information: whether each toss has resulted in a head or a tail. Define X, = 1 if the ith toss shows a head and = 0 if it is a tail. Let x, be the observed value of X,, which is, of course, also 1 or 0. The likelihood function is given by (7.3.8) = n!/(n1!n2! . log L = log c - nK!).The log likelihood function is given by + K nj log pi. j= 1 Differentiate (7.3.8) with respect to pl, p2, . . . , pK-l, noting that 1 - pl - p2 - . . . - pK-l, and set the derivatives equal to zero: (7.3.9) - pK = a log L - 1 n. - nK = 0 , aPj Pj PK j = 1 , 2, . . . , K-1. Adding the identity nK/pK = nK/pK to the above, we can write the K equations as (7.3.10) pj=anj, j = l , 2 , . . . , K, 2=1 Taking the logarithm, we have (7.3.6) log L = x, log p where a is a constant which does not depend on j. Summing both sides of (7.3.10) with respect to j and noting that CF1pj = 1 and CFlnj = n yields ] [ + n- x, log(1 - p). But, since k = X:=lxz, (7.3.6) is the same as (7.3.2) aside from a amstant term, which does not matter in the maximization. Therefore the maxi- 1 (7.31) a =- 1 . n Therefore, from (7.9.10) and (7.3.11) we obtain the maximum likeiihood estimator 136 7 1 Point Estimation 7.3 1 Maximum Ukdiho$)$ EstSmator: M d t i o n 137 The die example of Section 7.1.2 is a special case of this example. 7.3.2 Continuous Sample and (7.3.18) c2= (x, - 3'. 1 = ~ For the continuous case, the principle of the maximum likelihood estimator is essentially the same as for the discrete case, and we need to m o w Definition 7.3.1 only slightly. They are the sample mean and the sample variance, respectively. 7.3.3 Computation . . ,Xn) be a random sample on a continuous population with a density function f (-lo), where 0 = (el, 02, . . . , DEFINITION 7.3.2 Let (X1,X2,. OK), and let xi be the observed value of Xi. Then we call L = lIr='=, f(xi 1 0) the likelihood function of 0 given (xl, xz, . . . , x,) and the value of 0 that maximizes L, the maximum likelihood estimatoz E X A M P L E 7.3.3 Let {X,), i = 1, 2 , . . . , n, be a random sample on N(p, a2)and let {x,] be their observed values. Then the likelihood function is given by (7.3.13) so that (7.3.14) L = " 1 I1 exp GU [ r=l --- 1 202 In all the examples of the maximum likelihood estimator in the preceding sections, it has been possible to solve the likelihood equation explicitly, equating the derivative of the log likelihood to zero, as in (7.3.3). The likelihood equation is often so highly nonlinear in the parameters, however, that it can be solved only by some method of iteration. The most common method of iteration is the Nmton-Raphson method, which can be used to maximize or minimize a general function, not just the likelihood function, and is based on a quadratic approximation of the maximand or minimand. Let Q(8) be the function we want to maximize (or minimize). Its quadratic Taylor expansion around an initial value 81 is given by n n 1 " log L = -- log(27~) - - log u2 - 2 2 2u , = I 'C (XI - p12. Equating the derivatives to zero, we obtain (7.3.15) and (7.3.16) ------- = - where the derivatives are evaluated at 81. The second-round estimator of the iteration, denoted G2, is the value of 8 that maximizes the right-hand side of the above equation. Therefore, aiog L I -ap , a%0 -,A) = U E = ~ aiogL au2 --n + 2u2 , 1 " 2a ,=I (x, - ,A)' = 0. The maximum likelihood estimator of p and u2, denoted as l i , and e2, are obtained by solving (7.3.15) and (7.3.16). (Do they indeed give a maximum?) Therefore we have Next G2 can be used as the initial value to compute the third-round estimator, and the iteration should be repeated until it converges. Whether the iteration will converge to the global maximum, rather than some other stationary point, and, if it does, how fast it converges depend upon the shape of Q and the initial value. Various modifications have been proposed to improve the convergence. 4. In Section 7.4. xs.) W = EY' = -E - a2 log L - ae2 We also have Therefore (7. . Therefore.4 M A X I M U M LIKELIHOOD ESTIMATOR: PROPERTIES. .4. Then. .4.2) and (7.. we have (7.1) follows from the CauchySchwartz inequality (4. .4. In this section. see Arnemiya (1985). . .3.X 2 . which makes the likelihood function itself a random variable.Xn.3 the likelihood function was always evaluated at the observed values of the sample. . from (7.4) - The right-hand side is known as the Cram&-Rao lower bound (CRLB).1.4 1 Maximum Likelihood Estimator: Properties 139 7. In Sections 7. Sketch of Pro05 ( Arigorous proof is obviously not possible. X.4. . To avoid mathematical complexity. R .x*. Xnl 0) be the likelihood function and let 6(x1.138 7 1 Point Estimation 7. the expectation operation. X2. W e 7. Note that E. where we are concerned with the properties of the maximum likelihood estimator.3).4. X2.3) .4 examples are given. we need to evaluate the likelihood function at the random variables XI.4.) be an unbiased estimator of 0. because the theorem uses the phrase "under general conditions. which is closely related to the Cram&-Rao lower bound. . because there we were only concerned with the definition and computation of the maximum likelihood estimator. . . .4.1 CramCr-Rao Lower Bound a2 10. Then we have where the integral is an n-tuple integral with respect to X I . We show this by means of the Cram&-Rao lower bound. In Section 7. is taken with respect to the random variables XI. X2.4. .") Put X = 6 and Y = a log L/a0 in Theorem 7.1) a log L 2 ~ ( 82 ) - 1 a2 log L E--- ae2 where the fourth equality follows from noting that E(l/L) (a2~/a0*) = $ ( a 2 ~ / a 0 2 )= d~ a2/ae2($~dx) = 0. In Section 7. .. under general conditions. . (In Section 7. some results are given without full mathematical rigor.X.4.1 we show that the maximum likelihood estimator is the best unbiased estimator under certain conditions. however.3) we have (7. . . THEOREM 7. also have (7.1 (Cramer-Rao) Let L(Xl. For a rigorous discussion.4.2 and 7.3 we define the concept of asymptotic efficiency.4.4.3 we show the consistency and the asymptotic normality of the maximum likelihood estimator under general conditions. L - a a log L ae ae ae We shall derive a lower bound to the variance of an unbiased estimator and show that in certain cases the variance of the maximum likelihood estimator attains the lower bound. . 15) again with respect to p.3.1.i. This was unnecessary in the analysis of the preceding section. We shall give two examples in which the maximum likelihood estimator attains the CramCr-Rao lower bound. with the density f (x.f. we essentially need to show that Qn(0) converges in probability to a nonstochastic function of 0. for example.d.p) n The maximum likelihood estimator can be shown to be consistent under general conditions. to converge to oO. Differentiating (7. note by (7.i. the fourth equality of (7. we obtain - F l C U RE 7. If.(In the present analysis it is essential to distinguish 0. which attains the global maximum at the true value of 0.) Next we shall show why we can expect Q.2 log f ( X . We shall only provide the essential ingredients of the are i.4. . and the third equality of (7. P(l . is the best unbiased estimator. we implicitly assumed that they were evaluated at the true value of the parameter. Therefore we can apply Khinchine's LLN (Theorem 6.To answer the first question.4. Define (7. we obtain E X A M P L E 7.. the maximum likelihood estimator. E X A M P L E 7. the conditions are violated because the fifth equality of (7.4.} be similarly analyzed.d.4. known as regularity conditions.p)/n by (5.10) en(@) = 1 log L. the true value. the domain of the likelihood function.4.6.4 1 Maximum LikdItrood Estimator: P~operties 141 The unspecified general conditions. Suppose {X. It can be also shown that even if (T' is unknown and estimated.2). (7.5).2).4.8) Therefore (7.1 Let X B(n.1). where a random variable Xiappears in the argument off because we need to consider the property of the likelihood function as a random variable.3 (normal density) except that we now assume u2 is known.5) do not hold. Let {X.5).4. and (7.4. denoted OO.3. random variables.7) / CRLB = --------.3). a random variable. should expect 6.6 Convergence of log likelihood functions 7.(B) =t=1 1 Since Vp = p(l .10) that a ( 0 ) is ( l / n ) times the sum of i.]be as in Example 7.2 Consistency where we have substituted X for k because here we must treat L . . the maximum likelihood estimator attains the CramCr-Rao lower bound and hence is the best unbiased estimator. are essentially the conditions on L which justify interchanging the derivative and the integration operations in (7. so that p is the only parameter to estimate. This is illustrated in Figure 7. Therefore the maximum likelihood estimator attains the CramCr-Rao lower bound.4.4.3). n But we have previously shown that V(x) = u2/n.ur.3. from 00. (7. CRLB = - u2 .3) again with respect to p.4. R is the best unbiased estimator. Note that a ( 0 ) is maximized at 6. the support of L (the domain of L over which L is positive) depends on 0.9) ------.-- a2 log L ap2 n . 0). To prove the consistency of the maximum likelihood estimator.140 7 1 Point Estimation 7. in other words.(0) to converge to Q(0) and why we can expect Q(0) to be maximized at OO.4. Differentiating (7.2. 0).4.4.3. Whenever L or its derivatives appeared in the equations.. Therefore we obtain (7.. If &(0) converges to ~ ( 0 1we . unless it was noted otherwise. denoted Q(0). p) as in Example 7. The discrete case can proof. 2. 00) f (X. . Therefore. 90) if 0 # OO.. 00) if 0 # OO. In other words.4. 0) f(X9 0) < log EE log f(X. Since the asymptotic normality can be proved only for this local maximum likelihood estimator. Then . X2. where we have simply written f for f (X. I 0).17) 0 = ------- But the right-hand side of the above inequality is equal to zero.1.001. We expand it in a Taylor series around O0 to obtain Let X be a proper random variable (that is.4.12) and (7..4.. by (iii) of Theorem 6.4. But we have (the derivatives being evaluated at throughout) 7. .A)b] > Ag(a) (1 .4. 4 .we obtain Therefore we obtain from (7. alogL/d0 evaluated at 8 is zero. Therefore pIim.E log f (X.4.13) (7. . 3 as in (7.4 1 Maximum Likelihood Estimator: Properties 143 provided that E log f (X.4. g[Aa (1 .4. under general conditions. o* .1 and 7.).4 (Slutsky).2) or (7.--. Then. ..4.22) as .2. we obtain (7. (Jensen k inequality) Taking g to be log and X to be f (X. because where 0* lies between 8 and go. this is precisely what we showed in (7. henceforth this is always what we mean by the maximum likelihood estimator.4. 0) < E log f (X.12) f(X.00).14) But we can show (see the paragraph following this proof) that 1 E log f (X. 0) < m. we need T H E O R E M 7. The reader should verify (7.41) Eg(X) < g(EX).4.A)g(b) for any a < b and 0 < A < 1.14) by the statement that the derivative of Q(0) is zero at Oo. rather than the global maximum likelihood estimator.22) G (8 .3). To answer the second question.4. That is to say. the maximum likelihood estimator 8 is asymptotically distributed as x. (7. a log L We have essentially proved the consistency of the global maximum likelihood estimato7: To prove the consistency of a local maximum likelihood estimator we should replace (7. (7.we conclude Let the likelihood function be L(Xl.3 Asymptotic Normality T H EO REM 7 . Solving for (8 (8 .4.) be a strictly concave function. 0)/f (X.) Sketch ofproof: By definition. 0).142 7 1 Point Estimation Qn(0) = Q(0) 7. 00) in Theorem 7. it is not a constant) and let g(. + + (7. we should show and - But assuming we can interchange the derivative and the expectation operation.80) We may paraphrase (7.4.15) in Examples 7.4.4.2 ( J e n s e n ) - = (Here we interpret the maximum likelihood estimator as a solution to the likelihood equation obtained by equating the derivative to zero.2).4. 4.4.PI +- 1 (1 .4.4.30) log L = n log S 1 .. . I= 1 Inserting (7.1 (7.4. . 0 < x < 1. n- 2 log x.29).4.4. The convergence result (7. . .1).2).4.27) for 0.- a 2 1 0 g ~ . using integration by parts. we can express the log likelihood function in terms of p as and that the right-hand side satisfies the conditions for Khinchine's LLN (Theorem 6. and compare its asymptotic variance with the variance of the sample mean We have x.3 is that the asymptotic variance of the maximum likelihood estimator is identical with the Cram&-Rao lower bound given in (7.2p I :log x. X .2. (7.34) E log X = (1 + 0) EL-1. a= Thus the maximum likelihood estimator is asymptotically efficient e e n tially by definition. 0 > -1.4.4 1 Maximum Likelihood Estimator: Properties 145 Finally.3 .(1-2p)n + ------- 2 " 1% xi. C Equating (7.2.31) to zero. X2.4.p12 . = I " 1% x. (xelog x)dx = --EL Let X have density f (x) = (1 + O)xe. 1 Differentiating (7.4.4. .4.20) follows from noting that The log likelihood function in terms of 0 is given by (7.p13t = l Since we have.33) . Somewhat more loosely than the above.4.(I .31) a log L dk n *. A significant consequence of Theorem '7. Differentiating (7.4. .1).29) log L = n. we must have and that the right-hand side satisfies the conditions for the Lindeberg-L6vy CLT (Theorem 6.144 7 1 Point Estimation I I " 7.28) into (7. we obtain (7. EXAMPLE 7. Therefore we define DEFINITION 7. we obtain the maximum likelihood estimator A consistent estimator is said to be asymptoticalZy eficient if its asymptotic distribution is given by ('7.p12 (1 .31) again. we have > -1.4.27) defines a one-to-one function and 0 0 < p < 1.33) 7. I: p2(1 .30) with respect to p yields (7.16).19) follows from noting that 1 Since (7. Obtain the maximum likelihood estimator of EX(= p) based on n we obtain from (7.4.4. log(1 S 0) + 0 C log x.4. Solving (7.4. (7. This is almost like (but not quite the same as) saying that the maximum likelihood estimator has the smallest asymptotic variance among all the consistent estimators. the conclusion of the theorem follows from the identity observations XI.4 Examples We shall give three examples to illustrate the properties of the maximum likelihood estimator and to compare it with the other estimators.4. the asymptotic variance of @. That is why our asymptotic results are useful. In this example.4. solving alogL/ap = 0 for p led to the closed-form solution (7.3: . by Theorem 7. where the maximum likelihood estimator is explicitly written as a function of the sample.4. Remark 2 .is given by (7.39). with mean ( p .1.i. 7.2.146 7 1 Point Estimation I 7. the asymptotic variance can be obtained by the method presented here.4.2. has been substituted for x. as in Examples 7. which we state as remarks.1) Therefore 2 Therefore the consistency of ji follows from Theorem 6.4.440) I < 1.) are i. in such cases the maximum likelihood estimator can be defined only implicitly by the likelihood equation.3.4.because we must treat @ as a random variable.d. as defined in Definition 6.4 1 Maximum Likelihood Estimator: Properties 147 Therefore. without appealing to the general result of Section 7.34).36) and (7.4.32) as AV(@)= p2(1 . Since @ in (7. and 7.3. Even then. For this purpose rewrite (7. But since {log X.3. This is not possible in many applications. Remark 1. Remark 4.2.4.1 ) / p as given in (7. from (7.32). In a situation such as this example.3. ~8 .1. let alone the exact distribution. we conclude (7. Remark 3. denoted AV(@). the Therefore.4.4.4. We have where X . we can derive the asymptotic normality directly without appealing to Theorem 7. of the estimator are difficult to find.3. however.p12 n ~ C Next we obtain the variance of the sample mean.we have by Khinchine's LLN (Theorem 6. means that both sides have the same limit distribution and is a conse- .32) is a nonlinear function of the sample.36) consistency can be directly shown by using the convergence theorems of Chapter 6. we can state The second equality with LD in (7.which expressed @ as an explicit function of the sample. - Finally.4.3.AV(P) = (2 'I4 > 0 for 0 < - P)n There are several points worth noting with regard to this example.43).3. . as pointed out in Section 7.3.4.1. the exact mean and variance. Similarly. The convergence in distribution appearing next in (7.3.7 I Illustration for function (7.x.3.x. one negative. > 0 and at a negative value of p when C.4. which can be obtained as follows: B y integration by parts. so that p is the sole parameter of the distribution. given by (7.7. < 0.4.4 Assuming u2 = p2 in Example 7. The function A is an even function depicted in Figure 7. From these two figures it is clear that log L is maximized at a positive value of y when CElx.7 (xi ..4.p12 2 2 2~ t = 1 C which can be written as There are two roots for the above.30).43) is a consequence of the Lindeberg-L6vy CLT (Theorem 6. < 0.4.".-log p . We know from the argument of the preceding paragraph that the positive root is the maximum likelihood estimator if Z.48) We shall study the shape of log L as a function of y.8. E X A M P L E 7.4.30) and found the value of p that maximizes (7. We would get the same estimator if we maximized (7. one positive.1.29) with respect to 0 and inserted the maximum likelihood estimator of 0 into the last term of (7.47) with respect to p equal to zero yields (7.lx.3 (normal density).27).49) B = -* := 1 C xi CL and C is a constant term ahat does not depend on the parameter k. Assume that p Z 0. Here we need the variance of log X.4 (Slutsky). Setting the derivative of (7.> 0 and the negative root if Cr=lx. and looks like Figure 7.the respective maximum likelihood estimators are related by 8. if two parameters and O2 are related by a one-to-ne continuous function O1 = g(02). Therefore Remark 5 .4. The shape of the function B depends on the sign of C:=. More generally. = g(&).2).47) log L = - n n 2 1 ' ' -log (2. obtain the maximum likelihood estimator of p and directly prove its consistency. From (7.4.4. G .4. We first expressed the log likelihood function in terms of p in (7.") .4.2.148 7 1 Point Estimation T64 f Maximum Likelihood Edrnat~r: Pmperties 129 quence of (iii) of Theorem 6.14) we have F I c u R E 7. 56) 1 L = .=ljkfi. 0). Therefore. We have. therefore. . n where z = max(xl.1.4. 2.1. 1.150 7 1 Point Estimation I Exercises 151 p are the same with probability approaching one as n goes to infinity. we have (7. 1 w) 2 = (7.4. take three values 1.5 Let the model be the same as in Example 7. Compare max(X1. . Therefore the maximum likelihood estimator is consistent.=~X. (Section 7. the support of the likelihood function depends on the unknown parameter 0 and.for 8 en O 2 z.1.5. Thus we see that &..4. = otherwise. j = 1. < 0. and root is consistent if p Suppose X1 and X 2 are independent.3. and 3 with probabilities pl. .2. Clearly.4. (b) Find a method of moments estimate of A different from the one in (a). therefore. the maximum likelihood estimator of 8 is 2. and ps.55) plim n-tm &=2 (-p 1 5 .2) which shows that the positive root is consistent if IJ. = j and Y. and 2.lx. (Section 7. the signs of Z.1).(-p + qpl) 2. 2. But because of (7. Xn be independent with exponential distribution with parameter A. the regularity conditions do not hold.3..53) 1=1 XI plim ---.IJ. st j. The likelihood function of the model is given by (7.2. xn).2. k = 0. j = 1.". = n-' Z:=&.4. Define Yj.53). > 0 and the negative . .3. = Z. the observed value of Z defined in Example 7. defined in that example is the biascorrected maximum likelihood estimator.49) Next we shall directly prove the consistency of the maximum likelihood estimator in this example. Since p = 0/2. satisfies n n-1 Z.54) plim n+m 1=I -- n 2p2.4. Therefore the asymptotic distribution cannot be obtained by the standard procedure given in Section '7. . XZ)and X1 X2 as two estimators of 0 by criterion (6) in Section 7.4. and 3. using Khinchine's LLN (Theorem 6. each distributed as U(0.4. 3. xp. = 1 if X.2.. the maximum likelihood estimator of p is 2/2 because of remark 5 of Example 7.2. .5. by Theorem 6.8 Illustration for function (7. + . = 0 if X. In this example. (a) Find a method of moments estimate of A. FIGuRE 7.2.2) Let X1.2) Let X.4. pl. . . n--tm n 1. Then show that j. 2. and 3. Further define $ . EXERCISES C (7. E X A M P L E 7. (Section 7. whereas (6) is not.3.i. If it is known that the probability a defendant has committed a crime is 0. (Section 7. Define Z = X Y . Let two estimators of p be defined by T = (X1 + X2)/2 and S = XI.2) Let XI and X2 be independent.p) 5 Eg(S . Define the following two estimators of 0 = p(l . = 0) = 1 . = P(X=O) = 1 . and X3 be independently distributed r18 B(l.p. (Section 7.d. (Section 7. The remaining I .5. where = (XI + X2 + X3)/3.) is convex if for any a < b and any 0 < A < 1.2.p.1) Let X B(n. Show that Eg(T .2. P ( Y = 1) =I/.2. You may consider the special case where n = 2 and the true value of p is equal to 3/4. Ag(a) (1 . 5. Note that a function g(. (Section 7. "2 = 1" eight times.0 1 < E) for every E > 0 and > for at least one value of E.5) Suppose we define better in the following way: "Estimator X is better than Y in the estimation of 0 if P(Ix .) 9. =p. p). Assume that these forecasts are independent and that each forecast is accurate with a known probability n.3.152 7 1 Point Estimation I Exercises 153 4.A)b]. and "2 = 0" eight times.3. p) and kt two estimators of p be defined as follows: Obtain the mean squared errors of the two estimators.1) Suppose the probability distribution of X and Y is given as follows: P(X=l) P(Y = 7.p. observations on Z yield "Z = 2" four times. Find the maximum likelihood estimator of p based - .3) Let XI and X2 be independent. If r of them say Stanford will win. and let each take 1 and 0 with probability p and 1 . (Section 7.2. 0) 5/4. 12.2) Show that criteria ( I ) through (5) are transitive. how would you estimate p? Justify your choice of estimator. X2.1) Suppose we want to estimate the probability that Stanford will win a football game. Note that we o b serve neither X nor Y . If n = 5 and r = 3.p) based on XI and X2. + - 6. = 1) = p and P(X.A)g(b) 2 g[Aa + (1 .. Suppose the only information we have about p consists of the forecasts of n people published in the Stanford Daily.3) Let XI. Define two estimators = and = (~/2) + (1/4). denoted by p. and X and Y are independent. (A more general theorem can be proved: in this model the sample mean is the best linear unbiased estimator of p in terms of an arbitrary convex loss function.0 1 < c) 2 P(IY . Can you say one estimator is better than the other? 1 13. X2.p proportion ofjurors acquit a defendantwho has not committed a crime with probability 0." Consider the binary model: P(X. and X 3 be independent binary random variables taking 1 with probability p and 0 with probability 1 .3) Let X1. (Section 7. find the maximum likelihood estimator of p when we observe that r jurors have acquitted the defendant.2.9 and acquit a criminal with probability 0.2. Supposing that twenty i.1) A proportion p of n jurors always acquit everyone. (Section 7. each taking the value of 1 with probability p and 0 with probability 1 . Show that the sample mean 2 is not the best linear unbiased estimator. what is your maximum likelihood estimator of p? + Which estimator do you prefer? Why? 8. 11. (Section 7. compute the maximum likelihood estimator of p. (Section 7. regardless of whether a defendant has committed a crime or not.3.p) for any convex function g and for any p. For what values of p is the mean squared error of jn smaller than that of $I? x 10.2.p . (Section 7.p. Assuming that a sample of size 4 from this distribution yielded observations 1. Assume XI < xp.14).3. n.5.5.2) i and 6' obtained by solving (7. (a) Show that if n = 1. (b) Show that it is also equal to the median of (Y. Suppose X N(p. (c) Show that the maximum likelihood estimator of EX is best unbiased. . Obtain the maximum likelihood estimator of p based on Nx i. (Section 7.3.d.3. X.d. are i. i = 1.2) Let the density function of X be given by f (x) = 2x/0 = for 0 5 x 5 0.15) and (7. assuming you know a priori that 0 5 p 5 0. (Section 7. (Section 7. .01.51. 0 > 0.5. (Section 7. and 4. . x > 0. 21. (a) Given i.1) for 0 < x 5 1. .0)2])-1. - Suppose that XI. otherwise. . (a) Derive the second-round estimator k2 of the Newton-Raphson iteration (7. 19. 2(x . independent of each other. and the maximum likelihood estimator is not unique.2) Let XI. I ) . n.4. (Section '7. .1)/(0 .3) 16. This is called the Box-Cox transformation. = 1) = a.4. sample Y1. . (Section 7.3. starting from an initial guess that ^XI = 1. . = 1/(40) for 0 =0 22. (Section 7. the maximum likelihood estimator of 0 is XI. .0.i.. . i = 1. 1) and Y N(2p.1) < x 5 20. X.1x1) (the Laplace or double-exponential density). Find the maximum likelihood estimator of 0.3. . . observations on X and Ny i. Y2.3.a.Ox). derive the maximum likelihood estimator of 0. (b) Show that if n = 2. . (Section 7. be a sample from the Cauchy distribution with the density f (x.3. (a) Find the maximum likelihood estimator of 0. = 0) = 1 . (b) For the following data.]. . .3. .i. 20.3. 3. (Section 7.3. where 0 < 0 < 1. P(X. the likelihood function has multiple maxima.2) Let XI. 14.i. 0 + 0. compute k2: 17.3.16) indeed Show that i maximize log L given by (7. observations on Y and show that it is best unbiased. Supposing that two independent observations on X yield xl and x2. .3. Y. . See Box and Cox (1964).2) Suppose that XI . . 18.1) Given f (x) = 0 exp(.154 7 1 Point Estimation ( Exercises 155 on a single observation on X. - - . are independent and that it is known that (x.2) The density of X is given by f(x) = 3/(40) f o r 0 5 x 5 0. (a) Show that the maximum likelihood estimator of 0 is the same as the least absolute deviations estimator that minimizes C(X.d.5. (b) Find the maximum likelihood estimator of EX.1) Suppose the probability distribution of X and Y is given as foBows: P(X. .)" 10 has a standard normal distribution. (Section 7.0.19). (b) Find the exact mean and variance of the maximum likelihood estimator of a assuming that n = 4 and the true value of a is 1. 15. .d. 0) = { ~ [+ l (x .i. . . with the common density f (x) = (1/2) exp(. X. 23. 2. find the maximum likelihood estimator of a. . calculate the maximum likelihood estimator of 0. be a sample drawn from a uniform distribution U[0 . Derive its variance for the case of n = 3. . p).3. (c) Show that the maximum likelihood estimator attains the Cram&Rao lower bound in this model. . . Find the maximum likelihood estimator of p assuming we know 0 < p 5 0. (a) Find the maximum likelihood estimate of p. p). .d.4. . where we assume p > 0. with P(X > t ) = exp(-kt).3) Suppose f (x) = 0/(1 + x ) ' " . . . * . H i n t :E log (1 X) = 0-'.2. is unknown. . 28. . i = 1. .4.1. (Section 7. be i.]and {Y. Y. p). .5. .) are (-2. respectively. 27. 26. In each experiment we toss the coin until a head appears and record the number of tosses required. -1.2) Let (XI. (Section 7. 2.i. 31. X = 1 with probability p . (Section 7. .3) Let (X. 1) and N(0.4.4. (Section 7. (Section 7. . .4.]. Obtain the maximum likelihood estimator of p and prove its consistency. Define 0 = exp(-A). what is your estimate of the true proportion of people who have not committed a crime? Also obtain the estimate of the mean squared error of your estimator. (Section 7. Assume that all the X's are independent of all the Y's.5. Calculate the maximum likelihood estimator of p and an estimate of its asymptotic variance. If 30 people are acquitted among 100 people who are brought to the court.4.d. (Section 7. i = 1.4. . (Section 7. where p > 0. .2. p. be the number of times and of ji2 = X = i in N trials. n. z = 1. Suppose we observe X = 1 three times.z = 1. .2.d. 32. 2.3) Let {X. be a random sample on N(p. 2. .3) Let {Xi].4.1. . + + . be i.3) In the same model as in Exercise 30. . where 0 < p < 1. ~ ( p ' I). . n.p). . be i. X = 2 four times. Derive the asymptotic variance of the maximum likelihood estimator of p based on a combined sample of (XI. Suppose the experiments yielded the following sequence of numbers: 1.3. 1.i = 1. 2 .X2.0. = 3 with probability (1 - p)2. -3. each distributed as B(1. 2. .156 7 1 Point Estimation 1 $9. 25. 5. let N.i = 1.3) Suppose that X has the Hardy-Weinberg distribution. (Section 7.3) Exercises 157 24.8 probability. and X = 3 three times.4. . .3) Using a coin whose probability of a head. Suppose that the observed values of {X. I ) and let (Y. 30. n. Prove its consistency. Y 2 . 2. . .i. 33. (b) Using the Cramtr-Rao lower bound.3. .].0. We are to observe XI. N(p.4. . 0 < x < m. . respectively. .]. be independent of each other and across i. we perform ten experiments. Also obtain its asymptotic variance and compare it with the variance of the sample mean.i. Xn) and (Yi. . 1.]. Find the maximum likelihood estimator of 0 and its asymptotic variance. V log (1 X) = o .).]and {X2. Prove the consistency of jil = 1 and obtain their asymptotic distributions as N goes to infinity. It is also known that those who have committed a crime are acquitted with 0. Let X and Y be independent and distributed as N(p. Find the maximum likelihood estimator of 0 based on a sample of size n from f and obtain its asymptotic variance in two ways: (a) Using an explicit formula for the maximum likelihood estimator. 5. Compute the maximum likelihood estimator of p and an estimate of its asymptotic variance. -1) and (1..5). n.3) It is known that in a certain criminal court those who have not committed a crime are always acquitted. = 2 with probability 2p(1 .~ . 0 > 0. (Section 7. (b) Obtain an estimate of the variance of the maximum likelihood estimator.2 probability and are convicted with 0. X2. .4. . (Section 7. (b) 8 = w.4. P(X = 0) = 1 - 0.g)'/n. Observe a sample of size n from f . (Section 7. 36. P(Y = 0 1 X = 0) = 0.4. 273. where we assume 0.4. - 37. Compare the asymptotic variances of the following two estimators of 0 = b . (c) If five drawings produced numbers 411. P(Y=OIX=l) = I . and P(X = 3) = '/S Suppose Xis observed N times and let N.3) Abox contains cards on which are written consecutive whole numbers 1 through N.4. (Section 7. .3) Suppose f (x) = l / ( b . Let X.I. and 585.25 5 0 5 1. P(X = 2) = (1 + 0)/3. Do you think f i is a good estimator what is the numerical value of k? of N? Why or why not? 39. (Section 7.158 7 1 Point Estimation I Exercises 159 - .2.4.15) in Examples 7.5.4.5.a: (a) 8 = maximum likelihood estimator (derive it). Derive the maximum likelihood estimator and compute its asymptotic variance. P ( Y = l I X = l ) =0.a) for a < x < b. (b) 8 = 243~(x. and we see that Y = 1 happens N1 times in N trials. (a) Find EX. Suppose we observe only Y and not X. 0 > 0. Find EN and V N . N 35.-- 34.Compare the asymptotic variances of the following two estimators of 0: (a) 8 = maximum likelihood estimator (derive it). where the K numbers drawn. where N is unknown.. 38.1 and 7.4.@ .1. denote the number obtained on the ith drawing. We are to draw cards at random from the box with replacement.0)/3. is the average value of (b) Define estimator = 2 z . P ( Y = 1 1 X = 0) = 0. . Find an explicit formula for the maximum likelihood estimator of 0 and derive its asymptotic distribution. 950. x r 0.3) Suppose that P(X = 1) = (1 .4. Define = I . Compute their variances.(3N1/N) and G2 = (3N2/ N) . (Section 7.3) Suppose f (x) = 0-hexp(-x/0). 156. (Section 7.4. and VX. Observe a sample of size n from f.3) Verify (7.3) Let the joint distribution of X and Y be given as follows: P(X = 1) = 0. be the number of times X = i. If we toss the coin 100 times and get 50 heads. equivalently. Therefore. N [ p . More information is contained in 8: namely. Xn) is a given estimator of a parameter 0 based on the sample XI.1. Note that we have used the word conjidence here and have deliberately avoided using the word probability. n. suppose we want to know the true probability.1 T =X A Let X. the classical and Bayesian methods of inference often lead to a conclusion that is essentially the same except for a difference in the choice of words. . "a 0. Xn The estimator 8 summarizes the information concerning 0 contained in the sample. we will have more confidence that p is very close to %. . be distributed as B(1. the smaller the mean squared error of 8. we do not use it concerning parameters. we have . . p.5. Then ." A confidence interval is constructed using some estimator of the parameter in question.95 conjidence interval for 0 is [a. because we will have. We toss it ten times and get five heads. X2. given 6.9 or 0. in classical statistics we use the word probability only when a probabilistic statement can be tested by repeated observations.4." This degree of confidence obviously depends on how good an estimator is. 2 . (An exception occurs in Example 8. 0. Our point estimate using the sample mean is %.1.1 INTRODUCTION Obtaining an estimate of a parameter is not the final purpose of statistical inference. Because we can never be certain that the true value of the parameter is exactly equal to an estimate. however. How should we express the information contained in 8 about 0 in the most meaningful way? Writing down the observed value of 8 is not enough-this is the act of point estimation.) The concept of confidence or confidence intervals can be best understood through examples. b] . because most reasonable estimators are at least asymp totically normal." or. we shall define a confidence interval mainly when the estimator used to construct it is either normal or asymptotically normal. The better the estimator. As discussed in Section 1. E X A M P L E 8.8. the more fully it captures the relevant information contained in the sample. In Section 8.p ) / n ] . This is an act of interval estimation and utilizes more information contained in 8. X2. where a chisquare distribution is used to construct a confidence interval concerning a variance.p ( l . carries out statistical inference. b] with 0.2. but we must still allow for the possibility that p may be. say. 8. Although there are certain important differences.2 1 Confidence Intervals 161 8.2. a better estimator. who uses the word probability for any situation. the greater confidence we have that 0 is close to the observed value of 8.2 CONFIDENCE INTERVALS 1 IE We shall assume that confidence is a number between 0 and 1 and use it in statements such as "a parameter 0 lies in the interval [a. we would like to know how much confidence we can have that 0 lies in a given interval. . . Thus. Although some textbooks define it in a more general way. i = 3. . of getting a head on a given coin. We would like to be able to make a statement such as "the true value is believed to lie within a certain interval with such and such wnjidence. . therefore. . in effect. This restriction is not a serious one. . which may be biased in either direction.95 confidence. . although we are fairly certain that p will not be 0.3 we shall examine how the Bayesian statistician.6 or 0. p ) . The word confidence. More generally. we would like to know how close the true value is likely to be to an estimated value in addition to just obtaining the estimate. For example. . suppose that 8(x1. has the same practical connotation as the word probability. The classical statistician's use of the word confidence may be somewhat like letting probability in through the back door. " Definition (8.p)'/p(l .8) is appealing as it equates the probability that a random interval contains p to the confidence that an observed value of the random interval contains p.4).2. Confidence is not as useful as probability. Equation (8. which in this example is the interval [0. Moreover.4) is motivated by the observation that the probability that T lies within a certain distance from p is equal to the confidence that p lies within the same distance from an observed value of T. however.p) as a function of p for fixed values of n and t. 11. we have from (8. we have a proportionately large confidence that the parameter is close to the observed value of the estimator. we have C ( I 1 )2 C(Z2). The probabilistic statement (8. We have also drawn horizontal lines whose ordinates are equal to k%nd k*2.5. From (8.96. Figure 8. b*) correspond to the yk and yk* confidence intervals.3) may be equivalently written as = 0.yk*. even after such an extension.2. By definition. assuming n = 10 and t = 0.2.This suggests that we may extend the definition of confidence to a Iarger class of sets than in (8.5.1 shows that if interval Zl contains interval Z2.6) is a legitimate one.2. consider n ( t . Next we want to study how confidence intervals change ask changes for fixed values of n and t. and C(a* < p < b*) = yk*. t ) and increasing in ( t .2. and we may further define C [ ( a< p < a*) U (b* < p < b)l = yk . we have which may be further rewritten as where Thus a 95% confidence interval keeps getting shorter as n increases-a reasonable result.2 1 Confidence Intervals 163 Let Z be N ( 0 . Then.4). For this purpose.2. This function is graphed in Figure 8.4) can be written as yk = P(IZI < k). respectively. since yk = 0. Then we could . so that confidence satisfies probability axiom (3) as well.2. Let us construct a 95% confidence interval of p. l ) and define (8.2.1. (8. For example.2.2) Similarly. It states that a random interval [hl ( T ) . and is decreasing in the interval ( 0 . 1 ) . Then we define mnjkhw b y If n = 100 and t which reads "the confidence that P lies in the interval defined by the inequality within the bracket is y c or "the yk confidence interval of p is as indicated by the inequality within the bracket.95 when k = 1. because it concerns a random variable T. It is easy to see that this function shoots up to infinity near P = 0 and 1.2. (8. For example. This is definitely a shortcoming of the confidence approach.2. confidence clearly satisfies probability axioms ( 1 ) and ( 2 ) if in ( 2 ) we interpret the sample space as the parameter space. Note that this definition establishes a kind of mutual relationship between the estimator and the parameter in the sense that if the estimator as a random variable is close to the parameter with a large probability.h2(T) 1 contains pwith probability yk. C(a < p < a*) cannot be uniquely determined from definition (8. Definition (8. b) and (a*.162 8 1 Interval Estimation 8.4) defines C(a < p < b) = yk. In Bayesian statistics we would be able to treat p as a random variable and hence construct its density function. attains the minimum value of 0 at p = t.1) we have approximately Suppose an observed value of T is t . because there are many important sets for which confidence cannot be defined.8) Then we can evaluate the value of yk for various values of k from the standard normal table. Thus the intervals (a. 2 E X A M P L E 8 . calculate the probability that p lies in a given interval simply as the area under the density function over that interval.2. a2/n) as a confidence density for p. ~et~.2.p) is known and has been tabulated. Note that (8. 1) by the right amount.wherepisunknown and a2is known. where yk for various values of k can be computed or read from the Student's t table.96 in (8. n = 10.2. . i~.-ll - ~ < k). ) .(x. .164 8 ( Interval Estimation I I 8. shown above.--~(p.cr~ 1. a2/n). with both p and a2unknown.88 < p < 6. Define - - (8.2. Its density is symmetric around 0 and approaches that of N(0. Then we have (8.12). ~ ( p .1 Therefore the interval is (5. Then the probability distribution ~-'c:=. It depends only on n and is called the Student's t distributionwith n . there is no unique function ("confidence density.12) as F I G U R E 8. For example. This is not possible in the confidence approach.12).1 degrees offeedom. We have T If N(p." so to speak) such that the area under the curve over an interval gives the confidence of the interval as defined in (8. the function obtained by eliminating the left half of the normal density N(t. We can construct a 95% confidence interval by putting t = 6.1 Construction of confidence intervals in Example 8.2 1 Confidence Intervals 165 I Therefore. We may be tempted to define N(t. a2= 0.4).2. given T = t.2.2.12) defines confidence only for intervals with the center at t. o * )i . and k = 1. 0. Define EXAMPLE 8. For.n.-l = S-'(T .04) and that the average height of ten students is observed to be 6 (in feet). 2 . Let T = be the estimator of p and let s2 = be the estimator of a2. = . the greater the confidence that p lies within the same distance from t . but this is one among infinite functions for which the area under the curve gives the confidence defined in (8. 1) as n goes to infinity.2.15) S .2. of t.14) yn = P(lt. t) and lowering the portion over (t.04. 2 . given one such function. 02/n) and doubling the right half will also serve as a confidence density. In other words. = 1 . we define confidence Thus the greater the probability that T lies within a certain distance from p. Suppose that the height of the Stanford male student is distributed as N(p. . 3 Suppose that X. n . . we can construct another by raising the portion of the function over (0. See Theorem 5 of the Appendix for its derivation. . s$ = (2. .2. 5 Let X. j. This example differs from the examples we have considered so far in that (8. and d? We reverse the procedure and start out with a statistic of which we know the distribution and see if we can form an interval like that in (8.18) does not follow from equation (11) of the Appendix.95 confidence interval for the true difference between the average lengths of unemployment spells for female and male workers. I The assumption that X and Y have the same where yk = ~ ( l t ~ . we get Therefore the 95% confidence interval of p is (5. c. s i . 2. but assume that u2is unknown and estimated by s'.19) yields Consider the same data on Stanford students used in Example 8.5 days. E X A M P L E 8. . as in Example 8. consider constructing a 0.85 < p < 6.166 8 1 Interval Estimation 8. we define confidence by average with a standard deviation of 2 days. variance is crucial.16). i = 1. Using it. letYi N ( p y . ny. nx. .2.3.2. given in Theorem 3 of the Appendix. as shown in Theorem 6 of the Appendix. The larger interval seems reasonable. .It is natural to use the sample variance defined by s2 = n n 1 C. then.2.2.1 for a method which can be f a. u 2 ) . 2 .15). we can define the confidence interval for px . + ~ . because otherwise (8.i = 1 .2 in (8. and proceed as follows: - Thus.5) . . L e t x i N ( p x . because in the present example we have less precise information.=l(X.2.2. j = 40. u 2 ) . b. n. N ( p . . c. We begin by observing ns2/u2 X:-l.2. and d.2. 2 inserting k = 2. Since ~ ( l t< ~ 2~ ) ( 0. which is observed to be 0.4 E X A M P L E 8 .95. = 2 ' into (8. .. used in the case of A s an application of the formula (8.2. can we calculate the probability of the event (8.3.2 1 Confidence Intervals 167 Given T = t. and that a random sample of 40 unemployment spells of male workers lasted 40 days on the where k1 and k2 are chosen so as to satisfy (8. This time we want to define a confidence interval on a2.Then. given the observed values 2.04.24) does not determine kl and k2 uniquely. and - - where we can get varying intervals by varying a. and s. and s. nx = 35. a'). ny = 40.py by rk Therefore. assume that {Xi} are independent of {Y. . Note that this interval is slightly larger than the one obtained in the previous example. we would like to define a confidence interval of the form - x)2. i = 1. .21). 2 = 42.. .2.]. b. a - y-confidence interval is < -~ k). Putting t = 6 and s = 0. .24) ui P(kl < X:-l < k2) = 7 . See Section 10.2.19). given that a random sample of 35 unemployment spells of female workers lasted 42 days on the average with a standard deviation of 2. with both p and u2 unknown. A crucial question is. given the observed value defined by - ? of s'. S = s. In practice it is customary to determine these two values so as to satisfy .21) for various values of a. 3 BAYESIAN METHOD E X A M P L E 8 . which he has before he draws any marble.4. After the posterior distribution has been obtained. 1 Suppose there is a sack containing a mixture of red marbles and white marbles.1. 8. 3 .50). defined over the parameter space. as will be shown below. Inserting n = 100.3 1 Bayesian Method 1 69 Given y.168 8 1 Interval Estimation 8. which was proved in Theorem 2. there are many situations where T. consider the same data given at the end of Example 8. we obtain an alternative confidence interval.2. but a shortcoming of this method is i 1 [ i Suppose he obtains three red marbles and two white marbles in five drawings. we can define a density function over the parameter space and thereby consider the probability that a parameter lies in any given interval.79.56 into (8. The fraction of the marbles that are red is known to be either p = '/s or p = %.2.2. is either normal or asymptotically normal. whereas in the Bayesian statistics the posterior distribution is obtained directly from the sample without defining any estimator. we can define estimators using the posterior distribution if we wish. even though we can also use the approximate method proposed in this example. This probability distribution. We shall subsequently show examples of the posterior distribution and how to derive it. where the true value of the parameter is likely to lie. (26. because in it we can treat a parameter as a random variable and therefore define a probability distribution for it.2. as estimator of 0. kl and kg can be computed or read from the table of chi-square distribution.2. If.2.3. the variance of T is consistently estimated by some estimator V. consider constructing a 95% confidence interval for the true variance u2 of the height of the Stanford male student. It is derived by Bayes' theorem. (8. using the estimator. 2 . see DeGroot (1970) and Zellner (1971).27) S ' A N ( U ~ u4/50).23). Assume that the height is normally distributed.2. This is accomplished by constructing confidence intervals.02.2. . kl = 74. respectively. so that his prior distribution is We have stated earlier that the goal of statistical inference is not merely to obtain an estimator but to be able to say. denoted by A. E X A M P L E 8 .26). we may define confidence approximately by where Z is N ( 0 . If the parameter space is continuous. The Bayesian expresses the subjective a priori belief about the value of p. that confidence can be defined only for a certain restricted sets of intervals. by Theorem 4 of the Appendix. Then. it is better to define confidence by the method given under the respective examples. 6 Besides the preceding five examples. If the situations of Examples 8. moreover. l ) and t and v are the observed values of T and V. Suppose he believes that p = % is as three times as likely as p = '/2. We are to guess the value of p after taking a sample of five drawings (replacing each marble drawn before drawing another).2. 45.5. which does not differ greatly from the one obtained by the more exact method in Example 8. or 8. Then the posterior distribution of P given the sample. In the Bayesian method this problem is alleviated.2. As an application of the formula (8. 3 = 36.26). called the posterior distribution.5. For more discussion of the Bayesian method. 8. and k p = 129. Note that in classical statistics an estimator is defined first and then confidence intervals are constructed using the estimator.98). embodies all the information an investigator can obtain from the sample as well as from the a priori information. As an application of the formula (8.5 actually occur.4. Estimating the asymptotic variance a4/50 by 36'/50 and using (8. 48. is calculated via Bayes' theorem as follows: .22. The two methods are thus opposite in this respect. 8. given that the sample variance computed from a random sample of 100 students gave 36 inches. in the form of what is called the prior distribution.23) yields the confidence interval (27.2. as is usually the case.2. = y2. In this case the Bayesian's point estimate will be p = %. as given in Table 8. the classical statistician can talk meaningfully about the probability of this event. instead? Denoting this event by B. There are four sacks containing red marbles and white marbles. This estimate is different from the maximum likelihood estimate obtained by the classical statistician under the same circumstances.39 as before. The difference occurs because the classical statistician o b tains information only from the sample. For example. It indicates a higher value of p than the Bayesian's a priori beliefs: it has yielded the posterior distribution (8.174) 8 1 Interval Estimation TABLE 8 . then his estimate would be the same as the maximum likelihood estimate. the posterior distribution now becomes . What if we drew five red marbles and two white marbles. whereas the Bayesian allows his conclusion to be influenced by a strong prior belief indicating a greater probability that p = %. In contrast. The prior probability in the previous question is purely subjective. where the expectation is calculated using the posterior distribution. in the previous question.3. One of them contains an equal number of red and white marbles and three of them contain twice as many white marbles as red marbles.2). however. Suppose we change the question slightly as follows. he would consider the loss he would incur in making a wrong decision. hence. The reader should recognize the subtle difference between this and the previous question. 1 8.3.3. whereas the corresponding probability in the present question has an objective basis. He chooses the decision for which the expected loss is the smallest. because probability to him merely represents the degree of uncertainty. what is the probability that the sack with the equal number of red and white marbles was picked? Answering this question using Bayes' theorem.2) would be sufficient. which indicates a greater likelihood that p = y2than p = '/s.1)has been modified by the sample. If three red and two white marbles are drawn. If he wanted to make a point estimate. We are to pick one of the four sacks at random and draw five marbles. In the wording of the present question.3 1 Bayesian Method 171 Loss matrix in estimation State of Nature Decision p=% p=% This calculation shows how the prior information embodied in (8. The Bayesian.1.3. therefore. (8. Given the posterior distribution (8. In the present example. is free to assign a probability to it. he chooses p = '/3 as his point estimate if For simplicity. he incurs a loss ye Thus the Bayesian regards the act of choosing a point estimate as a game played against nature.the Bayesian may or may not wish to pick either p = '/3 or p = % as his point estimate. the classical statistician must view the event ( p = %) as a statement which is either true or false and cannot assign a probability to it. there is only one sack. let us assume y.2).1). because it contains all he could possibly know about the situation. which assigns a larger probability to the event p = %.3. if he chooses p = 1/3 when p = % is in fact true. If he simply wanted to know the truth of the situation. If the Bayesian's prior distribution assigned equal probability to p = l/g and p = % instead of (8. we obtain 0. Since this is a repeatable event. the event ( p = %) means the event that we pick the sack that contains the equal number of red marbles and white marbles. 2.1 and compare the Bayesian posterior probability with the confidence obtained there.7634).2.1.7). We shall now consider the general problem of choosing a point estimate of a parameter 0 given its posterior density. Equating the derivative of (8.3.9) we obtained the 95% confidence interval as (0. Suppose we a priori think any value of p between 0 and 1 is equally likely. say.3. n! m! Note that the expectation is taken with respect to 0 in the above equation.3.7) Suppose the observed value of X is k and we want to derive the posterior density of p.the Bayesian can evaluate the probability that p falls into any given interval.3.7.7) In this case the Bayesian would also pick p = % as his estimate. Using the result of Section 3.3.13) with respect to 8 to 0 .8) for nonnegative integers ( l + n + m ) ! nandm. we obtain .3. This situation can be expressed by the prior density From (8.2366 < p < 0. This problem is a generalization of the game against nature considered earlier. Then the Bayesian chooses 8 so as to minimize - ( n + I ) ! pk(l . In Example 8. assuming y. Let 8 be the estimate and assume that the loss of making a wrong estimate is given by (8. In (8.3. f l ( 0 ) . we can write Bayes' formula in this example as These calculations show that the Bayesian inference based on the uniform prior density leads to results similar to those obtained in classical inference. Therefore we have Loss = (8 . because the information contained in the sample has now dominated his a priori information. We shall assume that n = 10 and k = 5 in the model of Example 8.8) we can calculate the 80% confidence interval We have from (8. we assumed that p could take only two values. It is more realistic to assume that p can take any real number between 0 and 1.12) where the denominator is the marginal probability that X = k. p ) .3.2 Let X be distributed as B ( n . for the purpose of illustration.We have from (8. the conditional density of p given X = k. since 0 is the random variable and 6 is the control variable.k)! where the second equality above follows from the identity (8. that is. E X A M P L E 8.p)n-k k!(n .3 I Bayesian Method 173 Using (8. = y2 as before.o ) ~ .172 8 1 Interval Estimation 8.2. 2 and found to have a smaller mean squared error than the maximum likelihood estimator k / n over a relatively wide range of the parameter value.3. Note that if the prior density is uniform..3.3 1 Bayesian Method 175 Note that in obtaining (8. k2). Let x. was proposed by Birnbaum (1962). For example. where a2 is assumed known.3. an intermediate step between classical and Bayesian principles. . Let us apply the result (8. where cl is chosen to satisfy JLf ( p ( x) d t ~= . . . Although the classical statistician uses an intuitive word. Therefore we finally obtain E X A M P L E 8 .5). . It gives a more reasonable estimate of p than the maximum likelihood estimator when n is small. "likelihood. the Bayes estimator under the squared error loss). 3 . if a head comes up in a single toss (k = 1. because it can give her an estimator which may prove to be desirable by her own standard. that is. As this example shows. Using the formula (8. whereas the latter chooses its average." she is not willing to make full use of its implication. as in (8. As n approaches infinity. .2. from the Bayesian point of view. for ignoring the shape of the posterior density except for the location of its maximum.We shall write the exponent part successively as . Then the posterior density of p given x is by the Bayes rule This is exactly the estimator Z that was defined in Example 7.7). The likelihood principle.1 74 8 1 Interval Estimation 8. both estimators converge to the true value of p in probability. u2). x2. we obtain I 1: Suppose the prior density of p is N ( p O . Then the likelihood function of p given the vector x = (xl.3. p = 1. however. as long as the estimator is judged to be a good one by the standard of classical statistics. 2. n = I). more precisely.3. In this case the difference between the maximum likelihood estimator and the Bayes estimator can be characterized by saying that the former chooses the maximum of the posterior density. . the Bayes estimator is the expected value of 0 where the expectation is taken with respect to its posterior distribution.13). Nothing prevents the classical statistician from using an estimator derived following the Bayesian principle.14) we have assumed that it is permissible to differentiate the integrand in (8. be the observed value of X.7). 1.3. the posterior density is proportional to the likelihood function. the Bayesian method is sometimes a useful tool of analysis even for the classical statistician. as we can see from (8. Classical statistics may therefore be criticized. the Bayesian estimate p = % seems more reasonable than the maximum likelihood estimate.3. ) be independent and identically distributed as N ( p .) is given by We call this the Bayes estimator (or. x.15) to our example by putting 0 = p and f l(p) = f ( p 1 X = k) in (8. i = 1. .8) again. . 3 Let { X . In words. n. 23) is what we mentioned as one possible confidence density in Example 8.(Cf. is not always possible. might unduly influence statistical inference. Then (8.4. we have (8.5 < 0 < 9.22) becomes - - 1 (log 0.3. The classical school.3.3 1 Bayesian Method 177 Therefore we have f (0) = = 5 for 9.21) by calculating One weakness of Bayesian statistics is the possibility that a prior distribution. Example 7.3.2. in fact.N ( p .3. if parameters 0 and p are related by 0 = p-'. A. the prior distribution approaches that which represents total prior ignorance. where cg = ( Therefore we conclude that the posterior distribution of p given x is I calculate the posterior density of 0 given that an observation of X is 10. we could have obtained a result identical to (8. n-lo2).24).24) f (3 1 p) = exp G u [- n -( i .3.2. We might think that a uniform density over the whole parameter domain is the right prior that represents total ignorance. This.1. Assuming that the prior density of 0 is given by 8.3. Example 7.3.3. a uniform prior over p. 0 otherwise.2 and 8.2. but this is not necessarily so. EXAMPLE Let X be uniformly distributed over the interval (0.0) Note that (8. was developed by R.3.22). This weakness could be eliminated if statisticians could agree upon a reasonable prior distribution which represents total prior ignorance (such as the one considered in Examples 8.) Since we have . Note that the right-hand side of (8.5 . The probability calculated by (8.5). 10.12) whenever the latter is defined. however. For example. as it is the optimally weighted average of 2 and PO. Fisher and his followers in an effort to establish statistics as an objective science. for 1 < p < 2 .4 =0 otherwise.22) depends on the sample x only through 3. implies a nonuniform prior over 0: . We have E ( p I x) in the above formula is suggestive. This result is a consequence of the fact that 2 is a sufficient statistic for the estimation of p. which is the product of a researcher's subjective judgment.) As we let A approach infinity in (8.1 76 8 1 Interval Estimation 8. f ( p ) = l .7.3.3.3.8)(10.3) in every case.2. I / * ) ( / U X ) in order for f ( p X) to be a density. (Cf.23) coincides with the confidence given by (8.p)2] 2u2 Using (8. i = 1. 5. construct an 80% confidence interval on the difference of the mean rates of recovery of the two groups ( p l . Obtain an 80% confidence interval for 0 assuming 2 = 10. 2.5 and P ( p = 5/4) = 0. Let the prior probabilities of p be given by P ( p = %) = 0. How many times should you toss the coin in order that the 95% confidence interval for p is less than or equal to 0.p2).2.7. .2) If 50 students in an econometrics class took on the average 35 minutes to solve an exam problem with a variance of 10 minutes.3) Suppose the density of X is given by /I f(x1 0) = 1/0 for 0 =0 5 x 5 0. Comparison of Bayesian and classical schools Classical school Use confidence intervals as substitute. even by classical standard. Answer using both exact and asymptotic methods. assuming the height is normally distributed. Note: Asterisk indicates school's advantage. Bayesian school *Can make exact inference using posterior distribution. . Exercises 179 f(0) = K 2 . for 1/2 < 0 = O otherwise.2) The heights (in feet) of five randomly chosen male Stanford students were 6. 0 5 p I 1.2) Suppose X.2 4. (Section 8. e2).5. of a particular coin. . N(0. "Objective inference.. 6. 5. - Table 8. otherwise. Find a 90% confidence interval for the mean height.- 5. 100.. (Section 8. Calculate the posterior probabilities Pl(p) = P ( p ) X1 = 1) and P2(P) = P(p I X1 = 1. and two heads appear in two tosses. Assuming that 60 patients of Group 1 and 50 patients of Group 2 recovered. 6. and 6.- 178 8 1 Interval Estimation 1 < 1.2 summarizes the advantages and disadvantages of Bayesian school vis-2-vis classical statistics. If sample size is large. *No need to calculate complicated integrals. *Can use good estimator such as sample mean without assuming any distribution. Also calculate P ( p I Xp = 1) using Pl (p) as the prior probabilities. p). 3. X2 = 1). construct a 90% confidence interval for the true standard deviation of the time it takes students to solve the given problem. p.p).3) - -A -- . what is her estimate of p? - EXERCISES 1. *No need to obtain distribution of estimator. (Section 8. (Section 8.8. If her prior density is f (p) = 6p(l . *Bayes estimator is good. (Section 8.- - A - -- - A Bayesian is to estimate the probability of a head. and the prior density of 0 is given by I * f(0) = 1/0' for 0 r 1. TAB L E 8.-. maximum likelihood estimator is just as good. and no drug was given to another group of 100 patients (Group 2). Compare it with P2(p). otherwise. (Section 8. (Section 8. Bayes inference may be robust against misspecification of distribution. Use prior distribution that represents total ignorance.2.2) A particular drug was given to a group of 100 patients (Group I ) . . (Section 8. .3.1.3) Let XI and X2 be independent and let each be B(1.5 in length? 8.2) Suppose you have a coin for which the probability of a head is the unknown parameter p. =0 . find the Bayes estimate of 0. (Section 8. where we assume 0 r 0 5 2.3) The density of X.3) Let ( X i ]be i. assuming that the observed value of X is 2. lei. 11.1) for 0 < x 5 1.d. p. change the loss function to L(e) = Obtain the Bayes estimate of p. find the Bayes estimator of p. Obtain the Bayes estimate of p. In the experiment of tossing the coin until a head appears.3) Let the density function of X be given by f(x) = =0 otherwise. If your prior probability distribution of the probability of a head. f (0) = 0. and f 2 ( . 15. given X = 1. Derive the maxiwhere f mum likelihood estimator of A based on one observation on X.1)/(0 .$1.i. Assuming the prior densityf (0) = 60(1 . otherwise.3) We have a coin for which the probability of a head is p. (a) Find the maximum likelihood estimator of 0. (b) Assuming the prior density of 0 is f (0) = 0-2 for 0 2 1. given an unknown parameter A E [O. Assuming the uniform prior density. (Section 8. (Section 8.p. ) are known density functions. is given by P ( p = %) = 1/3.3) Suppose the density of X is given by f (x I A) = Af1(x) + (1 .1]. 9.3) B(1. derive the Bayes estimator of 0 based on a single observation of X. and P ( p = 4/5) = 1/3 and your loss function is given by Ip . is given by where e = p . (a) Find the maximum likelihood estimator of 0 and obtain its exact mean squared error. 11.3) Suppose that a head comes u p in one toss of a coin. (Section 8.A)fz(x). we observe that a head appears in the kth toss.) by 10.5 for 0 5 0 =0 2. (Section 8. (b) Find the Bayes estimator of 0 using the uniform prior density of 0 given by 2x/0 for 0 5 x 5 0. = 2(x . (Section 8. p) and the prior density of p is given by f (p) = 1 for Let X 0 5 p 5 1. i = 1 and 2. Obtain its exact mean squared error. what is your estimate p? What if your prior density of p is given by f (p) = 1 for 0 5 p 5 12 and define {Y. derive the Bayes estimator of X based on one observation on X. given X = 1. We do not observe { X i ) . Assuming the prior density of A is uniform over the interval [0. with the density I Exercises 181 Obtain the Bayes estimate of 0. (Section 8.180 8 1 Interval Estimation 13. 16. 14.0). P(p = %) = l/g. 12. ) is given by - Suppose we observe {Yi).3) In the preceding exercise. and find Y 1 = 1 and Y p = 0. (Section 8. Suppose we want to estimate 0 on the basis of one observation on X . where 0 < 0 < 1. Suppose the loss function L ( . . and the alternative hypothesis. compared with some other set. and the hypothesis that the mean of a normally distributed sample is equal to a certain value is an example of the second. For example. The set R is called the region o f rejection or the critical region of the test. In practice. . and we accept Ha (and therefore reject H.3. 1 . Sections 9. The purpose of estimation is to consider the whole parameter space and guess what values of the parameter are more likely than others. it is called composite. Xn). In hypothesis testing we pay special attention to a particular set of values of the parameter space and decide if that set is likely or not. denoted Ho. a test of a hypothesis is often based on the value of a real function of the sample (a statistic). the assumption that p = % in the binomial distribution is a simple hypothesis and the assumption that p > '/2 is a composite hypothesis.1 . D E F l N IT10 N 9. The hypothesis that a sample follows the normal distribution rather than some other distribution is an example of the first. . A statistic which is used to test a hypothesis is called a test statistic. A hypothesis may be either simple or composite. In Chapter '7 we called a statistic used to estimate a parameter an estimator. In the general discussion that follows. 9. 2 . Throughout this chapter we shall deal with tests of hypotheses of the second kind only. true.2 and 9. D E F l N ITION 9 . Otherwise. .2 TYPE I AND TYPE II ERRORS 9. ndimensional Euclidean space. In hypothesis tests we choose between two competing hypotheses: the null hypothesis.4 and 9. Most textbooks. we shall treat a critical region as a subset of E. however. Specifjrlng the mean of a normal distribution is a composite hypothesis if its variance is unspecified. the complement of R in En. A Type 1 when H I is true). the most interesting case is testing a composite hypothesis against a composite hypothesis. Thus the question of hypothesis testing mathematically concerns how we determine the critical region.2 1 Type I and Type I1 Errors 183 9 TESTS OF HYPOTHESES As we shall show in Section 9. denoted HI. A Type I error is the error of re~ecting Ha when it is 1error is the error of accepting Howhen it is false (that is. . In this regard it is useful to define the following two types of error.) if X E R. Thus X is an n-variate random variable taking values in En. If T(X) is such a statistic. devote the greatest amount of space to the study of the simple against simple case.1 INTRODUCTION There are two kinds of hypotheses: one concerns the form of a probability distribution.3 we shall assume that both the null and the alternative hypotheses are simple.X2. Then a test of the hypothesis Hamathematically means determining a subset R of En such that we reject Ho (and therefore accept H I ) if X E R. We make the decision on the basis of the sample (XI. because the event T(X) E R can always be regarded as defining a subset of the space of X.. In Sections 9. - The question of how to determine the critical region ideally should depend on the cost of making a wrong decision. There are two reasons: one is that we can learn about a more complicated realistic case by studying a simpler case. 1 A hypothesis is called simple if it specifies the values of all the parameters of a probability distribution.5 will deal with the case where one or both of the two competing hypotheses may be composite. the critical region is a subset R of the real line such that we re~ect H a if T(X) E R. the other is that the classical theory of hypothesis testing is woefully inadequate for the realistic case. and the other concerns the parameters of a probability distribution when its form is known. denoted simply as X.9. 2 plots the characteristics of the eight tests on the a.2. Definition 9. the probabilities of for R .3) The following examples will illustrate the relationship between a and as well as the notion of admissible tests.P I ) and (a2. this much is certain: given two tests with the same value of a. and R2 are ( a l . if Type I error is much more costly than Type I1 error.6 ) a 2 and 6 3 = SP. and she decides in advance that she will choose R1 if a toss of the coin yields a head and R2 otherwise. Even if we do not know the relative costs of the two types of error. We denote the probability of Type I error by a and that of Type I1 error by P. respectively. Construct all possible nonrandomized tests for this problem and calculate the values of a and p for each test. P a = 6a1 + ( 1 . Figure 9. P plane. R1 and R P . Otherwise it is called admissible. In Section 9. 3 A test is called inadmissible if there exists another test which is better in the sense of Definition 9.1 Relationship between a and P Pq) be the characteristics of two DEFlN I T I O N 9 . Such a test is called a randomized test. we must consider the relative costs of the two types of errors. denoted as ( a .184 9 1 Hypotheses 9. 1 Let X be distributed as B(2. p). therefore.3 we shall show how the Bayesian statistician determines the best test by explicit consideration of the costs. DE F I N IT10 N 9 .2. 2 . The first test is better (or more powerful) than the second test if al 5 a 2 and PI 5 Pp with a strict inequality holding for at least one of the 5.1 describes the characteristics of all the nonrandomized tests.1. Therefore we can write mathematically 9. should ideally be devised by considering the relative costs of the two types of error. because a consideration of the costs tends to bring in a subjective element. 2 . are given by (9. the two types of error for the randomized test. Sometimes it is useful to consider a test which chooses two critical regions. The remaining tests that we need to consider are termed admissible tests. Classical statisticians usually fail to do this.P). p ) the characteristics of the test.2 1 Type I and Type II Errors 185 sider only the critical regions of the form R = { x I x > c]. we should choose the one with the smaller value of P.8)P2. or the so-called loss function. We want to use a test for which both a and / Making a small tends to make P large and vice versa. Thus we define FIG u R E 9.2. + ( 1 . tests. and > The probability of Type I error is also called the size of a test.6 respectively.with probabilities 6 and 1 . 3 are as small as possible. however. If the probabilities of the two types of error P2). Such a test can be performed if a researcher has a coin whose probability of a head is 6 .2.2. and suppose we are to test Ho: p = % against HI: p = 3/4 on the basis of one observation on X.2 is useful to the extent that we can eliminate from consideration any test which is "worse" than another test. 2 Let (al. as illustrated in Figure 9. Ifwe cannot determine that one test is better than another by Definition The probabilities of the two types of error are crucial in the choice of a critical region. Table 9. In the figure the densities of X under the null and the alternative hypotheses are f ( x 1 H o ) and f ( x I H I ) . We call the values of ( a . where 6 is chosen a priori. A n optimal test. 2 . we should devise a test so as to make a small even though it would imply a large value for P. P I ) and (ap. a and P are represented by the areas of the shaded regions. say.respectively. Any point on the line segments connecting (1)-(4)-(7)-(8) except the end points themselves represents the characteristics of an admissible ran- . For example. If we con- E X A M P L E 9 .2. E X A M P L E 9. D E F I N I T I O N 9.2.2. if X = 1. We are to test Ho: 0 = 0 against H1: 0 = 1 on the basis of a single observation on X.2 1 Type I and Type II Errors 187 T A B L E 9. In stating these definitions we identify a test with a critical region. but the definitions apply to a randomized test as well. The most powerful randomized test of size S/s is S/4 . Represent graphically the characteristics of all the admissible tests.3. =1+0-x domized test.2 FIG u R E 9. such that a(R.3. and.) Risthemostpowerfultestofleuelaifa(R) s o l a n d f o r any test R1 of level a (that is. otherwise.4 R i s the m o s t p o w e r f u ~ t e s t o f s i z e a i f a ( R= ) a and for any test R1 of size a . (R) 5 (R1).2.2. Tests (2). Although test (6) is not dominated by any other nonrandomized test.1.1 where all the possible tests are enumerated. size and level. choosing HI =0 for 0 5 x 5 0 + 1 .2. In Definition 9.2.5 We shall illustrate the two terms using Example 9. (It may not be unique. we do not need to use the word &el.1 5 x < 8. the reader should carefully distinguish two terms.) 5 a ) . D E F I N I T I O N 9.2.4) f (x) = 1 . denoted by fo(x) and fl(x). The densities of X under the two hypotheses. Note that if we are allowed randomization. Intuitively it is obvious that the critical region of an admissible nonrandomized test is a half-line of the form [t. (7). w) where 0 5 t 5 1.0 +x for 0 . In Figure 9. . and (5) are all dominated by (4) in the sense of Definition 9.1 TWO types of errors in a binomial example Test R R a = P(RI Ho) P= I HI) if X = 2. (4) + '/. For example. The most powerful nonrandomized test of level S/s is (4). Such a randomized test can be performed by choosing Hoif X = 0. We can state: The most powerful test of size '/4 is (4). flipping a coin and choosing H o if it is a head and H I otherwise. it is inadmissible because it is dominated by some randomized tests based on (4) and (7). the randomized test that chooses the critical regions of tests (4) and (7) with the equal probability of '/2 has the characteristics a = % and p = 1/4 and therefore dominates (6). (3).2 we defined the more powerful of two tests. it is natural to talk about the most powerful test.2. It is clear that the set of tests whose characteristics lie on the line segments constitutes the set of all the admissible tests.186 9 1 Hypotheses 9.2 LetX have the density Two types of errors in a binomial example (9. When we consider a specific problem such as Example 9. P(R) 5 P(Rl). a is represented by the area of the .2. are graphed in Figure 9. In the two definitions that follow. 3.3 1 Neyman-Pearson Lemma 189 F lG U R E 9 . we incur a loss y2. 3 Densities under two hypotheses lightly shaded triangle and algebraically.2. &. For her it is a matter of choosing between Ho and H I given the posten'orprobabilities P ( H oI x ) and P ( H l I x) where x is the observed value of X. Assuming that the Bayesian chooses the decision for which the expected loss is smaller.P plane is a continuous. We first consider how the Bayesian would solve the problem of hypothesis testing.2.1 Reject Ho if y l p ( H o1 x) < y2P(Hl I x ) .5) yields and is stated in the lemma that bears their names.3. Suppose the loss of making a wrong decision is as given in Table 9. . Because of the convexity of the curve. her solution is given by the rule (9. the Bayesian problem may be formulated as that of determining a critical region R in the domain of X so as to (9.188 9 1 Hypotheses 9. no randomized test can be admissible in this situation.6) is graphed in Figure 9. monotonically decreasing.3 NEYMAN-PEARSON LEMMA In this section we study the Bayesian strategy of choosing an optimal test among all the admissible tests and a practical method which enables us to find a best test of a given size.1) Equation (9. The set of admissible characteristics plotted on the a.3) Minimize $ ( R ) = ylP(HoI X E R) P(X E R) + yzP(H1I X E R)P(x E R). For example. FIG u R E 9 . which we state without proof. her critical region. A Bayesian interpretation of the Neyman-Pearson lemma will be pedagogically useful here. 4 A set of admissible characteristics Eliminating t from (9. convex function which starts at a point within [O. I ] on the ol axis. by the area of the darker triangle. if we choose H o when H I is in fact true.2.2. A more general result concerning the set of admissible characteristics is given in the following theorem. In other words. Every point on the curve represents the characteristics of an admissible nonrandomized test. where the expectation is taken with respect to the posterior distribution. The latter is due to Neyman and Pearson Alternatively. Therefore.4.1] on the p axis and ends at a point within [0. is given by 9. T H EoREM 9. 3.3. By virtue of Theorem 9. The best he can do.1. Multiply both sides of the inequality in (9. is to obtain the set of admissible tests. P ( H o ) and P ( H l ) are the pior where q o = ylP(Ho).3. ) .) .7) cannot be carried out.3.3.3.5) 444) = y1P(HoI Rl n Ro)P(R1 n Ro) + YlP(H0 l Rl n RO)P(Rl n 4) n Ro)P(R. ) P ( H .hence he does not wish to specify the ratio q o / q l .2) is indeed the solution of (9.190 T A B L E 9. Compare the terms on the right-hand side of (9.3).5). 1 ( N e y m a n . without which the minimization of (9. therefore. If the curve is differentiable as in Figure 9. 0 may be a vector.4. respectively. touches the line that lies closest to the origin among all the straight lines with the slope equal to -qo/ql. probabilities for the two hypotheses. When the minimand is written in the provided that such c exists. against H I : 0 = 01. Let R1 be some other set in the domain of X. f i l fl6). such as those drawn in Figures 9. n 4) + y2P(Hl ( R. (Here.2 and 9.3. The first and fourth terms are identical. the point of the characteristics of the Bayesian optimal test is the point of tangency between the curve of admissible characteristics and the straight line with slope -qo/ql The classical statistician does not wish to specify the losses yl and y2 or the prior probabilities P ( H o )and P ( H . Let L ( x ) be thejoint density or probability of X depending on whether X is continuous or discrete.5).3. ) .4) are smaller than the third and the second terms of (9. the best critical region of size a is given by We can rewrite + ( R ) as where L is the likelihood function and c (the critical value) is determined so as to satisfy = Y ~ P ( H and ~ ) .4) with those on the right-hand side of (9. This attitude of the classical statistician is analogous to that of the economist who obtains the Pareto optimality condition without specifying the weights on two people's utilities in the social welfare function. which shows the convexity of the curve of admissible characteristics. This fact is the basis of the Neyman-Pearson lemma.p plane.3. The second and the third terms of (9. because of the definition of & given in (9.i = 0.4. n & ) ~ ( .l x ) L ( x ) with L ( x I H . as well as in the following analysis. Therefore we have + Y2PW1 I Rl form of (9. Then we have and (9.2 9 1 Hypotheses 9.P e a r s o n lemma) In testing Ho: 8 = 8.2) by L ( x ) and replace P(H.2.2). the above analysis implies that every admissible test is the Bayesian optimal test corresponding to some value of the ratio q o / q l . 3 .3. 1. Then the Bayesian optimal test & can be written as Thus we have proved THEOREM 9 . it becomes clear that the Bayesian optimal test Ro is determined at the point where the curve of the admissible characteristics on the a.3.7).3 ) Neyman-Pearson Lemma 191 LOSS matrix in hypothesis testing State of Nature Decision' Ho H1 Yn 0 Ho H1 o Y1 We shall show that & as defined by (9. 3. who should determine it based on subjective evaluation of the relative costs of the two types of error.3. Therefore the best critical region of size a is side of (9.2).9). 3 . The best critical region for testing Ho: p = po against H. we should do well if we used the best available estimator of a parameter as a test statistic to test a hypothesis concerning the parameter.1. we (9.i= 1.2.15) Proof.. 2 LetXibedistributedasN(p. E X A M P L E 9 .7). We shall consider a few examples of application of Theorem 9. the inequality in (9. If p1 < po.16) ( p l . (4).3. If pl < Po.3. 1 Let X be distributed as B(n.3. EXAMPLE 9 . Intuition. is.the inequality in (9.14) is reversed.01.10) for a = %. and (8). . Then the term inside the parentheses on the left-hand is positive.12) "(l pa . Let xibe the observed value of Xi.15) and collecting terms. it is not possible to have a(R)5 a(Ro) and P(R) 5 P(&) with a strict inequality in at least one.3. and there is no c that satisfies (9. however.05 or 0.: p = p. the better the test becomes.3.17) f > PO.&).12) and collecting terms. from (9.17)is reversed. (7).3. This result is also consistent with our intuition. 3 > c for some c. Given a test statistic. Common sense tells us that the better the estimator we use as a test statistic. PO)"-" Taking the logarithm of both sides of (9. (9. does not always work.3.13) defined by (9.3. The Bayes test is admissible.3. by (9. Therefore.where a2is assumed known.1. There is a tendency.such a statistic is called a test statistic. where d is determined by P(X > d I H. 3 . . however. p) and let x be its observed value. There are often situations where a univariate statistic is used to test a hypothesis about a parameter.3.3.2 9.3. CI The Neyman-Pearson test is admissible because it is a Bayes test. for the classical statistician automatically to choose a = 0. T H Eo R E M 9 . from (9.3. Taking the logarithm of both sides of (9. in Example 9. The result is consistent with our intuition. it is often possible to find a reasonable critical region on an intuitive ground. In both examples the critical region is reduced to a subset of the domain of a univariate statistic (which in both cases is a sufficient statistic).A small value is often selected because of the classical statistician's reluctance to abandon the null hypothesis until the evidence of the sample becomes overwhelming..1 the Neyman-Pearson test consists of (I). The choice of a is in principle left to the researcher.9). Therefore.the best critical region of size a is of the form > d. as the following counterexample shows.3 1 Neyman-Pearson Lemma 193 The last clause in the theorem is necessary because. > c for some c.3. for example. even in situations where the Neyman-Pearson lemma does not indicate the best test of a given size a.n.3. The best critical region for testing Ho: p = po against H I : p = p1 is. Let & be as defined in (9.po) i= 1 xi > o log c + 2 ( p l . where d is determined by P(X > d I Ho) = a. we obtain n (9.2. 2 " 2 Therefore if (9.14) Let the density of X be given by x > d. 3 .P1)n-x - get Suppose p1 > Po.a2). As stated in Section 9. Then.) = a.192 9 1 Hypotheses EXAMPLE 9.. In this regard it is useful to consider the concept of the power function. is identified in Figure 9.4 SIMPLE AGAINST COMPOSITE D E F I N I T I O N 9 .4 1 Simple against Composite F I G u R E 9.4 defined the concept of the most powerful test of size ct in the case of testing a simple against a simple I Using the idea of the power function. In the figureVitis drawn assuming 0. denoted R.3. LetQ1(0) a n d a ( 0 ) be the power functions oftwo tests respectively. The Find the Neyrnan-Pearson test of Ho: 0 = 0 against HI: 0 = densities under Hoand H1 are shown in Figure 9.194 9 1 Hypotheses 9.6 The Neyman-Pearson critical region in a counterintuitive case hypothesis. . 1 If the distribution of the sample X depends on a vector of parameters 0. If 8 1 consists of a single element. because here the P value (the probability of accepting Ho when H1 is true) is not uniquely determined if 8 1 contains more than one element. The Neyman-Pearson critical region.4. = 1.2 - el (9. Now we shall turn to the case where the null hypothesis is simple and the alternative hypothesis is composite. We have I F 1 G U R E 9.2) Q1(8) 2 &(0) for all 0 E 8. we can rank tests of a simple null hypothesis against a composite alternative hypothesis by the following definition.4. 4 . 9.1. In the present case we need to modify it.19) changes with O1.5 Densities in a counterintuitive case f > 0. Definition 9.1.5. We can mathematically express the present case as testing Ho: 0 = O0 against HI: 0 E 81. we define the power function of the test based on the critical region R by We have so far considered only situations in which both the null and the alternative hypotheses are simple in the sense of Definition 9. where 8 1 is a subset of the parameter space. The shape of the function (9. Then we say that the first test is uniformly better (or unvormly more powerful) than the second test in testing Ho: 0 = Oo against H1: 0 E if Q1(OO)= &(OO) and D E F I N I T I O N 9.2.6. it is reduced to the simple hypothesis considered in the previous sections. In some of them the test is UMP. we have I 0 ) for any 0 E 0.p). against HI: 0 E if P(R I 0. B y (9.) = ( 5 )a.4.: 0 = 1 against H I : 0 > 1 on the basis of one observation on X. .5) L(00) < c. P(R I 0 ) 2 P(Rl where c is chosen to satisfy P(A < c I H o ) = a for a certain specified value of a. The so-called likelihood ratio test.L(x.) element equal to 01.: 0 = 0. is what we earlier called a. the UMP test of a given size a may not always exist. 4 . 4 Let L(x 1 0 ) be the likelihood function and let the where null and alternative hypotheses be H.3) Q I ( 0 ) > @ ( 0 ) for at least one 0 E Note that Q(0.4. F Ic u R E 9 .4 1 Simple against Composite 197 and (9.2.5. Note that we have 0 5 A 5 1 because the subset of the parameter space within which the supremum is taken contains 8. This time we shall state it for size and indicate the necessary modification for level in parentheses.: p = against H I :p > Po. the Neyman-Pearson lemma provides a practical way to find the most powerful test of a given size a. . We are to test H.7.4.4. Its graph is shown in Figure 9. even when it does not.6) = p a . where the alternative hypothesis is composite. against H I is defined by the critical region = 0 otherwise. p).196 9 1 Hypotheses 9.-'.1) we have el. E X A M P L E 9. Then the likelihood ratio test of H . which means that H o is accepted for any value of a less than 1.4.) = (5) a and for any other test R1 such that P(R11 8. and H 1 : 0 E is a subset of the parameter space 0. standing for supremum. 7 The following is an example of the power function. usually gives (9.. R ------sup L(0) . Therefore the critical region of the likelihood ratio test is given by E X A M PL E 9.and if el consists of a single p. however. clearly A = 1.2.we have Q(O1) = 1 - el. means the least upper bound and is equal to the maximum if the latter exists. p) = C:px(l .2 In the case where both the null and the alternative hypotheses are simple. If x / n > Po.4 and 9.PO)"-" < c for a certain c. maxpZp.: 0 = 0. p) is attained at p = x / n . In the present case.4. D E F I N IT I o N 9 .1 Power function LetXhave thedensity the UMP test if a UMP test exists. Below we give several examples of the likelihood ratio test. DEFINITION 9 . a ) . which may be thought of as a generalization of the Neyman-Pearson test. Let X be distributed as B ( n . 4 . Obtain and draw the power function of the test based on the critical region R = [0. If x / n 5 po. el (9. We are to test H. The following is a generalization of Definitions 9. given the observation X = x. the likelihood ratio test is known to have good asymptotic properties. but in others it is not.75. The likelihood function is L(x. 3 A test R is the uniformly most powerful (UMP) test of size (level) a for testing H. Sup.. 4.198 9 1 Hypotheses Taking the logarithm of both sides of (9.2. Since it can be shown by differentiation that the left-hand side of (9. The assumptions are the same as those of Example 9.2 the 1 . i = 1 . If x 5 0.7) I 9. .a confidence interval of y is defined by 1 2< d. therefore.3.6) and divid'ig by -n.4.where a2is assumed known.4 1 Simple against Composite 199 Therefore.3 P(I~- > d I HO) = a. n.12) are called twetail tests.4. the likelihood ratio test in this case is characterized by the critical region (9. So suppose 2 > po. Therefore the likelihood ratio test is to reject Ho if x > d.4. log c > --.4.12).4.2.8) does not depend on the value of p.4. where d is chosen appropriately.8) and (9.10). Then the denominator in (9. Tests such as (9.3. .3.p)2 = Z(xi . be the observed value of X. it is equivalent to (9. .6 We are to test No: 0 = 1/2 against H I :0 Suppose Xhasauniformdensityover [0. Therefore. In a two-tail test such as (9. . (Note that c need not be determined. Therefore assume x > 0. .. From Example 8. because it is not a Neyman-Pearson test against a specific value of P.4.4.4. Then the denominator of h attains a maximum at y = f . whereas tests such as (9.(1 . we accept Ho. E X A M P L E 9 .~ ( p . 4 . 4 12 > d. since f > yo. We are to test Ho: y = po against H I : p > po The likelihood ratio test is to reject Ho if If n 5 yo.11) are called one-tail tests. Then the numerator of A is equal to 1/[2(1 + x)'] and the denominator is equal to 1/2.8) under the null hypothesis (approximately) equal to a.1) and because the test defined by (9.a)' i n (~f)'.4. which was obtained in Example 9. Let x. 5 Consider the model of Example 9. . attaining its maximum at p = f. t log 1 + (1 . we obtain (9. c r ' ) . then h = 1. but this time without the further constraint that f > PO. 1/2 on the basis of one observation + . where d is determined by X where d should be determined so as to make the probability of event (9.t log po . That the UMP test does not exist in this case can be seen more readily by noting that the Neyman-Pearson test in this example depends on a particular value of 0. 4 . this test is UMP.8) For the same reason as in the previous example.6].7) is an increasing function of t whenever t > Po. we have AMPLE 9. 2 .t) . Therefore we again obtain (9.3 and test Ho: 0 = 0 against H I : 0 > 0 on the basis of one observation x. This test cannot be UMP. Therefore. then A = 1 because we can write C(x.) This test is UMP because it is the Neyrnan-Pearson test against any specific value of p > Po (see Example 9. Ho should be rejected if and only if yo lies outside the confidence interval.3 except that H I : y f yo. This test is not UMP because it is not a Neyrnan-Pearson test.4.4. n where we have put t = x/n. L e t t h e s a m p l e b e X .12) we could perform the same test using a confidence interval.3. where d is determined by P(8 > d 1 Ho) = a. .&) f > d. EXAMPLE9. as discussed in Section 8. so we accept Ho.t ) log (1 .4. Therefore the critical region is EXAMPLE 9 .4.O < 6 5 1.4.9) is maximized with respect to the freely varying P.4.t ) log (1 . where d is the same as in (9. 5. 5 . and show that it is UMP. To show that this is UMP. Since the numerator of the likelihood ratio attains its maximum at p = Po. This implies that c = %. if there are r parameters and Ha specifies the values of all of them. draw its power function. where €lo and el are subsets of the parameter space 8. DEFlN I T I O N 9. 4 . the exact probability of A < c can be either calculated exactly or read from appropriate tables. however. In all the examples of the likelihood ratio test considered thus far. We conclude that the critical region is [O.. Henceforth suppose x / n > po.A is the same as in (9. THEOREM 9 . Next we must determine d so as to satisfy EXAMPLE 9 . Then we have For the present situation we define the likelihood ratio test as follows. (For example. In such a case the following theorem is useful.8 x. this situation is the most realistic. 1 I Let A be the likelihood ratio test statistic defined in (9.8. Consider the same model as in Example 9.5). The following are examples of the likelihood ratio test. 11. therefore. 11 should be part of any reasonable critical region.200 9 1 Hypotheses 9. Its power function is depicted as a solid curve in Figure 9. 0. 11.5.4.251 U [0. Derive the likelihood ratio test of size %. r0. Therefore the critical region is again given by (9. Here we define the concept of the UMP test as follows: Power function of a likelihood ratio test A test R is the uniformly most powerful test of size P(R I 0) = (I a ) and for any other test R1 such that (level) a if supeEeo supeEeo P(Rl 1 0) = ( I a) we have P(R 1 0) 2 P(RI 1 0) for any 0 E el.251 is removed from the critical region and the portion B is added in such a way that the size remains the same. This completes the proof. accept Ho. Then. Next assume that x E [O. 1 Fl G UR E 9. D E F I N IT I o N 9 .5. where c should satisfy P(2X < c I Ha) = %. Then part of the power function shifts downward to the broken curve.5. Suppose the portion A of [O.5.2 Let L(x 0) be the likelihood function.5 Composite against Composite 1 201 A H.) 9. First.8). 5 . 0.. 11 should be part of the critical region. note that A = 0 for x E [0.4.2. As noted earlier. 0. There are cases. 1 where c is chosen to satisfy supeoP(A < c 0) = a for a certain specified value of a. -2 log A is asymptotically distributed as chi-square with the degrees of freedo~n equal to the number of exact restrictions implied by .5). where P(A < c) cannot be easily evaluated.4.4. first note that [0. A = 1.5 COMPOSITE AGAINST COMPOSITE In this section we consider testing a composite null hypothesis against a composite alternative hypothesis. Then the likelihood ratio test of Ha against H I is defined by the critical region I Therefore we reject Ho if 2x < c. Let the null and alternative hypotheses be Ha: 0 E 8 0 and H I : 0 E el. I f x / n 5 p. because this portion does not affect the size and can only increase the power. the degrees of freedom are r. but here test Ha:p 5 Po against H I :p > Po. therefore.6). 1 degrees of freedom.7) (62/&2)-n'2< c for some c. the critical region (9. we have P ( R I I p) 5 P ( R I p) for all p > po by the result of Example 9. To see this.5 Composite against Composite 1 203 But since P ( X / n > d I p) can be shown to be a monomnicaily increasing function of p.4. In Section 9. . Here. If x/n 5 po.5. is reduced to which can be equivalently written as Recall that (9.l ~ r = l (~R i ) Therefore ~ ~ the critical region is (9.8). = y2 for 0 r OO. that L1( 0 ) and & ( O ) are simple step functions defined by (9. let R be the test defined above and let R1 be some other test such that I p) 5 a.4. in addition to the problem of not being able to evaluate q. i= 1..5. Let us first see how the Bayesian would solve the problem of testing Ho: 0 5 O0 against H I : 0 > OO. ./ql.But since R is the U M P test of Ho: p = la.5. Then we have L1(0) = 0 = yl for 0 > 0. .(x. and we shall see that the classical theory of hypothesis testing becomes more problematic. and L1(8) by choosing H1. accept Ho. We are to test Ho: p. for 0 5 O0 and (9.O and 0 < u2 < m. Note that since P ( R I H o ) is uniquely determined in this example in spite of the composite null hypothesis. This test is UMP.6) where sup L(0) = 00u0. u2)by 0. &&ore.l)-'Z. N(p.2 where e2is the unbiased estimator of u2 defined by ( n .2."=.10) where b2= n-'Z~=l(x. Then it follows that P ( R .~ 1 . u" with unknown u2.-- 202 9 1 Hypotheses 9. for simplicity. we have Therefore the value of d is also the same as in Example 9.5. against H I : p > Po. we have - Theref ore where f (0 ( x) is the posterior density of 0.5.5. 02). . 2. therefore (9. (~T)-~'~($)-"$X~ [. A = 1. k can be computed or read from the appropriate table. = po and 0 < u2< m against H I : p. .8) is distributed as Student's t with n . In this case the same test can be performed using the confidence interval defined in Example 8. In this case the losses are as given in Table 9. 1 Po) 5 a.]: .3 we gave a Bayesian interpretation of the classical method for the case of testing a simple null hypothesis against a simple alternative hypothesis. Then the Bayesian rejects H o if .5. > IJ. Henceforth suppose x / n > Po. Let h ( 0 ) be the loss incurred by choosing Ho.. the . Suppose. Here we shall do the same for the composite against composite case.2. n.9).3.8) If the alternative hypothesis specifies p should be modified by putting the absolute value sign around the lefthand side.3. . as can be seen in (9.11) is the basis for interpreting the Neyman-Pearson test.2. # po. E X A M P L E 9.5.2. Denoting ( p . Let the sample be X . e2= n .z ) ~ Since the left-hand side of (9. there is no need to compute the supremum. which establishes the result that the left-hand side of (9.: p = % and H I : p > %.however.5. To see this. Note that in this decision problem. (9.5. put f ( p x) = f(p [ y).13) is an increw ing function in x.P ) ~ . The likelihood ratio test is essentially equivalent to rejecting No if 1.7).9).3. We toss this coin ten times and a head comes up eight times.5.15) such that p* # 0 or (9.13) is equivalent to (9.5.13).17) j ( \>(p)f(p 1 y)dp > j>(p)f(p 1 W P .5.6 EXAMPLES OF HYPOTHESIS TESTS Now cross only once.12) may not be a good substitute for the left-hand side of (9.* . .5.5. Let p be the probability of a cure when the drug is administered to a patient. suppose y > x. assuming for simplicity that f ( p I x) is derived from a uniform prior density: that is.11).4 204 1 9 1 Hypotheses 9. hypothesis testing on the parameter p is not explicitly considered. 9.13) is essentially the same kind as (9.6.nondecreasing in p.18) x > c. Therefore (9.13) more explicitly as an inequality concerning x. The classical statistician facing this decision problem will. she would have to engage in a rather intricate thought process in order to let her Po and a reflect the utility consideration. Next we try to express (9.5. we should approve the drug if (9.whereas the right-hand side is smaller than U ( p * ) .f 205 classical statistician faces the additional problem of not being able to make sense of L(x [ H I ) and L(x I Ho).16) is equivalent to A problem here is that the left-hand side of (9.6.15) is 0 if p = 1 and is monotonically increasing as p decreases to 0.5. The decision rule (9. If p f 0 or 1.1) H. Let p* be the solution to (9. Then j1 U(P)h(P)dp> l ~ ( p ) k ( p ) d' p h(P)dP Tk(p)dp P* 0 because the left-hand side is greater than U ( p * ) . In this section we shall apply it to various practical problems.6 1 Examples of Hypothesis Tests and k(p) = f(p 1 x) .14) where c is determined by (9. size)?What if the significance level is lo%? From the wording of the question we know we must put I The left-hand side of (9.1 ( m e a n o f b i n o m i a l ) It is expected that a particular coin is biased in such a way that a head is more probable than a tail.5. where f ( p I x) is the posterior density of p given x. and assume that the net benefit to society of approving the drug can be represented by a function U ( p ) .5.18).5.5.f we have (9. Her decision rule is of the same form as (9. Should we conclude that the coin is biased at the 5% significance level (more precisely.If the classical statistician were to approximate the Bayesian decision.5. For example. consider the problem of deciding whether or not we should approve a certain drug on the basis of observing x cures in n independent trials.5. except possibly at p = 0 or 1. first. from (8. and define h(p) = f (p I y ) . Then f ( p I x) and f(p I y) f In the preceding sections we have studied the theory of hypothesis testing.But (9. E X A M P L E 9. Sometimes a statistical decision problem we face in practice need not and/or cannot be phrased as the problem of testing a hypothesis on a parameter. paraphrase the problem into that of testing hypothesis Ho: p 2 po versus H I : p < po for a certain constant po and then use the likelihood ratio test. except that she will determine c so as to conform to a preassigned size a.5. According to the Bayesian principle.16) p* ( p 1 x) ( p I y). this equality can be written as ( p I x) = ( n + l)Czpx(l . v a r i a n c e k n o w n ) Therefore we conclude that Ho should be accepted at the 5% significance level but rejected at 10%. the critical region should be chosen as (9.05 or a = 0. Then we should calculate the pvalue (9. in this particular question there is no value of c which exactly satisfies (9.6.2. where c is determined by P(L? > c [ Ho) = a. Note that.1.4 ( d i f f e r e n c e o f means o f n o r m a l .6. as before. as the test statistic.6) Since L? x > c.6.8 against H. rather than determining the critical region for a given size.4 foot.N(5.7) we conclude that H o should be accepted if a is 5% and rejected if it is 10%." a two-tail test would be indicated.6.8. This decision can sometimes be difficult.3) for a given value of a.6. In fact. the number of heads in ten tosses.6. If the sample average of 10 students yields 6 . "but the direction of bias is a priori unknown.8) > c. In this kind of question there is no need to determine c by solving (9.8) a ( x.2.2. In such a case we must provide our own. 0. Suppose theheight ' ) . Another caveat: Sometimes a problem may not specify the size. (9. For example.9 1 Hypotheses 9.6.7) where Z is N ( 0 .3. we know that the best test of a given size a should use as the test statistic and its critical region should be given by x where c (the critical value) should be chosen to satisfy (9. Assume the ' is unknown and we same model as in Example 9.a EXAMPLE 9 . 6 . 2 ( m e a n o f n o r m a l .58) = 0.8) = t 9 . to say "Hoshould be accepted if a < 0. Instead we should calculate the probability that we will obtain the values of X greater than or equal to the observed value under the null hypothesis. except that now a have the unbiased estimator of variance 6' = 0.5. what if the italicized phrase were removed from Example 9. v a r i a n c e k n o w n ) Suppose that in 1970 the average height of 25 Stanford male students was 6 feet with a standard deviation of 0. we know that we should use X B(10. We have under Ho From (9.2. wherec is determined by P (t9 > c) = or. 1 ) . v a r i a n c e u n k n o w n ) . should we accept Ho at the 5% significance level? What if the significance level is lo%? From Example 9.6) and then checking if the observed value f falls into the region is equivalent to calculating the pvalue P ( x > f I Ho) and then checking if it is smaller than a.6.055 and rejected if a > 0. where a is the prescribed size. and the critical region should be of the form - to be 0.3) for either a = 0.016) under Ho. 6 Therefore. = 6. EXAMPLE 9. called the p-value: that is. EXAMPLE 9 . .05 and rejected if a = 0.6. we have P ( g > 6 ) = P ( Z > 1.6 1 Examples of Hypothesis Tests 207 From Example 9. From (9.5.1? Then the matter becomes somewhat ambiguous. where a ' is known of the Stanford male student is distributed as N ( F .: p. we were to add. If.16.055. We must determine whether to use a one-tail test or a two-tail test from the wording of the problem. 6 . instead of the italicized phrase. while in 1990 the average .4. p). which would imply a different conclusion from the previous one. 3 ( m e a n o f n o r m a l . however. determining the critical region by (9.16.9) We have m(x 6 - 5.6.6. It is perfectly appropriate.1. by Example 9. We are to test Ho: = 5.6.4) we conclude that Ho should be accepted if a = 0. Student's t with 9 degrees of freedom.0571." This is another reason why it is wise to calculate the pvalue. 18) above.4 except that now we shall not assume that the sample standard deviation is equal to the population standard deviation. we shall assume a .3 foot. EXAMPLE 9 . However. ) / ~ O .6.)/300 and = (2. and 46 of 200 women favored it. )a/ n Assuming the normality and independence of XI and Y.6.097 and rejected if u > 0.6. = 1 if the ith man favors the proposition and = 0 otherwise. By Theorem 6 of the Appendix we have under Ho x Inserting nx = 30. The competing hypotheses are (9.6.. We have Therefore we conclude that Ho should be accepted if a < 0. If we define 7 = (C:!: Y. But since px = f i under Ho. we have asymptotically This example is the same as Example 9. Once we formulate the problem mathematically as (9. Using the latter method. Y = 6. ny = 25. respectively.3. Is there a real difference of opinion between men and women on this proposition? Define Y .15) Ho:px - fi = 0 and H I :px . we have where we have assumed the independence of X and Y. EXAMPLE 9 .021 and . Let Y . Similarly define X . The only difference is that in the present example the variance of the test statistic X .4 into (9.6. 5 ( d i f f e r e n c e s o f means o f b i n o m i a l ) In ap0ll51 of 300 men favored a certain proposition.6 1 Examples of Hypothesis Tests 209 height of 30 students was 6.2.4. and Sy = 0. be the height of the ith student in the 1970 sample and 2d 5x = ( Z : ! ~ X . v a r i a n c e u n k n o w n ) we conclude that Ho should be accepted if a < 0. we realize that this example is actually of the same type as Example 9. 6 ( d i f f e r e n c e o f means o f n o r m a l . 6 .6. we can get a better estimate of the common value by pooling the two samples to obtain (46 + 51)/(200 + 300). Since we have under Ho We have chosen H I as in (9. and X .6.208 9 1 Hypotheses 9.2.12).02 and rejected if a > 0. =a : . = 1 if the ith woman favors the proposition and = 0 otherwise. Since we have under Ho we conclude that Ho should be accepted if a < 0. = 1 ) . Should we conclude that the mean height of Stanford male students increased in this period? Assume that the sample standard deviation is equal to the population standard deviation.097. we calculate the observed value of the Student's t variable to be 2. = 6.12) Ho: px - py = 0 and H I : px - py > 0. Define 7 = ( ~ : 2 ~ ~ . Define py = P(Yl = 1 ) and px = P ( X . One way is to estimate px by 46/200 and fi by 51/300. Therefore this example essentially belongs to the same category as Example 9.12) because it is believed that the height of young American males has been increasing during these years.2 with a standard deviation of 0.6. we have under Ho - hypotheses can be expressed as (9.11) and (9. the 1990 sample. Sx = 0.:: X.6. 6 .fi # 0.I? under Ho should be estimated in a special way.)/200.02.077. we have.7 1 Testing about a Vector Parameter 21 1 rejected if a > 0.6. That would amount to the test: (9.0 ~ 0 ) ~c > 9. . F I C U R E 9.where 0 is a K-dimensional vector of parameters.9. We consider the problem of testing Ho:0 = O0 against H1:0 00. An undesirable feature of this choice can be demonstrated as follows: Suppose ~6~ is much larger than ~ 6 2 Then .6.6 into the left-hand side of (9.136. Therefore.810)2 + ( 6 2 . for the latter could be a result of the large variability of 6 1 rather than the falseness of the null hypothesis.7.0)'.6. In using the Student's t test in Example 9. .136 and is a rejected if a > 0.1 Variance-Covariance Matrix Assumed Known Since a two-tail test is appropriate here (that is. a . we need to assume a$ = a . In this particular example the use of the Student's t statistic has not changed the result of Example 9.6. The results of this chapter will not be needed to understand Chapter 10. ) . : u Section 9. It is intuitively reasonable that an optimal critical region should be outside some enclosure containing eO.010 l2 + (61 .7.02012 a2 2 'c. (Throughout this section a matrix is denoted by a boldface capital letter and a vector by a boldface lower-case letter. it is wise to test this hypothesis.021. ) where .6.7 TESTING ABOUT A VECTOR PARAMETER Those who are not familiar with matrix analysis should study Chapter 11 before reading this section.559.6. 02)'and O0 = (010. 2 = ~ (6 0 ) ( 8 . we have E X A M P L E 9.6.21). By Theorem 3 of the Appendix. as depicted in Figure 9. Insofar as possible. + Consider the case of K = 2. This weakness is alleviated by the following strategy: (9. Inserting the same numerical values as in Example 9.22) yields the value 0. where cis chosen so as to make the probability of Type I error equal to a given value a. and in Section 9.4 very much.7.Olol.) In + - for some c. a : . the alternative hypothesis .21) - nys. 4 - Xny-1. under the null hypothesis : a =. a should be more cause for rejecting No than an large value of 162 equally large value of 10.1 we consider the case where Z is completely known.2 the case where 2 is known only u p to a scalar multiple. we conclude that Ho should be accepted if a < 0. We are to use the test statistic 0 N(0.7. I . 2 is a K X K variance-covariance matrix: that is.210 9 1 Hypotheses 0 2 9. But we have 9. We can write 0 = (el.7 (difference of variances) i and (9.2) Reject Ho if (61 .6. OZ0)'. What should be the specific shape of the enclosure? An obvious first choice would be a circle with O0 at its center.1) Reject Ho if (61 .9 2 Critical region for testing about two parameters Applying Definition 3 of the Appendix and (9.7. we shall illustrate our results in the tw~dimensional case. 3 .1).7.4) is the fact that 022 - - 4 4 1 under the null hypothesis. a12 = ~ o v ( 682). and p3. to (9.1) if a : =a .7.) I 1i (0 .4). Reject Ho if (A0 .1).7. We should not be completely satisfied by this solution either. we obtain A-'"(x . THE 0 R E M 9 . That is. the optimal test should be defined by (9.7. A s an illustration of the above.(0) @ But the maximand in the denominator clearly attains its unique &m at 0 = 0.5) is reduced to (9.7) Test Ho:p. An additional justification of the test (9. consider a three-sided die (assume that such a die exists) which yields numbers 1. because the fact that this critical region does not depend on the covariance.7.5.4) is provided by the fact that it is a likelihood ratio test.3) can be written as (9. which implies Z-' = A'A. ~ . To see this.~ /(x ' . By this transformation the original testing problem can be paraphrased as against A0 # ABO using A0 N(AOO.7.~ .@ ) l ~ . = A-'(Ar)-'.) > c. where A is a positive definite matrix.2) represents the region outside an ellipse with O0 at its center. using -h - Proof.p ) ' ~ .p) X : . Then (x .versus HI:not Ho.A B ~ ) ' ( Suppose x is an n-vector distributed as N ( p . note that by (5. Therefore.5. not necessarily diagonal nor identity.Following (11. and 3 with respective p r o b abilities pl.2) if u12= 0 and. H'AH (9. the inequality in (9. But AXA' = I implies I. where <g / I 3 a .*:. We are to test the hypothesis that the die is not loaded versus the hypothesis that it is loaded on the basis of n independent rolls.7 1 Testing about a Vector Parameter 213 ! where a : = vil and a : = ~ 8 Geometrically.= 0.7. by our premise.4.define ~ . that I Then by Theorem 11.5. 2. 1 (9. I).l ( x . (x . By Definition 1 of the Appendix.7. 7 . is a positive definite matrix. . Then.1) and (9. that is.i * where A-"' is the diagonal matrix obtained by taking the (-1/2)th power of each diagonal element of A.4) Reject Ho if (0 .5) Reject Ho if Therefore. p2.p) 0 - .1 we can find a matrix A such that AZA' = I.' / ~ A . so that c can be computed to conform to a specified value of a. We shall now proceed on the intuitively reasonable premise that if a: = and 01.p) ' A . ~ .1 (~ 0.212 9 1 Hypotheses 9.Suppose .7. 1 (9.4.1 / 2= H ~ . = p. Let H be the orthogonal matrix which diagonalizes A. = A.1 / 2 ~ ~ 1 In the twodimensional case. = p.7. further. Another attractive feature of the test (9.7.3) A AO0) ~ > c.This result is a consequence of the following important theorem. (9. where A is the diagonal matrix of the characteristic roots of A (see Theorem 11.O ~ ) ' Z . Thus. = .5). A). Note that (9. we should n= exp 0 [-+(6 - % ) 1 ~ .I) as the test testing Ho:A0 = AOO statistic. we can easily show that A-'/~A~ . suggests its deficiency.'(~ 0.1 / 2= 1 and ~-'/2~-1/2 = A-1.p ) N ( 0 .7. elongated horizontally. 7. (9.8) Test Ho:pl The left-hand side of the above inequality is asymptotically distributed as under the null hypothesis.7. we can write (9. Since (9./n and defining c equivalently as (9. but that would not be entirely satisfactory. In order to make use of Theorem 9. under the null hypothesis. the test (9. as we shall show below.5) to the original problem (9. a reasonable test would be to (9. the null hypothesis can be stated as 1 . where n. which can be approximately determined from the standard normal table because of the asymptotic normality of pl and In Figure 9.7.7) is obvious: an outcome such as j1 = 0 and j2 = 1. we decide to test the hypothesis that the expected value of the outcome of the roll is consistent with that of an unloaded die. namely. Therefore (9.11) Reject H O FIG u R E 9.7 1 Testing about a Vector Parameter 215 (9.7. Now we apply the test (9.7). (9.7. We have.10 Critical region for testing the mean of a three-sided die If we should be constrained to use any of the univariate testing methods expounded in the preceding sections.7. we would somehow have to reduce the problem to one with a single parameter.9) Reject Ho if 1 1 .10 the critical region of the test (9.7. is the number of times j appears in n rolls.pl . Noting that j .4.7.7. will lead to an acceptance by this test. which is extremely unlikely under the original null hypothesis. = n. + jl log pl + log j2+ j3 log j3) > c.p2 Since p = 0.7.7. A weakness of the test (9.9) is outside the parallel dashed lines and inside the triangle that defines the total feasible region.13) A.214 9 1 Hypotheses 9.11) is not identical with the likelihood ratio test. Next we derive the likelihood ratio test and compare it with the generalized Wald test.7.10) holds only asymptotically. Suppose.7. By Definition 9.14) 2n(log 3 -2 log d.13) -2 log A = g = 1 .5) becomes (97.4 the likelihood ratio test of the p r o b lem (9.p2.9) as a solution of the original testing problem > -2 log d.7. for example.7. In such a case.2pl .1 we transform the above inequality to (9.4. 3p3 f 2. If we define j l and $2 as the relative frequencies of 1 and 2 in n rolls. .11) is called the generalized Wald test.2j1 .7) is Xg - + 2p2 + 3p3 = 2 versus HI: p1 + 2p2 3.$21 >c 2 (n log 3 + nl log jl + n2log j2+ n3 log p3) = for some c. In this case it seems intuitively reasonable to reject H o if 1= (9. 7 . on the basis of nx and ny independent observations on X and Y. we obtain by Definition 3 of the (0 . Assuming the availability of such W may seem arbitrary. use the Taylor expansion - .17) Xi in (9.7 1 Testing about a Vector Parameter 21 7 Wald Test Likelihood Ratio Test One solution is presented below. We assume that the common variance u2is unknown.7.61. We first note (9.14).1 1 describes the acceptance region of the two tests for the case of n = 50 and c = 6. Suppose that 8 = n ~ ' ~ and ~ E 2 =~ ~ . 1 Suppose X pendent of each other. ~ ( p ~ .7. so that we can determine c so as to conform to a given size a? c2 . :L~ We Y .7.7.Pxo .1 1 The 5% acceptance Mans of the generalized Wald test and the likelihood ratio test Therefore.~ ~X . respectively.Xn.2 Variance-Covariance Matrix Known u p t o a Scalar Multiple and because of Theorems 1 and 3 of the Appendix.For what kind of estimator where can we compute the distribution of the statistic above. have. sometimes a good solution if 2 = a 2 ~ where . is some reasonable estimator of u2.+n. Q is a known positive definite matrix and u2 is an unknown scalar parameter.'( ~Oo)/K W/M . If we are given a statistic W such that which is independent of Appendix (9. There is no optimal solution to our problem if Z is completely unknown. Figure 9. u4 9. F l G U R E 9.14). defining 6 ' = W / M will enable us to determine c appropriately. u2)are inde- Px .N ( p y . a and ') Y E X A M P L E 9 . from Definition 1 and Theorem 1 of the Appendix. by Definition 3 of the Appendix. M). so we shall give a couple of examples.F ( K .05. Note that ~ ( ~ > 22 6) 0. . however.7. We are to test He: To show the approximate equality of the left-hand side of (9.19) 1 I= I u 2 2 .[ p r o ] versus HI: [ z j * [g] and apply it to the three similar terms within the parentheses of the left-hand side of (9.7.7.O ~ ) ' Q .6).11) and (9.1 216 9 1 Hypotheses 9.7. There is.-2. Therefore. 61. X1 = x . Define hl = px . and 2 be the sample averages based : be on nx.7. Find a critical region of a = 0. we have under Ho. X2 = p~ .2) Suppose that X has the following probability distribution: X = 1 with probability 0 where where 0 5 9 5 1/3. E X A M P L E 9 . respectively. by Theorems 1 and 3 of the Appendix.1. respectively. we are to test the hypothesis Ho:0 = 2 against HI: 0 = 3 by means of a single observed value of X. Let y.7. b e c a w of Theorem 9. and S the sample variances based on nx. Assuming the prior probabilities P(Ho) = P(HI) = 0. is distributed as N(p. and n~ observations. 16). we have Decision where e is Euler's e ( = 2.u2). lets.3. Suppose the loss matrix is given by - True state Since the chi-square variables in (9. (b) Find the most powerful nonrandomized test of size 0. = 2 against HI: p = 3 on the basis of four independent observations on X.5. . s.25 on the basis of one observation on X. and Z. Y N(py. . a2).3) Let X N(p.4. which we can determine from (9. and nz observations. and Z N(pz.SQ-l k.71 . ny. We want to test Ho: y But. respectively.2.218 9 1 Hypotheses EXERCISES I Exercises 219 We should reject Ho if the above statistic is larger than a certain value. (Section 9. ). Then we have 1.. We are to test Ho: px = py = pz versus HI: not H o on the basis of nx.7. derive the Bayesian optimal critical region. 4). Similarly. 0 < x < 0.2 against H I : 8 = 0. Y. nr.y.py. 4. and we want to test Ho: p = 25 against HI: p = 30. Is the region unique? If not.7.. (c) Find the most powerful randomized test of size 0. (Section 9.5 which minimizes p and compute the value of p. . define the class of such regions. and 0 elsewhere.23) are independent.] ^- 3. and ^X2 = . and nz independent observations on X. ny.20) to conform to a preassigned size of the test. - - - a.2) Given the density f(x) = 1/0. (Section 9. u2 Xn . 7 . (a) List all the nonrandomized admissible tests. find the Bayesian optimal critical region.3) An estimator T of a parameter p. Assuming that the prior probabilities of Hoand H1 are equal and the costs of the Type I and I1 errors are equal. u2) are mutually independent. 2 Suppose that X N(px.pz. We are to test Ho: 0 = 0. (Section 9. 2.22) and (9. define the Neyman-Pearson test. Y).times. and 6 occurs with probability (a) If number j appears N.4) Random variables X and Y have a joint density 6. 13. assuming a = 0. j = 1. Suppose that 100 independent tosses yielded N I ones. . we are to test Ho: 0 = O0 against H1: 0 = O1 < Oo. Y) with a = 0.25. 12. and N g threes.2.220 9 1 Hypotheses I Exercises 221 Calculate the probabilities of Type I a m d Ttrpe I f emrs for this critical region. 0 > 0. (Section 9. (Section 9. should you reject the null hypothesis at the 5% significance level? What about at lo%? You may use the normal approximation. Derive: (a) the Neyman-Pearson optimal critical region.3) Supposing f (x) = (1 O)xe. (c) Prove that the likelihood ratio test of the problem is the uniformly most powerful test of size 0. N 2 twos. test the hypothesis Ho: 0 = 2 against HI: 0 > 2 by means of a single observed value of X. . 0 < x < 0.05.3) Let X be the outcome of tossing a three-sided die with the numbers 1. . and 0 elsewhere. Indicate how to determine the critical region if the size of the test is a. and N6 = 18.9.01 based on a single observation of X and Y .5 against H1: 0 # 0. define the best test you can think of and justify it from either intuitive or logical consideration. (a) Find the likelihood ratio test of size 0 < a < 1.3) Letf(x) = 0 exp( -Ox). . P(X = k) = p(l . 11. Find the power function for testing Ho: p = 1/4 if the critical region consists of the numbers k = 1. (Section 9. 3. y 5 1.05. Consider the test which rejects Hoif X > c.2.4) Suppose that a bivariate random variable (X. 0 5 Y 5 A . Find the uniformly most powerful test of the hypothesis 0 = 1 of size a = 0. 0 > 0. 2. (Section 9. assuming that P(Ho) = P(H1) and that the loss of Type I error is 2 and the loss of Type I1 error is 5. and 0 < h < m. . + 7. 2. . Compare it with the power function of the critical region consisting of the numbers {1. 2. 6 . xO.3) We wish to test the null hypothesis that a die is fair against the alternative hypothesis that each of numbers 1. We are to test Ho: p = A = 1 versus H I : not Hoon the basis of one observation on (X. 4 and 5 each occurs with probability 1/5. x r 0.P)~-'.25. That is.4) Suppose (X. N g = 14. Determine c so that a = 1/4 and draw the graph of its power function. (Section 9. (a) Derive the likelihood ratio test. (Section 9. y) 9. p2. 2.4) Let X be the number of trials needed before a success (with probabilityp) occurs. (Section 9. (b) the Bayesian optimal critical region. and 3 occurring with probabilities PI. (b) If N = 2. . = 1/ ( p i ) . . .5 on the basis of a single observation on (X. (b) Show that it is not the uniformly most powerful test of size a. . (b) Obtain the power function of the likelihood ratio test (or your alternative test) and sketch its graph. in N throws of the die. Find the Neyman-Pearson test based on a sample of size n. Y) is uniformly distributed over the square defined by 0 5 x.).8.and p3.k = 1. we are to x 5 p. Derive its power function. Choose a = 0. obtain the most powerful test of size '/4 and compute its p value. 10.4) Given the density f (x) = 1/0. We are to test Ho: 0 = 0. Y) have density f (x. PI = p2 = 2/5 against H I : PI = Obtain a Neyman-Pearson test of Ho: % and p2 = %. N5 = 17. . 0 < x < 1. (Section 9. N4 = 22. where we assume 0 5 0 < 1. (c) If N1 = 16. and 3 occurs with probability Xo. . We want to test Ho: 0 = 1 against HI: 0 = 2 on the basis of one observation on X. 0 0 < I. If you cannot. < M. N p = 13. (Section 9. You may use the normal approximation.L. 5) Let p be the probability that a patient having a particular disease is cured by a new drug. Suppose that the density of X given 0 is f (x 1 0) = 2x/e2. y) = 20-* for x = + y 5 0. Obtain the likelihood ratio test of Ho: 0 = 2 against H1:0 < 2 on the basis of one observation of X at CY = 0. 0 I and the prior density of 0 is f (0) = 20. (Section 9. Formulate a Bayesian decision rule regarding whether or not the drug should be approved. Assume that the loss matrix is given by Suppose that a prior density of p is uniform over the interval [0. (a) Derive the Bayes estimate of 0. how large should x be for the drug to be approved? 20.6) Suppose you roll a die 100 times and the average number showing on the face turns out to be 4. (Section 9. 0 5 x.5) :x 5 0.25. (Section 9. 1.5. (Section 9. (a) Derive the likelihood ratio test of size 0. where we a 5 1. should you decide that the stimulant has an effect at the 1% significance level? What about at 5%? 17. 21 on the basis of one observation on X.6) One hundred randomly selected people are polled on their preference between George Bush and Bill Clinton.5.51 versus H I : 0 E (1. We test Ho: 0 = 0. 1 comes up four times and 2 comes up seven . 18.4) The joint density of X and Y is given by f (x. Suppose that the net social utility from a commercial production of the drug is given by U(p) = -0. (c) Show that it is the uniformly most powerful test of size 0.5) +1 for -2 5 0 5 2 a n d O 5 x 5 1. 15. Show that this test is h e uniformly most powerful test of size 0. 0. for 0 . (Section 9. (b) Derive its power function and draw its graph.9 for 0.5) Random variables X and Y have a joint density f (x.05. (Section 9. 19. (Section 9. 16.222 9 1 Hypotheses I Exercises 223 14.25.4) The density of X is given by f(x) = 0(x .5 against H I : 0 # 0. Suppose that we are given a single observation x of X.O < 0 < 1. 22. 21. Y). in which one runner is given a stimulant and another is not. If n = 2.6) Thirty races are run. show how a Bayesian tests Ho: 0 5 0.1 5 0 5 1. Assuming that the prior density of 0 is uniform over [ l . (Section 9.0.5 =2(p-0. (b) Assuming that the costs of the Type I and I1 errors are the same.5) Let X be uniformly distributed over [O.1 5 0 I :1.5) for 0 5 p 5 0. 0 5 y.5 against H1:0 > 0. assuming the prior density single observation of each of X and Y f (0) = 1/0.5. Find the Bayesian test of Ho: 0 2 % against H I : 0 < % based on a .6) We throw a die 20 times. (Section 9. oI 5 0. Y 10) = 0-' for o 5x I :0. 0 otherwise.5. find the Bayes test of Ho: 0 E [ l . How large a percentage point difference must be observed for you to be able to conclude that Clinton is ahead of Bush at the significance level of 5%? 21. Assume that the loss matrix is the same as in Exercise 16. If twenty races are won by the stimulated runner. 01. on the basis of one observation on (X. Is it reasonable to conclude that the die is loaded? Why? 23.05. 11 and that x patients out of n randomly chosen homogeneous patients have been observed to be cured by the drug. (Section 9. 5 < p 5 1. can we conclude that training has an effect at the 5% significance level? What about at lo%? 2'7. can we conclude that the fertilizer is effective at the 5% significance level? Is it at the 1% significance level? Assume that the yields are normally distributed. based on a sample of 5000 voters. Let pl be the probability that 1 comes up and p2 be the probability that 2 comes up. Other things being equal. (Section 9. Participant A B C D Weight after (lbs) 126 125 129 131 128 130 135 142 Farm A B C Y~eld without fertilizer (tons) 5 6 7 8 9 Yield with fertilizer (tons) 7 8 7 10 10 25.6) It is claimed that a new diet will reduce a person's weight by an average of 10 pounds in two weeks. whereas another poll. Would you conclude that the graduation record of athletes is superior to that of nonathletes at the 1% or 5% significance level? 29. Among the 1024 students were 84 athletes. and Y be the duration for those with training: x y 35 31 42 37 17 21 55 10 24 28 Assuming the two-sample normal model with equal variances.6) According to the Stanford Observer (October 19'77). test the hypothesis pl = p2 = '/6 against the negation of that hypothesis. recorded before and after the twc-week period of dieting.224 9 1 Hypotheses times. On the basis of our experiment. Would you accept the claim made for the diet? Weight before (lbs) unemployment for those without tdnhg. based on a sample of 3000 voters. showed Clinton ahead by 20 points. The weights of seven women who followed the diet. Let X be the duration of .6) The following data are from an experiment to study the effect of training on the duration of unemployment. (Section 9. 1024 male students entered Stanford in the fall of 1972 and 885 graduated. Should we reject the hypothesis at 5%? What about at 10%? 24.-~)' 18 10 2 28.6) One preelection poll. (Section 9. of which 78 graduated. Assume that the prices are normally distributed with the same variance (unknown) in each city. (Section 9. Are the results significantly different at the 5% significance level? How about at lo%? 26.6) The price of a certain food item was sampled in various stores in two cities. City A n X D E City B 9 9 2 n-"(x.6) The accompanying table shows the yields (tons per hectare) of a certain agricultural product in five experimental farms with and without an application of a certain fertilizer. Test the hypothesis that there is no difference between the mean prices of the particular food item in the two cities using the 5% and 10% significance levels. and the results were as given below. (Section 9. (Section 9. showed Clinton ahead by 23 points. are given in the accompanying table. rl of nl students passed a test. p3).6) Using the data of Exercise 27 abow*test the e q * at the 10% significance level. b 3 ) . Students are homogeneous within each group.1 Class 2 7. Use it to answer problem (a) above.7) In Exercise 25 above.7) In Group 1.3. The students all took the same test. b2. Let pl and p2 be the probability that a student in Group 1 and in Group 2. (Section 9. q = 40.and @a are 4. passes the test. (Section 9.8 34. Choose the 5% significance level. 35. and their test scores were as shown in the accompanying table.8 7. should you re~ect H o at a = 0. 31. where 6' = (bl.a 2 )for class i = 1. Choose the size of the test to be 1% and 5%.6) Using the data of Exercise 26 above. (Section 9. . and 1. respectively. r1 = 14. (Section 9. and fis having the joint distribution j i N(p. Given nl = 20.2. in Group 2. b2. 2.r2 of n2 students passed the test.3 8.3 Class 3 Test the hypothesis that the mean prices in the three cities are the same. (Section 9. test the equa£irp of the dances at the 10% significance level.1? (b) Derive the likelihood ratio test for the problem. Assuming that the test scores are independently distributed as N ( p i . = 0.0 6. We are to test Ho: pl = p.5 against H I : not Ho. and 7 2 = 16.7) There are three classes of five students each. Assume that the test results across the students are independent.05 or at a = 0.226 9 1 Hypotheses I Exercises 227 1 30. derive the Wald test for the problem. 7. add one more column as follows: City C ' - Assume that the observed values of b l . $2. (a) Using the asymptotic normality of jl = r l / n l and jP = r2/n2. of the variances 32. Score in Class 1 8. (Section 9. test Ho: p1 = p2 = pg against H I : not Ho. pP = ( ~ 1 p2> . and respectively.A).7) Test the hypothesis p1 = ~2 = p5 using the estimators bI. . where {y. A variable such as y is called a dependent variable or an endogenous vanable. f (y I x). .}. We also assume that x. t = 1. we shall need the additional assumption that {u. . In the present c h a p ter we shall consider the relationship between two random variables. From now on we shall drop the convention of denoting a random variable by a capital letter and its observed value by a lowercase letter because of the need to denote a matrix by a capital letter. . 2. We call this bivariate (more generally. Regression analysis is useful in situations where the value of one variable. 2 . Let (X. It is wise to choose as the independent variable the variable whose values are easier to predict. and a ' are unknown parameters that we wish to estimate. I I 10. is determined through a certain physical or behavioral process after the value of the other variable. y) = f (y I x)f (x).). = a + fix. .) of W t l . We shall continue to call x. = a ' . This is equivalent to assuming that (10.. 12.] to be known constants rather than random variables. and a variable such as x is called an independent variable. . be a sequence of independent random variables with the same distribution F. Since we can always write f (x.}and {y. we may not always try to estimate the conditional density itself. Let us assume that x and y are continuous random variables with the joint density function f (x.1) specifies the mean and variance of the conditional distribution of y given x. y. . regression analysis implies that for the moment we ignore the estimation off (x).]are normally .)are known constants. The problem we want to examine is how to make an inference about f (x. We define the bivariate linear regression model as follows: (10. Thus far we have considered statistical inference about F based on the observed values {x. we mean the inference about the joint distribution of x and y. 2. and 13 we shall study statistical inference about the relationship among more than one random variable. in a consumption function consumption is usually regarded as a dependent variable since it is assumed to depend on the value of income. the independent variable. x and y. As in the single variate statistical inference. T. T. is determined. and {u. fi. At some points in the subsequent discussion. {x. y) . we often want to estimate only the first few moments of the density-notably. Thus. whereas income is regarded as an independent variable since its value may safely be assumed to be determined independently of consumption. In situations where theory does not clearly designate which of the two variables should be the dependent variable or the independent variable. a. y) on the basis of independent observations {x. + u. mutivariate) statistical analysis. The linear regression model with all the above assumptions is called the clcwr sical regression model. instead. The reader should determine from the context whether a symbol denotes a random variable or its observed value.d.T. . B y the inference about the relationship between two random variables x and y. an exogenous vanabk. Here. we can state that the purpose of bivariate regression analysis is to make a statistical inference on the conditional density f (y I x) based on independent observations of x and y. with Eu. one can determine this question empirically. . In this chapter we shall assume that the conditional mean is linear in x and the conditional variance is a constant independent of x.i. on x and y. but it is not essential for the argument. = 0 and Vu. We make this assumption to simpllfy the following explanation.] are unobservable random variables which are i.1 INTRODUCTION In Chapters 1 through 9 we studied statistical inference about the distribution of a single random variable on the basis of independent observations on the variable. .]are observable random variables. In Chapters 10. . say. .1.1) y. the mean and the variance. or a regressox For example.1 1 Introduction 229 i 1 10 BIVARIATE REGRESSION MODEL 10.1. is not equal to a constant for all t. t = 1. x. Bivariate regression analysis is a branch of bivariate statistical analysis in which attention is focused on the conditional density of one variable given the other.t = 1.. Note that we assume {x. . as in Figure 10. since if E(y* I x*) is nonlinear in x*. t refers to the tth period (year. Alternatively. xi). Another simple method would be simply to connect the two dots signifying the largest and smallest values of x. y = logy* and x = log x*. P. for example. and so forth. Since Eu.230 10 1 Bivariate Regression Model 10.1. In that figure each dot represents a vector of observations on y and x.2. it is possible that E(y I x) is linear in x after a suitable transformation--such as. and u2in the linear regression model because of its computational simplicity and certain other desirable properties. # 0 even if t s) or heteroscedastic (that is.. E(y I x) is. tth firm. but there are a multitude of ways to draw such a line. In Figure 10.1) specifies completely the conditional distribution of y given x. as we have seen in Chapters 4 and 5. tth nation.1. and u2 in the bivariate linear regression model (10. we can draw a line so as to minimize the sum of absolute deviations.. x. which we shall show below. We have labeled one dot as the vector (y. varies with t). The T observations on y and x can be plotted in a secalled scatter diagram. Then the problem of estimating a and p can be geometrically interpreted as the problem of drawing a straight line such that its slope is an estimate of P and its intercept is an estimate of a.4 we shall briefly discuss nonlinear regression models. consumption and income). Eu. Still. Vu. Minimizing the sum of squares of distances in any other direction would result in a different line. We can go on forever defining different lines. The assumption that the conditional mean of y is linear in x is made for the sake of mathematic& convenience. = 0.] may also be regarded as a starting point. or the sum of the fourth power of the deviations. In Section 13. it should by no means be regarded as the best estimator in every situation.1 Definition In this section we study the estimation of the parameters a. If we are dealing with a time series of observations. The linearity assumption may be regarded simply as a starting point. We first consider estimating a and p. a reasonable person would draw a line somewhere through a configuration of the scattered dots. Gauss in a publication dated 1821 proposed the least squares method in which a line is drawn in such a way that the sum of squares of the vertical distances between the line and each dot is minimized. Our assumption concerning {u. In the subsequent discussion the reader should pay special attention to the following question: In what sense and under what conditions is the least squares estimator the best estimator? Algebraically. where y* and x* are the original variables (say. in general.) as (5. the linearity assumption is not so stringent as it may seem. But in some applications t may represent the tth person. the least squares (LS) estimators of a and P.2 1 Least Squares Estimators 231 distributed.). Given a joint distribution of x and y.1. and the like.1 Scatter diagram + 10.1. month. the vertical distance between the line and the point (y. how shall we choose one method? The least squares method has proved to be by far the most popular method for estimating a. x. taking only two values). Then (10. denoted by & and p. We have also drawn a straight line through the scattered dots and labeled the point of intersection between the line and the dashed perpendicular line that goes through (j. In Chapter 13 we shall also briefly discuss models in which (q) are serially correlated (that is. and so on).) is indicated by h.. nonlinear in x. We have used the subscript t to denote a particular observation on each variable.1). Data which are not time series are called cross-section data. can be defined as the values of CY and P which minimize the sum of squared residuals . Two notable exceptions are the cases where x and y are jointly normal and x is binary (that is. However.2 LEAST SQUARES ESTIMATORS 10.u. x. F 1 G u R E 1 0. P. ]as known constants.] after the effect of the unity regressor has been removed. and I5 as measuring the effect of the unity regressor on {y. So far we have treated a and p in a nonsymmetric way. we are predicting x.2.2) and (10. = T-'zxtyt . we can treat a and p on an equal basis by regarding a which is actually the deviation of x. we can express the orthogonality as 1 and (10. where we have defined 7 = T-'c%.3) -- as aP . a sequence of T ones.9)..3) simultaneously for a and P yields the following solutions: zLI as the coefficient on another sequence of known constants-namely.2. T). = 0.2. . In Section 10.. Define the error of the predictor as and call it the least squares residual.2) and (10. and sXy is the sample covariance. on x.27.3).2.9) follow from (10.8) and (10. (See the general definition in Section 11. Solving (10.* without the 6. based on the unity regressor and {x. where X should be understood to mean unless otherwise noted. Under this symmetric treatment we should call j. We shall call this sequence of ones the unity regressox This symmetric treatment is useful in understanding the mechanism of the least squares method. y. that is.2.232 10 ) Bivariate Regression Model 10. = 2. We define the error made by the least squares predictor as where jJ is the value that minimizes X(x. by the sample mean. where t is not included in the sample period ( 1 .2. We shall present a useful interpretation of the least squares estimators 1 5and by means of the above-mentioned symmetric treatment.} has been removed.}on {y.2 ( Least Squares Estimators 233 Differentiating S with respect to a and tives to 0. respectively. We shall call this fact the orthogonality between the least squares residual and a regressor.x.}.2.] after the effect of {x.6) the least squares predictor of y. can be interpreted as the least squares estimator of the coefficient of the regression of y. But as long as we can regard (x. Note that J and 2 are sample means.-y)'. In other words. 2 = T-'zx.2. . .3.2.. s: = T .a - ~ x . We define Note that (10. and calling a the intercept. as defined in (10.2.2Z(yt .3. which defined the best linear unbiased predictor. . The least squares estimator (j can be interpreted as measuring the effect of {x.2. 0. jJ = 2. regarding P as the slope coefficient on the only independent variable of the model.5) can be obtained by substituting sample moments for the corresponding population moments in the formulae (4.. namely x. ) x= . that is.4).) Mathematically.2. The precise meaning of the statement above is as follows.4) and (10.2. Then defined in (10. There is an important relationship between the error of the least squares pre&ction and the regressors: the sum of the product of the least squares residual and a regressor is zero. .' c ~:2 ' and s.8) and (4. 2 . from the sample mean since ? . based on the unit regressor as p and call it the least squares predictor of y.6 below we discuss the prediction of a "future" value of y. Thus the least squares estimates can be regarded as the natural estimates of the coefficients of the best linear predictor of y given x. s: is the sample variance of x.9) Czi..2. we obtain P and equating the partial den* 7 and (10. Define the least squares predictor of x. It is interesting to note that (10. . Then.1) into (10.21) 0 P. A problem arises if x.3 Note that this formula of & has a form similar to as given in (10. we have froan (10.1) into (10. their unbiasedness is clearly a desirable property.)*l2 by Theorem 4.a ~ .14) cl:ut &.1 vp = - 1 [Z(xf1 I u2 V(C x:u. In other the coefficient of the regression of y.2 1 Least Squares Estimators 235 intercept: that is.2. on 1 )~ that words.).2.20) ~p = p. 0. where 8 minimizes C(l . *so (10. Therefore under to go to zero at about the reasonable circumstances we can expect same rate as the inverse of the sample size T.2.16) rather than (10. & minimizes C(y. The variance of & has a similar property.2.5).17) yields Reversing the roles of {x.2.6. minimizes C(y.2.2.]and the unity regressor.2.23) 0 can be evaluated as follows: by Theorem 4. inserting (10. This is another desirable property. defined in (10.2 Properties of & and fi First. Using (10.18) yields (10. cx: .12) and (10.1.1.21) and 10.1. = 0 and {x:] are constants by m assumptions.2. we obtain from (10. .12) and using (10. if we define which implies (10.~ x : ) ~In . Next.2.as we can easily verify that 0 Similarly.2.16) and using (10. we d&* squares predictor of the unit regressor based on {x.19).2. Therefore Cx.)on the unity regressor or in the regression of the unity regressor on {x. In other words.2.5). ~(11.)~ We call it the predictor of 1 for the sake of symmetric treatment.2.}as the least Since Eu.a =-. the variance of we can show that &. we obtain the means and the variances of the least squares estimators & and For this purpose it is convenient to use the formulae (10. First.2.12).) ~(x:)~ [C(x2.2.22) E& = a.4) and (10. is an unbiased estimator of (10. let us see what we can learn from the means and the variances obtained above.2.. is the least squares estimator of : without the intercept.2.19) and Theorem 4. How good are the least squares estimators? Before we compare them with other estimators. .2.2.234 10 1 Bivariate Regression Model 10. Similarly.23) is equal to T times the sample variance of x. even though there is of course no need to predict 1 in the usual sense. The orthogonality between the least squares residual and a regressor is also true in the regression of {x. (10.6xJ2.3.2. 6 = -. VO . note that the denominator of the expression for given in (10. this intwpfeatiorlitis more natural to write fi as 0 Inserting (10. 2.28). = P for all ci and f3.]: (10. Equation (10. which may be regarded as a natural generalization of the sample mean.2.12: Recall that in Chapter '7 we showed that we can define a variety of estimators with mean squared errors smaller than that of the sample mean for some values of the parameter to be estimated. where {c.) on the unity regressor is precisely the sample mean of {y.2.236 10 1 Bivariate Regression Model 10.2. The proof of the best linear unbiasedness of & is similar and therefore left as an exercise.26) ECc. we give only a partial proof here. (10.y. = x:/~(x. although its significance for the desirability of the estimators will not be discussed until Chapter 12. .2.33) Cctx.*)~-in other words. is a constant for all t.) Intuitively speaking.2.23). = dl and (10. = 0 and since Zc. (Note that the least squares estimator of the coefficient in the regression of {y.30) -5 I ~(x. for then both c(x:)~ and ~ ( 1 : ) are ~ small.2. and (10.2.2. The unbiasedness condition implies + From (10. We can establish the same fact regarding the least squares estimators. Then dl& f d2Pis the best linear unbiased estimator of dla + drip. = d2. we see that the condition (10. = 0.30) if and only if c. and inasmuch as we shall present a much simpler proof using matrix analysis in Chapter 12.y.]satisfymg (10. (Note that when we defined the bivariate regression model we excluded the possibility that x.2 1 Least Squares Estimators 237 stays nearly constant for all t.12) we can easily verify that unbiased estimators. Because the proof of this general result is lengthy.}and the unity regressor on (y.2.2.2. The class of linear unbiased estimators is defined by imposing the following condition on {c.1) into the left-hand side of (10. we can prove a stronger result.29) and (10.y.2. The problem of large variances caused by a closeness of regressors is called the problem of multicollinearity.28).27) Zc. Comparing (10.* = 1 using (10.2.) are arbitrary constants. Inserting (10.2. the least squares estimator.2'7) and (10.31) also shows that equality holds in (10.27). but that the sample mean is best (in the sense of smallest mean squared error) among all the linear unbiased estimators. We have p is a member of the class of linear (10.2. and putting dl = 1 and d2 = 0 for the estimation of a. Again. we note that proving that $ is the best linear unbiased estimator (BLUE) of P is equivalent to proving (10.32) Cc.2.31) because the left-hand side of (10.1.x.26) is equivalent to the conditions (10.) when xt is nearly constant.30) follows from (10.2.26) and using Eu. Note that (10. The class of linear estimators of p is defined by Z.2.2.11).) Let us consider the estimation of p. But (10.30) follows from the following identity similar to the one used in the proof of Theorem 7. we define the class of linear estimators of dla d2p by E. For the sake of completeness we shall derive the covariance between & and p. we cannot clearly distinguish the effects of {x. since in that case the least squares estimators cannot be defined.2.].31) is the sum of squared terms and hence is nonnegative.)' CC: for all {c. Consider the estimation of an arbitrary linear combination of the parameters dla -t d2P. Actually.2. The results obtained above can be derived as special cases of this general result by putting dl = 0 and d2 = 1 for the estimation of p. 2. Using 6' we can estimate and V& given in (10.36) by Q. we have We omit the proof of this identity. If we prefer an unbiased estimator.41).38) and (10. although the bias diminishes to zero as T goes to infinity.2.8) and (10.23) and (10.18).37) 10. we obtain from (10.40)..2.44) implies that ~6~= T-'(T . we should note that the proof of the best linear unbiasedness of the least squares estimator depends on our assumption that {u. we must first predict them by the least squares residuals (Q.33) imply Cc.43) we conclude that which we shall call the least squares estimator of u2.2.2.1'7).38).] are not observable. Similarly. It is well to remember at this point that we can construct many biased and/or nonlinear estimators which have smaller mean squared errors than the least squares estimators for certain values of the parameters. its The f )best ~ .2. Moreover.* = ~ ( c : ) ~ .238 10 I Bivariate Regression Model (10..42) Finally.2 1 Least Squares Estimators 239 The variance of Zctytis again given by (10.2. summing over t.11) by : x and summing over t yields We shall now consider the estimation of u2.2.)'.2. except to note that (10.36) and (10.u.32) and (10.29).2. linear unbiasedness of the least variance is given by ~ ~ ~ ( e squares estimator follows from the identity (10. Then u2 can be estimated by because of (10.9) yields We shall not obtain the variance of 6 ' here. Also.2.2.2.] were observable.2.) defined in (10.2. and using (10. and (10.2. the most natural estimator of u2 would be the sample variance T-'XU:.2)u2 and hence 62is a biased estimator of u2.39) yields But multiplying both sides of (10. We shall evaluate ~ 6From ~ (10. from (10.2. . we can use the estimator defined by Multiplying both sides of (10.2. Since {u.Q. in certain situations some of these estimators may be more desirable than the least squares.38) cC: = XU: - C(Q.2. (10.3 Estimation of 0 ' Taking the expectation of (10.15) by 1 : and summing over t yields because of (10. 10.] are serially uncorrelated with a constant variance. in Section 10.7) . multiplying both sides of (10.3 we shall . we use it because it is an estimator based on the least squares residuals. Although the use of the term bast squares here is not as compelling as in the case of & and p.2.2. Dkfine CG: = Zu. from which we obtain Then the least squares estimator dl& + d2P can be written as C C ~ and ~ .24) by substituting 6' for u2 in the respective formulae. we can write Equation (10.2.7).2. If {u. Using (10.2.c.2. Therefore.2. 12) into the right-hand side of (10. we must choose between two regression equations and .2 1 Least Squares Estimators 241 E indicate the distribution of Z C : .].we shall use the measure of the goodness of fit known as R square.]. it makes sense to choose the equation with the higher R'.]. which states that convergence in mean square implies consistency. P - a .240 10 1 Bivariate Regression Model 10. A systematic pattern in that plot indicates that the assumptions of the model may be incorrect. assuming the normality of (%I. That is.2.2.they should behave much like (u.51) we obtain In this section we prove the consistency and the asymptotic normality of and the consistency of (i2 under the least squares estimators & and suitable assumptions about the regressor {x.] are the predictors of {u. 10.2. as well as its variance.2. = T-'IC(~.]. Since both & and are unbiased estimators of the respective parameters.24) converge to zero. we need only show that the variances given in (10.46) and (10.5.47) TI?' = minqy. and so on.]. Since {C.55) lim x(lt*)' T4- =w and which is the square of the sample correlation coefficient between {y. it is usually a good idea for a researcher to plot (6. The statistic G2 may be regarded as a measure of the goodness f i t . . therefore.50) CQ: = C(y.] or on another independent sequence {s.] or (xJ.2.] and {x.2.a)'. .1. a.]. a we have (i2 5 s : . fi fi (10. Then we must respecify the model. how accurately the future values of the independent variables can be predicted.2.j)' + fi'~(x2.2.]. from which one could derive more information.]. One purpose of regression analysis is to explain the variation of {y.2. * ~ . Therefore.46) as the square of the sample correlation between {y. Other things being equal.2. = minZ(y.). we shall discuss this issue further. namely.or by including other independent variables in the right-hand side of the regression equation. .7). This Here s statistic does not depend on the unit of measurement of either {y. ' the better the fit. In Section 12. The statistic G2 is merely one statistic derived from the least squares residual {zi. we conclude that & and are consistent if fi p. the regression is good.px. where In practice we often face a situation in which we must decide whether {y.)and btl. is the sample variance of {y. - Inserting (10.) against time.50) yields Finally.}by we say that the fit of the variation of {x.2.].)' and (10. 0 5 R' 5 1.1. Since (10.2.11).From (10.] are explained well by {x.2. we have (10.)~2 f i ~ x .} are to be regressed on {x. s. using (10.5) and (10.j)'. . To prove the consistency of & and we use Theorem 6.2. from (10. This decision should be made on the basis of various considerations such as how accurate and plausible the estimates of regression coefficients are.).2. We can interpret R2 defined in (10. perhaps by allowing serial correlation or heteroscedasticity in {u.4 Asymptotic Properties of Least Squares Estimators Therefore. However.If {y.48) TS. since 6' depends on the unit of measurement of {y.23) and (10. the smaller the G . . 242 10 1 Bivariate Regression Model 10.2 ( Least Squares Estimators 243 We shall rewrite these conditions in terms of the original variables {x,}. Since ~ ( 1 : ) and ~ (ZZ:)~ are the sums of squared prediction errors in predicting the unity regressor by {x,) and in predicting {x,) by the unity regressor, respectively, the condition that the two regressors are distinctly different in some sense is essential for (10.2.55)and (10.2.56)to hold. and (z,], t = 1,2, . . . , T, we measure Given the sequences of constants (x,] the degree of closeness of the two sequences by the index Finally, we state our result as T H E O R E M 10.2.1 In the bivariate regression model (10.1.1), the least squares estimators & and fi are consistent if (10.2.64) lim EX: = a T-+m i and Then we have 0 5 p ; 5 1. To show p ; 5 1, consider the identity Note that when we defined the bivariate regression model in Section Since (10.2.58)holds for any A, it holds in particular when 10.1, we assumed PT f 1.The assumption (10.2.65)states that p~ 1 holds is in general not restrictive. in the limit as well. The condition (10.2.64) Examples of sequences that do not satisfy (10.2.64)are x, = t-' and x, = 2-" but we do not commonly encounter these sequences in practice. we have Next we prove the consistency of 6'. From (10.2.38) (10.2.66) + xu: 1 2 6* = -- - Z(d,- u,). T T Inserting (10.2.59) into the right-hand side of (10.2.58)and noting that the left-hand side of (10.2.58) is the sum of nonnegative terms and hence is nonnegative, we obtain the Cauchy-Schwartz inequality: Since {u:) are i.i.d. with mean a2,we have by the law of large numbers (Theorem 6.2.1) Equation (10.2.43)and the Chebyshev's inequality (6.1.2)i m p l y (See Theorem 4.3.5 for another version of the CauchySchwartz inequality.) The desired inequality p : 5 1 follows from (10.2.60). Note that p ; = I if and only if x, = z, for all t and &. = 0 if and only if {x,)and {z,) are x , %= 0). orthogonal (that is, E Using the index (10.2.57) with z, = 1,we can write Therefore the consistency of 62follows from (10.2.66), (10.2.67), and (10.2.68) because of Theorem 6.1.3. We shall prove the asymptotic normality of & and From (10.2.19) and (10.2.21)we note that both fi - P and & - a can be written in expressions of the form 0. and where {z,] is a certain sequence of constants. Since the variance of (10.2.69) goes to zero if Cz; goes to infinity, we transform (10.2.69)so that I 244 10 1 Bivariate Regression Model 10.2 1 Least Squares Estimators 245 the transformed sequence has a constant variance for atE T. This is accomplished by considering the sequence Next, using (10.2.61) and (10.2.62), we have since the variance of (10.2.70) is unity for all T. We need to obtain the conditions on {z,}such that the limit distribution of (10.2.70) is N(0, 1). The answer is provided by the following theorem: THEOREM 1 0 . 2 . 2 Let {u,} be i.i.d. with mean zero and a constantvariance u2 as in the model (10.1.1). If n,ax z: (10.2.71) lim Tjm - 0, zz: Therefore (1:) satisfy the condition (10.2.71) if we assume (10.2.65) and (10.2.74). Thus we have proved that Theorem 10.2.2 implies the following theorem: T H E O R E M 1 0.2.3 then In the bivariate regression model (10.1.1), assume further (10.2.65) and (10.2.74). Then we have Note that if zt = 1 for all t, (10.2.71) is clearly satisfied and this theorem is reduced to the Lindeberg-L6vy central limit theorem (Theorem 6.2.2). Accordingly, this theorem may be regarded as a generalization of the Lindeberg-L6vy theorem. It can be proved using the Lindeberg-Feller central limit theorem; see Amemiya (1985, p. 96). 3 and & - a by putting z, = We shall apply the result (10.2.72) to fi - 6 x : and z, = 1 : in turn. Using (10.2.63), we have (10.2.73) max (x:)' : - max (xt - 2) 5 4 max x ~(x:)~ (1 - p;)zx; (1 - p;)Zx: 2 and Using the terminology introduced in Section 6.2, we can say that & and p are asymptotically normal with their respective means and variances. Therefore {x:] satisfy the condition (10.2.71) if we assume (10.2.65) and I (10.2.74) lirn T+m ' 2 max xt Ex; '- 0. - Note that the condition (10.2.74) is stronger than (10.2.64), which was required for the consistency proof; this is not surprising since the asymp totic normality is a stronger result than consistency. We should point out, however, that (10.2.74) is only mildly more restrictive than (10.2.64). In of this fact, the reader should try to construct a order to be co~lvinced sequence which satisfies (10.2.64) but not (10.2.74). The conclusion of Theorem 10.2.3 states that & and fi are asymptotically 246 10 1 Bivariate Regression Model 10.2 1 Least Squares Estimators 247 normal when each estimator is considered separately. The assumptions of that theorem are actually sufficient to prove the joint asymptotic normality of & and that is, the joint distribution of the random variables defined in (10.2.76) and (10.2.77) converges to a joint normal distribution with zero means, unit variances, and the covariance which is equal to the limit of the covariance. We shall state this result as a theorem in Chapter 12, where we discuss the general regression model in matrix notation. 6; Solving (10.2.81) for a2yields the maximum likelihood estimator, which is identical to the least squares estimator k2. These results constitute a generalization of the results in Example 7.3.3. In Section 12.2.5 we shall show that the least squares estimators & and are best unbiased if {u,]are normal. p 10.2.6 Prediction 10.2.5 Maximum Likelihood Estimators In this section we show that if we assume the normality of {u,]in the model (10.1.1), the least squares estimators 2, and 6 ' are also the maximum likelihood estimators. The likelihood function of the parameters (that is, the joint density of y1, y2, . . . , p)is given by 6, The need to predict a value of the dependent variable outside the sample (a future value if we are dealing with time series) when the corresponding value of the independent variable is known arises frequently in practice. We add the following "prediction period" equation to the model (10.1.1): where yp and u p are both unobservable, xp is a known constant, and up is independent of {u,),t = 1, 2, . . . , T, with Eup = 0 and Vup = u2.Note that the parameters a,p, and a2are the same as in the model (10.1.1). Consider the class of predictors of yp which can be written in the form Taking the natural logarithm of both sides of (10.2.78), we have (10.2.79) log L = -- T log 2 n - T log u 2 - 1 E(y, - a - PX,)? 2 2 where & and are arbitrary unbiased estimators of a and P, which are linear in {Y,], t = 1, 2, . . . , T. We call this the class of linear unbiased predictors of yp. The mean squared prediction error of jp is given by (10.2.84) p Since log L depends on a and P only via the last term of the right-hand side of (10.2.79), the maximum likelihood estimators of a and P are identical to the least squares estimators. Inserting & and into the right-hand side of (10.2.79), we obtain the socalled concentrated log-likelihood function, which depends only on a2. E(yp - jP)' + pxp) - (a+ = u2 + V ( &+ Pxp), = E{up - [(& p where the second equality follows from the independence of u p and {y,), t = l , 2 , . . . , T. The least squares predictw of yp is given by (10.2.80) log L* T log = -2 T log a2 - --1 2~:. 2a - 2 2a2 It is clearly a member of the class defined in (10.2.83). Since V(& Oxp) 5 V ( &i-pxp)because of the result of Section 10.2.2, we conclude that the least squares predictor is the best linear unbiased predictol: We have now reduced the problem of prediction to the problem of estimating a linear combination of a and P. Differentiating (10.2.80) with respect to u2 and equating the derivative to zero yields (10.2.81) + d log L* = T --- du2 2a2 1 +- E2i: =O. za4 3.5. the distribution of would be completely specified and we could perform the standard normal test. where Po is a known specified value.3. which is usually the case. A test on the null hypothesis a = a. Therefore. We could use either a one-tail or a two-tail test.2 Tests for Structural Change 1 i \ j Suppose we have two regression regimes i and I where each equation satisfies the assumptions of the model (10.1. If u2 is unknown.}are not normal.1) and (10. we obtain In Section 9. it implies their independence by Theorem 5.248 10 ( Bivariate Regression Model 10.19).3. so we see from (10. however. it is reasonable to expect that a test statistic which essentially depends on 0 is also a good one.2.1 Student's t Test Therefore. Using Definition 2 of the Appendix. but since they are jointly normal.4) shows that the covariance between and ii.3.] .3 ( Tests of Hypotheses 249 10. Because of the asymptotic normality given in (10.i. provided that the assumptions for the asymptotic normality are satisfied. we have 1 I 10. Since this proof is rather cumbersome.3.2) are independent. we shall postpone it until Chapter 12.2.2 independent standard normal variables.2.2) are independent by Theorem 3.3. we need a chi-square variable that is distributed independently of (10. where a simpler proof using matrix analysis is given. we must use a Student's t test.2. Since is a good estimator of P.11). is zero. Throughout this section we assume that {u.17).5) is a p proximately correct for a large sample even if (u.3.3 TESTS OF HYPOTHESES 10. we assume that {ult]and {u2.1).3.3.3. (10. using (10.Po divided by the square root of an unbiased estimate of its variance.4. A hypothesis on a can be similarly dealt with.2. In the next two paragraphs we show that U-~ZG: fits this specification.5) and (10.3.5) is simply @ .5 we showed that a hypothesis on the mean of a nomral i. We now prove that (10. Equation (10.2. can be performed using a similar result: To prove (10. We denote Vult = o : and VuZt= u: .2. In addition.d.2) we must show that u-' Cu ^f can be written as a sum of the squares of T . if u2 were known. Note that the left-hand side of (10.1. A similar test can be devised for testing hypothkses on a and P in the bivariate regession model. as in the proof of Theorem 3 of the Appendix. We shall consider the null hypothesis Ho: f3 = Po. From Definition 2 of the Appendix we know that in order to construct a Student's t statistic. We can do so by the method of induction. We state without proof that where 1 5 ' is the unbiased estimator of 0' defined in (10.3.] are normally distributed. sample with an unknown variance can be tested using the Student's t statistic. The test is not exact if {u. the test based on (10.2.3. Using (10.] are not normal.45).76). Therefore. we conclude that under Ho 6 under the null hypothesis Ho.1) and (10.1).19) that p since Zxf = 0 by (10. depending on the alternative hypothesis. A linear combination of normal random variables is normally distributed by Theorem 5. 8).12) are independent. Several procedures are available to cope with this so-called Behrm-Fisher . = P2 without asBefore discussing the difficult problem of testing : = a : .t (10. C2i.3.14) cannot be derived from (10.3. (10. We can construct a Student's t statistic similar to the one defined in (10.we have under Ho Setting a : =a .X 2T . we have p : . let us consider testing the null hypothesis Ho: a : = suming a a : . by Theorem 1 of the Appendix. . we study the test of the null hypothesis Ho: PI = P2. For example.4) ($2 using (10.3.7) may represent a relationship between y and x in the prewar period and (10.8)..2.3 ( Tests of Hypotheses 251 are normally distributed and independent of each other.12) t=1 -------- TI Tz + t= 1 . The difficulty of this situation arises from the fact that assuming a (10. A one-tail or a two-tail test should be used.11).13) without assuming a : = a.3.3. t t= 1 - - 2 XT.14) in either a one-tail or a two-tail test.3. Then. defining xTt = XI.13) simplifies it to Let {2ilt} and (2i2J be the least squares residuals calculated from (10. A simple test of this hypothesis can be constructed by using the chi-square variables defined in (10.]are independent. depending hypothesis a on the alternative hypothesis. in (10.3. and x .3. t = x2.2) implies and T2 f . First. Since they are independent of each other because {ult} and (up. under either the null or the alternative hypothesis.11).3.3. assuming a : = a . we have by Definition 2 of the Appendix Note that a : and a : drop out of the formula above under the null : = a.respectively.zi:.-2 ~ Therefore.7) and (10. Finally. we have by Definition 3 of the Appendix + ZTZ (10. Since (10.3.250 10 1 Bivariate Regression Model 10.3.3.8) in the postwar period.10) and (10.R2 as in (10.. we consider a test of the null hypothesis Ho: pl = P2 without : = a.5).3. respectively. a2 0 : where we have set TI + T2 = T..3.Let 1 3 1 and 132 be the least squares estimators of PI and pp obtained from equations (10.3.~ .7) and (10. depending on the alternative hypothesis.3. The null hypothesis can be tested where Ci2 = (T . Then (10. This two-regression model is useful to analyze the possible occurrence of a structural change from one period to another.11) p . .3.).9) and (10.3. 2)2i:. Welch's method is based on the assumption that the following is a p proximately true when appropriate degrees of freedom. z i : .3.d. based on the assumption a = P. and covariance a . Show that if in fact a = p.1) obtain the constrained kmt squares estimator of P. Obtain the mean and mean squared error of the reverse least squares estimator (minimizing the sum of squares of the deviations in the direction of the x-axis) defined by = & l y : / ~ ~ = l y tand x t compare them with those of the 3 3 2 least squares estimator p = Zl=lytx. = x. where (y?} and ( ~ $ are 1 unknown constants. ./C. {y. p~ p pR 4.1.2)-'ZT2 .74).For other methods see Kendall and Stuart (1973). and 3. random variable with mean zero and vari2 ances ut and u . into the right-hand side of (10. 5. therefore.).2. with the distribution P(ut = 1 ) = P(ul = -1) = 0. .2) Following the proof of the best linear unbiasedness of same for &. = t fort = 1. EXERCISES I . .3. The problem is to estimate the unknown parameter p in the relationship y$ = P$.where ( is defined by cT=P~ denoted by 0. prove the J 2.16). is approximately distributed as X: for an appropriately chosen value of v.) and {v. 2. where 6: = (TI ..5.1'7) to obtain (10. Since E( = 1 and since EX: = v by Theorem 2 of the Appendix.)are unobservable random variables.. That is to say. (Section 10.d. t = 1. . . v must be estimated by inserting I?. . are i.]are observable random variables.1) assume that a = 0. . t + 1.3.px. (Section 10. and 3.2) In the model (10.64) but not (10. t = 1. the mean squared error of is smaller than that of the least squares estimator p p p.2. denoted v. on the basis of observations (y. Then we can apply Definition 2 of the Appendix to (10. T = 3. 2.2. In practice. . we .) is a bivariate i. T. The assump tion that (10.18) and then choosing the integer that most closely satisfies v = 2 ( ~ 6 ) . but we shall present only one-the method proposed by Welch (1938). T.i.252 10 I Bivariate Regression Model I Exercises 253 problem.16) is approximately true is equivalent to the assumption that US. (Section 10.9) and (10. (Section 10. We now equate the variances of ve and x:: 3.4) Suppose that y. minimizes z T = ~ ( ~-p .i.]and (x. Also assume that {q). 2.)'. (Section 10.2. 13. Y + v. . and (u.2.2.2) In the model (10. Create your own data by generating {ut} according to the above scheme and calculate and for T = 25 and T = 50. 2. and x.3. .3. P = I . The remaining question. we should determine v by v = 2(V()-'.} and {x.2. = (T2 . Assume (u.' . are chosen: 5. . Derive its mean squared error without assuming that or = P.have Evt = EX:.=lxt. and 6. = y$ + ut and x. v. .1. is how we should determine the degrees of freedom v in such a way that vS is approximately X:.4) Give an example of a sequence that satisfies (10.Obtain the probability limit Since vX: = 2v by Theorem 2 of the Appendix. and . 75 3.6) Consider a bivariate regression model y.) and (y. t = 1.)in ascending order and define x(l) 5 x(2) 5 . + u.R1 = c sistency of and & defined by < lim. 2. . (Section 10.i. p .d.80 5.Let S be T / 2 if T is even and (T + 1)/2 if T is odd.3. (Section 10..} are i. .50 Period 2 4. data on hourly wage rates (y) and labor productivity (x) in two periods: Period 1.00 4. 0..30 4.2.3. = 0 and Vu. .. .S. assume also that c= 1 and obtain the probability limit of T T Obtain the mean squared prediction errors of the two predictors. = a px. x . N(0. (Section 10.i.5'1 h= - Pz. 1980-1986. We wish to predict y 5 on the basis of observations (yl.25 99. where (x. For what values of a and p is j5 preferred to j5? 9. yp. (Section 10.]are known constants and {u. 5 x(q. . .2. y3. (Source: Economic Report of the President.69 100. 22 = d < a. limT+ known as an errors-in-variables model. .=I Cx2 7.24 93. with Eu.53 95. . (Section 10.}are known constants and equal to (2.30 5. depending on whether the ith person is male or female. Exercises 255 T 2 of = ~ ~ l ~ . 1992. D. 2. We assume that {u. . 5. . t = 1.40 6. = 0'. p This is 6. u2).86 98.Prove the con- P=7 and X2 .d. 19721979.00 3 where we assume limT. Government Printing Office. 10.60 3.).d. with Eu.70 92. = 1 or 0. = 0 and Vu.4) Consider a bivariate regression model y.}are i.16 99.C. . / ~ .assuming = ~ x . We consider two predictors of y5: + Y : x: 3. T.1) Test the hypothesis that there is no gender difference in the wage rate by estimating the regression model P=Cy.] are i.2) The accompanying table gives the annual U. Also define + + 1 where y. is the wage rate (dollars per hour) of the ith person and x.) Are these estimators better or worse than the least squares e s t i m a m and &? Explain.2..i.. = a fix.30 (1) j5 = & + +x5. p Period 1 8.94 95. Washington. u. . The data are given by the following table: Number of people Male Female 20 10 Sample mean of wage rate 5 4 Sample variance of wage rate 3. 0. where {x.2' - I n .. 4) and {u.I 254 10 1 Bivariate Regression Model I = E.4) In the model of the preceding exercise. = u? Arrange {x. y4). where & and are the least squares estimators based on the first four observations on (x. 2.. and Period 2. 256 10 1 Bivariate Regression Model (a) Calculate the linear regression equations of y and x for each period and test whether the two lines differ in slope. A matrix. assuming that the error variances are the same in both regressions. for example. Symmetric matrices play a major role in statistics. and Bellman's discussion of them is especially good. For the other proofs we refer the reader to Bellman (1970). may be found in a compact paperback volume. chapter 4). (b) Test the equality df the error variances. Graybill (1969) described specific a p plications in statistics. 11. In this chapter we present basic results in matrix analysis. especially with respect to nonsymmetric matrices. or Arnemiya (1985.1 D E F I N I T I O N OF BASIC TERMS Matrix. The multiple regression model with many independent variables can be much more effectively analyzed by using vector and matrix notation. Additional useful results. appendix). here denoted by a boldface capital letter. Anderson (1984. (c) Test the equality of the slope coefficients without assuming the equality of the variances. is a rectangular array of real numbers arranged as follows: . . Marcus and Minc (1964). Et6MlLMTS OF M A T R I X A N A L Y S I S I In Chapter 10 we discussed the bivariate regression model using summation notation. appendix). Johnston (1984. For concise introductions to matrix analysis see. we prove only those theorems which are so fundamental that the reader can learn important facts from the process of proof itself. Since our goal is to familiarize the reader with basic results. denoted by A'. If A and B are matrices of the same size and A = {a.1) and Matrix multiplication.1. a. If A and B are matrices of the same size and A A +. Then the transpose of A.. Thus. (A vector will be denoted by a boldface lowercase letter. . From the definition it is clear that matrix multiplication is defined only when the number of columns of the first matrix is equal to the number of rows of the second matrix. jth element is equal to a]. a vector with a prime (transpose sign) means a row vector and a vector without a prime signifies a column vector.1) is a square matrix if n = m.]. A square matrix whose offdiagonal elements are all zero is called a diagonal matrix. = b. .] and B = {b.].1. A in (11.].. 2 b.1) and suppose that n = m (square matrix).1. Sometimes it is more simply written as I. jth element (the element in the ith row and jth column) is u. For example. Matrix A may also be denoted by the symbol {a. we define cA or Ac. Diagonal matrix. For example.. The other elements are off-diagonal elements. jth element is equal to a .258 11 I Elements of Matrix Analysis 11. . Square matrix. If a square matrix A is the same as its transpose. both AB and BA are defined and are square matrices of the same size as A and B. Kcto?: An n X 1 matrix is called an n-component column vector. as in (11.1. then Note that the transpose of a matrix is o b t a i n 4 by rewriting its columns as rows. The exception is when one of the matrices is a scalar-the case for which multiplication was previously defined. Let A be as in (11.] and B B if and only if a. For example. AB and BA are not in general equal.1) and let c be a scalar (that is. we have = {a. If A and B are square matrices of the same size. b' (transpose of b) is a row vector.1. to be an n X m matrix whose i. Then.2 1 Matrix Operations 259 A matrix such as A in (11. Transpose. jth element is ca. A is called a symmetric matrix. Let A be as in (11. 11. Scalar multiplication.]. Normally. An n X n diagonal matrix whose diagonal elements are all ones is called the identity matrix of size n and is denoted by I.) If b is a column vector. every element of A is multiplied by c.. . a square matrix A is symmetric if A' = A.1). Addition or subtraction. Symmetric matrix.2 MATRIX OPERATIONS = {b. and a 1 X n matrix is called an n-component row vector. if the size of the matrix is apparent from the context. is called an n X m (read "n by m") matrix. . A matrix which has the same number of rows and columns is called a square matrix. jth element cll is equal to C&larkbkl. Elements all. Let A be an n X m matrix {azl) let B be an m X r matrix {b. Identity matrix. then we write A = Equality.1).. The following example illustrates the definition of matrix multiplication: is a symmetric matrix. a real number). are called diagonal elements. az2. For example. B is a matrix of the same size as A and B whose i.1. indicating that its i. is defined as an m X n matrix whose i. for every i and j. the product of a scalar and a matrix. which has n rows and m columns. C = AB is an n X r matrix whose i. In other words. Let A be as in (11. Then.. However.]. In other words. . 1 ) I ( n . Clearly. a. r .. . a . a. .jI is called the cofactor of the changing the value of /A]. Then. . 2.by [al ( i ) . I I A l =x n 2 (. the second number is an element of a2.3 DETERMINANTS AND INVERSES 1 Alternatively. .a. be the identity matrices of size n and m. r2( i ). .b2. . . element as I 1 11.l~. The determinant of a 1 X I matrix. a Z 3 ) . =1 IfAB is defined. Throughout this section. .1.A 1= { a I . 1 Now we present a formal definition.3.) and let b be a column vector such that its transpose b' = ( b l . Before we give a formal definition of the determinant of a square matrix. given inductively on the assumption that the &terminant of an ( n .l)'+'a. b. a n d l e t A z . namely a'a.1) matrix has already been defined. One can define n! distinct such sequences and denote the ith sequence. ( i ) ] .. THEOREM 1 1 .(i). and consider the sequence [rl ( i ) . . Then it is easy to show that 1. .u p . and so on.3. . .N = 1 for ( a l l . denoted by [A[. First. the determinant may be defined as follows. . . n ] . The proof of the following useful theorem is simple and is left as an exercise. and I. a$$).as (1 1 .A = A and Ah. . Let A be an n X m matrix and let I.. are n X 1 column vectors. denoted by [A/or det A.. N = 0 for the sequence ( a l l . .3 1 Determinants and Inverses 261 is given by [:] :] [Y I: = " In describing AB. we may say either that B is premultiplied by A.Let rl (i) be the row number of al ( i ) . chosen in such a way that none of the elements lie on the same row. a r b = bra.b. Let N ( i ) be the smallest number of transpositions by which [rl ( i ).a l s ) . b e t h e 1 / ( n . ..(i) ] can be obtained from [ l .as2. . ] b e a n n X n m a t r i x . . .a 2 ( i ) . which is called the vector product of a and b . Consider a sequence of n numbers defined by the rule that the first number is an element of al (the first column of A ) . we have arb = Cy=laa. . . Let a' be a row vector ( a l . Vectors a and b are said to be orthogonal if a'b = 0. is called the inner product of a.Then we have n I 11. or that A is postmultiplied by B. . . and so on. i = 1. Consider a 2 X 2 matrix r 1 A = ( a l . . . Its determinant.an.i 260 11 I Elements of Matrix Analysis 11. (AB)' = B'A'.1) matrix obtained by deleting the ath row and the jth column from A. . .). r. 3 e t. by the above rule of matrix multiplication. is the scalar itself. a. . .) I A l= t=l (-1)N(')al(i)a2(i) . respectively. let us give some examples. a. . and N = 2 for (az1. a 2. . = A. in the case of a 3 X 3 matrix.). i / ~ ~ ~ l N l T l 0 ~ 1 1L . . . or a scalar. n!. all the matrices are square and n X n. . 2 . The vector product of a and itself. . .azp. is defined by The determinant of a 3 X 3 matrix where al. I The j above can be arbitrarily chosen as any integer 1 through n without The term (-l)'+J!A. For example. 3 . Then we define the determinant of A. we write A as a collection of its columns: (11. ( i ) ] .1) X ( n . .r 2 ( i ) . 2 . as given in Definition 11.5). Because of the theorem. T H Eo R E M 1 1.3. The determinant of a matrix in which any row is a zero vector is also zero because of Theorem 11. the determinant is zero..3.3.3. is the matrix defined by b 1 = IA'(. a n d D b e matrices such that If any two columns are identical.1 The inverse of a matrix A. 3.4. The proof of Theorem 11.3). T H E O R E M 1 1 . then (AB)-' = B-'A-'. The proof of this theorem is apparent from (11. It implies that if AB = I.3. This theorem follows immediately from Theorem 11. then B = A-' and B-I = A..2 and Theorem 11.3.3.3. If any column consists only of zeroes.3.4 If A and B are square matrices of the same size such that (A1 f 0 and 1 B I # 0. we can easily prove the theorem without including the word "adjacent.1.l. THEOREM 1 1 .1 and 11.5 is rather involved.3.1. 3 .3. the determinant I is zero. The theorem follows immediately from the identity ABB-lA-' = I.3. denoted by A-'. (As a corollary.5).I] is the matrix whose z. 3 . but B and C need not be.l is the cofactor of a!.3. I A I This theorem can be proved directly from (11.5).3. THEOREM 11.B.") THEOREM 11.3. but only for a matrix with a nonzero determinant. we may state all the results concerning the determinant in terms of the column vectors only as we have done in (11.3. 7 If the two adjacent columns are interchanged.262 11 I Elements of Matrix Analysis DEF~NITION 11. We now define the inverse of a square matrix.5 is square and ID\ st 0.8 LetA. T H E0 RE M 1 1 . The use of the word "inverse" is justified by the following theorem. Theorem 11.3) and (11. jth element is (-l)''~~A.1. (11. Here (-l)'"l&. 3 This theorem can be easily proved from Definitions 11.2 r provided that bl # 0. since the same results would hold in terms of the row vectors.3.2 11. where 0 denotes a matrix of appropriate size which consists entirely of .6) A-I = - 1 {(-l)'''l~.3.3.. C. and {(-l)'+'IA.2 follows immediately from (11. but can be directly derived from Definition 11.l~. (Note that A and D must be square.3 1 Determinants and Inverses 263 Let us state several useful theorems concerning the detest THEOREM 11.3. since the effect of interchanging adjacent columns is either increasing or decreasing N ( i ) by one. We have same size. the determinant changes the sign.3.) Then 1 ~ = ~ IAl 1IBI if A and B are square matrices of the Proof. yl for any yl and yp. . . K. ~. It can be shown that if A satisfies the said condition.1.BD-'c.y.4. .)' and y = (yl..)' and let A be as in (11. . Ax = y. 1 A s e t o f v e c t o r s x l . Generally. x2. The matrix where E = A .4 1 Simultaneous Linear Equations 265 zeroes.8) is unity and the determinant of A [Dl.2) D E F I N I T I O N 1 1 .8) and using Theorem 11. .4.t. A will denote an n X n square matrix and X a matrix that is not necessarily square. we shall assume that X is nXKwithKSn. V (for any). . .3. 4 . Consider the following n linear equations: I This matrix satisfies the condition because constitute the unique solution to XI = 2yl - n.CA-'B. Therefore. + + does not satisfy the above-mentioned condition because there is dearly no solution to the linear system Proot To prove this theorem. . E-' = A-' A-'BF-'CA-'.3.5 yields (11. Using the notation 3 (there exists).3.3.5). = 0 for all i = 1. . Define x = (xl. ~.5) that the determinant of the first matrix of the left-hand side of (11.8) is equal to I taking the determinant of both sides of (11. . y2. l = 0 implies c. Then (11. . . the solution to Ax = y is unique. . F = D . We can ascertain from (11. and F-' = D-' D-'cE-'BD-'.x:! = n .4 SIMULTANEOUS LINEAR EQUATIONS 3 1 since any point on the line xl + 2x2 = 1 satisfies (11. Next consider Throughout this section. in which the major results are given as a series of definitions and theorems. In general. if A is such that Ax = y has no solution for some y. x ~ i s s a i d to belinearly =x.264 11 I Elements of Matrix Analysis 11. (such that).1) with n = m.7). it is linearly dependent. We now embark on a general discussion. we can express the last clause of the previous sentence as THEOREM 1 1 .1) can be written in matrix notation as (114. .3.x. O BD-'cI 1 1i 1 A major goal of this section is to obtain a necessary and sufficient condition on A such that (11.2) can be solved in terms of x for any y. 2.4. provided that the inverse on the left-hand side exists. Otherwise independent if ~ f < c.3. it has infinite solutions for some other y. 9 Let us consider a couple of examples. and s. . the right-hand side of (11. simply premultiply both sides by ' 1 But there are infinite solutions to the system : 11. 3 . 3. . . IfX' is K X n.) and noting that (el. . A is CI. .a2. ( ProoJ Assume K = n .4. . Assume that it is true for n. a2.4. where K < n. Ax = y. O THEOREM 11. .2. .) + x3F(a3.2 and the A l = 0.x.8) is zero by Theorem 11. DE F I N ITI o N 1 I . .5. we say the matrix is row independent (RI) .Xf is not CI.4 1 Simultaneous Linear Equations 267 For example. where xl is the first element of x. 8 Pro$ (a) If IAl f 0.1 equations for which the theorem was assumed to hold. . .1 v y 3 x s.t. We prove the theorem by induction. = en may be summarized as AX = I by setting X = (xl. THEOREM 11. .5 A is CI + IAi # 0.4.a Combining Theorems 11. y) as X' in Theorem 11. c2. But (1. the vectors ( 1 . .+~) and assume without loss of generality that xn.) + . . y)c = 0. Q IA~ F(Ax. Therefore 1 The converse of Theorem 11.4.) and IAl = F(al. and consider n + 1. Ax = y.. Assume xl # 0 without loss of generality. Premultiplying both sides by A yields Ax = y because of Theorem 11. .3.en) = I. . .3. a2.cn+l)'?Write the nth row of X' as (xnl. if the row vectors are linearly independent. But the left-hand side of (11.t. x2. . 2) are Imearly iaadepenht because 2 only for cl = c = 0.x. f 0 follows from Theorem 11.3. Ax. as. . . D THEOREM 11. Then Axl = el. Since 1 1 1 = 1.4. . .4....4.4. .4.. we say the matrix is column independent (abbreviated as CI).4) are linearly dependmt hecause c2 = I:[ I:[ + [:] The theorem is clearly true for n = 2. .K row vectors of zeroes at the bottom of X'.3 A is CI =3 can be satisfied by setting cl = -2 and c p = 1.3. for otherwise we can affix n . we prove three other theorems first. is more difficult to prove.2) and (2. Pmo$ Write A = (al.1. . Since A is not CI.5. Q right-hand side is xlbl by Theorem 11. a. e2.a. .3.p. x. namely. . So the prescribed c exists. . (%) Let e. there is a vector x # 0 such that Ax = 0. a. If the column vectors of a matrix (not necessarily square) are linearly independent.4. But x # 0 means that at least one element of x is nonzero. where X' is n X (n + 1) and c = (cl. we have ( 1 .4 IAl f 0 a v y 3 x s. y) is not CI. ..) = ~11-41 + x2F(a2. Set x = A . Therefore there exists c f 0 such that (A.1 .4.4. Solving the last equation of X'c = 0 for cn+l and inserting its value into the remaining equations yields n .3 and 11.2 shows that (A.a2. Prooj Using the matrix (A. . ap. .' ~ . Is there a vector c f 0 such that X'c = 0. l ) and (1.+~f 0. From Definition 11. . A i s n o t CI =-3 b l = 0. be a column vector with 1 in the ith position and 0 everywhere else. .4.4. the coefficient on y in c is nonzero and solving for y yields Ax = y.). THEOREM 11. therefore. A-"exists.2 From the results derived thus far.4 immediately yields the converse of Theorem 11. we conclude that the following five statements are equivalent: [A[f 0. . A-' exists by Definition 11. .1. . . Since A is CI. a. a3.266 11 I Elements of Matrix Analysis 11.1 without loss of generality. stated below as Theorem 11. .6.2 THEOREM 11.1. a. Ax2 = e2. . r ) matrix.K such that Z'X = 0.2.4. suppose that rank(X) = r < K. Premultiplying by S t yields S'Sc = 0. z. where Y is an arbitrary r X (K . To prove the "if" part. is called the rank of X. there exists avector zl # 0 such that X'z. where K 5 n.11.4. Then there exists an n X ( n . 4 .4. z)( # 0. which is a contradiction.1 and 11. where B. note that X'Xc = 0 =0 c1X'Xc = 0 + c = 0. The remainder of this section is concerned with the relationship between nonsingularity and the concept called full ran$ see Definition 11..Y) is RI.K. we say A L nonsingulal: If A is nonsingular.268 11 I Elements of Matrix Analysis 1 1. . But since CR(X) = RR(X). 4 . If rank(X) = min(number of rows. If RR(X) < K. where K < n.10. Therefore. Z) is nonsingular and X'Z = 0. . d = 0. Suppose rank(X) = r > K .5 below. Then by Theorem 11. RR(X) = K.2. where K 5 n. we can similarly show RR(X) r CR(X). If W'X = 0 for some matrix W with n columns implies that rank(W) 5 n . Then CR[(X. k:] . number of columns). In this method x. Therefore. is determined by fill rank. c = 0. Next. Collect these n . Because of Theorem 11.2. Then rank(X) 5 K. but this contradicts the assumption that X is CI.4. note that Xc = 0 3 X'Xc 0 Proof. Premultiplying by Z' yields Z'Zd = 0. z1)'z2 = 0. Again by Theorem 11. Then. by Theorem 11. and so on. Let S be r linearly independent columns of X. B y reversing the rows and columns in the above argument. M~:B daat a q w e matrix is full rank if and only if it is nonsingular.. is the n X n matrix obtained by replacing the ith column of A by the vector y. alternative solution is given by CramerS rule. the ith element of x. we say X is Let X be a matrix not necessarily square. 7 where D = Z'Z is a diagonal matrix. 4 . 1 0 ProoJ Suppose that X is n X K. R THEOREM 1 1 . Clearly.4. 6 Let an n X K matrixx be full rank. (x'xI f 0 and ID\ # 0. o D EF 1 N ITI 0 N 1 1. we can solve (11.4 / Simultaneous Linear Equations 269 D E F I N IT I o N 1 1 . For an arbitrary matrixx.5 CR(X) or. and CR(X) = r 5 K.K ) matrix Z such that (X. Then (XI1. RR(X). Z)] > n. 4 .8. there exists a vector cw # 0 such that X a = 0 because of Theorem 11.3. 4 . 9 Proof. THEOREM 1 1 . Let S be as defined above.' ~ An .8. Therefore RR(X) 2 CR(X).4. Ll T H Eo R E M 1 1. X'Z = 0. I(x. . Suppose Sc + Zd = 0 for some vectors c and d.5. RR(X1) = r. .K vectors and define Z as (zl. THEOREM 1 1 .4. + c = 0. We have . we denote the maximal number of linearly independent column vectors of X by CR(X) (read "column rank of X") and the maximal number of linearly independent row vectors of X by RR(X) (read "row rank of X")..4. there exists a vector z2 # 0 such that (X. . z.4. Proof.4.3 If any of the above five statements holds. 8 An n X K matrix X. by Theorems 11. is full rank if and only if X'X is nonsingular. equivalently. By the same theorem.4. DEFINITION 11.4.6.K) matrix Z of rank n . and suppose that there exists an n X ( n .4 THEOREM 1 1 . then rank(X) = K. = 0. Let X1 consist of a subset of the column vectors of X such that Xl is n X r and CI. K 5 n. TO prove the "only if' part.3.2) for x as x = A . is r X r and RI. = [ X'X 0 9 Pro$ RR(X) > K contradicts Theorem 11.-K). Let an n X K matrix X. . Let Xll consist of a subset of the row vectors of XI such that XI. be C1. Then. 4. 2 where A is a diagonal matrix. Premultiplying (1 1. (115.-'). Then clearly A-' = D(x.5. if we ignore the order in which they are arranged. (To see this. T H E0 R E M 1 1 5 .2) and using Theorem 11. a square matrix satisfying H r H = I) such that (11. We can write Y'X = Y'B-'BX.1) has enabled us to reduce the calculation of the matrix inversion to that of the ordinary scalar inversion.. For example. More generally. in the ith diagonal position. rank(BX) = rank(X) if B is nonsingular (NS).1) by H and H' respectively. 5 .5. Thus the orthogonal diagonalization (11. p.5. rank(X) = K . for example. Since B is NS.9. Inverting both sides of (11. Theorem 11. Denote A by D(A.6) Ah = hh and (115.r) matrix W such that S'W = 0 and rank(W) = n . 0 THEOREM 1 1. The reader should verify. Premultiplying and postmultiplying (11. Therefore. The diagonal elements of A are called the characteristic roots (or eigenualues) of A. and noting that HH' = H'H = I.54 f (A) = HD[f(h.5) A'(=AA) = H D ( A )HI. Proof. which is the ith diagonal element of A.re) X n matrix Y such that Y'X = 0.1) H'AH = A.Then a'Z1 = 0 because B is NS. Then.4. there exists an orthogonal . But this contradicts the assumption of the theorem. matrix H (that is. Proo_f: See Bellman (1970.rl) X n matrix Z' such that Z'BX = 0.5.) Therefore r2 5 rl by the first part of Theorem 11.2 is related to the usual inverse of a scalar in the following sense.10. 2.3. suppose that a'ZrB = 0 for some a.3.4. Let rank(BX) and rank(X) be rl and rz. The following theorem about the diagonalization of a symmetric matrix is central to this section. We shall often assume that X is n X K with K 5 n.5.1) by H yields . and rl = rz. Throughout this section.270 11 I Elements of Matrix Analysis 11. which play a major role in multivariate statistical analysis. we obtain (115.1 1 For any matrix X not necessarily square. there exists afull-rank (n . since Z is full rank. Y'B-' is full rank. the inverse of a matrix defined in Definition 11. By Theorem 11.7) I A . Therefore rl 5 r:. 54).2) A =Hm'. But this implies that a = 0.r. Note that H and A are not uniquely determined for a given symmetric matrix A. The ith column of H is called the characteristic vector (or eigenuector) of A corresponding to the characteristic root of A. O 11.).)]H1. indicating that it is a diagonal matrix with A.A11 = 0. that (115. there exists an n X ( n . ~ For any symmetric matrix A. a matrix operation f (A) can be reduced to the corresponding scalar operation by the formula (11.1 since H'H = I implies H-' = H'. Z'B is also full rank. I I Given a symmetric matrix A.5 1 Properties of the Symmetric Matrix 271 1 4 3 by Theorem 11. Proof.4.5. Also by Theorem 11. how can we find A and H? The faglowing theorem will aid us.1 is important in that it establishes a close relationship between matrix operations and scalar operations.5 PROPERTIES OF THE SYMMETRIC MATRIX H.9.9 there exists a full-rank (n .7 yields Now we shall study the properties of symmetric matrices. The set of the characteristic roots of a given matrix is unique. however. A will denote an n X n symmetric matrix and X a matrix that is not necessarily square. respectively. Clearly. since H'AH = A would still hold if we changed the order of the diagonal elements of A and the order of the corresponding columns of 14 Let be a characteristic root of A and let h be the corresponding characteristic vector.4. THEOREM 1 1 . 5.7 by Theorem 11. The characteristic roots of any square matrix can be also defined by (11. have Therefore the characteristic roots are 3 and -1. we obtain xl = x2 = *-I.) The diagonalization (11. = A.5.1 proves (11.1) can be written in this case as . The following are useful theorems concerning characteristic roots.11 by Theorem 11. is the ith diagonal element of A and h. p.is the ith column of H. we obtain yl = fi-' and y2 = -fi-'~ (yl = -@-I and y2 = @-I also constitute a solution. be the largest and the smallest characteristic roots. Proof. From this definition some of the theorems presented below hold for general square matrices. p.272 11 I Elements of Matrix Analysis THE 0 RE M 1 1 . Using (11. we shall prove it only for symmetric matrices. See Bellman (1970. of an n X n symmetric matrix A.5. 5 . O Let us find the characteristic roots and vectors of the matrix The rank of a square matrix is equal to the number of its nonzero characteristic roots.6). where A.6) as (A . respectively.4.11 .4. the nonzero characteristic roots of XY and YX are the same. ProoJ: See Bellman (1970.5.we have .7). 4 = rank(A) [: I:[ = 3 [I:] and xy + xf = 1 For any matrices X and Y not necessarily square. Whenever we speak of the characteristic roots of a matrix. Then for every nonzero n-component vector x.XI)h = 0 and using Theorem 11..5 simultaneously for yl and yp. simultaneously for x l and xp.5. we have - - = rank(AHr) = rank(HA) by Theorem 11. Proof.7).4.4. Using (11. Writing (11. 96).5. 56). Even when a theorem holds for a general square matrix.5.5.6 Let XI and A . Then A and B can be diagonalized by the same orthogonal matrix if and only if AB = BA.8) yields Ah. the reader may assume that the matrix in question is symmetric. Suppose that n l of the roots are nonzero.2). This proves (11. We shall prove the theorem for an n X n symmetric matrix A.1) and HH' = I . Solving THE 0 R E M 1 1 .h. T H E o R E hn 1 1 .5 ( Properties of the Symmetric Matrix 273 Singling out the ith column of both sides of (11. Let A and B be symmetric matrices of the same size.5.5. whenever both XY and YX are defined.3 11. Solving Proof. T H E O R E M 1 1 5 . we write A > 0.1 and 11.3). Taking the determinant of both sides of (11.11) 1 Properties of the Symmetric Matrix = 275 where z = H'x. If x'Ax 2 0. we have . We deal only with symmetric matrices.7 tr A = tr HAH' tr AH'H = tr A .X.5. T H E0 RE M 1 1 .8.) If A is positive definite. Therefore A > 0 implies d'Ad > 0 and X'AX > 0.9 LetAbeannXnsymmetricmatrixandletXbean n X K matri'k where K 5 n.5.5. T H E O R E M 1 1.5. If X is full rank. then A > 0 3 X'AX > 0. (If A is diagonal.5." or "nonpositive.2) and Theorem 11.12 There is a close connection between the trace and the characteristic roots.1 1 A >0 J A-I > 0.9) follow from e'(hlI .1 0 + Proof: We shall prove the theorem only for a symmetric matrix A.Aisposztivedefinite The determinant of a square matrix is the product of its characteristic roots. A symmetric matrix is positive definite if and only if its characteristic roots are all positive.B is positive definite.5.") Proof.5. is another important scalar representation of a matrix. T H E O R E M 1 1.5. Proof: We shall prove the theorem only for a symmetric matrix A. (Negative definite and nonpositive definite or negatzve semzdefinite are similarly defined.5. Then csX'AXc = dsAd. The following theorem establishes a close connection between the two concepts.5. Let X and Y be any matrices.I)z 2 0. H'H = I implies I H I = 1. R Each characteristic rootbf a matrix can be regarded as a real function of the matrix which captures certain characteristics of that matrix.5. then d # 0.2 1 IfAisannXnsymmetricmatrix. D E F I N I T I o N 1 1 .5.5. if rank(X) = K. we write A > B. T H E0 RE M 1 1 .8 THEOREM 11. if A . tr XY = tr YX. 0 IAJ We now define another important scalar representation of a square matrix called the trace. we say that A is nonnegative dejinite or positive semidejinite. D E F I N I T I O N 11. Moreover. Proof: The theorem follows from Theorem 11. The inequalities (11. we have X'AX r 0. not necessarily square. The determinant of a matrix.6. For nonnegative definiteness. R The following useful theorem can be proved directly from the definition of matrix multiplication.2) and using Theorems Similarly. and define d = Xc. The trace of a square matrix is the sum of its characProof. A > 0 does imply that all the diagonal elements are positive. Using (11." "negative. Let c be an arbitrary nonzero vector of K components.1 The trace of a square matrix.10. 0 teristic roots. The inequality symbol should not be regarded as meaning that every element of A is positive.) More generally.274 11 I Elements of Matrix Analysis < 11. R We now introduce an important concept called positive definiteness. is defined as the sum of the diagonal elements of the matrix. R THEOREM 11.3. Then A 2 0 *X'AX 2 0.h ) z 2 O and z l ( A . since the characteristic roots of A-' are the reciprocals of the characteristic roots of A because of (11. implies the theorem. such that XY and YX are both defined. since the determinant of Therefore a diagonal matrix is the product of the diagonal elements.3. (The theorem is also true if we change the word "positive" to "nonnegative. which we examined in Section 11. which plays an important role in statistics.5 yields IA~ = IHI' I A~. Then.Since A 2 0 implies d'Ad 2 0. we use the symbol 2.3.5 (11. if x'Ax > 0 for every n-vector x such that x 0.5. 11. The theorem follows immediately from Theorem 11. denoted by the notation tr. = [ A ( which .5.5. Since ( X .5 1. Then A r B + B-' 2 A-'. Hence we call P a projection matrix. For example.5. A = ~ ( -0 0 ) ( 0 .0 ) ' and B = ~ ( -8 0 ) (8 .13). where X is an n X K matrix of rank K. By Theorem 11.Z) . the converse is not necessarily true. A = [:. Unfortunately.12). My = y2. 1 5 1 . since the resulting vector yl = Xcl is a linear combination of the columns of X .5. we cannot always rank two estimators by this definition alone.2.P - M ) ( X . T HE0R E M 1 1. 5 . where Z is as defined in the proof of Theorem 11.' X I . it implies that c'0 is at least as good as c'0 for estimating c ' 0 for an arbitrary vector c of the same size as 0 .K ) matrix Z such that ( X .14) by ( X . We must use some other criteria to rank estimators. 5 .Z) = 0 . We have (11.1 to the case of vector estimation. In (11. and in (11.5. Then clearly Pyl = yl and Py2 = 0 . there exists an n X (n . The proof is somewhat involved and hence is omitted.5. 5 . Z) is nonsingular. This matrix plays a very important role in the theory of the least squares estimator developed in Chapter 12.4. consider (11. More generally. Let A and B be their respective mean squared error matrix.5. (Both A and B can be shown to be nonnegative definite directly from Definition 11. Then we say that 8 is better than 0 if A 5 B for any parameter value and A f B for at least one value of the parameter.5.trA < tr B.0 ) ' .(X.14) (I .9. It immediately follows from Theorem 11. R + Note that if 0 is better than 8 in the sense of this definition. The question we now pose is.P = M. that is. THEOREM 1 1 . 1 6 P = P' = p2. In each case. In the remainder of this section we discuss the properties of a particular positive definite matrix of the form P = x ( x ' x ) . Note that A < B implies tr A < tr B because of Theorem 11.14.5. Theorem 11. 0 is at least as good as 8 for estimating any element of 0 .] and B = [2 O 0 0.2. Z )= (X. Proof: See Bellman (19?0. An arbitrary n-dimensional vector y can be written as Proof. Any square matrix A for which A = called an idempotent m a t r i x . Ci THEOREM 1 1 .2. Let 0 and 8 be estimators of a vector parameter 0 . Recall that in Definition '7. .(0. p. 3 I A l < IBI.5. Postmultiplying both sides of (11. We call this operation the projection of y onto the space spanned by the columns of X .2. The projection matrix M = z(z'z)-'z'.16 states that P is a symmetric idempotent matrix. z)-' yields the desired result. plays the opposite role from the projection matrix P.5. .5. there exists an n-vector c such that y = ( X .1.) D E F I N ITI o N 1 1 . Z ) c = Xcl Zcp. It can be also shown that A < B implies IAl < IBI. Next we discuss application of the above theorems concerning a positive definite matrix to the theory of estimation of multiple parameters. The two most commonly used are the trace and the determinant.5.14 that Py = yl. Set yl = Xcl and y2 = Zc2.O) . is This can be easily verified.14 y = yl + y2 such that Pyl = yl and Py2 = 0 . and A > B 3 B-' > A-'. 9 3 ) . How do we compare two vector estimators of a vector of parameters? The following is a natural generalization of Definition 7. Thus we see that this definition is a reasonable generalization of Definition 7. In neither example can we establish that A 2 B or B r A.5 1 Properties of the Symmetric Matrix 277 T H € 0 RE M 1 1 . 5 . Z) is nonsingular and X'Z = 0. 1 3 Let A and B be symmetric positive definite matrices of the same size.9. Proof. Namely.1 we defined the goodness of an estimator using the mean squared error as the criterion.12.- - 276 11 Elements of Matrix Analysis 11. and second. where .5. there exists an n X (n .) R THEOREM 1 1.278 11 1 Elements of Matrix Analysis I A= Exercises 279 THEOREM 1 1 .3) Using Theorem 11. + X. by using the inverse of the matrix.10.4 the nonzero characteristic roots of x(x'x)-'x'and (x'x)-'X'X are the same. 5 . Suppose PW = 0 for some matrix W with n rows. But since the second matrix is the identity of size K . ] then x ~ . [:4k:] I:[ = 6. x2.5. (An alternative proof is to use Theorem 11. As we have s h o w in the proof of Theorem 11.~ x .)-"x. 5.PartitionXas X = (XI.18 below.we have x(x'x)-'x' = xl(x. (Section 11. first. which in turn implies A = 0. (Section 11. W = XA + ZB for some matrices A and B. prove its corollary obtai word "adjacent" from the theorem.4. Therefore W = ZB. verify it for the A and B given in Exercise 3 above. which implies rank(W) 5 n . Thus the theorem follows from Theorem 11. 3. and second. (Section 11. PW = 0 implies XA = 0.5. 1 Proof. Since. The theorem follows from noting that 2. first. and xs. S . 5 .*)-'X.14.*(~. by using Cramer's rule: Characteristic roots of P consist of K ones and n . by Theorem 11. Proof.3. If you cannot prove it.3) Venfy = IAl IBI.K.4) Solve the following equations for xl for x2.3) (A + B)-' = A-'(A-\ fB-')-'A-' whenever all the Prove A-" inverses exist.5. (Section 11. [::] and B = [: i i.K) full-rank matrix Z such that PZ = 0.3. By Theorem 11.~X.4) Solve the following equations for xl. its characteristic roots are K ones.x ~ ( ~ .3 and Theorem 11. (Section 11. If we define X l = [I . (Section 11. by using Cramer's rule: I 7.K zeroes.Xp) such that X1 is n X K I and Xg is n X K2 and K1 + K2 = K . 1 9 L e t X b e a n n X KmatrixofrankK.x. by using the inverse of the matrix.*~. I 8 4.14. x ~ ) . R THEOREM 1 1 .5.4) Find the rank of the matrix Proof. 1 7 rank(P) = K. and PI. = 0 and Vu. T are known constants.8.5) Find the inverse of the matrix I dimension as I. We define the multiple linear regres sion model as follows: and its characteristic vectors and roots. = a ' . . .i.5. PZ. with Eu. . it is positive definite.280 11 I Elements of Matrix Analysis and compute A' 10. As before. The multiple regression model should be distinguished from the multivariate regression model. . a boldface capital letter will denote a matrix and a boldface lowercase letter will denote a vector.) are observable random variables.4. after we rewrite (12.1 INTRODUCTION 13. In this chapter we consider the multiple regression model-the regression model with one dependent variable and many independent variables. 12. (Section 11. . (Section 11. which refers to a set of many regression equations.1) in matrix notation. . The results of Chapter 11 on matrix analysis will be used extensively.}. {xi. The organization of this chapter is similar to that of Chapter 10.i = 1. (u.5) Suppose that A and B are symmetric positive definite matrices of the same size. 15.d.5) Compute 5. PR and (r2 are unknown parameters that we wish to estimate. We shall state the assump tion on {x. . (Section 11. . except for the discussion of F tests in Section 12.5) Let A be a symmetric matrix whose characteristic r o w are less than one in absolute value. - 12 MULTIPLE REGRESSION MODEL Prove Theorem 11.]later. (Section 11. . (Section 11. K and t = 1. The linear regression model with these assumptions is called the classical regression model. 2.5) Define r + xx' where x is a vector of the same In Chapter 10 we considered the bivariate regression model-the regression model with one dependent variable and one independent variable. 14.] are unobservable random variables which are i. . Compute x(x'x)-'X where {y. . . 2. Show that 12.1. Show that if AB is symmetric. . Most of the results of this chapter are multivariate generalizations of those in Chapter 10. 5) fully. We denote the columns of X by x(. . 2 (12. K. . .13) y = X$ + u.) The assumption that X is full rank is not restrictive.X(*). Therefore we can rewrite the regression equation (12. .5) is of the size T.=~u:. Then (12. in the multiple regression model (12. t = 1. . 1 We assume rank(X) = K.)imply in terms of the vector u (121. Define the Kdimensional row vector $ = (xtl. But we shall not make this assumption specifically as part of the linear regression model. by Definition 11.8. Note that u'u. .2. Taking the trace of both sides of (12. To understand (12.2). the reader should write out the elements of the T X T matrix uu'. .1) are defined as the values of {P. Suppose rank(X) = K1 < K.1. PK)I .3) as 12. .A).2.Then we can write X2 = XIA for some K1 X K2 matrix A.4. .1.2 In practice. . . x ( ~is .). .1) in vector and matrix notation in two steps.. .1 Definition The least squares estimators of the regression coefficients {PC]. 0 denotes a column vector consisting of T zeroes.xT). Then we can rewrite (12. we can find a subset of K1 columns of X which are linearly independent.% ! 1 yt = xlP + u.1. .4.p ) ' and u = (ul. Note that in the bivariate regression model this assumption is equivalent to assuming that x..1. . T is a scalar and can be written in the summation notation as c. . (See Theorem 11. . Thus. .8 yields Eu'u = U'T. . Our assumptions on {u. is not constant for all t. and hence X = XI(I. i = 1.1) with respect to pcand equating the partial derivative to 0. a row vector times a column vector. The identity matrix on the right-hand side of (12. u2. Define the column vectors y = (yl. X(K).x.2 LEAST SQUARES ESTIMATORS 12.2 1 1 Least Squares Estimators 283 BS We shall rewrite (12. . where X = I.. .*. The reader must infer what 0 represents from the context.5.T X Kg.1.where XI is T X K1 and X2. columns of X and partition X = (XI.4.5) Euu' = a I.1) to (12. 2. . X = [x(.4). which the reader should also infer from the context. the real advantage of matrix notation is that we can write the T equations in (12. usually taken to be the vector consisting of T ones.1. for most of our results do not require it. xtK) and the Kdimensional column vector P = (PI.5) and using Theorem 11. Without loss of generality assume that the subset consists of the first K.. . A)P and XI is full rank.1. . . . X(K) are linearly independent. x(p).1. .. . Although we have simplified the notation by going from (12. .1) can be written as (12. We shall denote a vector consisting of only zeroes and a matrix consisting of only zeroes by the same symbol 0. x(n).).2) as a single vector equation. Then.1. we obtain where pl = (I. . . .1. .] which minimize the sum of squared residuals Differentiating (12.2) as and (121. . Another way to express this assumption is to state that X'X is nonsingular. 2.X2).x ( ~ )The ] assumption rank(X) = K is equivalent to assuming that x(I. . T. ~ ~ In (12. uT)' and define the T X K matrix X whose tth row is equal to xi so that X' = (x. . . .y*.1.4) Eu = 0 . because of the following observation.282 12 1 Multiple Regression Model 12. x2.1. . P2. the multiple regression model is reduced to the bivariate regression model discussed in Chapter 10. (12.. Note that the decomposition of y defined by since we have assumed the nonsingularity of X'X.X 2 ) . . .2.2. 1. Equating (12.2. .+ BKzxtl~tK = zxtlyts fiK&tzxt~ Pgzx?z CX~~Y~. .P. The reader should verify that making the same substitution + p2 1 and . . 8.10) respectively as (12.6) we define the least squares predictor of the vector y. Let 1 = ( 1 .5.2. as I (12.12) El = My.8) and (10. in (12. we can write (12. 7 = xfi. If we assume K = 2 and put x ( l ) = 1 and X ( Z ) = X .2) and rewrite it as Dizxfi {Dl). Using the vector and matrix notation defined in Section 12.2.2). Suppose we partition fit = (B.2.9) and f = Py and where we have defined ( X ' X )-"elds fi = (81.2.x2.2 1 Least Squares Estimators 285 The least squares estimators i = 1.7) the vector of the least squares w. are the solutions to the K equations in (12.in (12. We can write (12.3) i 1 As a generalization of (10.5). .. K.2.2.11) and M = I .2. Equation (12. P g .4) by (12. . Partition X conformably as X = a Kg-vector such that K1 ( X I .. P K ) '. 2.2.5) gives the least squares estimators & and @ defined in Section 10.2. We now show how to obtain the same result without using the surnmation notation at all.2.5.9).2.1.2.9) + Pgz~tl~t2 Plz~tg~tl * + . The P and M above are projection matrices.3) as Defining P = x(x'x)-'x' (12.2.) where fil is a K1-vector and is Kg = K.2. .Then we can write (12. . Premultiplying (12.8)by noting that it is equivalent to (12. 1)' be the vector of T ones and define x = ( x l .2.4) as Define the vector of partial derivatives Then The reader should verify (12.Xp)'(y -Xp) = y'y - 2y'xp + p'X'XP which signifies the orthogonality between the regressors and the least squares residuals and is a generalization of equations (10. denoted by f.6) i 1 1 is the same as that explained in Theorem 11. . .13) represents the decomposition of y into the vector which is spanned by the columns of X and the vector which is orthogonal to the columns of X. .4) an explicit formula for a subvector of f i . we can rewrite (12. As a generalization of (10. It is useful to derive from (12.We put PI = @. whose properties were d i s cussed in Section 11. Premultiplying (12.2..2).2. .12) by X' and noting that X'M = 0. + + ..2.284 12 1 Multiple Regression Model 12.xT)'.2.2.2.14.2. .sidtds is defined by where C should be understood to mean c : = . . we obtain S(p) = (y . .1) in vector notation as (12. unless otherwise noted.8) to 0 and solving for 3 f yields (12. .2. 1 ( C a u s s .2.17) (j2 and inserting it into (12. If X'X is nearly singular.' X .20). 2 .we have = = ( X ' X )-'XI (EUU')x(x'x) uZ(x'x)-'.2.15) yields (jl = (x.2.17) and (12.16) and (10. In the special case where X1 = 1. and Note that ~ ( isj equal to C O V ( ~ .16) for (12.' X M.20) to the variance of the ith element of (j.24).2. X 1 ) . Since p* = P I.22) yields the variances V& and given.23) is equivalent to (12.y.2.2.24).iZ 286 12 1 Multiple Regression Model 12. pJ). The theorem derived below shows that (j is the best linear unbiased estimator (BLUE). THEOREM 1 2 .2. C'X = I.1. Define the variance-covariance matrix of 6.1.2 1 Least Squares Estimators 287 1 4 Solving (12. The off-diagonal element yields Cov(&.we obtain Since X is a matrix of constants (a nonstochastic matrix) and Eu = 0 k y the assumptions of our model. the least squares estimator (j is unbiased.2. VP O). since each element is a linear combination of the ? of the matrix (x'x)-'x'uu'x(x'x)-' elements of the matrix uu'. M2y.2.25). The reader should verift. -' Thus the class of linear unbiased estimators is the class of estimators which can be written as C'y. where we have used the word best in the sense of Definition 11.2. In other words.1.2. from Theorem 4. x 2 ) .3) into the right-hand side of (12.23) as the diagonal elements of the 2 X 2 variance-covariance matrix. the vector of T ones.2.M&~)-'x.2.1. (12. Inserting (12.3) into the left-hand side of (12. we impose vP.19) and (12. where C is a T X K constant matrix satisfying (12.2. or more exactly.2.22) exists and is finite. in (10.' X . . (12.2. as every variance-covariance matrix should be. the variancecovariance matrix (12.2 Finite Sample Properties of B We shall obtain the mean and the variance-covariance matrix of the least squares estimator (j.2. and X2 = X . j ~ ( =j p + (xrx)-'X'EU = p. as we can see from the definition of the inverse given in Definition 11.23) EC'y = p.2. The least squares estimator (j is a member of this class where C' = (x'x) -'xl.we have . . the elements of ~ ( are j large.X I ( X . The fourth equality follows from the assump tion Euu' = 0'1.2. We call this largeness of the elements of ~ ( due to the near-singularity of X'X the problem of multicollinearity.23).2. which was obtained in (10. using (12. The ith diagonal element of the variancecovariance matrix ~ ( is jequal Let p* = C ' y where C is a T X K constant matrix such that C'X = I.6.2.24) ~ ( =j ~ ( ( j ~(j)(( j~ 6 ) ' .24) and (10.2.respectively.x ) in (12. Then.. where M1 = I . we note that (12.18) (j2 = (x.5). Then.2.M a r k o v ) The third equality above follows from Theorem 4. that setting X = (1. symmetric.5. . respectively. . Next we shall prove that the least squares estimator (j is the best linear unbiased estimator of p. The i.18) are reduced to (10. Proof.2.2. Similarly.6.(X. (12. formulae (12. if the determinant X'X is nearly zero. denoted as V (12.M ~ X ~ ) . where M2 = I . (j is better than P* if P* # (j. Since we have assumed the nonsingularity of X'X in our model.21) By inserting (12.2. We define the class of linear estimators of P to be the class of estimators which can be written as C ' y for some T X K constant matrix C . We define the class of linear unbiased estimators as a subset of the class of linear estimators which are unbiased. 12..x. That is.3.3. 6. jth element of ~ ( isj the covariance between 6 .12).C ' u because of (12. ) 12. ' defined in (12. is the better estimator of p.u2(X'X)-' is a nonnegative definite maand trix. there are biased or nonlinear estimators that are better than the least squares for certain values of the parameters.5. meaning that Vp* . the theorem follows from observing that u2(X'X)-' = using Definition 11. = ~ .) are normally distributed. Finally.27) we obtain To verify the last equality. and partition X = (1.46) as Estimation of o2 As in Section 10. (For further discussion of the ridge and related estimators. Using (12.27) and (12. Note.2 follows as a special case of Theorem 12.2.K) by Theorem 11.2.46). that the bias diminishes to zero as T goes to infinity.2. chapter 2.26).4 for the distribution of Q'Qunder the assumption that (u.But Z'Z 2 0. we note that Theorem 12.2.2. An unbiased estimator of o2is defined by (12.x(x'x)-'.1 implies that d'fi is better (in the sense of smaller mean squared error) than any other linear unbiased estimator d'P*.24).3.6 u2 tr M since Euu' = u21 u2(T .5.28) EQrii= Eu'Mu =E =E = = = tr u'Mu tr Muu' VB trM(Euur) byTheorem4. In deriving is useful to note that (12.18. In particular. Suppose we wish to estimate an arbitrary linear combination of the regression coefficients.2 1 Least Squares Estimators 289 = = C' (Euu' ) C u2c1c using the least squares residuals ii defined in (12. by choosing d to be the vector consisting of one in the ith position and zero elsewhere.2. is better than the least squares for certain values of the parameters because the addition of yI reduces the variance.2.12.5. since M~ = M since ufMu is a scalar by Theorem 11.' y ' ~ y the projection matrix which projects any vector onto the space orthogonal to 1. An example is the ridge estimator defined as (X'X + y1)-'Xry for an appropriately chosen constant y.2.19. O (12.K)u'. The last term above can be written as U'Z'Z by setting Z = C .2.288 12 1 Multiple Regression Model 12. that L is where 1 is the vector of T ones.27) ii = My = it M(XP + u) = Mu.2.30). 1985.3 1 1 Therefore.2.29) 6' =- a'& T-K See Section 12.5.2. bv Theorem 11. From the discussion following Definition 11. If we define a T X T matrix Note .1.8 From (12.2. although biased. however. multiply out the four terms within the square brackets above and use (12. hence &2 is a biased estimator of u2.2. we can define R ' by the same formula as Using 6 equation (10. X?).12). we define the least squares estimator h2 as Suppose that the first column of X is the vector of ones.2 regarding the sample mean.2.5. meaning that Z'Z is a nonnegative definite matrix. say drP. . we can rewrite (10.1.5. by Theorem 11. the premultiplication of a vector by L subtracts the average of the elements of the vector from each element. we see that 6. we can write s. Therefore VP* r u2(X'X)-'.2.3. As we demonstrated in Section 7. In other words. Then. The estimator. see Amemiya. The improvement is especially noteworthy when X'X is nearly singular. ~6~= T-'(T . than Pf Thus the best linear unbiasedness of 6i and fi proved in Section 10.3. 4 Asymptotic Properties of Least Squares Estimators so that <. XI[(XfX)-'1 -+ 0 implies tr (x'x)-~ -+ 0 because the trace is the sum of the charac= u2(x'x)-" as we obtained teristic roots by Theorem 11. the ith diagonal element of X'X. tr (X'X)-' + 0 implies tr 1 2 . Because of (12. 2.3) shows that the characteristic roots of (x'x)-' are the reciprocals of the characteristic roots of X'X. as we show below. 12.)x(.5. the first of which consists of T ones and the second of which has zero in the first position and one elsewhere. the square root of R'.2. the least squares estimator @ is a consistent estimator of p if XS(X1X) -+ m. Since + 0. Therefore AS(XrX) + m implies <. .2 1 Least Squares Estimators 291 # ! VP.---T T 9 .1).2. Therefore we do not have X. Suppose X has two columns.34) defines the least squares predictor of y* based on X& SOwe can rewrite (12. Then.1.2. which in turn implies in (12. THEOREM Pro$ Equation (11. i = 1. For this purpose define y* = Ly.(XfX) -+ m. in this example.2.2. the variance of the least squares estimatbr of each of the two parameters converges to u2/2 as T goes to infinity.+ m. Then.1).5 for a discussion of the necessity to modify R2 in o r d e r _ _ to use it as a criterion for choosing a regression equation. Since the characteristic roots of (x'x)-"re all positive. ) x ( .35) R' = (y*'7*12 (Y*'Y*) .2 .)x(.33) that gmeralizes the intepretation given by (10. we have ! which is the square of the sample correlation coefficient between y* and f*.33) as (12.5. (7*'7*) But the right-hand side of (12.(XrX) + implies that every diagonal element of X'X goes to infinity as T goes to infinity.27) we have .2. we have from (12.1.3. VD 1 Proof.36) is greater than or equal to XS(XrX) by Theorem 11.-. p for X. Solving In this section we prove the consistency and the asymptotic normality of and the consistency of 6 ' under suitable assumptions on the regressor matrix X. + 0 f a i = I.9.. K.40) u .1.1.35) we sometimes call R. The converse of this result is not true.. 3 In the multiple regression model (12. Thus.1. From (12.2. )can be written as \ Y We now seek an interpretation of (12.. Using the results of Section 11.2.u'u ulPu (12. 2 .2. We can prove this as follows. 2 . < .2. Therefore Xs(XfX) + m implies x~[(x'x)-'1 + 0.26) and (12. 2.5. 2 In the multiple regression model (12.52). Let e.2. See Section 12. and Note that (12.37) THEOREM 1 2 .2. . be the T-vector that has 1 in the ith position and 0 elsewhere. . Note that the assumption X.26) is a consistent estimator of 0 ' . X: = LX2. Our theorem follows from Theorem 6.2. where XS(X1X) denotes the smallest characteristic root of X'X.22). where Al denotes the largest characteristic root.290 12 1 Multiple Regression Model 12.6. 62 as defined in (12.2. we find that the characteristic roots of X'X are 1 and 2T . the multiple correlation coefJident.+ m. . 2. T Define Z = XS-'.2. 1985. 20 Inserting (12.2.1. 2 . B THEOREM 1 2 .we have Eu'Pu = u 2 ~Therefore. Z'Z = R exists and is nonsingular.2.2. see Amemiya.2.40).26).2.45) is a generalization of equations (10. We also show that (j is the best unbiased estimator in this case.1.xf3)'(y .2. Since (12. Then s(B .2.28).292 12 = 1 1 Multiple Regression Model 1 4 12.19) and (10.3. chapter 3). To show that is best unbiased under the normality assumption of u.21).81) as that defined in (12.2.1..2).45) has the same form as the expression (10.2.4.we can write the ith element of (j as which is a generalization of equation (10.dO.1).1).) = 0.2.68).the likelihood function of the parameters p and u2 can be written in vector notation as . From (12..R a o ) Let 6" be any unbiased estimator of a vector parameter 0 and let VO* be its variance-covariance matrix.2.43) plim -= u'P u T 0 The consistency of 6' follows from (12. 2 .44) and noting that M(-&.41) x(x'x)-'x' - as before.. (12.5 Maximum Likelihood Estimators B y a derivation similar to (12.80) and (10. U'R-'). Define In this section we show that if we assume the normality of {u. 0 Let x ( ~be ) the ith column of X and let X(-o be the T X (K submatrix of X obtained by removing x ( ~from ) X.2.2 and shows that the elements of (j are jointly asymptotically normal under the given assump tions.41).2. 4 1 Least Squares Estimators 293 where P (12.the reader should follow the discussion in Section 10. we have Note that (12. B. Taking the natural logarithm of (12.]in the model (12.2.67).)x(.2 THEOREM 1 2 .3) into the right-hand side of (12.47) yields (12.1). jth element is equal to d2 log L/de.)]'I2 as its ith diagonal element.2.2. and assume that lim.2.2. that appears in equations (10.17). In the multiple regression model (12.43) because of Theorem 6.2.12).2.48) log L = --log T 2 T 2~ --log 2 2 1 u -7 (y .2. a d (12.assume that 4 plim u'u 2 =a .Xp). 12. where S is the K X K diagonal matrix with [x. As we show& i equation (10.1.45) can be obtained from Theorem 10.2.2.48) it is apparent that the maximum likelihood estimator of p is identical to the least squares estimator To show that the maximum likelihood estimator of u2 is the 6 ' defined in (12. the least squares estimators (j and 6' are also the maximum likelihood estimators. by Chebyshev's inequality (6. . the sufficient condition for the asymptotic normality of (12. (For proof of the theorem.1. Suppose that a2 log L/d0aO1 denotes a matrix whose i.2.2. The following theorem generalizes Theorem 10.p) -+ N(0.2.77).we have for any e2 which implies (12.5 by regarding the 2. we need the following vector generalization.1) Using (12. Using the multivariate normal density (5. 5 ( C r a m B r . Then we have . 53)..1): where yp and up are both unobservable and xp is a K-vector of known constants.1). (12.1. We assume that upis distributed independently of the vector u.56) The second equality follows from the independence of up and u in view of Theorem 3.2.We have fi is best unbiased.2. The right-hand side of (12. Equation (12. Let fi and p* be the two estimators of p. I f u i s n o r m a l i n themodel (12.5.2.52). theleastsquares 1 3 i i 12.2. u 1 From (12.2.2.6.5.2. it shows that if an estimator fi is better than an estimator p* in the sense of Definition 11.2.PI' or the reverse inequality.54) we obtain From (12.294 12 [ Multiple Regression Model T H E O R E M 12.x$* in the sense that the former has the smaller mean squared prediction error.X'XP)..1.55) we conclude that if (gX is any unb-d estimator of 0.3.49) is called the C r a k . and (12.54) -------- a2 log L - apa(U2) .2.2. Thus.jp12 = E[up-xi@ - P)]' (12.Note that p and u2 are the same as in (12. if not.2.2. We put 0 = ( P I .6 Prediction As in Section 10.56) is the variance-covariance matrix of fi. we a f f i x the following "prediction period" equation to the model (12. and Eup = 0 and Vup= u2.P)(P* .5 we demonstrated that we may not be always able to show either L vp* 2 $(x'x)-I. In particular. the corresponding predictor jp = xi6 is better than yp* -.4.1.2. But since the right-hand side of (12. we immediately see that the least squares predictor jp = x# is the best linear unbiased predictor of yp.2. (12.2. Let fi be an arbitrary estimator of f3 based on y and define the predictor i I : i We obtain the mean squared prediction error af j pconditional as (12. we can rank the two estimators by the t .1. we have proved the following generalization of l h m p l e 7.59) establishes a close relationship between the criterion of prediction and the criterion of estimation.R a o (matrix) lower bound.P)' 5 E(P* .48).2.5. In Section 11.1).59) x+ E(yp. u2)' and calculate the CramCr-Rao lower bound for the log L given in (12. i E(fi . by restricting 6 to the class of linear unbiased estimators.6 12.2 1 Least Squares Estimators 295 1 f estimator where r is in the sense given in connection with Definition 11.P)(fi .49) and (12.2.(X'y . the case where Q' is a row vector of ones and c = 1 corresponds to the restriction that the sum of the regression parameters is unity.2. One weakness of this criterion is that xp may not always be known at the time when we must choose the estimator.3.3.p)(p .58).2. provides another scalar criterion by which we can rank estimators.3.1).3 cannot be found. We shall call (12.3) under (12. Using (12.3. is defined to be the value of f3 that minimizes the sum of the squared residuals: + which means that the second moments of the regressors remain the same from the sample period to the prediction period. we minimize (12. we must often predict xp before we can predict yp.2. whereas the remaining K2 elements are allowed to vary freely.1 we showed that (12. The study of this subject is useful for its own sake.The right-hand side of (12.61) plus a2 the unconditional mmn squared prediction erro?: 12. The constrained least squares (CLS) estimator of P.5) for 6 gives .5.3 1 Constrained Least Squares Estimators 297 trace or the determinant of the mean squared error matrix.3.2. The solution is obtained by equating the derivatives of p with respect to 6 and a q-vector of Lagrange multipliers X to zero. The essential part of the mean squared prediction error (12. if Q' = (I. it also provides a basis for the next section.2) as S(D) where E* denotes the expectation taken with respect to xp. We assume q < K and rank(Q) = q.59). As another example.p)'xp. denoted P+. K 2 = K.3.2.3. Accordingly we now treat xp as a random vector and take the expectation of x P ( 6 . Solving (12.1. where we shall discuss tests of the linear hypothesis (12.3.We assume that Constraints of the form (12.p = 6 and Q'@.3 C O N S T R A I N E D LEAST SQUARES ESTIMATORS Instead of directly minimizing (12. For example.1). In practice. We assume that the constraints are of the form and where Q is a K X q matrix of known constants and c is q-vector of known constants. Writing for the sum of the squares of the least squares residuals. Thus. which is mathematically simpler. Put .3.2) subject to (12. we can rewrite (12.2) is minimized without constraint at the least squares estimator 0.1).1).61) is a useful criterion by which to choose an estimator in situations where the best estimator in the sense of Definition 11.c = y. In this section we consider the estimation of the parameters P and a2in the model (12. Then the problem is equivalent to the minimization of 6'XrX6subject to Q'6 = y.296 12 ( Multiple Regression Model 12. 0) where I is the identity matrix of size K1 and 0 is the K 1 X K2 matrix of zeroes such that K .3. we obtain subject to the constraints specified in (12. In Section 12.1) when there are certain linear constraints about the elements of p. the constraints mean that a Kl-component subset of p is specified to take certain values.1) embody many common constraints which occur in practice. From (12.1) the null hypothesis. the CLS P+ is the best linear unbiased estimator.4.c).298 12 1 Multiple Regression Model 12. Z2 and c are all known constants. for fi ignores the constraints.3. I The corresponding estimator of u2 is given by Taking the expectation of (12. Throughout the section we assume the multiple regression model (12.7) into (12. R'Q = 0. where Z = XA-'.3. (The proof can be found in Amemiya.8) back into (12.10) y .) 12.1. We can give an alternative derivation of the CLS P+. The R is not unique. chapter 1.] are normal in the model (12. we obtain (12.1.We should expect this result. I:[ In (12. we get (12.4.q parameters without constraint. 4 Finally.3. Finding R is easy for the following reason.3. (~ Zlc) VP' = u2{(~'X)-" (x'x)-'Q[Qr (X'X)-'Q]-'Q~ (x'x)-'1. We shall call (12.Zlc is the vector of dependent variables and a (K . a = R'P.Q(Q'Q)-'Q'IS satisfies our two conditions.3.5) and sdvlng for 8.3.4 TESTS OF HYPOTHESES 12. R)'.14) represents a multiple regression model in which y .12) Since Z1.9)-vector a constitutes the unknown regression coefficients.3. second.R(R'X'XR)-'R'x'x]Q(Q'Q)-".3.6) and solving for A.14) I E I Transforming 6 and y into the original variables.3.10) under the assumption that (12.15) I & = ( z . z ~ ) .[Q' (x'x)-~Q]-'~. inserting (12. we have VP' 5 V f i .3. (12. 1985.1) as a testable hypothesis and develop testing procedures.3.q) matrix S such that (Q.10) as (12. S) is nonsingular.q) matrix R such that. equation (12.4 1 Tests of Hypotheses 299 Inserting (12. Suppose we find a K X (K . Zl consists of the first q columns of Z.3.8) A 6 = .13).16) P + = A-' = p by Since the second term within the braces above is nonnegative definite.9) = (x'x)-'Q[Q'(X'X)-~Q]-'~.1) and if (12.3) as follows: R(R'X'XR)-'R'x'~ + [I . 1 p ' = fi - (x'x)-'Q[Q'(x'x)-'Q]-'(Q'fi . Then R defined by R = [I .3. Using A we can transform equation (12. R) is nonsingular and. since the distribution of the test statistics we use is derived under the normality assumption.1) is true.1 Introduction In this section we regard the linear constraints of (12. Now define A = (Q. We discuss . We can evaluate V P + from (12.1) is true. If {u.~ z . we have reduced the problem of estimating K parameters subject to q constraints to the problem of estimating K . the constrained least squares estimators pt and 02+ are the maximum likelihood estimators.3. We can apply the least squares method to (12.14) to obtain (12.3. we can write the solution as (12.3. and & consists of the last K .3. It can be shown that if (12.16) we have used the same symbol as the CLS 0' because the right-hand side of (12. first.10) if X'X is nonsingular.1) with the normality of u.9 assures us that we can find a K X (K . and then estimate (12.ZIc = Z2a + U.q columns of Z. (Q.13).3. Theorem 11. any value that satisfies these conditions will do.3.3.3.16) can be shown to be identical to the right-hand side of (12.1.1) is true.3.3. we immediately see that ED' = P. Thus. by the transformation of (12.3.3. the F test.we can write 0 . by Definition 2 of the Appendix.5.1).4. we obtain THEOREM 1 2 .4. we have T-K where w = H'v and wiis the ith element of w.1 LetQbeasdefinedin(12. Note that here Q' is a row vector and c is a scalar.2. where the right-hand side of (12.2 to (j defined in (12. Since (j is normal as shown above.N ( 0 .18.1) with the normality of u.1) and (12. discussed in the next section.9) is an estimate of the standard deviation of the numerator.300 12 I Multiple Regression Model 12.5).4.we immediately see that (j is normally distributed if y is normal.5.K degrees of freedom.4.4. Hence.29). I). must be used if q > 1. The F test. a 2 ~(x'x)-'Q] ' under the null hypothesis (that is.1) and (12. I ) . we derive the distribution of (j and G'ii and related results. where 6 is the square root of the unbiased estimator of u2 defined in equation (12. 4 .4. Student's t with T .27). and a test for structural change (a special case of the F test).8) are independent because of Theorem 12. Applying Theorem 5.4.N [c.2.4.4. that is. 0 zT~:w? - .2.4. depending on the alternative hypothesis.2. there exists a T X T matrix H such that H'H = I and Q'$ . q = 1.4.7) Because of Theorem 11. As preliminaries. Since w X.2. if Q'P = c).2 Student's t Test The t test is ideal when we have a single constraint. But since (j and Q are jointly normally distributed by Theorem 5. we have v from (12. Therefore.4. Pro$ If we define v = u-'u.K diagonal positions and zero elsewhere. Note that the denominator in (12.9).2). This follows from (12.N(0. with the normality of u we have (x'x)-'x' (Euu') M = a2 (x'x)-'X'M =o. we have (12.4 1 Tests of Hypotheses 301 Student's t test. .4. Therefore. The null hypothesis Q'P = c can be tested by the statistic (12.We use a one-tail or two-tail test.6) E(B .4.1. in that order. Pro$ We need only show the independence of (j and Q because of Theorem 3.4. the random variables defined in (12. Using the mean and variance obtained in Section 12. 2 In the model (12. Since iilii = ulMu 12. we need only show that (j and Q are uncorrelated.-K by Definition 1 of the Appendix.2.1.12).2) are independent.2) and (12.4) denotes a diagonal matrix that has one in the first T . Next. The random variables defined in (12. let us show the independence of (12.4.2.2.P)Q' = E(x'x)-'X'UU'M = THEOREM 12.Inthemadel(121. 11) Therefore.~ q T-K ---- 4 .S(@) 7 = ----.2) and (12. ( Q r @.4.17) are independent because [I . K2 ii'ii .10) right away and reject the null hypothesis if the left-hand side were greater than a certain value.4. (12. The reader will recall from Section 9.4. Partition X as X = (XI.K ) . Consider the case where the f3 is partitioned as P' = ( P i . the F statistic (12.7. = C.10) are independent.3 The (12.c)' [ Q r ( X ' x ) .11) alternatively as (12.4. (12.4.18) ((3= ) U'Z~(Z.12) we have (12.Inserting these values into (12.19.F ( K 2 .13). s ( P +) ~ ( ( 3 )= ( ~ ' (3 c)'[Q'(x'x)-'Q] 1 I From equations (12.P + ) .we see that if q = 1 (and therefore Q ' is a row vector).4. unspecified.10) and (12. ] z .18) somewhat. and c = 0 . T .4.4.4. Q is a K-vector of ones.9). by Theorem 9.17) S(P+) . where d is determined so that P ( q > d ) is equal to a certain prescribed significance level under the null hypothesis. and the closer Q'f3 is to c .~B.4.3. From equation (12. 4 ~(li) - F Test In this section we consider the test of the null hypothesis Q ' P = c against the alternative hypothesis Q'f3 Z c when it involves more than one constraint (that is. Then the null hypothesis becomes Pi = Pi. The F statistic 7 given in (12.11) yields (12. Note that s ( P +) is always nonnegative by the definition of p f and (3.14) provides a more convenient form for computation than (12. I ) .z(z'z)-'Z'IZ.c).11) takes on a variety of forms as we insert specific values into Q and c .-vector and p2 is a K2-vector such that K 1 + K2 = K.we have ~(8) ~(0) and If a ' were known. we have (12.4.14) T .z~)-'z.15) and T . The F statistic can be alternatively written as follows.4 1 Tests of Hypotheses 303 a The following are some of the values of Q and c that frequently occur in practice: The ith element of Q is unity and all other elements are zero.4.3. The ith and jth elements of Q are 1 and -1.12 1 Multiple Regression Model 12. P.).4. The result (12.3) we have where Z1 = [I (12.4. The distribuAgain ~ ' (tion of ~ ' (given 3 in (12.13) (Q'B . 3 c will play a central role in the test statistic.4. z~(z. . Finally. We can simplify (12. 12.4. and the null hypothesis specifies p2 = p2 and leaves p.K s(P+) .5. X2) conform- .7) is valid even if g > 1 because of Theorem 5. Using the regression equation (12.11) is the square of the t statistic (12. q > 1 ) .4.4.' Q ] .' ( ~ ' ( j C) ii'ii .4.U. the chi-square variables (12. we could use the test statistic (12.T . I)(x'x)-'(o.11). Therefore we can write (12.1. Since (3 and ii are independent as shown in the argument leading to Theorem 12.p2) q = ------.4.4.9) and (12. by Theorem 11.11) if constrained least squares residuals can be easily computed.4.4.7 that this would be the likelihood ratio test if (3 were normal and the generalized Wald test if (3 were only asymptotically normal. = 0 .P + ) ' x ' x ( ( ~. by Definition 3 of the Appendix..P2)'[(0. where P1 is a K.2. Comparing (12.4. respectively. I is the identity matrix of size K 2 .K ( @ 2 . Then the null hypothesis is simply Pi= c.K).This fact indicates that if q = 1 we must use the t test rather than the F test.4.4.4.Z~)-~Z. Therefore. where 0 is the K2 X K1 matrix of zeroes. Also note that (12.. The null hypothesis Q ' P = c is rejected if 7 > d. F(q.14) may be directly verified. Therefore.12) s ( P +) ~ ( p =) ((3 . T-K).F(q. In this case the t test cannot be used. since a one-tail test is possible only with the t test. Then the null hypothesis becomes Z$. the smaller s ( P + ) becomes. and c = D2.3.2. This hypothesis can be written in the form Q ' P = c by putting Q ' = ( 0 . 1)'1-'(@2. 3.4. This test can be handled as a special case of the F test presented in the preceding section.that is. We can represent our hypothesis P1 = p2 as a standard linear hypothesis of the form (12. X ~ ) . T . K = 2K*. residuals from (12.4 1 Tests of Hypotheses 305 ably with the partition of P.4.~~ fi2 = ( X . ~ y 12.4. we have from equation (12. 1 ) .F(Kp.25) have T 2 rows. .24) have T 1 rows and those in (12. we can combine the two equations as where the vectors and matrices in (12.I ) .3. Let s(B) be the sum of the squared where we have defined Z = (x.4.2.24) and (12.2.25) as Therefore. X1 is a T I X K* matrix and X2 is a T 2 X K* matrix.4.9.33).20) where K 1 = 1.~ X . ~ ~ 2by ) (12. so that p1 is a scalar coefficient on the first column of X . ~ 1 ~ 2 ( 88 22 ) iilii .20) can now be written as ' given in (12.4.2. In (12.26) is the same as the model (12.. combine equations (12. which we assume to be the vector of ones (denoted by 1 ) .4.20) 'q =. and c = 0. where Therefore (12.4. Then.18) as (12.14). .2.' Xand We can obtain the same result using (12.4. X ~ ) .4 Tests for Structural Change Suppose we have two regression regimes and .' x .4.4.' l l ' Also.11) yields the test statistic since Q'Q = ytLy - y ' ~ ~ 2 ( X . .1.32).4.K (82 K2 P 2 ) ' ~ .24) and (12. ~ ~ .1) by putting T = T I + T 2 . where fil = ( X .~ .4. Q' = ( I .25) without making use of the hypothesis PI = P2.T . we can rewrite (12.22) Using the definition of R as Since cr: = cr: (= cr2). We want to test the null hypothesis PI = P2 assuming cr: = o i (= a'). Then MI in (12.4.26) we combined equations (12.4. If we do use it. 1 ~ . . Furthermore.we further rewrite (12. (12. and define M1 = I by Theorem 11. we have -x~(x x.304 12 1 Multiple Regression Model 12.14). Of particular interest is a special case of (12. and ul and u2 are normally distributed with zero means and the variancecovariance matrix xi)'.19) becomes L = I . Inserting these values into (12.K ) . To apply the F test to the problem.4.26). we assume = 0.4.1) with normality. q = K*. 27) look very different.4.14). when the sample size is large. that is. -. X ~ f i ) '(~ ~ 2~0 ) .34) is a reason for rejecting the null hypothesis. ~ a.31) and (12. Then using (12. So we shall mention two simple procedures that can be used when the variances are unequal.3.'(y2 . If. however. 6. Calculate Since these two chi-square variables are independent by the assumption of the model. The first is the likelihood ratio test.33) - ~ i M 2 ~ 2XT*-K*. ' ( ~ r .4. Both procedures are valid only asymptotically.2 we presented Welch's method of testing the equality of regression coefficients without assuming the equality of variances in the bivariate regression model.4.Z ( Z ~ Z ) .4. can be obtained by evaluating the parameters of L at the constrained maximum likelihood estimates: Dl = fi2 (= p).4. Welch's approximate t test does not effectively generalize to the multiple regression model. we should use a two-tail test here. and 6: = ~ .4. @ ~ )-' X(I & ~). Under the null hypothesis that a: = a : (= a') we have In Section 10. 0 . since it is analogous to the one discussed in Section 10. we use the t test rather than the F test for the reason given in the last section. Q' = (I. and I$.25) is given by + The value of L attained when it is maximized without constraint. that is. We do not discuss this t test here. we have by Definition 3 of the Appendix Step 2. 2 u2 - The value of L attained when it is maximized subject to the constraints = P2.4 1 Tests of Hypotheses 307 and let s(P+) be the sum of the squared r e s i d t d h r n (12.24) and (12. we wish to test the equality of a single element of P1 to the corresponding element of fin. but q = KT. For example. denoted by E. if the subset consists of the first KT elements of both P1 and P2.3.we have Even though (12.4. Calculate Unlike the F test developed in Section 12.14).4.306 12 1 Multiple Regression Model 12. So far we have considered the test of the hypothesis PI = P2 under the assumption that u: = a : .4. and = = 6. -I. The hypothesis P1 = P2 is merely one of the many linear hypotheses that can be imposed on the P of the model (12. we put T = T 1 T 2 and K = 2K* as before.can be obtained by evaluating the parameters of L at P 1 = B1. We may wish to test the hypothesis a: = a : .2. O). they can be shown to be equivalent in the same way that we showed the equivalence between (12.26). These may be obtained as follows: pl Step I . (12.There may be a situation where we want to test the equality of a subset of P1 to the corresponding subset of P2. before performing the F test discussed above.3. The likelihood function of the model defined by (12. P2 = fin. and c = 0 . = ~ ~ ' (-y~ l l u.%3). Unfortunately.4.30) s ( P += ) yl[I . since either a large or a small value of the statistic in (12.x ~ @ ~-)X2O2).4.4.11) and (12.T.~ Z ' ] ~ . . denoted by i. '(~~ and (12. and I$. First.2. respectively. choosing (12. It can be shown that the use of Theil's R2 is equivalent to c = 1 .2) over (12.1) is equivalent to accepting the hypothesis P2 = 0. the larger R2 tends to be.20) with set equal to 0. he shows that In Section 10. In the extreme case where the number of regressors equals the number of ob' = 1 . Then.4.Suppose the vectors p and y have K and H elements. is defined by = P2 is rejected when the statistic in (12. Therefore. where r\ is as given in (12.3 we briefly discussed the problem of choosing between two bivariate regression equations with the same dependent variable.1. denoted R2. p.5. we need to adjust it somehow for the degrees of freedom. R . Step 4.36) given below. So if we are to use R2 as a criterion for choosing servations. Xc = fix2. for 6. other things being equal.respectively.4. it no if the expectation is taken assuming that (12.we have asymptotically (that is. We stated that. and 6.5.2.37) as the given equations.29). because the greater the number of regressors.5.2) is the true model. Then. and define 6 = k 1 / k 2 . Without loss of generality. substituting the estimates of for fi.3). where y. and perform the F test (12. and 6 . pi). and U: = 6u2. Theil proposed choosing the equation with the largest R2.4. An important special case of the problem considered above is when S is a subset of X. equation with the smallest e2. approximately.4. PP . In practice. Theil offers the following justification for his corrected R2. That is.27) on them. 12. since 6 is a consistent estimator of ul/u2.1) and (12. But the F test of the hypothesis accepts it if q < c. and u Second. Repeat step 2. Theil's corrected R2. Repeat step 1.1.2. we consider choosing between two multiple regression equations and where each equation satisfies the assumptions of model (12. and & : . . substituting 6. and 6. the estimates obtained at the end of step 1 and step 2 may be used without changing the asymptotic result (12. any decision rule can be made equivalent to the choice of a particular value of c. respectively.25) by 6 and define the new equation where K is the number of regressors. Finally. multiply both sides of (12. from (12. = y l [ I . P obtained in step 3 Continue this process until the estimates converge. By Theorem 9. : by 6.4. The justification is merely intuitive and not very strong. it makes sense to choose the equation with the higher R2.4. Partition p conformably as P' = ( P i . 213) proposed one such adjustment. treat (12. The second test is derived by the following simple procedure.5.36) is The null hypothesis large.a regression equation.5 1 Selection of Regressors 309 Step 3.31). be the unbiased estimators of the error variances in regression equations (12.308 12 1 Multiple Regression Model 12.4. assume X = ( X I .5 SELECTION OF REGRESSORS choosing the equation with the largest R2 is equivalent to choosing the defined in (12. estimate u.s ( s ' s ) . Let 6: and 6.' s l ] y / ( T - H).5.4. Theil (1961. respectively. = by2. other things being equal. The method works asymptotically because the variance of $ is approximately the same as that of ul when T I and T 2 are large.X2) and S = X I . Since.24) and (12. Here.2). If H # K. when both T I and T2 are large) longer makes sense to choose the equation with the higher R2. 2.2) Consider the regression model y = px + u. sample {X.2) Consider the regression model y = XP I.~. and + u. i = 1. .6) In the model y = XP + u and yp = x. where P is a scalar unknown parameter. .4. We have definedx. (Section 12. Note that K .. (Section 12. . .2.X~)-'X. without using Theorem 12.2) Consider the regression model .p + up.d. x is a T-vector consisting entirely of ones. without using the CramCr-Rao lower bound. 1. without using 6. 1. Akaike (1973). (Section 12. 2.2. where fi = (xlX)-'~'yand p: = (X. Assume that T is even. .2. Which estimator is preferred? Answer directly. The results cast some doubt on the customary choice of 5%.20).H.H appears as K2 in (12. EXERCISES 1. . .K) < c ] = 0.H and T .where Show directly that Ij is a better estimator than Theorem 12.465 (a) Derive an explicit formula for the maximum likelihood estimator of a based on i. u is a T-vector such that Eu = 0 and Euu' = u21~. 3. (b) Derive the Cram6r-Rao lower bound. (Section 12.Y.2. Euu' = 4. as the first K1 columns of X and x i p as the first K1 elements of x.13) satisfies the two conditions given there.O). and Sawa and Hiromatsu (1973) obtained solutions to this problem on the basis of three different principles and arrived at similar recommendations. 0.1 2.1.i. .K. fi. n.05.5) Suppose the joint distribution of X and Y is given by the following table: Critical values of F test implied by 5% significance level K-H 1 T .4.K 30 c 0.). Under what circumstances can the second predictor be regarded as superior to the first? 5.8 to 2. T .2.3.3) Show that R defined in the paragraphs before (12..1 gives the value of c for selected values of the degrees of freedom. Obtain the mean squared errors of the following two estimators: E 1 8 1 8 1 ta I fix* x 'x and D=?Y z'x' B where z' = (1. T A B L E 1 2.obtain the unconditional mean squared prediction errors of the predictors xkfi and xipp:.310 12 1 Multiple Regression Model i I Exercises 311 i Mallows (1964).0. where h = 0.1. What value of c is implied by the customary choice of the 5% significance level? The answer depends on the degrees of freedom of the F test: K . (Section 12. (Section 12. though an improvement over the unadjusted still tends to favor a regression equation with more regressors. in which the value of c ranges roughly from 1. - Let Ij = (x'x)-'x'~ and fi = (S'X)-'S'~.. Table 12. and derive its asymptotic distribution directly. These results suggest that Theil's E2. The table is calculated by solving for c in P[F(K . where 1 is a fourcomponent vector consisting only of ones. and C and test the hypothesis that P2 is the same for industries A and B and p3 is the same for industries B and C (jointly. where y and u are eightcomponent vectors. and that the K .0. + + where ul and up are independent of each other and distributed as N(0. each consisting of four observations: I 10.3) 9.3) We want to estimate a Cobb-Douglas production function log Q. + q.4. and L. 11. ~ ' 1 ~ ) . We assume that PI varies among the three industries. (Section 12. which we write as PI. up. (Section 12. test the null hypothesis P2 = P1 against the alternative hypothesis P2 > at the 5% significance level.). (Section 12. (b) Test pl = P2 = P3 at the 5% significance level. Write detailed instructions on how to perform such a test. are distributed independently of the u.. 2. . and f3 is a three-component vector of unknown parameters. y 1 = P2 Consider the regression model y = XP + u. + pl = s + pp and = 0 versus H I : not Ho. PI = p2 = f3. . = a3 and significance level." at the 5% 7. -I). The data are given as follows: (a) Test P2 = P1 against p2 > P1 at the 5% significance level. B.4. X is an 8 X 3 matrix. Assuming that the observed values of y' are (2. 8.3) Consider three bivariate regression models. .4. We want to test hypotheses on the elements of f3.3) Solve Exercise 35 of Chapter 9 in a regression framework. You may assume that the u. not separately). Use the 5% significance level. . The data are given by '~1x1 + Plzl + u1 and y2 = a2x2 P2z2 u2. The data are as follows: in each of three industries A. are normal with mean zero and their variance is constant for all t and for all three industries. test Ho:a.312 12 1 Multiple Regression Model I Exercises 313 where pl and P2 are scalar unknown parameters and u N(0.4. (Section 12. and u3 are independent normal with zero mean and constant variance.3) In the following regression model. P2. 1. - Test the null hypothesis "al = a:. t = 1. and the elements of ul. and P3. u21. = p1 + PI log Kt + p3 log L. (Section 12. T. .4. Moreover. In this subsection we develop the theory of generalized least squares under the assumption that Z is known (known up to a scalar multiple.) Therefore (13. Our presentation will focus on the fundamental results.1.and u* = Z-"*U. We have given them the common term "econometric models. to be precise). the models of Section 13. ~ and that (I.5.3) is a classical regression model. including econometrics.1) by I.3) has all the good properties derived in Chapter z'/'x'/" x.2 and 13. since 2 is positive definite. the diagonal elements of A are positive by Theorem 11.1. = Z-'"x.i 13. The models discussed in Sections 13. we obtain (131. Eu* = 0 and x-"~~. and 13. This model differs from the classical regression model only in its general specification of the variancecovariance matrix given in (13."~]. is the ith diagonal element of A.5.2).' / ' 2 ' = / I. The models of Section 13.-'/'.6. ." but all of them have been used by researchers in other disciplines as well.1 arise as the assumption of independence or homoscedastidty (constant variance) is removed from the classical regression model. Then.6.4 arise as the linearity assumption is removed. by Theorem 5 (131.1 GENERALIZED LEAST SQUARES In this section we consider the regression model (The reader should verify that that ~ .5.4).1 we can find an orthogonal matrix H which diagonalizes 2 as H'ZH = A. 1 i 1 13. The models of Sections 13.5.-'/~)' = 8-I/z from the definitions of these matrices.1 / 2 2 1/2 8 112 2-1/2 13. It is also the basic model from which various other models can be derived. Premultiplying (13.1.1 Known Variance-Covariance Matrix The multiple regression model studied in Chapter 12 is by far the most frequently used statistical model in all the applied disciplines. In this chapter we study various other models frequently used in applied research. We assume only that 2 is a positive definite matrix.7 are more general than regression models.1 through 13. where A is the diagonal matrix consisting of the characteristic roots of Z. where = D{x.1.6 = 2 . where X. in the remaining subsections we discuss various ways the elements of 2 are specified as a function of a finite number of parameters so that they can be consistently estimated. by Theorem 11.4) Eu*u*' = EX-"/'UU'(~-'/~)' x-1/22(2-1/2)1 by Theorem 4.10. Finally.3 arise as the assumption of exogeneity of the regressors is removed.1.12 i e 3 $ EuuP=z. For these reasons the model is sometimes called the classical regression model or the standard regression model. 13.4 may be properly called regression models (models in which the conditional mean of the dependent variable is specified as a function of the independent variables). Using (11.5 through 13. where y* = X* = 4. The models of Sections 13.1 1 Generalized Least Squares 315 13 ECONOMETRIC MODELS where we assume that X is a full-rank T X K matrix of known constants and u is a T-dimensional vector of random variables such that Eu = 0 and (13.3) : * f = X*P f u*. we define 8-'I2 = HA-'/~H'.1. and hence the least squares estimator applied to (13. whereas those discussed in Sections 13.7 are more general models. Since Z is symmetric. For a more detailed study the reader is referred to Amemiya (1985).1. 1. 13. where A is the diagonal matrix consisting of the characteristic roots of Z.'/2 = I. the diagonal elements of A are positive by Theorem 11. whereas those discussed in Sections 13.6.10. to be precise).5 through 13.) Therefore (13. in the remaining subsections we discuss various ways the elements of 2 are specified as a function of a finite number of parameters so that they can be consistently estimated. This model differs We assume only that ' from the classical regression model only in its general specification of the variancecovariance matrix given in (13. and hence the least squares estimator applied to (13.1.6 = 2 . and u* = z ' / ' u .7 are more general models.1) y = Xp + u. In this subsection we develop the theory of generalized least squares under the assumption that 2 is known (known u p to a scalar multiple.1.4).-'/~. since Z is positive definite.1 Known Variance-Covariance Matrix The multiple regression model studied in Chapter 12 is by far the most frequently used statistical model in all the applied disciplines.1 / 2 2 1/2 x 1/2 8 . The models of Sections 13.3 arise as the assumption of exogeneity of the regressors is removed. In this chapter we study various other models frequently used in applied research. Eu* = 0 and 2-'/2~.3) is a classical regression model. X* 4.we obtain (131. Then. For a more detailed study the reader is referred to Amemiya (1985).1) by I.1 GENERALIZED LEAST SQUARES (The reader should verify that 81/22'/2 = 2. The models of Section 13.2 and 13. the models of Section 13.1.1 we can find an orthogonal matrix H which diagonalizes 2 as H'ZH = A.3) has all the good properties derived in Chapter In this section we consider the regression model (13. The models discussed in Sections 13.1 through 13. = where y* = Z-'"y.1.1." but all of them have been used by researchers in other disciplines as well. we define 2-'" = HA-'/~H'. For these reasons the model is sometimes called the classzcal regression model or the standard regression model.1 1 Generalized Least Squares 315 I ECONOMETRIC MODELS where we assume that X is a full-rank T X K matrix of known constants and u is a Tdimensional vector of random variables such that Eu = 0 and I : is a positive definite matrix. We have given them the common term "econometric models.4 may be properly called regresszon models (models in which the conditional mean of the dependent variable is specified as a function of the independent variables).5. Using (11.1 arise as the assumption of independence or homoscedasticzty (constant variance) is removed from the classical regression model. Since Z is symmetric. It is also the basic model from which various other models can be derived. where A-'I2 = D { x ~ ' / ~where ]. and that (2-1/2~1 = 2-112 from the definitions of these matrices.7 are more general than regression models. . and 13.1.5. by Theorem = X-l/n'I:(2-1/2)1 by Theorem 4. that 8-1/281. Moreover. Finally.1 / 2 13.2). Xi is the ith diagonal element of A. 13.5. Premultiplying (13.5.6. by Theorem 11. The models of Sections 13. including econometrics. Our presentation will focus on the fundamental results.1.3) y* = X*P f u*.4 arise as the linearity assumption is removed.13. 1.2.26). the wezghted least squares estimator I f the variances are unknown.1). There are two main methods of parameterization. .1.1. we shall indicate how 0 can be consistently estimated. 1985. Then a drops out of formula (13. it under the model (13. by (13. Let 0 be a vector of unknown parameters of a finite dimension.1).4.1.6.10) ( x r x ) . (See Amemiya.) Inserting (13. In the present case. In each of the models to be discussed. = (Suppose 2 is known up to a scalar multiple.6) Although strict inequality generally holds in (13. = (~'2-'x)-'.1. T. section 6. denoted by (jF.1. Here we relax this assumption and specify more generally that ED = P This assumption of nonconstant variances is called heteroscedastidty. The above can also be directly verified using theorems in Chapter 11. f i is ~ consistent.P). In the first method.1 1 Generalized Least Squares 317 2 12. In the next three subsections we consider various parameterizations of 2 . . there are cases where equality holds.2.17) V@G where the dependence of 2 on 0 is expressed by the symbol 2(0). If 2 is unknown.1.5) and we have = (x'Q-'x)-'x'Q-'~.1.) The consistency and the asymptotic normality of the GLS estimator follow from Section 12. o:. The GLS estimator in this case is given a special name. in the period t = 1. using Theorem 4.18) 13. . Under general conditions.4. 2. this model is a special case of the model discussed in Section 13.' x ~ z ( 0 ) .1. and @ ( ( i ~ . If the variances are known. this is the same as (12. 2 is a diagonal matrix whose tth diagonal element is equal to a : .1) and the L follows from Theorem 12. its elements cannot be consistently estimated unless we specify them to be functions of a finite number of parameters. we can readily show that (131.1.' ~ .~ x ] . The classical regression model is a special case. It is important to study the properties of the least squares estimator applied to the model (13. we can define the feasible generalized least squares (FGLS) estimatol. T1 + 2.1. Since the GLS estimator is the best linear unbiased estimator S estimator is a linear estimator.1) into the final term of (13.3.2. The LS estimator can be also shown to be consistent and asymptotically normal under general conditions in the model (13. Denoting it by BG.1 that (13.5) and using Theorem 4. If T I is known. the variances are assumed to remain at a constant value. .1. .1.1.10). The other assumptions remain the same. suppose 2 = aQ.1).9) ~ f =i E(x'x)-'x'uu'x(x'x)-' = (x'x) -1x'2x(x'x)-1. we must specify them as depending on a finite number of parameters.we have (131.P) has the same limit distribution as fl(& .1/2X)-'X'~-l/2 2-1/2 Y (x'~-lx)-'x'x-'~. . . T1 and then change to a new constant value of a : in the period t = T1 + 1. We have.316 13 1 Econometric Models 13.6.5) PG = (X*'X*)-~X*'~* - (X'2-1/".1. (13. That is. where a is a scalar positive unknown parameter and Q is a known positive definite matrix. .1. We call it the generalized least squares (GLS) estimator applied to the original model (13.1. Denoting the consistent estimator by 0.' x r ~ x ( x t x ) -r ' (x'z-'x)-'.1.22).2 Heteroscedasticity i I In the classical regression model it is assumed that the variance of the error term is constant (homoscedastic). EBG = P and (13. There we suggested how to estimate and (131. say. however. . is different from either (13.7) or (12.11) 3 i I 1 i 3 BF = [ ~ ~ z ( e ) . Thus the LS estimator is unbiased even under the model (13.1) because the researcher may use the LS estimator under the mistaken assumption that his model is (at least a p proximately) the classical regression model. Its variance-covariance matrix. in which a = c2and Q = I. . weak stationarity). EZ.15) by u. Next. Next we evaluate the variances and covariances of {%I. with Ect = 0 and Vet = u2. .1.d. Repeating the same procedure for t = 2.See Eicker (1963) and White (1980).) is a known function. Goldfeld and Quandt (1972) considered the case where g(. we see that Eul = pEuo + Es1 = 0. (13. = .19) as special cases.1. we obtain (13. Using these estimates. 13. T. .1 ( Generalized Least Squares 319 a : and a .) is a linear function and proposed estimating a consistently by regressing .18).15) for t = 1 . it is specified that Taking the expectation of both sides of (13. z.} be the least squares residuals. multiplying both sides of (13. Note that (13. and define the diagonal matrix D whose tth diagonal element is equal to tf . here we consider one particular form of serial correlation associated with the stationaryfirst-order autoregressive model.9).20) contains (13.1.and uo is independent of 81. . . we can consistently estimate the variancecovariance matrix of the LS estimator given by (13.1.1). we obtain Under general conditions 7@ can be shown to converge to [email protected] i Repeating this process. we obtain where { E .15) for t = 1 and using our assumptions.-1 and taking the expectation.) is nonlinear.i. we obtain where g(.T .C. .1. and usfor s f t in the model (12.1.1.9) is defined by Repeating the process for t = 2. ET with Euo = 0 and Vuo = u2/(1 .1.1.-1 and E . In the second method. we conclude that Vu. 1 . Taking the variance of both sides of (13. 6: must be treated as the dependent variable of a nonlinear regression model-see Section 13. 1 . .20) constitute stationarity (more precisely. T 1 as well as a still estimated. .17) .1. If g(. not necessarily related to x. is a vector of known constants.2.p2 u' Multiplying both sides of (13. Then the heteroscedasticipconsistent estimator of (13.p2). are independent. .1.1. It is not difficult to generalize to the case where the variances assume more than two values. we conclude that (13. . 3.3 Serial Correlation because of (13.16) and (13. where (i&} are the least squares residuals defined in (12. See Goldfeld and Quandt (1976) for further discussion of this case. ..16) Eu. (13. we can define the FGLS estimator by : and a : can be the formula (13. and (13.for all t.1. Conditions (13.15) by q . It can be specified in infinitely various ways.2 and taking the expectation. Even if we do not spec* a.1. .1.17) and because u.ci: on z. but the computation and the statistical inference become much more complex. If T I is unknown.12).1?').11).1.and a is a vector of unknown parameters.1.- - 318 13 1 Econometric Models 13. ] are i.4 below. = 0 for all t. It is defined by Eu~u~ = -~ u2p2 for a 1 1 r.1.1. . 3. Correlation between the values at different periods of a time series is called serial correlation or autocorrelation. as a function of a finite number of parameters. Let {.19) In this section we allow a nonzero correlation between u. t = 2. It is defined by . we should appropriately modify the definition of R used in (13. . Many economic variables exhibit a pattern of serial correlation similar to that in (13.25).on both the dependent and independent variables and then apply the LS method. The computation of the GLS estimator is facilitated by noting that (13.1.) follow a pth order autoregres- sive model P (13. Another important process that gives rise to serial correlation is the moving-average process.pz.1. however.1.) follow a higher-order autoregres sive process.the estimator by (13.1. + E.1. if we suppose that {u. T. we believe that {u.25). . . 3.-.1.320 13 ( Econometric Models 13.15) is an empirically useful model to the extent that the error term of the regression may be regarded as the sum of the omitted independent variables. .. .1. .u.25) & = (XIR'IIX)-'X'RIR~.1.5).20). we can compute the GLS estimator of fJ by inserting X-' obtained above into (13.-1.. The asymptotic distribution of the GLS estimator is unchanged if the first row of R is deleted E n defining .For example.16) can be written equivalent to En 0 and (13.Note that a2need not be known because it drops out of the formula (13. z2. .~ & ~ p .1. If. 1 ? I t can be shown that If p is known. (13.5). u where we should perform the operation z. zT)'by R performs the operation z. 2-'= 1 R'R. Therefore the first-order autoregressive model (13.23) Except for the first row. Thus the GLS estimator is computed as the LS estimator after this operation is performed on the dependent and the independent variables.1. .1. = j=1 p. premultiplication of a T-vector z = (zl.1 1 Generalized Least Squares 321 In matrix notation. .90) is Using R.5) as (13. z . . .1. we can write the GLS estimator (13.26) u. . ~ )are i. Finally. random variables with zero mean and are . are the cross-section specific and time-specific components. Still. . . We consider next the estimation of p in the regression model defined by (12.). of demand functions using data on the quantities and prices collected monthly from many consumers.15) itself cannot be regarded as the classical regression model because ut-1 cannot be regarded as nonstochastic.1. because its distribution can be more easily computed than that of 6. But it can also be shown that p is consistent and its asymptotic distribution is given by (13. but with more difficulty than for an autoregressive process. were observable.4 Error Components M o d e l Since (13.1. Namely. the exact pvalue of the statistic can be computed. and define The error components model is useful when we wish to pool time-series and cross-section data.1 1 Generalized Least Squares 323 where { E J are i. At the least. we hope to be able to estimate the parameters of a relationship such as a production function or a demand function more efficiently than by using two sets of data separately.15).d.1. however. and a : . This test is equivalent to testing No: p = 0 versus H I :p # 0 in (13. where fi is the LS estimator.1.28) fi = 7. 2.). . 13.30) as the %eat statistic. Computation of the GLS estimator is still possible in this case.24). b does not possess all the properties of the LS estimator under the classical regression model. inserting 6 into R in (13. I Since {u.15). 1 .xi@. It is customary. .1) and (13. In the simplest version of such a model. and A.1. C u:t=2 1 which is approximately equal to 2 . In mutually independent with the variances a order to find the variance-covariance matrix 2 of this model. Therefore it would be reasonable where y . we take the classical regression model as the null hypothesis and the model defined by (12. T. (A.15). Nevertheless. to use the Durbzn-Watson statistic C utut-1 (13. For example. however. we may want to estimate production functions using data collected on the annual inputs and outputs of many firms.1. and { E . a : . we could estimate p by the LS estimator applied to (13. .26.1.]are in fact unobservable. If (u. Today. B y pooling time-series and crosssection data. we assume that the sequence {p. it shodd be reasonable to replace them in (13.p) + N ( 0 . as follows: and It can be shown that 6 is consistent and has the same asymptotic distribution as b given in (13.29) @ (0 .1) and (13.15) as the alternative hypothesis. we consider the test of independence against serial correlation. we should try to account for the difference by introducing the specific effects of time and cross-section into the error term of the regression. T 1=2 to use (13.29). researchers used the table of the upper and lower bounds of the statistic compiled by Durbin and Watson (1951). respectively.322 13 1 Econometric Models 13.1. we can compute the FGLS estimator. t = 1.1. For example.1. as before.p2).).1.d.i. a moving-average process can be well approximated by an autoregressive process as long as its order is taken high enough.28) by the LS residuals fit = y.i.1. In particular. it can be shown that P is generally biased. we must first . we should not treat timeseries and cross-section data homogeneously. Before the days of modern computer technology. In the remainder of this section.1.1. 2 where y1 = P a:(o: + Ta.1. .i. and yJ Alternatively.it is not a classical regression model because Y cannot be regarded as a nonstochastic matrix.and ( y l . .}beaK X L m a t r i x a n d l e t B b e a n M X N matrix.2) + +az~m. T .YZT.2. Let J K be the K X K matrix consisting entirely of ones.1. .k. . . --. .1) y.Y I T .-. y21.1 where (13.. y.we can estimate P by the sclcalled transformation estimator Although the model superficially resembles (12.If we define X and u similarly. or more practically by the FGLS estimator (13. . Then the KroneckerproductA Q B is a KM X LN matrix defined This estimator is computationally simpler than the FGLS estimator. y by defining where A = IN Q JT and B = J N Q IT. yet is consistent and has the same asymptotic distribution as FGLS. cp+2. for example. . y2. . Remember that if we arrange the observations in a different way.1.1. Y y2 = a.1.2. = ]=I p .1. 0k (13. .3). +~a?)-l. .. ~ 2 2 .1. We can write (13. p+2. t = p+l.To facilitate the derivation of Z.1). 5 Y E ( I m .B where { E ~ are ) i. model differs from (13. p can be estimated by the GLS estimator (13. . .1.y I A . it is customary to arrange the observations in the following way: y' = (yll. .(cr: From the above. .11). . .yP) are independent of (cp+l.34) = a : ~ U. T N NT L e t A = {a. Then we have (13.1) in matrix notation as (13. Y N I . DEFINITION 13. Its inverse can be ~hxrwn t . y12.)-'. .324 13 ( Econometric Models 13.)in the earlier equation are not. we need a different formula for Z. E ~ ) This .}are observable.1).35) 2-I = 4 ( J . whereas the (u. we can write (13.1. ..37) 1 1 1 Q=I--A--B+-Jw. 13.1. yNT). Y N ~ . .ApB + yPJh7). + E. .2 TIME SERIES REGRESSION by In this section we consider the pth order autoregressive model P (132. In defining the vector y.d. because it does not require estimation of the y's.32) in the form of (13.2 1 Time Series Rwesslon I 325 decide how to write (13.32) in the form of (13.. .5). with E E ~ = 0 and V E = ~ u2. using the consistent estimators of y l .1.26) only in that the {y. . we need the following definition..1. The model was extensively used by econometricians in the 1950s and 1960s. = u121. see Amemiva and Fuller (1967). under general conditions. or the Koyck lag.6) are distributed over j. .2).U. obtained earlier.and z.2). this problem arises whenever plim T-'X'U # 0 in the regression model y = XP u. u~(x'x)-']. ~ and . The asymptotic variancecovariance matrix above suggests that.1). This particular lag distribution is referred to as the geometric lag.3 SIMULTANEOUS EQUATIONS MODEL This model can be equivalently written as m (13. plim T-' z ~ = ~ # ~ 0. the more S is correlated with X. which tells her .2. Then.5) to (13. loosely speaking. in (13.2. the above result shows that even though (13. is a special case of (13.3. Vu2 = U ~ Iand .2. where Z = Euu'. = p-1 q.2.3) Section 13.3).3 1 Simultaneous Equations Model 327 The LS estimator fi = (Y'Y)-'Y'~ is generally biased but is consistent with the asymptotic distribution (13. we can asymptotically treat it as if' it were.2. ] are serially correlated. u~(Y'Y)-'].4).6) is called the inversion of the autoregressive process. Essentially the same asymptotic results hold for this model as for (13. and ul and u2 are unobservable random variables such that Eul = Eu2 = 0. A buyer comes to the market with the schedule (13. For a more efficient set of instrumental variables.3.29).2. Since Theorem 12. we have where Z is a known nonstochastic matrix.2.2) y 2 yzy. Although it was more frequently employed in macroeconomic analysis. + where yl and y2 are T-dimensional vectors of dependent variables. It should be noted that the nonstochasticness of S assures plim T-~S'U= 0 under fairly general a s sumptions on u in spite of the fact that the u are serially correlated. although the results are more difficult to prove.2.1. In this case.=o A study of the simultaneous equations model was initiated by the researchers of the Cowles Commission at the University of Chicago in the 1940s.2. The transformation from (13.3. In such a case we can consistently estimate mental variables (IV) estimator defined by P by the instru- fi N [ p . A similar transformation is possible for a higher-order process.(13. which tells him what price (yl) he should offer for each amount (y2) of a good he is to buy at each time period t. Note that (13.5) presents a special problem if { E . We now consider a simple special case of (13.2) is not a classical regression model.2.4) as if it were a classical regression model with the combined regressors X = (Y. See a survey of the topic by Griliches (1967).6) y.2) by including the independent variables on the right-hand side as where S is a known nonstochastic matrix of the same dimension as X.4 implies that 0 A N[P.2. A seller comes to the market with the schedule (13.3. + x2P2 + up. such that plim T-'S'X is a nonsingular matrix. It is useful to generalize (13. The reverse transformation is the inversion of the moving-average process. we can asymptotically treat (13. the better. the above consideration suggests that z . Z).- - 326 13 1 Econometric Models 13.2. corresponding to the tth element of the vector. In general. Consider where w . The term "distributed lag" describes the manner in which the coefficients on q-. Economists call this model the distributed-lag model.We give these equations the following interpretation. EU. we shall illustrate it by a supply and demand model. X1 and X2 are known nonstochastic matrices.5). We shall encounter another such example in + and . as it originated in the work of Koyck (1954). That is. VuI = U:I.-1 constitute a reasonable set of instrumental variables. = -yCdzt-. To return to the specific model (13.2. The estimation of p and -y in (13.~ therefore E ~ the LS estimators of p and y are not consistent.+ w ~ .2. 13. and (13.3. we assume the joint normality of ul and u2. Since a reduced-form equation constitutes a classical regression model.l . Solving the above two equations for yl and y2. discussed in Section 13. A structural equation describes the behavior of an economic unit such as a buyer or a seller.rrl or m2.3.3.3. m2. and v2 are appropriately defined. plim T(Z'PZ)-' 5 plim T ( s ' z ) .l ~ ' ~ ( ~ ' ~ ) .we can derive the likelihood function from equations (13. ii2) = g(yl.1) y2 is correlated with ul because y on ul. we obtain the two-stage least kquares (2SLS) estimator = x(x'x)-'x'.6).3.X) and a = (y.3. we consider the consistent estimation of the parameters of structural equations. Then.9).3.3. the equilibrium price and quantity-are determined. p2). the values of yl and y2 that satisfy both equations simultaneously-namely. whereas a reduced-form equation represents a purely statistical relatic--ship. Express that fact as (13.) is one-to-one. and a's yields a consistent and asymptotically efficient estimator. rewrite the equation as We call (13. If.328 13 1 Econometric Models 13. Consider the estimation of y1 and PI in (13. It can be shown that (13..6) y 2 = Xm2: * V2. Rewrite the reduced-form equations as where Z = (y2.7) k2s A N [ a . = g(y1.1). Note that . any solution to the equation ( e l .6).2.3.6) yields a consistent estimator of .nl and . For example. Then the instrumental variables (W) estimator of a is defined by Under general conditions it is consistent and asymptotically (13.3. If m a p ping g ( .3) and (13. ) is many-toane.rr2 are functions of y's and P's. PI. Maximizing that function with respect to y's. however.1) and (13.3. we can uniquely determine In other words.10) ikw A N[a. Next. c r : (s'z)-~s's(z's)-~]. Let S be a known nonstochastic matrix of the same size as Z such that plim T-'S'Z is nonsingular.3. or trial and error). Let X be as defined after (13.3. Nowadays the simultaneous equations model is not so frequently used . p's. we obtain (provided that Y1Y2 # 1): (13.2) the structural equations and (13. 72.3.3 1 Simultaneous Equations Model 329 how much (y2) she should sell at each value (yl) of the price offered at each t . It is consistent and asymp totically (13.4) the reduced-fm equations.3.5) and (13.3. and define the projection matrix P If we insert S = PZ on the right-hand side of (13. This estimator was proposed by Theil (1953). P2).4).where eland ii2 are the LS estimators.3. P1. If mapping g(.5) or (13. y2. A salient feature of a structural equation is that the LS estimator is inconsistent because of the correlation between the dependent variable that appears on the right-hand side of a regression equation and the error 2 depends term. as we can see in (13. vl. the help of an auctioneer.3. is still consistent but in general not efficient. by some kind of market mechanism (for example. the LS estimator applied to (13. the twostage least squares estimator is asymptotically more efficient than any instrumental variables estimator.3) y1 = 1 1 @lP1 + X2~1P2 + U1 + ~ 1 ~ 2 ) and the estimates of y's and p's from the LS estimators of ml and n2. known as the full information maximum likelihood estimator A simple consistent estimator of y's and p's is provided by the instrumental variables method.3.and the resulting estimators are expected to possess desirable properties. in (13.12) where X consists of the distinct columns of XI and X2 after elimination of any redundant vector and ml.13) ( ~ 1m2) . PI)'. for example. For this purpose. and hence of vl and v2. u:(z'Pz)-'1. or S.d. Another example is the CES production function (see Arrow et al. for the simplest such model and Fuller (1987) for a discussion in depth. Let be the initial value. is smaller. may not necessarily be of the same dimension as P.3. The case when the researcher knows is called sample separatiolz. and u .16) with the equilibrium condition D. especially when data with time intervals finer than annual are used. the Gauss-Newton method. The parameters of this model can be consistently and efficiently estimated by the maximum likelihood estimator.2)...4. = ylPt + X& + ult where f.3. = 0 and V u . Let us illustrate. f .3.@) in a Taylor as series around p = 0 pl Bl . J ' . K.3. There are two different likelihood functions. capital input. again with the supply and demand model.3. 1961): where Dl is the quantity the buyer desires to buy at price P. for the tth element. We can write (13. = min (Dl. with Eu.] are i.3. is the quantity the seller desires to sell at price Pt.i.5. where xt is a vector of exogenous variables which. Expand f. we can estimate a2by The nonlinear regression model is defined by The estimators and 6 ' can be shown to be the maximum likelihood estimators if (u.2. An example of the nonlinear regression model is the Cobb-Douglas production function with an additive error term. The computation of the maximum likelihood estimator in the second instance is cumbersome.3 can be used for this purpose.1) in vector notation as This is the disequilzbn'um model proposed by Fair and Jaffee (1972). is specifically designed for the nonlinear regression model. and L denote output. We do not observe Dl or St.. Although the simultaneous equations model is of limited use. In practice we often specify and {u. which is defined by (13. 13. p). depending on whether the research knows which of the two variables D. = I ft($) = f (xt..16) Q. The minimization of ST(@) must generally be done by an iterative method.(-) is a known function.4 1 Nonlinear Regression Model 331 as in the 1950s and 1960s. and S. See Chapter 11. One reason is that a multivariate time series model has proved to be more useful for prediction than the simultaneous equations model. Consider (13.St). = St leads to a simultaneous equations model similar to (13. and u are T-vectors having y. unlike the linear regression model. we have the case of no sample separation. We have already seen one such example in Section 13. Note that replacing (13.1) and (13. be it an estimator or a mere guess. Exercise 5. when the researcher does not know. but instead observe the actually traded amount Q. f.2.330 13 1 Econometric Models 13. respectively. estimators such as the instrumental variables and the twestage least squares are valuable because they can be effectively used whenever a correlation between the regressors and the error term exists. The nonlinear least squares (NLLS) estimator of P is defined as the value of p that minimizes Denoting the NLLS estimator by 0. .] are assumed to be jointly normal. respectively. Another reason is that a disequilibrium model is believed to be more realistic than an equilibrium model.4 NONLINEAR REGRESSION MODEL where y.14) D.. The derivation is analogous to the linear case given in Section 12. and where Q. Another iterative method. Another example is the error-in-vanab. and labor input. The Newton-Raphson method described in Section 7.ks model. p is a K-vector of unknown parameters. and P is a vector of unknown parameters. In this book we consider only models that involve a single dependent variable.7) into the right-hand side of (13.2. as well as many other issues not dealt with here. The second-round estimator of the iteration.5. In Section 13. takes the values 1 or 0. whereas Newton-Raphson requires the second derivatives as well.2.is obtained as the LS estimator applied to (13. b2.. x. multinomial. y.8). where it was used to analyze phenomena such as whether a patient was cured by a medical treatment.9) is that of variables. how many cars a household owns.4.4. whether o r not a worker is unemployed at a given time. multivariate. It is simpler than the Newton-Raphson method because it requires computation of only the first derivatives off. but differs from a regression model in that not all of the information of the model is fully captured by specifying conditional means and variances of the dependent variables. chapter 9). We can show that under general assumptions the NLLS estimator @ is consistent and The above result is analogous to the asymptotic normality of the LS estimator given in Theorem 12. and the vector x. 13. If. however.5 QUALITATIVE RESPONSE MODEL The qualitative response model or discrete variables model is the statistical model that specifies the probability distribution of one or more discrete dependent variables as a function of independent variables.4. and in Section 13. or whether insects died after the administration of an insecticide.4. Many of these data are discrete.4. where the dependent variable takes more than two values.1 Binary Model i F t j I 9 2 i i / i 1 i We formally define the univariate &nary model by The asymptotic variancecovariance matrix above is comparable to formula (12. where the dependent variable takes two values.5.5. the x. / apr is a K-dimensional row vector whose jth element is the derivative off. The &fference is that df/dpr depends on the unknown parameter p and hence is unknown. The multivariate model. As in a regression model. It is analogous to a regression model in that it characterizes a relationship between two sets where we assume that y. 13. treating the entire left-hand side as the dependent variable and aft/aprlsl as the vector of regressors. as extensive sample survey data describing the behavior of individuals have become available. = 1 represents the fact that the ith person buys a car. for example. provided that we use a f / ~ ? p 'for ( ~ X.need not be the original variables such as price and income.4. Recently the model has gained great popularity among econometricians.1) and rearranging terms. the next two.1 we examine the binary model. We can test hypotheses about in the nonlinear regression model by the methods presented in Section 12. The qualitative response model originated in the biometric field.22) for the LS estimator. and by what mode of transportation during what time interval a commuter travels to his workplace. what type of occupation a person's job is considered.4.2 we look at the multinomial model. whereas X is assumed to be known.2.will include among other factors the price of the car and the person's income. The practical implication of (13.7) holds approximately because the derivatives are evaluated by Inserting (13. and the last. with respect to the jth element of p. we apply the model to study whether or not a person buys a car in a given year. The assumption that y takes the . given the independent variables. Note that (13. F is a known distribution function. The iteration is repeated until it converges. they could be functions of the original variables. The first two examples are binary. The following are some examples: whether or not a consumer buys a car in a given year. Note that df/dpr above is just like X in Theorem 12. we obtain: Dl.is a known nonstochastic vector.5 1 Qualitative Response Model 333 where df.332 13 1 Econometric Models 13.4. The same remark holds for the models of the subsequent two sections. are discussed at an introductory level in Arnemiya (1981) and at a more advanced level in Arnemiya (1985. 4. which for convenience we associate with three integers 1. but the econometrician assumed it to be F.5 1 Qualitative Response Model 335 values 1 or 0 is made for mathematical convenience.and U. where X I . and (ul.- 334 13 1 Econometric Models 13. The following two distribution functions are most frequently used: standard normal @ and logistic A. which may be regarded as the omitted independent variables known to the decision maker but unobservable to the statistician.3) is straightforward. The two distribution functions have similar shapes. > Uo.1) can be derived from the principle of utility maximization as follows. and %.5.2 and '7. Model (13.5. as he normally would in a regression model. The standard normal distributionfunction (see Section 5. We assume that (13. the vector aF/ax. The essential features of the model are unchanged if we choose any other pair of distinct real numbers. with ah/&. The asymptotic distribution of the 3 is given by maximum likelihood estimator f where and the logistic distribution function is defined by where f is the density function of F. suppose that the true distribution function is G.d.3. random variables.> uo.d. To the extent that the econometrician experiments with various transformations of the original independent variables.5..) appropriately. 13...5.5.) are bivariate i. Therefore. sample here because xi varies with i. Then.1) is by the maximum likelihood method.In most cases these two derivatives will take very similar values. When @ is used in (13. %.5. be the ith person's utilities associated with the alternatives 1 and 0. Let U1. Another example is the worker's choice of three types of employment: being fully employed. The best way to estimate model (13.1). The important quantity is.u l i is F and define xi = xli . rather. If one researcher fits a given set of data using a probit model and another researcher fits the same data using a logit model.1) the regression coefficients p do not have any intrinsic meaning. except that the logistic has a slightly fatter tail than the standard normal.5. he can always satisfy G(X: P) = ~ [ h ( x P)] : .i. 2.2) Uli = xiif3 + uli and Uoj = x b i ~ + %i.. To see this. We must instead compare a@/ax.3) It is important to remember that in model (13. Although we do not have the i. and self-employed.5. it would be meaningless to compare the two estimates of P. Thus we have (13. We assume that the zth person chooses alternative 1 if and only if Ul. The likelihood function of the model is given by P(y.2 Multinomial Model I We illustrate the multinomial model by considering the case of three alternatives. partially employed. the choice of F is not crucial.. the model is called logit. by choosing a function h ( . the likelihood function is globally concave in P.i. and train.. where the three alternatives are private car. maximizing L with respect to P by any standard iterative method such as the Newton-Raphson (see Section 7.2) is defined by When F is either @ or A. One example of the three-response model is the commuter's choice of mode of transportation. bus. the model is called probit.1) if we assume that the distribution function of uoi ..3. = 1 ) = P(U1.) We obtain model (13.x. when A is used. we can prove the consistency and the asymptotic normality of the maximum likelihood estimator by an argument similar to the one presented in Sections 7. respectively. . are nonstochastic and known.4. and 3. We can assume without loss of generality that their means are zeroes and one of the variances is unity.d. which can be mathematically stated as where Pli = P(yi = 1 ) and P2. 272).336 13 1 Econometric Models 13. although an advance in the simulation method (see McFadden. if we represent the ith person's discrete choice by the variable y. > Ul. up. In the normal model we must evaluate the probabilities as definite integrals of a joint normal density. An iterative method must be used for maximizing the above with respect to P and 0.. It is assumed that the individual chooses the alternative with the largest utility. The probabilities are explicitly given by where (ul. U21> U3. u3. Let us consider whether or not this assumption is reasonable in the two examples we mentioned at the beginning of this section. . Besides the advantage of having explicit formulae for the probabilities. the likelihood hnction of the model is given by This model is called the multinomial logit model..5. .by yJ.2) to the case of three alternatives as the errors are mutually independent (in addition to being independent across i ) and that each is distributed as This was called the Type 1 extreme-value distribution by Johnson and Kotz (1970.. we should expect inequality < to hold in the place of equality in (13. = j. this model has the computational advantage of a globally concave likelihood function. 2. respectively. It is easy to criticize the multinomial logit model from a theoretical point of view.5 1 Qualitative Response Model 337 We extend (13.5. j = 1.. In the first example.. to cite McFadden's well-known example.)are i. Given that a person has not chosen red bus. If we specify the joint distribution of (ul. and private car. 2. and suppose that a person is known to have chosen either bus or car. It is perhaps reasonable to surmise that the nonselection of train indicates the person's dislike of public transportation. Given this information. we might expect her to be more likely to choose car over bus.i. One way to spec* the distribution of the u9swould be to assume them to be jointly normal. train. instead of bus and train. Therefore.5. the model implies independence from irrelmant alternatives.13) means that the information that a person has not chosen alternative 2 does not alter the probability that the person prefers 3 to 1. p. 2.) The equality (13.). He assumed that and similar equalities involving the two other possible pairs of utilities. and the latter because multiplication of the three utilities by an identical positive constant does not change their ranking.. t of the errors that makes McFadden (1974) proposed a ~ o i ndistribution possible an explicit representation of the probabilities. u2. 1989) has made the problem more manageable than formerly. our model is defined by P(yz = 1) = P(ulz > U21. If this reasoning is correct. First. An analogous model based on the normality assumption was estimated by Hausman and Wise (1978). We should generally allow for nonzero correlation among the three error terms. .uQt) up to an unknown parameter vector 0. we can express the above probabilities as a function of p and 0.13). . 1. Second. = 1 if y. This argument would be more convincing if alternatives 1 and 2 corresponded to blue bus and red bus. = 2) = P(U2. it is likely that she . (We have suppressed the subscript z above to simplify the notation. This is cumbersome if the number of alternatives is larger than five. ulr > U31) P(y. and 3 correspond to bus. suppose that alternatives 1.n. no economist is ready to argue that the utility should be distributed according to the Type I extreme-value distribution. If we define binary variables yJ. The former assumption is possible because the nonstochastic part can absorb nonzero means. = P(yi = 2 ) . This model is called the censmed regression model or the Tobit model (after Tobin. . In the aforementioned examples. that is. The joint distribution was named Gumbel's Type B bivariate extreme-value distribution by Johnson and Kotz (1972. Clearly. It is apparent there that the LS estimator of the slope coefficient obtained by regressing .5. suppose that alternatives 1. it is natural to pair bus and train or fully employed and partially employed.5. see McFadden (1981) or Amemiya (1985. where z. or u2 to infinity. If the observations corresponding to yT 5 0 are totally lost. lists several representative applications. + p log q]. N(0.6 CENSORED OR TRUNCATED REGRESSION MODEL (TOBIT MODEL) Tobin (1958) proposed the following important model: (136.12) as a purely statistical model. Any multinomial model can be approximated by a multinomial logit model if the researcher is allowed to manipulate the nonstochastic parts of the utilities. Tobin used this model to explain a household's expenditure (y) on a durable good in a given year as a function of independent variables (x). p. if' p = 1 the model is reduced to the multinomial logit model.)are not observed whenever y.16) P& = 1 or 2) = A[(xzi . It is possible to generalize the multinomial logit model in such a way that the assumption of independence from irrelevant alternatives is removed.i.6. Suppose that u3 is distributed as (13. 2.]and {x.xs. but u 1 and u 2 follow the joint distribution and (13.11). including the price of the durable good and the household's income. It is assumed that {y. we can readily see that each marginal distribution is the same as (13. a2)and x.5 1 msared or Truncated Regression Model 339 In the second example. and 3 correspond to fully employed. p. if y : >0 z = 1 . and self-employed.3.6). .11) and independent of u. . The Tobit model has been used in many areas of economics.1) y. in analogy to probit). ify.5.d. For generalization of the nested logit model to the case of more than three alternatives and to the case of higher-level nesting.5.*>O.]are observed for all i.and U Q .338 13 1 Econometric Models 13. If. 256). and it is hypothesized f the desired expenditure that a household does not buy the durable good i is zero or negative (a negative expenditure is not possible).13).3. .The parameter p measures the (inverse) degree of association between ul and U .5 and 9.1.5.5. such that p = 1 implies independence. The variable y interpreted as the desired amount of expenditure. and (13.2) y. partially employed. The probabilities of the above three-response nested logit model are specified by where (u. In a given practical problem the researcher must choose a priori which two alternatives should be paired in the nested logit model. we view (13. Again. we would expect inequality < in (13.* 5 0. to the extent that the nonselection of "partially employed" can be taken to mean an aversion to work for others. not necessarily derived from utility maximization. = x : p =0 + u. if {x. yet the probabilities can be explicitly derived. it is much more general than it appears. Amemiya (1985). but {y:] are unobserved if y4 5 0.1) as long as the researcher experiments with various transformations of the independent variables. the observed data on y and x in the Tobit model will normally look like Figure 13. precisely for the same reason that the choice of F does not matter much in (13. Therefore it is useful to estimate this model and test the hypothesis p = 1.]are assumed to be i. n . and if the researcher does not know how many obser: 5 0.)'P exp[(x~. sections 9. .* = xi6 + u. the model is called the truncated regmszon vations exist for which y model. however. The above model is necessitated by the fact that there are likely to be many : may be households for which the expenditure is zero. We shall explain the nested logit model proposed by McFadden (1977) in the model of three alternatives.XZ~)'P/PI f 1. By taking either u.2. 365. = 13. is a known nonstochastic vector. If there is a single independent variable x. . section 10. It is assumed that xl. yet the maximum likelihood estimator can be shown to be consistent and asymptotically normal.> 01.@(x:P/u)l n 1 where (ul. : is observed and that y2*. see Amemiya (1985.5) yields the maximum likelihood estimator of Pl/~l. Since a test of Type 1 versus Type 2 cannot be translated as a test about the parameters of the Type 2 Tobit model.1 Anexampleofcensoreddata and all the y's (including those that are zeroes) on x will be biased and inconsistent. . > 0.I y .7) u -1 +[(y. only the sign of y . For discussion of these cases. is observed only when y?.aP2 + u2t FIGURE 13. respectively.6) classifies them into five broad types. It is a peculiar product of a probability and a density.6. For the proof. The likelihood function of the model is given by (13. .d..) are i. ..xlP)/ul.~Ux:P)/u].) are serially correlated (see Robinson.3) L = n 0 [l . Amemiya (1985. ~~[(~. Type 2 Tobit is the simplest natural generalization of the Tobit model (Type 1) and is defined by (13.] is either nonnormal or hetercscedastic. 1982). when the true distribution of {u. The Tobit model (Type 1) is a special case of the Type 2 Tobit. while the probit maximum likelihood estimator applied to the first equation of (13. up. respectively.. however. are independent.i. Many generalizations of the Tobit model have been used in empirical research. = 0 and y21Z 0. The likelihood function of the truncated regression model can be written as (13. it can further be shown that the LS estimator using only the positive y's is also biased and inconsistent. on xzayields the maximum likelihood estimators of P2 and oi. of which we shall discuss only Types 2 and 5. I Y. 1 where 4. It loses its consistency. section 10.6. In this case the LS regression of the positive y2. are observed for all i but that xp.6.6. and are the standard normal distribution and density functions and no and 111 stand for the product over those i for which y. Another special case of Type 2 is when ula and u .5). L = 0 P(yT. where noand ITl stand for the product over those z for which y2. drawings from a bivariate normal distribution with It is assumed that zero means. . = x. The likelihood function of this model is given by (13.3).need not be observed for those z such that y .Y$ = x. the choice between the two models must be made in a nonclassical way.4) + L= n 1 @ ( X ~ ~ / U ) . : > 0) stands for the conditional density of y2*. Although not apparent from the figure. > o)P(YT.6. > 0. in which y : = y. The consistent and asymptotically efficient estimator of P and u2 in the Tobit model is obtained by the maximum likelihood estimator.. .5 ( Censored or Truncated Regression Model 341 The Tobit maximum likelihood estimator is consistent even when {u. 5 0) n f ( y 2 . see Amemiya (1973).5) 6 yT.340 13 1 Econometric Models 13.6. given y : > 0. = 0 and y. Olsen (1978) proved the global concavity of (13. variances u: and a.P1 + ul.. and f ( . 5 0. and covariance alp. Amemiya (1973) proved the consistency and the asymptotic normality of this model as well. 6. .J . S. 2 . i = l . is the utility of the jth portfolio of the of Dubin and McFadden (1984). + Vp.8) location. . Assuming that the density function f (t) exists to simplify the analysis. .7. j = l . Miller (1981). t + At) = P(t < T < t + At I T > t) X f (y.d. and y . and economics.) The disequilibrium model defined by (13. X f (Gi I zzi I Z$ is the maximum )P2i. t + At). J where yi.3. . sample. which is the greater of the two. engineering. = xip.15). We shall initially explain the basic facts about the duration model in the setting . vji) are i.3. xji. inclination to give aid to the ith country.342 13 1 Econometric Models 13. The researcher observes the actual wage rate y . t + At). Yt=Ytz if $=maxz. (13.d. 1 zTi is the maximum)Pli i (137. In the work of Dudley and Montmarquette (1976). and (13. signifies the measure of the U. Their joint distribution is variously specified by researchers. y = z. across i but may be correlated across j. is positive.i.J = 2. including medicine.. .i.2) $ 2 Hazard(t..of the i.+ = uJ. (We have slightly simplified Lee's model. Only when the offered wage exceeds the reservation wage do we observe the actual wage. Denoting the duration variable by T. n.i. The likelihood function of the model is given by (13. The duration variable may be human life. by ' . is the input-output vector at the jth location. given that she has lived up to time t. In duration analysis the concept known as hazard plays an important role.d.8) can be replaced by the minimum without changing the essential features of the model. ! I J where H.S.6. If T refers to the life of a person. and for each i and j the two random variables may be correlated with each other.3. case by specifying the distribution function L = n n n 1 f(yT. represents the offered wage minus the reservation wage (the lowest wage the worker is willing to accept) and y: represents the offered wage. and Lancaster (1990). so that aid is given if y.7.9) The duration model purports to explain the distribution function of a duration variable as a function of independent variables. then later introduce the independent variables. . which is equal to the offered wage. (vector) consists electric and gas appliances of the ith household. . .. and y2*. y. . we can completely characterize the duration model in the i. or the duration of unemployment. the above signifies the probability that she dies in the time interval (t. 2 .14).7 DURATION MODEL y . denoted A(t).6. z$ is +e net profit accruing to the ith firm from the plant to be built in the jth i Y and call it the hazard of the interval (t. is the maximum)Pj. how long a patient lives after an operation. of Duncan (1980). and zfi represents the wage rate of the ith worker in case he joins the union and z:i in case he does not. and y of the gas and electricity consumption associated with the jth portfolio in the ith household. In the model.. is the product over those i for which $ is the maximum and Pji = P(z: is the maximum). 13. the duration model is useful in many disciplines. . determines the actual amount of aid. It is assumed that (uji. As is evident from these examples. z .Y. the life of a machine.7 1 Duration Model 343 In the work of Gronau (1973). y. We define the hazard function.1) F(t) = P ( T < t). we have from (13. Introductory books on duration analysis emphasizing each of the areas of application mentioned above are Kalbfleisch and Prentice (1980). T y p e 5 Tobit is defined by (13.16) becomes this type if we assume sample separation.2) where the approximation gets better as At gets smaller. We define (13. 2 .sii are observed. In the model of Lee (1978). In some applications the maximum in (13. In the model . starting with the constant term only. attaining a minimum at youth. From (13. lnitially there are 1000 people in each group. In one of his models Lancaster (1979) specified the hazard function as Therefore X(t) contains no new information beyond what is contained in F(t).344 13 1 Econometric Models 13.d. The heterogeneity of the sample may not be totally explained by all the independent variables that the researcher can observe. the Weibull model is reduced to the exponential model. If L.5) we have for this model F(t) = 1 . Since f (t) = aF(t)/at. if different individuals are associated with different levels of the hazard function. The next equation shows the converse: (13.5.(u.e-'' and f (t) = hi-".7. (13. We explain this fact by the illustrative example in Table 13. the researcher can test exponential versus Weibull by testing a = 1 in the Weibull model. &/at < 0).) denotes the conditional likelihood function for the ith person. This model would not be realistic to use for human life.1. log unemployment rate of the area. d ~ ] The vector x. A more realistic model for human life would be the one in which h(t) has a U shape. The exponential model for the life of a machine implies that the machine is always like new.4) shows that X(t) is known once F(t) is known. Therefore. which acts as a surrogate for the omitted independent variables.7) with respect to t. say. gamma. As Lancaster showed. the likelihood function of the model with the unobserved heterogeneity is given by Thus the Weibull model can accommodate an increasing or decreasing hazard function. and log replacement (unemployment benefit divided by earnings from the last job). for it would imply that the probability a person dies within the next minute. but neither a U-shaped nor an inverted U-shaped hazard . The simplest generalization of the exponential model is the Weibull model. Differentiating (13. In such a case it would be advisable to introduce into the model an unobservable random variable. contains log age. He introduced independent variables into the model by specifying the hazard function of the ith unemployed worker as exp [. that his maximum likelihood estimator of a a p proached 1 from below as he kept adding the independent variables..7.j : ~ ( ~ ). He found. The first row shows. this phenomenon is due to the fact that even if the hazard function is constant over time for each individual. we obtain where {u. that 500 people remain at the end of period 1 and the beginning of period 2.7.7 1 Duration Model 345 There is a one-to-one correspondence between F(t) and A(t). and so on. regardless of how old it may be. in which the hazard function is specified as When a = 1.)are i. For some other applications (for example. 0. The last row indicates the ratio of the aggregate number of people who die in each period to the number of people who remain at the beginning of the period. known as the unobserved heterogeneity.1.5) ~ ( t= ) I - Lancaster (1979) estimated a Weibull model of unemployment duration.i. and 0. hence it is easier for him to specify the former than the latter. it is useful to define this concept because sometimes the researcher has a better feel for the hazard function than for the distribution function. Lancaster was interested in testing a = 1. curiously. the duration of a marriage) an inverted U shape may be more realistic. remaining high for age 0 to 1. In this example three groups of individuals are associated with three levels of the hazard rate-0. because economic theory does not clearly indicate whether a should be larger or smaller than 1. is the same for persons of every age.2. an aggregate estimate of the hazard function obtained by treating all the individuals homogeneously will exhibit a declining hazard function (that is. for example. given v. and then rising again with age.7. Nevertheless. The simplest duration model is the one for which the hazard function is constant: This is called the exponential model. 7. This formulation is more general than (13.7.5) as (13.7. Sturm (1991) Next we consider the derivation of the likelihood function of the duration model with the hazard function of the form (13.t + y2t2).7.7. as a . his estimate of a further approached 1.7 1 Duration Model 347 where the expectation is taken with respect to the distribution of v. A hazard function with independent variables may be written as where ho(t) is referred to as the baselim hazard function. in the sense that the baseline hazard function is general.1 2 7 I - Flinn and Heckman (1982) (13.3tk Gritz (1993) ho(t) = A exp(y. in the sense that x depends on time t as well as on individual i.12).exp [-/( o h0(s)e ~ ~ ( < ~ ) d s ] and then the density function.(t) = 1 .13. . and. by differentiating the above as The computation of the integral in the above two formulae presents a problem in that we must specify the independent variable vector x . The first step is to obtain the distribution function by the formula (13.) As Lancaster introduced the unobserved heterogeneity. Heckman and Singer (1984) studied the properties of the maximum likelihood estimator of the distribution of the unobserved heterogeneity without parametrically specifying it in a general duration model.14) (13.15) Ao(t) = pktk-' .7.9). Some examples of the baseline hazard functions which have been used in econometric applications are as follows: (13.13) Ao(t) = exp +Y t" . second.16) F. They showed that the maximum likelihood estimator of the distribution is discrete. 1 + . (The likelihood function of the model without the unobserved heterogeneity will be given later.. The unobserved heterogeneity can be used with a model more general than Weibull. first.7. 17). 1992. In every period there is a probability A that a wage offer will arrive. as G. whereas for patients of the second category ti refers to the time from the operation to December 31. see Lehrer (1988). In fact. because this sampling .7 1 Duration Model 349 continuous function of s. Consider a particular unemployed worker. and Ill is the product over those individuals who were still living on that date.(k). as above. let us assume that our data consist of the survival durations of all those who had heart transplant operations at Stanford University from the day of the first such operation there until December 31. The survival durations of the patients who had their operations before the first day of observation (in this example January 1. 1992.1980.7.1992.(t) must be interpreted as the probability that the spell of the ith person ends in the interval (t. The likelihood function depends on a sampling scheme.7. See Moffitt (1985) for the maximum likelihood estimator of a duration model using a discrete approximation. the two models are mathematically equivalent. and the contribution of a patient in the second category is the probability that he lived at least until December 31. and December 31. Note that for patients of the first category ti refers to the time from the operation to the death. The contribution of a patient in the first category to the likelihood function is the density function evaluated at the observed survival duration. 1992.6. Next we demonstrate how the exponential duration model can be derived from utility maximization in a simple job-search model. We have deliberately chosen the heart transplant example to illustrate two sampling schemes. its size is distributed i. is called the proportional hazard model.9) is a special case of such a model.348 13 I Econometric Models 13. he incurs the search cost c until he is + . Suppose we observe only those patients who either had their operations between January 1. see Amemiya (1991). 1992. Note a similarity between the above likelihood function and the likelihood function of the Tobit model given in (13. It is customary in practice to divide the sample period into intervals and assume that x i . We do so first in the case of discrete time.7. Now consider another sampling scheme with the same heart transplant data. 1980. 1980) and were still living on that date are said to be lej censored.d. Then (13.12). whereas the contribution to the likelihood function of the spell that lasts at least for k periods is l 7 [l . 1980. we must either maximize the no The duration model with the hazard function that can be written as a product of the term that depends only on t and the term that depends only on i. 1992. As an illustration. Thus the likelihood function is given by correct likelihood function or eliminate from the sample all the patients living on January 1. remains constant within each interval. he will receive the same wage forever.16) or (13. With data such as unemployment spells.7.(t)]X. Cox (1972) showed that in the proportional hazard model P can be estimated without specifying the baseline hazard Xo(t). The problem does not arise if we assume is the product over those individuals who died before December where 31. 1992) are said to be right censored. and those who were still living on that date. t 1).scheme tends to include more long-surviving patients than short-survivingpatients among those who had their operations before January 1. If the worker accepts the offer. The general model with the hazard function (13. In order to obtain consistent estimates of the parameters of this model.7. which arises when we specify the hazard function generally as (13.(t)]. There are two categories of data: those who died before December 31. This assumption simplifes the integral considerably. or those who had their operations before January 1. In this case X.3). The survival durations of the patients still living on the last day of observation (in this example December 31. We mentioned earlier a problem of computing the integral in (13. Maximizing it would overestimate the survival duration.X. and second in the case of continuous time.7. but were still living on that date. 1980. The contribution to the likelihood function of the spell that ends after k periods is I I f l : [l . Note that Lancaster's model (13.18) is no longer the correct likelihood function.i. and if it does arrive. For an econometric application of these estimators. For the correct likelihood function of the second sampling scheme with left censoring.X. If he rejects it. This estimator of P is called the partial maximum likelihood estimatm The baseline hazard Xo(t) can be nonparametrically estimated by the Kaplan-Meier estimator (1958). the first sampling scheme is practically impossible because the history of unemployment goes back very far.12) may be estimated by a discrete approximation. 21). the wage is distributed i.' ~ ( t K . wdG(w) + RG(R).24) V(t) = r n a x [ ~ .i.S)EV(t + 1) .28) is approximately equal to (13.7.' ~ [ m a x ( ~ - V because of + 6-'(1 - X)R.7. for example. We define c and 6 as before. as G.21) V = 6-'AE[max(~. Thus we have obtained the exponential model.6)V .7. Let V(t) be the maximu~n t. is that W(t) is serially correlated. whcre R = 6K. and define R* accordingly. The worker should accept the wage offer if and only if W > R*.S)V* . (13. call the solution V*.7. A new feature in the model of Pakes (1986). Note further that V appears in both sides of (13. . (13. The duration T until the wage offer arrives is distributed exponentially with the rate h: that is.22) E[max(W.] . where P = P(W > R*). The next model we consider is the continuous time version of the previous model.7. (13.7.26) for V .c] and W(t) has been written simply as W because of our i. Then we have (13. A fuller discussion of the model can be found. This feature makes solution of the Bellman equation considerably more cumbersome.c].7. second. Then the Bellman equation is + ( I .7 1 Duration Model 351 utility at b e employed. assumption. the wage is observed with an error. The model of Wolpin (1987) introduces the following extensions: first. Then the likelihood function of the worker who accepted the wage in the (t 1)st period is + Many extensions of this basic model have been estimated in econometric applications. P ( T > t) = exp(-kt).d. of which we mention only two. (13.R)] V because of Taking the expectation of both sides and putting EV(t) stationarity.7. It is easy to show that R* satisfies where R = S[(1 . R)] = I.d.c]. Define P = P(w > R*).X)[(1 . call the solution V*. When it arrives. The Bellman equation is given by ) . and define R* = 6[(1 . we have R)]. in which W is the net return from the renewal of a patent. the reservation wage.7. the planning horizon is finite.26) V = 6 .350 13 1 Econometric Models where 13.23).28) f (t) = h P exp(-hPt). Solve (13. Let f (t) be the density function of the unemployment duration. Taking the expectation of both sides and setting m ( t ) stationarity. in Lippman and McCall (1976). The discount rate is 6. For a small value of XP.i. Solve for V . Note that (13. X: . with n degrees of freedom and denoted by DEFINITION 1 (Chi-square Distribution) I Xt T H Eo R EM 1 X + Y . 2.X:+m .i.APPENDIX: DISTRIBUTION THEORY Let {Z. . Then the distribution of C : = ~ Zis ~ called the chi-square . .X: and if X and Y are independent.X. n .). and Y . 1).i = 1.d. be i. .. . as N(0. If X . then THEOREM 2 ~f x . then EX = n and VX = 2n. distribution. i = 1. Therefore. asN(kx.).i. and (9) because of Definition 2.5. . ny.. the left-hand side of (3) is X : . n-I$). We have - 1 Also. Let and be the sample means and S: and S. . n. i = 1.2.) (Zn+l . Define X = (10) (X .. Second. . n Moreover.Znl2 t=l (9) - up Xn-1. be as defined in Theorem 3.u. Then the distribution of & y / C i s called the Student S t distributionwith n degrees of freedom. as N ( p y . Assume that ( X i ) are independent of {Yi).and are jointly normal.' c % . we have + But the first and second terms of the right-hand side above are i n d e pendent because (4) Since Xi. we can easily venfji But we have by Theorem 3 C ( X .X12 "x ny (Yi - n2 . I ) . O Therefore. the theorem follows from (6).A)- and Pmoj Since x .z 2 ) / f i N ( 0 .. 2.3. THEOREM 6 1 T H E OR E M 4 Let (Xi)and x.i.d. (8) x x E(Z.i.8 .) = 0.N ( p . 2 which implies by Definition 1 that the square of (5) is X : . by Theorem 1.1.). (7) implies that these two terms are independent by virtue of Theorem 5. we have C (Xi . i .Ax - PY) (X - . . .) be i. the right-hand side of (2) is X: by Definition 1. assume it is true for n and consider n 1. . s2and Y are independent. . . Therefore. Then if u i = u.212. . (8).d. nxandlet(Y. .354 Appendix Distribution Theory 355 But since (Z1 . Then x Let Y be N ( O . 2. Then ~ . .(.2. Therefore. a. D E F I N I T I O N 2 (Student's t Distribution) Prooj We have THEOREM 5 {Xi) be i. l ) and independent of a chi-square variable x:. We shall denote this distribution by t. .4. I ) . R Let {Xi)bei. .d. 2 . as N ( 0 . be the sample variances. the theorem is true for n = 2.and x ~s2 = Let = 1. by Theorem 3. DEFINITION 3 = cr.. Cox." Econometnca 41: 997-101 6. Advanced Econometrics Cambridge. T. Amemiya." Technical Paper no. 0 ( F D i s t r i b u t i o n ) If X X: and Y X: and if X and Y are independent. "A Note on Left Censoring. the theorem follows from inserting a . ser. into (12). "Regression Models and Life Tables. "Cap~tal-Labor Substitution and Economic Efficiency. R. . of Economzcs and Statzstzcs 43: 225250. A Course zn Probabilzly Theory. B. "Qualitative Response Models: A Survey. A. D. "On the Foundations of Statistical Inference. M. Anderson. 1973. CEPR. DeGroot. New York: Academic Press. T. Chung. Calif. S. 1970. M. Csaki. P. 1991. I S . Cox. 235.L. R J. 34: 187-220 (with discussion). REFERENCES Finally. Minhas. 2nd ed. New York McGraw-Hill. ser. 267-281.356 i Appendix where (11) follows from Theorems 1 and 3. Budapest: Akademiai Kiado. Stanford Unmersity. Mathematzcal Analyszs." Revzeu. Zntroductzon to Multzvanate Statzstzcal Analysts. Birnbaum. H. "An Analysis of Transformations. H. Mass. Bellman. 1974. 1974. R. Mass. 1970. and W. (10) and (11) are independent for the same reason that (8) holds. K. 1965. Second Intmational Sy~nposzumon Informatzon Themy.: Add~son-Wesley. H. M. then ( X / n ) / ( Y / m ) is distributed as F with n and m degrees of freedom and denoted F ( n . E.. This is known as the F distrihution." Journal ofthe A m a can Statistzcal Associatzmz 57: 269-326 (with discussion). Therefore." Econometrica 35: 509-529. 1985. Reading. Box. 1984. 1961. New York: McGraw-Hill. pp. G. - - Akaike.. 2nd ed. Moreover. Here n is called the numerator degrees of freedom. Clptzmal Statzstzcal Deciszonr. Solow.. 1973." Journal of Economzc Lzterature 19: 1483-1536." Journal of the Royal Statistzcal Sonety. N. Chenery. 2nd ed.: Harvard University Press. Fuller. B. "Information Theory and an Extension of the Maximum Likelihood Principle. and m the denominator degrees of freedom. and D. eds. 26: 21 1-252 (with discussion). T. Introductzon to Matnx Analyszs. B. B. Arrow. R 1972. "A Comparative Study of Alternative Estimators in a Distributed-Lag Model. A. 1967." Journal of the Royal Statutzcal Sonety. New York: John Wiley & Sons. Petrov and F. "Regression Analysls When the Dependent Variable Is Truncated Normal. J. Apostol. 1964. Aspects of the Theory ofRzsk-Beanng Helsinki: Acadetnlc Book Store. 1981. W. 2nd ed. T. m ) . Arrow. by Definition 2. 1962. and R. Amemiya." in B. C. eds. 1980. Kans. Kotz. Introduction to Matnces wath Applzcatzons zn Statzstzcs. "Models for the Analysis of Labor Force Dynamics. Cambridge. A." Econometnca 46: 403-426. Frontiers in Econometrics. and G. Ferguson. Mallows. and P. McFadden. Studzes in Nonlinear Estzmatzon. eds. 'The Economics of Job Search: A Survey. Continuous/Discrete Dependent Variable Model in Classical Production Theory. Jaffee.: Ballinger Publishing. Prentice. Marcus. Griliches. 1964.. "Nonparametric Estimation from Incomplete Observations. 2. "Methods of Estimation for Markets in Disequilibrium " Ewnomehzca 40: 497-514. Singer. 1973. Amsterdam: North-Holland Publishing." Bzometraka 38: 159-178. "Determinants of Marital Instability: A Cox Regression Model. and B. 1973. New York: McGraw-Hill.. Goldfeld. Cambridge. The Advanced Theory of Statistics. "Asymptotic Normality and Consistency of the Least Squares Estimators for Families of Linear Regressions. and R. Manski and D. D. Measurement Error Models. M. "The Impact of Training on the Frequency and Duration of Employment. "A Model of the Supply of Bilateral Foreign Aid. 1988. Lee. Amsterdam: North-Holland Publishing. Foundationsfor Finann'al Economics. 1979. pp.358 References References 359 Dubin. Heckman. Hausman. 198-272. J. M. 1984. 3-35. L. R. 1974." Annals of Statistics 13: 295-314. New York: Cambridge University Press. Manhattan. "Qualitative Methods for Analyzing Travel Behavior of Individuals: Some Recent Developments. Hoel. L. Hwang. J. R. Jr. 1954. 1987. New York: John Wiey & Sons. E. 1984. H." in S. 'Techniques for Estimating Swtching Regressions. C.A Survey ofMatrix Theory and Matrix Inequalities. Kaplan. 1984. C. J. Minc. 1985. D. 1981. A. 1976. R.." Economehzca 35: 16-49. and D." in C. J. Miller. 1969. W." Paper presented at the Central Region Meeting of the Institute of Mathematical Statistics. 1985. Kalbfleisch. L. 1967. 5th ed. Johnston. Zarembka. Calif. Amsterdam: North-Holland Publishing. "A Conditional Probit Model for Qualitative Choice: Discrete Decisions Recognizing Interdependence and Heterogeneous Preferences. Survival Analysis. Kendall. S. 1978. - Johnson. G. 1972. Quandt. J.. "Formulation a n d Statistical Analysis of the Mixed. Moffitt. 1990. M. and C. and R. "Choosing Variables in a Linear Regression: A Graphical Aid. Mass. McFadden."Journal of Pokttcal Economy 81: S1684199. Duncan. Mathematzcal Statistzcs. M. "D~stributedLag Models: A Survey. R. Fair." Ewnomic Inquiry 14: 155-189.. T... McFadden.J." Econometnca 52: 345-362. Koyck.. - . J. N.. T. New York John Wiley & Sons. Watson. 1964. F. Heckman. 1977. Belmont. L.. ed. J. 1993. "Unemployment Insurance and the Distribution of Unemployent Spells. 1972. Gronau. Huang. New York: John Wiley & Sons. 57: 21-51. New York: Academic Press. Dudley." Econometnca 52: 271-320. "An Econometric Analysis of Residential Electric Appliance Holdings and Consumption." Journal of Ewnometncs." Advances z n Econometncs 1: 35-95."Jmrnal ofEconometrics 28: 85-101.J. 3rd ed. A. Nonlinear Methods zn Economehzcs. 1963. 11. Flinn. 1981." Applied Economics 20: 195-210. New York: Academic Press. Lehrer. New York: Hafner Press. StatisticalAnalysir ofFailure TimeData. 1951. "Econometric Methods for the Duration of Unemployment. vol. and S. E. L.. Mass. 1970. 1982. pp. Meier. G. 1988. Continuous Univariate Distributions-I. Wise.. New York: John Wiley & Sons..: MIT Press.. M. Boston: Houghton Mifflin. 1989. "Econometric Models of Probabilistic Choice. and J." Amencan Ewnomzc Review 66: 132-142. G." Annals of Mathematzcal Statistzcs 34: 447-456. 105-142. "Conditional Logit Analysis of Qualitative Choice Behavior. 1984." Cowles Foundation Discussion Paper no. F." in P. M."Journal of the Amenmencan Statistical Association 53: 457481. J. S. E. and D. Lippman. L. M. 1976. and J. Goldfeld and R.. "Universal Domination and Stochastic Domination: Estimation Simultaneously under a Broad Class of Loss Functions. Introduction to Mathematical Statistics. Z. 1972. Quandt. The Econometric Analysis of Transition Data. A. and D. and A. 1967. 3rd ed.: Wadsworth Publishing. pp. Eicker. Montmarquette. J. Durbin. Econometric Methods. "Testing for Serial Correlation in Least Squares Regression. 1976. Distributed Lags and Investment Analysis. Stuart." Economtnca 48: 839-852." Econometrics 57: 995-1026. A. Graybill. 1958. J.. M. Distributions in Statistics: Continuow Multivariate Distributions.. and H.. "A Method for Minimizing the Impact of Distributional Assumptions in Econometric Models for Duration Data. Lancaster. S.. Gritz. S. and R L. Structural Analysis of Discrete Data with Econometric Applications. C." International Economic Review 19: 415-433. E. 474. A." Ewnomehica 47: 939-956. 1980. Boston: Prindle. "The Effects of Children on the Household'sValue of Time. Litzenberger. McCall. F. 1978. New York: John Wiley & Sons. "Unionism and Wage Rates: A Simultaneous Equations Model with Qualitative and Limited Dependent Variables. P. G. Fuller. "A Method of Simulated Moments for Estimation of Discrete Response Models without Numerical Integration. F. Weber & Schmidt. 338. M.. M.. 349 Lippman. The Hague.. White. C. A. Silverman. D . Densig Estimation for Statistics and Data Analysis. 342 Duncan.. 310 Hoel. $$8 Johnston. 252 Kotz. F . T. J. D... 119 Hwang. S. 1980. "Minimax Regret Significance Points for a Preliminary Test in Regression Analysis. K I. 257 Griliches. A. Hiromatsu. 326 Lancaster. R.. 177 Flinn. J. T." Biometnka 29: 35C-362." Mimeographed paper. 330 Johnson... 347. 327. "Estimating a Structural Search Model: The Transition from School to Work. 314.. B. J. 1 Bellman. 337. 292. J... I NAME INDEX Akaike. Rao. 336 Heckman. 310 Amemiya. 326 Grltz. 1953. 2nd ed. Sturm. 327. J. F. 949 Lee.. 257. J. dissertation. 1987. J. 341. T. 331 Bayes. 1986. 345. New York: John Wiley & Sons. 336. New York:John Wiley & Sons. Pakes. K. 1973. 257 Bernoulli.. 340. R. R. 347 Gronau. J. F. A.. G. M . "A Heteroscedasticity-ConsistentCovariance Matrix Estimator and a Direct Test for Heteroscedasticity. 330 Ferguson. W. M.. H. T. R. P G. 244.. P. L.. 317. 331 Chung. 333. H. B. 318 Fair. 138. 318 Graybill. C. Tobin. Zellner.. T. A.. C. 347 Fuller.." Econometrica 50: 27-41. R. Central Planning Bureau. Approximate Theorems ofMathematica1 Statistics. 1961. 288.. H. L. A. L. M. R.. L . H. Z. 1. D. Amsterdam: North-Holland Publishing. 323 Eicker. T. Serfling." Ph. R J. Wolpin. J. and T. T.. G. J. 7. Linear Statistical Inference and Its Applications. A n Introduction to Bayesian Inference in EcmomeCrics. J. M. F. M. "The Significance of the Difference between Two Means When the Population Variances Are Unequal. 89 Huang.. A. 1986. 119 McCall. 231 Goldfeld. 155 Chenery. N. "Repeated Least Squares Applied to Complete Equation Systems. "Note on the Uniqueness of the Maximum Likelihood Estimator for the Tobit Model. 1958.. Stanford University. 350 McFadden. Welch. R . 169 Dubm. P. L.. C. 349 Kendall. 327. 299. 401 Anderson... 1991. 100 Cox." Econometrica 41: 1093-1101." Econometrica 48: 817438. New York John Wiley & Sons. R. 257 Apostol. C. L. K. J. "On the Asymptotic Properties of Estimators of Models Containing Limited Dependent Variables. A.. 174 Box. M. E. 62 Birnbaum. 350 Litzenherger. . London: C h a p man & Hall. A . L. H. E.. R.. 257 Kalbfleisch. A. H." Econometrica 54: 755-784. 1938. EconomicForecasts and Policy. 100.. 27 Arrow. 343. Robinson." Econometric~ 26: 24-36. J... J. 1973. 342 Hausman. Sawa. 339. 327.. S.. "Patents as Options: Some Estimates of the Value of Holding European Patent Stocks. D... 1980. 347 Hiromatsu. 1982. 342 Dudley.. 342 Durbtn. T. B. 95 Fisher. 118 Jaffee.360 References Olsen. Theil." Econometrica 55: 801-817. 2nd ed.. 155.. M . 349 DeGroot. D. D. "Estimation of Relationships for Limited Dependent Variables. E. S. 62." Ewnomehica 46: 1211-15. 1971. S& . 342 Lehrer. S.J.... G. W. 338 Koyck. 338. W. 330 Gauss. "Reliability and Maintenance in European Nuclear Power Plants: A Structural Analysis of a Controlled Stochastic Process. 343 Kaplan. 1978. . 336 Wolpin. 185 unbiased estimator.. 80. C. 102.. A. 252 Sturm.. 125 bias of sample variance. 104 normality. G. 75. 343 Minc.. 252 White. M. 350 Prentlce.. H.. A. R.. B. 339 Watson.. 340 Pakes. 318 higher-order. 287 linear unb~ased predictor. P. 359 Bellrnan equatlon. W.. 344 Olsen. 348 central limit theorems (CLT): Liapounov. M. H. C.. 144 mean. R. R. 242 censored regression model. 257 Meier.. 247. D. 322 autoregressive process: first-order. 132.. 100 Silverman. 347 estimator. 1. J. 237. P. 295 predictor. 169 Bayesian school. 310 Serfling. 331 change of variables. 331 Moffitt. S. 103 CES production function. J.. L... I.. Jr.. L. 75. 87 bivariate normal. 47 characteristic roots and vectors. R. 104 autocorrelation of error: definit~on. 318 Rao.. 321. 232 Cauchy distribution. 22 Box-Cox transformation. 124 linear unbiased estimator (BLUE). 349 Miller. 257 Minhas. 333 binomial distribution. B. 74. 244 Lindeberg-Levy. 169 admissible and inadmissible estimator. H. M. C. L. R.. 178 Behrens-F~sherproblem. 331 Stuart. B. G. 168. 292 . R. A. 270 characteristics of test.. R. 323 Welch. 329 Tobin.. 325 inversion of. 347 Theil. 154 Cauchy-Schwarz inequality. 348 left. 140. J. 104 efficiency.. 347 Solow. 310 Marcus. 132 variance. 318 estimation of. 276 linear predictor..99 probability distribution. R. 309. 155 Mallows.. 104. 87 estimator (scalar case). 326 baselime hazard function. 100 Robinson. 104 Lindeberg-Feller. 95. B. K. 341 S U B J E C T INDEX - distribution. T. 184 Chebyshev's inequality. R.. 174 theorem.. 116 Singer. E. 92. 350 Zellner. 349 Montmarquette. 339 censoring: right. 350 Bernoulli (binary) distribution. 123 estimator (vector case). 295 bias..362 Name Index Sawa. 343 Quandt. 243. A.. 318 Wise. 95 test.. 251. 128. S. 127 binary model. 317 heteroscedasticity-consistent estlmator. 12. 183 idempotent matrix. 259 inadmiss~ble estimator. 49. 268 Cram&-Rao lower bound. 30-32 integration: by change of variables. 103. 314 hypothesis: null and alternat~ve. mutual. 210. 77 probabdity.i. 86. 334 multlnomial. 71-72. 354 density function. 87 binomial. 270 discrete variables model. 142 job search model. 39. 344. 110 Student's t. 114 image. 294 criter~a for ranklng estimators. 27 double. 92. 57 Kaplan-Meier estimator. 66 rules of operation. 38 among more than two random variables pairwise. 331 Newton-Raphson. 6. 154 logistic. 22. 28. 61. 344 exponential model. 30 Interval estimation. 330. 112. 160-162 confidence interval. mutual. 124 test. 329 full-rank matnx. 114 endogenous variable. univariate. 37-38 distribution. 135 normal: bivariate. 25 probability when conditioning event has zero probab~lity. 229 determinant of matrix. 43 distnbutions: Bernoull~(binary). 130. 22. 62 degrees of freedom: chi-square. 98 independence from irrelevant alternatives. 229 error components model. 213. 35. 318 homoscedastic~ty.300. 112 independent variable. 108 Koyck lag. 270 empirical d~stribut~on. 263 Iterative methods: Gauss-Newton. 154 chi-square. 281 Cobb-Douglas product~on function. 246 condlt~onal density. 330 estimator and estimate. 331 cofactor. 64 of mixture random vanable. standard. 115 event. 260. convergence in drstribution. bivariate. 277 identity matrix. 240 Gumbel's Type B blvariate extremgvalue distribut~on. 353 exponential. 229 Inner product. 182 value. 306 In multiple regresslon. 342 distr~buted-lag model. 343 baseline. 229 expectation. See also expected value expected value: of continuous random variable. 46 among bivariate random variables. 63 of dlscrete random variable. 8 concentrated log-likel~hood function. 327 in simultaneous equations model. 317 first-order autoregresslve process.56 Jensen's inequality. 254. 327 un~form.100 correlation (coefficient). See distr~butionfunction decislon making under uncertainty.d. 91. 215. 331 generalized least squares (GLS) estimator. 89 Poisson. 12 between a pair of random variables. 29 distribution between continuous and discrete random variables. 337 Tobit. 349 Khinchine's law of large numbers. 4. 327 Independent and identically distributed (i. 185 i i 1 independence between a pair of events. 296 cr~tical region. 324 Laplace (double exponential) distnbution. 356 Student's t. 344 E. 326 globally concave likelihood function: probit and logit. 99. 157 Laplace (double-exponential) . 77. 338 Hardy-Weinberg dlstnbution. 157 hazard. 334. 326 Kronecker product. 78 confidence. 47 mean. 116 duration models. 213. 329 integral: Riemann. 338 Hardy-Weinberg. 70 rules of operation. 340 goodness of fit. 300. 347 declining aggregate. 353 E. 151 We~bull. 191 cross sectlon. 167-168. 1.364 Subject Index diagonal matrix. 33 by parts. 335 multinom~al logit. 354 Type I extreme-value. 356 Gumbel's Type B bivariate extreme-value. 47 in relation to zero covariance.182 slmple and composite. 154 . 137 Jacobian method. 346 heteroscedasticity. 53 Jacoblan of transformation. 343 hazard function. 351 Fdistribution. 10. 65 exponential distribution. 276. 160 Inverse matrix.36 variance. 29 dependent variable. 230 cumulative distribution function. 100 in mean square. 118.) random variables. 323 eigen values and eigen vectors. 218 for equality of variances. 138. 343 Durbin-Watson statistic. 260 Instrumental variables ( N ) estimator in distr~buted-lag model. 261 Gauss-Markov theorem. 86. 302 geometric lag. 258 diagonalization of symmetric matrix. 251. 97. 167-168. 261 combination. 61 of function of random variables. 230. 161-162 consistent estlmator. 349 joint density. 27. 12 among more than two events: palrwlse. 73 Cramer's rule. 26. 7 exogenous variable. 165. defin~tion. 116 distribution-specific method. 217 for equallty of three means. 353 class~cal school. 74 covariance. 332 disequilibrium model. 95. 287 Gauss-Newton method. 99. 306 feas~ble generalued least squares (FGLS) estlmator. definition. multivariate. 356 F test for equality of two means. 316 generalized Wald test. 26. 326 distr~butionfunct~on. 103. 318 full Information maxlmum l ~ k e l ~ h o o d estlmator. 269 f t Subject lndex 365 chi-square d~str~bution. 101 in probab~lity. 302 for structural change. 178 classical regression model. 23. 323 error-~n-variables model. 132 constrained least squares estlmator. 344 distribut~on-free method. 229 multiple. 87 Cauchy. 302 structural change. 337 normal model. 338 Newton-Raphson method. 286 best linear unbiased. 137. 184 rank of matrix. 63 moments. 98 multivariate. 185. 112 St. 133 likelihood principle. 62. 275 covariance. 103 linear independence. I51 mean. 22-23. 95. 21. 330 nonnegative definite matrix. 293 normal. 136 discrete case. 296 median. 290 multiple regression model. 290 asymptotic normality. 240. 140 reservation wage. 196. 232. 236. 337 marginal density. 97 multivariate random variable. best linear unbiased predictor.295 unconditional. 201 normal. 328 region of rejection. 136 simultaneous equations model. 101.288 consistency. 198. 104 Lindeberg-Ley's central limit theorem. 246 cornputahon. 103 least squares (LS) estimator: definition. 63 method of moments. 70. 275 nonsingular matrix. 113. 133 duration model. 115 minimax strategy in decision making. 322 level of test. 233. 289 Pareto optimality. 277. 6 distribution. 122 mean squared error matrix. 316 least squares estimator of error variance: definition. 334 logit model.187 moving average process. 285 power function. 113 density. 330. 197. Petersbnrg paradox. 285 proportional hazard model. See independent variable regularity conditions. 144 !? 1 negative definite matrix. 127 r . 136. 197 composite versus composite. 25 randomized test. 215 multiple regression. 348 global versus local. 20 frequency interpretation. 334 projection matrix. 92. 134 bivariate regresslon. 62 sample. See best linear predictor. 285. 340 Type 2 Tobit model. 206 optimal significance level. 265 linear regression. 8 point estimation. 114 positive definite matrix. definition.81-83 of least squares predictor. 45. 89 standard. 292 under general error covariances. 247.r . 201 Liapounov's central limit theorem. 336 multiple correlation coefficient. 113 moment. 68. See also least squares residual reverse least squares estimator. 234. 97 univariate. 307 limit distribution. 144 binary model. See also expected value mean squared error. 169 space. 549 permutation. 191 nonlinear least squares (NLLS) estimator. 275 negative semidefinite matrix. See likelihood equation null hypothesis. 101 probit model. 99 linear combination. 243. 212. 63 in estimation. 4. 190 probability: axioms. --distribution. 202 vector parameter. 29. 182 regressor. 6' t probability. 231. discrete. 232. 288 risk averter and risk taker. best predictor. 247. 281 multivariate normal distribution. 349 pvalue. 62 R'. 110 pooling time series and cross section.295 least squares residual. 335 binomial. 268 normal distribution: bivariate. 4. 329 Type 1 Tobit model. 268 reduced form. discrete.r- . 4. 285. 75. 6 limit (phm).33 maximum likelihood estimator: asymptotic efficiency. 237. definition. 253 ridge estimator. 169 probability. 22 multivariate: continuous. 174 likelihood ratio test: simple versus composite. 75 prior density. 342 uniform. 172 separation. 195 prediction. 342 9 . 143 multinomial. 137 Neyman-Pearson lemma.25. 295 consistency. 275 mean. discrete case. 332 random variable: definition. 276 mean squared prediction error: population. least squares predictor pred~ction error. 4. 135 models. 32 1 inversion of. 75. 281 Subject Index 367 law of iterated means. 137 consistency. 206 qualitative response model. . 27. 349 of unobserved heterogeneity. 201 binomial. 287 multinomial distribution. 190. 238. 275 nested logit model. 103. 341 Type 5 Tobit model. 323 population: definition. 78 laws of large numbers (LLN). 67-68 most powerful test. 260 between L S residual and regressor. 174 McFadden's blue bus-red bus example. discrete. 275 nonparametric estimation. 245. 172 distribution. 91. 116 of baseline hazard function. 142 asymptotic variance. 291 least squares predictor. 326 multicollinearity. 187. 203 squared error. 66-67 bivariate: continuous. 350 residual. 189 I ' J * t i ? ~ ~ $ t y j & ance. 135 multinomial model.283 finite sample properties. 287 best unbiased. 142. 114 positive semidefinite matrix. 19 univariate: continuous. 342. 336 multiple regression. 61. 143 likelihood function: continuous case. 331 nonlinear regression model. 334 normal equation. 335 logit model. 112 Poisson distribution.366 Subject Index asymptotic normality. 141 definihon: continuous case. 113 posterior momen&. 171 in hypothesis testing. 270 orthogonality: definition. See selection of regressors orthogonal matrix. 25 multivariate regression model. 1 . 347 nonpositive definite matrix. 334 loss function: in estimation. 104 likelihood equation. 62 mean. 39. 34 probability. 68 correlation. 243. 191 partial maximum likelihood estimator. 182 one-tail test. 124 mode. 229 logistic distribution. 199. mixture. 102 standard deviation. 324 transpose of matrix. 301 test for equality of variances. 176 supply and demand model. 91. 274 truncated regression model. 301. 349 variance: definition. 327. 319 Weibull model. 260 Wald test. 184. 103. See duration model symmetric matrix. 196. 125 of variance. 70. 318 significance level. 177 uniformly most powerful (UMP) test. time series. 73 variance-covariance matrix. 349 uniform distribution. 70 Slutsky's theorem. 304 equations. 341 Type 5. 334 regression model. 206 Type I extreme-value distribution. 209 in bivariate regression. 251 in multiple regression. 187. 307 selection of regressors. SeeF test test of independence. 183. 258. See generalized Wald test weak stationarity. 327 size of test. 239. 230. 201 univariate normal distribution. 252. 319 statistic.339 Type 2. 317 Welch's method. 354. 135. 130. See also t statistic sufficient statistic. 329 twc-tail test. 327 Type I and Type I1 errors. 328 Student's t distribution. 68 normal distribution. 230 . 193 Theil's corrected R ~309 . 63. 55 survival model.342 transformation estimator. 68 rules of operation. 195-196. 270 t statistic in testing for mean of normal.368 Subject Index Tobit model: Type 1. 258 trace of matrix.S) estimator. 334 in multinomial model. 183 unbiased estimator: definition. 199. 330 support of density function. 249 for structural change in bivariate regression. 205 simultaneous equations model. 308 serial correlation. 344 weighted least squares estimator. 118 unobserved heterogeneity. 201 skewness. 151 uniform prior. 339 twostage least squares (2SL. 115 stochastic dominance. 210 vector product. 336 in duration model. 89 universal dominance. 207 in testing for difference of means of normal. 118 structural change. 322 test statistic. 203. 173. See classical regression model stationarity. 165. 347 utility maximization in binary model. 289 unemployment duration. 345. 345. Documents Similar To Takeshi Amemiya - Introduction to Statistics and EconometricsSkip carouselcarousel previouscarousel nextFurther Mathematics for Economic Analysisuploaded by katerina_athensSydsaeter Hammond - Mathematics for Economic Analysisuploaded by hohotun11Sargent - Macroeconomic Theoryuploaded by skyiconSydsaether Students Manual Further Smfmea2uploaded by micpicArthur S. Goldberger - A Course in Econometricsuploaded by jarameen alaka first course in optimization theory - sundaram.pdfuploaded by Karl WeierstrassIntroduction to Statistics and Econometricsuploaded by abey.mulugetaStokey Lucas - Recursive Methods in Economic Dynamics 1989uploaded by rollololloJohnston Dinardo Econometric Methodsuploaded by fireflyzzzTakeshi Amemiya-Advanced Econometrics[1]uploaded by Francis Mark QuimbaFudenberg Tirole Game Theory Solutions Completeuploaded by Maitê ShirasuFMEA K. Sydster, Et Al. (2008) Further Mathematics for Economic Analysisuploaded by Guilherme CeminObstfeld & Rogoff - Foundations of International Macroeconomicsuploaded by juanherrrApplied. Econometrcisstatapdfuploaded by mandcrutThe Structure of Economics 1uploaded by José Luis Castillo BApplied Econometrics - A Modern Approach Using Eviews and Microfituploaded by maardybumCausal Inferenceuploaded by stanleyhartwellEviews for Principles of Econometricsuploaded by Omar Akabli Baum C.F. - An Introduction to Modern Econometrics Using Statauploaded by UmbaranWisnuAdverse Selection SOLUTIONSuploaded by Maurizio MalpedePractice Solutionsuploaded by Preetha Rajan134979261 the Structure of Economics by Eugene Silberberguploaded by Sabin SadafSolution Manual for Microeconometricsuploaded by Safis HajjouzMicroeconometrics Using Statauploaded by Vu Thi Duong BaEconomy - The Theory of Industrial Organization - j Tiroleuploaded by sahasra2489Jean Tirole - The Theory of Industrial Organizationuploaded by NOCLUEBOYUnofficial Solutions Manual to R.a Gibbon's a Primer in Game Theoryuploaded by Sumit SharmaReal Macro Eco Bookuploaded by Arvind1108More From Lilian HancuSkip carouselcarousel previouscarousel nextСтатистическa внешнего долга.pdfuploaded by Lilian HancuМакроэкономика Экспресс Курс Учебное Пособиеuploaded by Lilian HancuДисперсионный Анализ в Экспериментальной Психологииuploaded by Lilian Hancu Математические Методы в Психологииuploaded by Lilian HancuStatistica uploaded by Lilian HancuБородич С.А. - Вводный курс эконометрики 2000.pdfuploaded by Lilian HancuAнализ внешнего долгаuploaded by Lilian HancuА. Гусев Дисперсионный анализ в экспериментальной психологии.pdfuploaded by Lilian HancuMetody Ekonometrikiuploaded by Lilian HancuВ. Давнис Прогнозные модели экспертных предпочтений.pdfuploaded by Lilian HancuRaport Curtea de Conturiuploaded by Lilian Hancu Теория Игрuploaded by Lilian HancuИНСТИТУЦИОНАЛЬНЫЕ ОСНОВЫ долгаuploaded by Lilian HancuЕлисеева И.И. (ред). Статистика 2010.pdfuploaded by Lilian HancuDatoria externa ruuploaded by Lilian HancuПрактическая бизнес-статистикаuploaded by Lilian HancuЧетыркин Е.М. - Финансовая математика. Учебник [2005, PDF, RUS]uploaded by Lilian HancuСтатистика. Методы Анализа Распределений. Выборочное Наблюдение – 2009uploaded by Lilian HancuДеньги. Кредит. Банки_под ред. Жукова Е.Ф_2011 -783сuploaded by Lilian HancuFC-2011-02uploaded by Lilian HancuПрактическая бизнес-статистикаuploaded by Lilian HancuСтатистическa внешнего долгаuploaded by Lilian Hancu Розничный бизнес российских банковuploaded by Lilian HancuEXCELuploaded by Lilian HancuСтатистическa внешнего долга.pdfuploaded by Lilian HancuФинансы, денежн обращение и кредит_под ред Романовского, Врублевской_Учебник_2006 -543сuploaded by Lilian HancuДеньги. Кредит. Банки_под Ред. Жукова Е.Ф_2011 -783сuploaded by Lilian Hancu Эконометрикаuploaded by Lilian HancuFC-2011-01uploaded by Lilian HancuEconometric Analysisuploaded by Lilian HancuMenú del pie de páginaVolver arribaAcerca deAcerca de ScribdPrensaNuestro blog¡Únase a nuestro equipo!ContáctenosRegístrese hoyInvitar amigosObsequiosAsistenciaAyuda / Preguntas frecuentesAccesibilidadAyuda de compraAdChoicesEditoresLegalTérminosPrivacidadCopyrightRedes socialesCopyright © 2018 Scribd Inc. .Buscar libros.Directorio del sitio.Idioma del sitio: English中文EspañolالعربيةPortuguês日本語DeutschFrançaisTurkceРусский языкTiếng việtJęzyk polskiBahasa indonesiaMaster your semester with Scribd & The New York TimesSpecial offer for students: Only $4.99/month.Master your semester with Scribd & The New York TimesRead Free for 30 DaysCancel anytime.Read Free for 30 DaysUsted está leyendo una previsualización gratuita.DescargarCerrar diálogo¿Está seguro?This action might not be possible to undo. Are you sure you want to continue?CANCELARAceptar


Comments

Copyright © 2024 UPDOCS Inc.