Simulating from a multinomial distribution with large number of categories

Computational Statistics & Data Analysis 51 (2007) 5471–5476 www.elsevier.com/locate/csda Short Communication Simulating from a multinomial distribution with large number of categories Sonia Malefaki, George Iliopoulos∗ Department of Statistics and Insurance Science, University of Piraeus, 80 Karaoli & Dimitriou Str., 18534 Piraeus, Greece Received 4 August 2006; received in revised form 30 March 2007; accepted 31 March 2007 Available online 5 April 2007 Abstract The multinomial distribution is a key-distribution for several applications. For this reason, many methods have been proposed so far in the literature in order to deal with the problem of simulation from it. A slight modiﬁcation is suggested which can be used in conjunction with any of the standard schemes. The proposed variation is a two-stage procedure based on the property of the multinomial distribution that for any partition of the set of outcomes the vector of total frequencies of each part follows also a multinomial distribution with parameters adjusted accordingly. It is empirically exhibited that this variation is faster than the original procedures in case the numbers of independent trials and possible outcomes are both large. The time reduction is illustrated via a simulation study for several programming languages such as R, Matlab, and others. © 2007 Elsevier B.V. All rights reserved. Keywords: Multinomial distribution; Random number generation 1. Introduction Nowadays, as computational power is rapidly growing, computer simulation is widely used in many disciplines. Simulation of random variables is often used when either no explicit theoretical results can be obtained or in order to conﬁrm existing ones. There is a huge number of general as well as special methods for generating random variates from discrete or continuous distributions. An introduction on this topic can be found in Devroye (1986); for more recent results see Robert and Casella (2004). In this paper we focus on generating random samples from the multinomial distribution. The multinomial distribution is a multidimensional generalization of the binomial distribution and is one of the most frequently used distributions in practical applications. Its most frequent use is in the analysis of contingency tables but it may be applied as well whenever data are grouped in a ﬁnite number of categories. Some recent applications can be found in McCulloch and Rossi (1994) and Imai and Van Dyk (2005). For more on multinomial distribution see Johnson et al. (1996) and the references therein. ∗ Corresponding author. Tel.: +30 210 414 2406, fax: +30 210 414 2340. E-mail address: [email protected] (G. Iliopoulos). 0167-9473/$ - see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2007.03.022 5472 S. Malefaki, G. Iliopoulos / Computational Statistics & Data Analysis 51 (2007) 5471–5476 Let P = (p1, . . . , pn) be a probability vector, that is, 0 S. Malefaki, G. Iliopoulos / Computational Statistics & Data Analysis 51 (2007) 5471–5476 5473 Proof. By conditioning on K = k we get P(X = x|K = k) = m∏ j=1 P(X(j) = x(j)|K = k) = m∏ j=1 kj !∏i∈Bj pxii q kj j ∏ i∈Bj xi ! I ⎛ ⎝∑ i∈Bj xi = kj ⎞ ⎠ and summation over all partitions k of T gives P(X = x) = ∑ (k1,...,km) P(X = x|K = k)P(K = k) = ⎧⎨ ⎩ m∏ j=1 kj !∏i∈Bj pxii q kj j ∏ i∈Bj xi ! I ⎛ ⎝∑ i∈Bj xi = kj ⎞ ⎠ ⎫⎬ ⎭ T !∏mj=1qkjj∏m j=1kj ! I ⎛ ⎝ m∑ j=1 kj = T ⎞ ⎠ = T !∏n i=1xi ! n∏ i=1 p xi i I ( n∑ i=1 xi = T ) . The second equality holds true since kj =∑i∈Bj xi and thus to each x = (x1, . . . , xn) corresponds a unique allocation (k1, . . . , km). � According to Lemma 2.1, in order to generate from the multinomial distributionMn(T ;p) one is allowed to break the task into the following steps: Step 0: Partition the set of categories into m subsets consisting of n1, . . . , nm outcomes, respectively. Step 1: Generation of the number of observations K1, . . . , Km corresponding to each of the m subsets. Step 2: Generation of Kj observations within the jth subset for j = 1, . . . , m. Assuming for convenience that n = km, one possible choice is to deﬁne the jth subset to contain the categories with labels ranging from (j − 1)k + 1 to jk. In this case, qj =∑jki=(j−1)k+1pi , j = 1, . . . , m. Of course, one may rank the categories with respect to the magnitude of their probabilities before forming the subsets. Theoretically, the generation of an observation using the proposed variation in conjunction with any standard method should have the same expected time as the method itself, but in practice this is not the case as we have found empirically. Indeed, a simulation study presented in the following section shows that the use of the proposed variation may reduce dramatically the time needed to generate one observation when T and n are large. 3. Simulations In order to compare the CPU time needed for generating a random vector from a multinomial distribution us- ing the proposed variation with that of the standard method we consider the multinomial distribution Mn(T ;p = (p1, . . . , pn)) with p1 =· · ·=pn=1/n for several values of n and T. The procedure has been implemented in Matlab, Mathematica, R, and Fortran and the corresponding CPU times have been recorded. In case of existence, the internal functions of the above programming languages were used. R and Fortran produce random samples from the multinomial distribution using the conditional method with Kachitvichyanukul (1982) algorithm for simulation from the binomial distribution. Mathematica uses also the conditional method but the samples from the binomial distributions are drawn by summing independent Bernoulli random variates. In Matlab there is no internal subrou- tine generating multinomial random variates. Therefore, we downloaded two functions available in “Matlab central ﬁle exchange” (http://www.mathworks.com/matlabcentral/fileexchange) and “Lightspeed Matlab toolbox” (http://research.microsoft.com/∼minka/software/lightspeed). The ﬁrst one is writ- ten by Trujillo–Ortiz et al. and uses the direct method whilst the second is written by Minka and utilizes the conditional method for the multinomial distribution together with the sequential search or Kemp’s method for the binomial distri- bution depending on its mean. According to our simulation results (presented in Table 1), in both Mathematica and R the reduction in CPU time is very signiﬁcant in almost all cases, especially for large values of T and n. For instance, in Mathematica our variation can become up to 50 times faster than the standard procedure (T = 104, n = 105,m = 103). In R, it is about 5474 S. Malefaki, G. Iliopoulos / Computational Statistics & Data Analysis 51 (2007) 5471–5476 Table 1 CPU times needed for generating a random sample from the multinomial distributionMn(T ;p = (1/n, . . . , 1/n)) in several packages for various T , n, and m using the proposed (P) and the standard (S) method T n m Matlab Mathematica R Fortran T M S P S P S P S P S P 102 103 10 .06 .03 .07 .06 .13 .05 .16 .25 .07 .06 102 .02 .05 .09 .28 .04 104 10 2.17 .42 .62 .52 2.64 .71 .51 .30 .58 .56 102 .13 .33 .57 .28 .25 103 .04 .12 .84 .34 .11 105 10 267.36 20.31 6.14 5.31 192.33 20.47 7.61 .65 5.78 6.91 102 2.28 3.03 7.37 .33 2.49 103 .25 .57 5.42 .34 .46 104 2.21 .71 9.98 .71 .67 103 103 10 .15 .06 .07 .07 .46 .09 1.55 2.38 .08 .07 102 .05 .07 .13 2.40 .07 104 10 2.96 .49 .64 .61 6.04 1.02 4.88 2.65 .64 .63 102 .26 .54 .63 2.49 .57 103 .31 .42 1.21 3.01 .37 105 10 278.19 24.31 6.22 6.04 228.12 24.03 74.89 5.99 6.17 6.76 102 4.30 5.23 7.78 2.87 5.66 103 1.43 3.24 5.93 3.10 2.62 104 3.41 1.19 14.14 6.34 1.10 104 103 10 1.08 .12 .09 .09 3.74 .45 15.5 23.46 .12 .12 102 .15 .10 .51 23.85 .12 104 10 12.52 1.31 .64 .65 41.08 4.30 49.39 26.83 .76 .78 102 .45 .64 1.35 24.58 .73 103 1.38 .65 4.59 27.15 .78 105 10 375.01 29.48 6.22 6.18 567.98 57.25 772.81 59.20 6.70 7.31 102 5.04 6.08 11.77 27.90 6.48 103 3.56 5.41 10.18 28.15 6.04 104 14.32 4.19 47.21 61.66 3.81 Columns T and M show the results using Trujillo–Ortiz et al. function and Minka’s function respectively in Matlab. The CPU times in Fortran are multiplied by 100. 25 times faster but only for large n. The time reduction in Matlab depends on which function for simulating from the multinomial distribution is used. For the function based on the direct method the results are very similar with those in R and Mathematica. Here, the greatest reduction in CPU time was achieved in the case where T = 100, n = 105 and m= 1000 where our variation was more than 1000 times faster than the standard procedure. On the other hand, for Minka’s function a signiﬁcant improvement is achieved mainly when T>n. In this point, it is interesting to note that the ﬁrst Matlab function combined with our variation for appropriate choice of m, results to a much faster procedure than Minka’s one regardless of whether the variation is used in the latter or not. In Fig. 1 the CPU times needed by Matlab and R to simulate a multinomial random vector using the standard algorithm and our two–stage procedure for T = 100 and various n are shown. In Fortran, the CPU time reduction is not as impressive as in the other languages and is achieved mainly when T>n. On the contrary, when T is of the same magnitude as n, the reduction is not very signiﬁcant. This behavior is probably a consequence of the fact that, contrary to the other languages tested, Fortran is a compiled programming language designed to create fast executable code for scientiﬁc programming. So, the reduction is smaller than that of the S. Malefaki, G. Iliopoulos / Computational Statistics & Data Analysis 51 (2007) 5471–5476 5475 2000 3000 4000 5000 6000 0.2 0.4 0.6 0.8 5000 10000 15000 20000 0.2 0.4 0.6 0.8 1 1.2 1.4 Fig. 1. Mean CPU times needed for the standard algorithm (solid line) and our two-stage procedure with m = 100 (dashes) and m = 1000 (dots) in order to generate an observation from the multinomial distributionMn(T = 100;p= (1/n, . . . , 1/n)) in Matlab (left) and R (right) for various n. 2x106 4x106 6x106 8x106 1x107 2x106 4x106 6x106 8x106 1x107 500 1000 1500 2000 2500 3000 500 1000 1500 2000 2500 3000 Fig. 2. Optimal m in R when T = 1000 (left) and T = 10 000 (right). The solid line is the graph of √n. other languages tested due to the following two reasons. First, the initial CPU time needed for generating a multinomial random vector is several orders of magnitude less than that of the other languages. Second, the compilation produces an executable program which is essentially unaffected by the number of loops required for the simulation. On the other hand, Mathematica, Matlab, and R are interpreted languages, hence the CPU time needed increases with the number of loops. In conclusion, there is no much reason to use our variation in Fortran. The whole procedure has been also run for some other probability vectors. More speciﬁcally, we generated indepen- dent uniformU(0, 1) random variates U1, . . . , Un and set pi = Ui/∑nj=1Uj for i = 1, . . . , n. We obtained almost the same results as above. Next, in order to investigate whether sorting the probabilities can affect the results, we generated fromMn(T ,p(1)) andMn(T ,p(2)), where p(1)i = 2i/{n(n+ 1)} and p(2)i =p(1)n−i+1, i = 1, . . . , n. Of course, the CPU time needed for simulating from the second multinomial distribution is less than that needed for the ﬁrst one when the standard procedure is used. However, the time reduction induced by our variation was similar in both cases. Concerning the choice of the number of subsets m, extensive simulation studies have shown that the optimal value m, say m∗, does not depend on the probability vector p. In R and Matlab, m∗ is of the order of √ n whereas in Mathematica it seems to be linear in n. Furthermore, in R, m∗ does not depend on the number of trials T either. In this case, taking m to be equal to √ n or a divisor of n close to it, is an appropriate choice (see Fig. 2). On the other hand, in Matlab, the dependence of m∗ on T is of the order of log(T ), so m∗ is given as the divisor of n which is closer to log(T ) √ n. In Mathematica, a simple linear regression of m∗ on n ﬁts almost perfectly (R2 > .95 for all T). The number of trials affect only the constant term, which in most cases is not statistically signiﬁcant. Moreover, for all T, the slope is estimated about .004. So, we conclude that in Mathematica, the value .004n (or the closest divisor of n) is a quite satisfactory choice for m. It is also worth mentioning that the variation of the CPU times is very small when 5476 S. Malefaki, G. Iliopoulos / Computational Statistics & Data Analysis 51 (2007) 5471–5476 m is set equal to any divisors of n near to the above values for each programming language. The simulation results are available by the authors upon request. As mentioned in the Introduction, our main interest was to run a SIR algorithm in order to estimate a model for a particular dataset. The reason we do not present such an example here is that the CPU time reduction when applying SIR using our variation in the resampling step is actually the same as in the simulation results reported in Table 1. In closing this section, it can be pointed out that the proposed method is adequate for applications in R, Matlab and Mathematica, especially for large values of T and n. 4. Conclusion In this paper we proposed a variation for sampling from a multinomial distribution with large number of both categories and trials. Although it requires simulation of m+1 multinomial random variates (instead of a single one), in many cases reduces the computational time of the task. Since it can be used in conjunction with any simulation scheme for the multinomial distribution it may be readily applied in all programming languages. Therefore, it can be proved quite helpful for reducing the runtime of computationally challenging methods such as SIR or PMC. References Ahrens, J.H., Dieter, U., 1980. Sampling from binomial and Poisson distributions: a method with bounded computation times. Computing 25, 193–208. Brown, M.B., Bromberg, J., 1984. An efﬁcient two-stage algorithm procedure for generating random variates from the multinomial distribution. Amer. Statist. 38, 216–219. Cappé, O., Guillin, A., Marin, J.M., Roberts, C., 2004. Population Monte Carlo. J. Comput. Graph. Statist. 13, 907–929. Davis, C.S., 1993. The computer generation of the multinomial random variates. Comput. Statist. Data Anal. 16, 205–217. Devroye, L., 1986. Non-Uniform Random Variate Generation. Springer, New York. Imai, K., Van Dyk, D.A., 2005. A Bayesian analysis of the multinomial probit model using marginal data augmentation. J. Econometrics 124, 311–334. Johnson, N.L., Kotz, S., Balakrishnan, N., 1996. Discrete multivariate distributions. Wiley, New York. Kachitvichyanukul, V., 1982. Computer generation of Poisson, binomial and hypergeometric random variates. Ph.D Thesis, Purdue University. Kemp, A.W., 1986. A modal method for generating binomial variables. Comm. Statist. Theory Methods 15, 805–813. McAllister, M.K., Ianelli, J.N., 1997. Bayesian stock assessment using catch-age data and the sampling-importance resampling algorithm. Canad. J. Fish. Aquat. Sci. 54, 284–300. McCulloch, R., Rossi, P.E., 1994. An exact likelihood of the multinomial probit model. J. Econometrics 64, 207–240. Robert, C., Casella, G., 2004. Monte Carlo Statistical Methods. second ed. Springer, New York. Rubin, D.B., 1987. Comment on The calculation of posterior distributions by data augmentation by Tanner and Wong. J. Amer. Statist. Assoc. 82, 543–546. Rubin, D.B., 1988. Using the SIR algorithm to simulate posterior distributions. Bayesian Statist. 3, 395–402. Walker, A.J., 1977. An efﬁcient method for generating discrete random variables with general distributions. ACM Trans. Math. Software 3, 253–256. Simulating from a multinomial distribution withlarge number of categories Introduction The two-stage procedure Simulations Conclusion References

Simulating from a multinomial distribution with large number of categories

Description

Comments