Quantitative linguistic study of DNA sequences

Available online at www.sciencedirect.com Physica A 321 (2003) 189–192 www.elsevier.com/locate/physa Quantitative linguistic study of DNA sequences S.P. Lia ;∗, Ka-Lok Ngb, M.C. Chunga aInstitute of Physics, Academia Sinica, Nankang Taipei, 115, Taiwan bDepartment of Information Management, Ling Tung College, Tai-chung, Nantun, 408, Taiwan Abstract A new family of compound Poisson distribution functions from quantitative linguistics is used to study the linguistic features of DNA sequences that go beyond the Zipf’s law. The relative frequency distribution of n-tuples and the compositional segmentation study can be 2t reasonably well using this new family of distribution functions. On the other hand, the absolute values of the relative frequency come out naturally from the linguistic model without ambiguity. It is suggesting that DNA sequences have features that resemble natural language and it may be modeled by linguistic methodology. c© 2002 Elsevier Science B.V. All rights reserved. PACS: 87.14.Gg Keywords: DNA segmentation; Statistical linguistic; Compound Poisson distribution; Jensen–Shannon divergence measure 1. Introduction In an early attempt, researchers [1] used the Zipf’s law [2] to study the statistical features that are embedded in DNA sequences. The Zipf’s law was 2rst proposed in 1932 when George Zipf made an empirical observation on some statistical regulari- ties of human writings which has become the most prominent statement of statistical linguistics. It is described as follows. Let us associate a particular word by an in- dex r equal to its rank, and by f(r) the normalized frequency of occurrence of that word, i.e., the number of times it appears in the text divided by the total number of words N . Zipf’s law states that there is an approximate relation between f(r) and r f(r) = A r� ; (1) ∗ Corresponding author. E-mail addresses: [email protected], [email protected] (S.P. Li). 0378-4371/03/$ - see front matter c© 2002 Elsevier Science B.V. All rights reserved. doi:10.1016/S0378-4371(02)01787-9 190 S.P. Li et al. / Physica A 321 (2003) 189–192 where � and A are constants. The above mathematical relation was used [1] to study the statistical features of DNA sequences where similar scaling behavior was found. It was however noted that for sequences composed of primarily coding regions, the data were well 2tted by a logarithmic function [3]. And just like in the case of linguistics, the Zipf’s law could only account for a limited zone of the rank variable. In the early days of quantitative linguistics, researchers had suggested that the math- ematical relation (1) proposed by Zipf was unsatisfactory. People have later introduced families of compound Poisson distribution functions to 2t word frequencies as well as sentence length [4]. In Ref. [4], Sichel has introduced a family of compound Poisson distribution functions which takes the form �(r) = [((1− )1=2)−�K�(�(1− )1=2)− K�(�)]−1 (� =2) r r! Kr+�(�) (2) for r¿ 1 where −∞¡�¡∞; 0¡ ¡ 1 and �¿ 0 are constants and K� is the modi2ed Bessel function of the second kind of order �. This is the Sichel model for word frequencies in its most general form. A natural question to ask is whether the quantitative studies made in linguistics can be carried out in a similar fashion in DNA sequences. In particular, we would like to know if the compound Poisson distribution functions introduced in the study of quantitative linguistics are universal, in the sense that they can be used to study human designed languages such as the languages we use everyday, the computer programming languages as well as the language used by nature—the information stored in DNA sequences. We will answer the above question by carrying out quantitative studies in DNA sequences using these compound Poisson distribution functions. In Section 2, we use this family of Poisson distribution functions to study the statistical features of the word frequencies in DNA sequences. Section 3 is a statistical study of the sentence-length in DNA sequences. Section 4 is the discussion and summary. 2. Word frequencies in DNA sequences In this section, we use the Sichel model to study the word frequencies in DNA sequences. In order to adapt the Sichel model to the quantitative study of DNA se- quences, the concept of word must 2rst be de2ned. In the case of coding regions, the words are the 64 3-tuples which code for the amino acids, AAA, AAT, etc. For non- coding regions, the words are however unknown. Therefore, it is better to consider the word length n as a free parameter and perform analyses for n from say, 3 to 8 as was done in Ref. [1]. The number of n-tuples will be 4n. Thus, for n = 6, the number of the 6-tuples will be 4096. To obtain the word frequency for each n-tuple, we will start from the 2rst base pair of the DNA sequence that is under study and progressively shift by 1 base with a window of length n. For a DNA sequence containing L base pairs, the total number of words will be L− n+ 1. To avoid any bias in DNA sequence selection, we performed analysis [5] of se- quences of eukaryotes (13 sequences), invertebrate (4 sequences), eukaryotic viruses (10 sequences), prokaryotes (7 sequences) and bacteriophages (2 sequences), all com- prised of base pairs within the range of 54 000 to 230 000 from GenBank Release 128. S.P. Li et al. / Physica A 321 (2003) 189–192 191 In Sichel’s model, �(r) is the fraction of the total number of words with a frequency r of appearance in the article under study. For example, �(1) is the fraction of words among the total number of words used that appear once in the article. To implement our analysis using the Sichel model, we 2rst record the total number (N ) of words (n-tuples) that are used in the DNA sequence. For each frequency of appearance, we record the total number (N (r)) of words (n-tuples) that have such a frequency r of appearance in that DNA sequence. We divide that number by N and call it �(r) and then plot �(r) against r. �2 test is used to obtain the best 2t of the data against �(r) in Eq. (2). Fig. 1 shows some of the sequences we have studied where we used �=−1=2 and performed the �2 test. 3. Sentence length in DNA sequences To study the sentence length of DNA sequences, one needs to de2ne what a sen- tence is. In linguistics, it is easy to identify what a sentence is. In the case of DNA sequences, what exactly a sentence should be is unknown. We here proceed with the following strategy. We divide a DNA sequence into segments in such a way as to maximize the nucleotides composition (such as the CpG domain) divergence between the resulting DNA domains until a stopping criterion is reached. We then identify each segment as a sentence in the DNA sequence. In our analysis, we use the segmentation method proposed by Bernaola-Galvan et al. [6] and Grossem et al. [7], the Jensen– Shannon divergence measure, DJS , to study the bacterial DNA sequence, Eco110K, as an example. The Jensen–Shannon divergence measure is an information-theoretical functional which quanti2es the diNerence between two or more probability distributions and can be used to compare the symbol composition between diNerent sequences. DJS has been used for measuring the distance between random graphs [8], in the analysis of DNA sequences [9] and in the segmentation of texture images [10]. In the analysis of Bernaola-Galvan et al. [6], the segmentation procedure is applied to the DNA sequence for the purpose of partitioning the sequence into domains of homo- genenous nucleotides composition. The Jensen–Shannon divergence is used to measure the compositional diNerence of the two subsequences. In actual simulation, one maxi- mizes DJS in order to maximize the compositional diNerence of the two subsequences. Finally, a stopping criteria is introduced in order to distinguish the partitioning due to true heterogenity and that due to random Ouctuation. We should remind our reader that one can use any other segmentation methods to study the sentence length in DNA sequences. The number of segments (N (r)) of length (r) is then recorded. We again divide N (r) by the total number of segments to obtain the relative frequency distribution of segments for r and plot it against r, which is shown in Fig. 2. 4. Summary and discussion In the above, we have introduced a family of compound Poisson distribution func- tions to the statistical study of DNA sequences. We have used the compound Poisson 192 S.P. Li et al. / Physica A 321 (2003) 189–192 distribution functions to 2t both the n-tuple and segment distributions of the DNA sequences. In both cases, we have obtained reasonable 2t, both for shape and nor- malization. The interesting thing is that the relative frequency distribution of n-tuples study follows the inverse Gaussian distribution among diNerent types of species. Fur- thermore, in the 6-tuples and segmentation study, the absolute value of the relative frequency comes out naturally from the linguistic model without ambiguity. Here we highlight some of the observations made by comparing the natural language study and the DNA study. For the 6-tuples study, the height of the inverse Gaussian peak value is somewhere between 0.03 and 0.08 (notice that Sichel reported the rank study, not the word frequency study.). For the sentence length study, the peak value is somewhere between 0.16 and 0.25 [4], and around 0.08 for diNerent authors and the Eco110K sequence, respectively. It may be too premature to conclude that our results imply that DNA sequences have any resemblance to a natural language. However, it is suggesting that DNA sequences have features that resemble natural language and it may be modeled by linguistic methodology. References [1] R.N. Mantegna, et al., Phys. Rev. Lett. 73 (1994) 3169. [2] G.K. Zipf, Selected Studies of the Principle of Relative Frequency in Language, Harvard University Press, Cambridge, MA, 1932. [3] M.Yu. Borodovsky, S.M. Gusein-Zade, J. Biomol. Struct. Dyn. 6 (1989) 1001. [4] H.S. Sichel, J. Am. Stat. Assoc. 70 (1975) 542; H.S. Sichel, J. R. Stat. Soc. A 137 (1974) 25. [5] M.C. Chung, K.L. Ng, S.P. Li, unpublished. [6] P. Bernaola-Galvan, R. Roman-Roldan, J.L. Oliver, Phys. Rev. E 53 (1996) 5181. [7] I. Grossem, P. Bernaloa-Galvan, P. Carpena, R. Roman-Roldan, J. Oliver, H. Stanley, Phys. Rev. E 65 (2002) 041905. [8] A.K.C. Wong, M. You, IEEE Trans. Pattern Anal. Mach. Intell. 7 (1985) 599. [9] W. Li, G. Stolovitzky, P. Bernaola-Galvan, J.L. Oliver, Genome Res. 8 (1998) 916; W. Li, Phys. Rev. Lett. 86 (2001) 5815. [10] V. Barranco-Lopez, P. Luque-Escamilla, J. Martinez-Aroza, R. Roman-Roldan, Electron. Lett. 31 (1995) 867. Quantitative linguistic study of DNA sequences Introduction Word frequencies in DNA sequences Sentence length in DNA sequences Summary and discussion References

Quantitative linguistic study of DNA sequences

Description

Comments