Author(s): Shuilin Jin 1, Renjie Tan 2, Qinghua Jiang 3, Li Xu 4, Jiajie Peng 2, Yong Wang 1,*, Yadong Wang 2,*

Introduction

The first concept of entropy was introduced by Shannon[8] as a measure of the complexity of a set of symbols, which can be formulated in mathematical form as:[see PDF for formula]where [see PDF for formula] is the probability of the [see PDF for formula]-th symbol. Since then the notions of entropy appeared in many forms, such as metric entropy, topological entropy, Kolmogorov-Sinai entropy and Rènyi [7] entropy. All of the concepts were focused on one purpose: the "quantitative" description of the complexity or simplicity of a set of symbol dynamics.

The complexity of DNA sequences, as a special kind of symbol dynamics which is composed of A, C, G, T , can be measured by the entropy. Kirillova [5] computed DNA sequences of different organisms by the topological and metric entropies. Vinga and Almeida [9] introduced Rènyi's quadratic entropy to evaluate the randomness of DNA sequences. Zhao F, Yang H and Wang B [10] investigated the complexity of human promoter sequences by a diffusion entropy. Bose and Chouhan [3] studied the superinformation of the DNA sequence. Recently, Koslicki [6] introduced a topological entropy for finite sequences and showed the complexity of introns is higher than that of exons for each chromosome.

In this paper, a generalized topological entropy is introduced. At the same time, the relationship between the topological entropy and the generalized topological entropy is compared, which shows the topological entropy is a special case of the generalized entropy. The use of generalized topological entropy removes high-dimensional problems. This definition can get the complexity of sequences of different length. At last, we apply the generalized topological entropy to human genome to compute the complexity of introns, exons and promoter regions.

Methods

Let [see PDF for formula] be a sequence of DNA with length [see PDF for formula], [see PDF for formula] be the number of different [see PDF for formula]-length subwords that appear in [see PDF for formula]. If the sequence is infinite, then the topological entropy is defined as:

Definition 1

For an infinite sequence [see PDF for formula] formed over A, C, G, T , the topological entropy is

[see PDF for formula] Take a symbol sequence [see PDF for formula] = CGCGCGCG··· as an example. It is easily seen, for any [see PDF for formula], the different sequence with length [see PDF for formula] is 2, so the topological entropy of the DNA sequence CGCGCGCG··· is:[see PDF for formula]

However, the length of DNA sequence is finite, by Definition 1, the complexity is zero as [see PDF for formula] tends to infinity. Colosimo and Luca [4] showed the precise description of the shape of the complexity function, and then Koslicki defined an...