Abstract
Background
In the postgenomic era several methods of computational genomics are emerging to understand how the whole information is structured within genomes. Literature of last five years accounts for several alignmentfree methods, arisen as alternative metrics for dissimilarity of biological sequences. Among the others, recent approaches are based on empirical frequencies of DNA kmers in whole genomes.
Results
Any set of words (factors) occurring in a genome provides a genomic dictionary. About sixty genomes were analyzed by means of informational indexes based on genomic dictionaries, where a systemic view replaces a local sequence analysis. A software prototype applying a methodology here outlined carried out some computations on genomic data. We computed informational indexes, built the genomic dictionaries with different sizes, along with frequency distributions. The software performed three main tasks: computation of informational indexes, storage of these in a database, index analysis and visualization. The validation was done by investigating genomes of various organisms. A systematic analysis of genomic repeats of several lengths, which is of vivid interest in biology (for example to compute excessively represented functional sequences, such as promoters), was discussed, and suggested a method to define synthetic genetic networks.
Conclusions
We introduced a methodology based on dictionaries, and an efficient motiffinding software application for comparative genomics. This approach could be extended along many investigation lines, namely exported in other contexts of computational genomics, as a basis for discrimination of genomic pathologies.
Keywords:
Comparative genomics; Computational genomics; Genome clustering; Information theory; Sequence analysisBackground
Genomes are sequences of nucleotides from hundreds to billions of base pairs long. As sequences of symbols they determine dictionaries, that is, formal languages constituted by words occurring in them. They encode the language of life, as dictating the functioning of all the organisms we consider living beings. A main open problem in science is to find a key to understand such an encrypted language, which more or less directly affects the structure and the interaction of all the cellular and multicellular components [1]. It is like having at hand a book, the language of which has still to be deciphered [2,3]. Namely, the international longterm project ENCODE [4] is searching for encyclopedias, lexicons, catalogs, of DNA biochemically annotated elements in human genome.
Working on genomic dictionaries requires the elaboration of enormous moles of data. As an example, the dictionary of all the substrings of length 18 occurring in Drosophila melanogaster’s genome has more than 116 millions of words, which require, only to be stored, nontrivial implementations of ad hoc procedures. To the best of our knowledge, exhaustive studies on collections of kmers were carried out for values of k which do not exceed 13 (see for example [58]).
The starting point of our analysis was the computation of all kmers, with k = 6,12,18, of given genomes, listed in Table 1. Some properties of such specific dictionaries and their compared statistics guided our research along lines of development which were in part already present in the literature [9,10], and in part took us towards new topics, which emerged just from the empirical evidence of computed data. An interesting concept in this context is that of hapax (a Greek term, meaning “once”, coming from philology, where it is used for denoting a “word said once”). In manuscripts these words are relevant for authorship attribution, in genomes they seem to play essential roles in the genome organization as opposed to repeat strings, which instead occur more than once.
Table 1. A list of genomes investigated in the paper
In Table 1 a list is reported of twelve (out of the sixty we have investigated) genomic sequences, to which we applied the methodology described below. They correspond to genomes of well known organisms, constituting biological models, of relevance in various kinds of genomic analysis. The sequences were downloaded from public websites as FASTA files, and processed by a dedicated Java software that we developed.
In the following basic terminology for genomic dictionaries and multisets, and genomic profiles/distributions, is introduced, along with a simple example focused on a specific DNA sequence. Results are reported in terms of both an analysis of dictionaries of klong hapaxes and repeats, together with the introduction of three related dictionarybased informational indexes, and the definition of krepeat sharing gene networks. Section Discussion is then developed around a phasetransition observed in kdictionaries from k = 12 to k = 18, and around the structure of genomic information which emerges when dictionary cardinality trends and multplicitycomultiplicity distributions are compared with those of randomly permuted sequences. A description of the software suite developed to perform all our computations is finally presented in section Methods.
Basic notations
Let us denote by Γ the genomic alphabet of four symbols (characters, or letters, associated to nucleotides): Γ={A,T,C,G} (then Γ^{⋆}, as usual, denotes the set of all possible words over Γ).
A genome G is representable by a sequence over Γ, that is, a table assigning a symbol of Γto each position (from 1 to the length of G). Symbols are written in a linear order, from left to right, according to the standard writing system of west languages, and to the chemical orientation 5^{′}−3^{′} of DNA molecules. By associating to each symbol of Γthe set of positions where it occurs, G may be equivalently identified by four sets of numbers.
All factors (fragments) of a genome G are collected in the set D(G), while we call kgenomic dictionary ofG (for some k ≤ G), denoted by D_{k}(G), the set of all the klong substrings of genome G. The kgenomic tableT_{k}(G), which mathematically corresponds to a multiset, is defined by equipping the words of D_{k}(G) with their multiplicities, that is, the number of their respective occurrences in G. Let α(G) denote the multiplicity of α and pos_{G}(α) gives the set of positions of α in a genome G (that is, the positions where the first symbol of α is placed). Of course, it holds α(G) = pos_{G}(α). Hence, the table T_{k}(G) may be represented by an association of strings to their corresponding multiplicities: α ↦ α(G), with α ∈ D_{k}(G). The sum of all the multiplicities of elements in D_{k}(G) is called the size of T_{k}(G), denoted by T_{k}(G), with the same sign for string length and for set cardinality (but the context of use should avoid any confusion). It is easy to realize that:
Word distribution in a genome may be represented along a graphical profile, which measures the number of kwords having a given number of occurrences. Words having the same multiplicity in a kgenomic table T_{k}(G) can be grouped and their number is called comultiplicity. As an instance, for the sequence ATTAGGATCTTAAT, we have: six 2words occurring once (i.e., AA, AG, TC, CT, GA, GG), two words occurring twice (i.e., TA, TT), one word (i.e., AT) occurring 3 times, and seven 2words which do not occur at all.
If we report 2words multiplicities on the xaxis and their number (comultiplicity) on the yaxis, we obtain the chart in Figure 1a. We call such curves multiplicitycomultiplicitykdistribution (see Figure 2) of a genome. This kind of charts [5] represents a recent approach in genome analysis, opening new investigation lines about the internal logic underlying genome organizations. The same information may be graphically reported as a rankmultiplicity Zipf map (usually employed to study word frequencies in natural languages [11]). As one may notice by looking at Figure 2, both the middle and final inclination of Zipf’s curves is different for four of our organisms, accounting for the multiplicity range in which we have a major density of strings. In all cases, we have few units with maximal multiplicity, indeed Zipf curves initially slope down steeply.
Figure 1. (a) Multiplicitycomultiplicity 2distribution of the sequence ATTAGGATCTTAAT. A simple example of a multiplicitycomultiplicity 2distribution diagram for the specific sequence ATTAGGATCTTAAT is here reported (b) Localization of some repeats. A diagram is shown for localization of repeats in the range 1.3  1.6 ×10^{4}of N. equitans’ genome, where one repeat of 130 occurs, after a few shorter ones (about 30). Positions versus repeat lengths are respectively reported on the axes.
Figure 2. Multiplicitycomultiplicity and rankmultiplicity distributions. Some examples of multiplicitycomultiplicity kdistributions and Zipf’s curves [11] are reported, related to the 6genomic dictionary of Escherichia coli, Saccharomyces cervisiae, Drosophila melanogaster, and chromosome 19 of Homo sapiens respectively. On the left, we report a given multiplicity on the xaxis, and the number of 6words having that multiplicity on the yaxis. On the right, we have the corresponding Zipf’s distributions, where the words of the 6genomic dictionary are reported on the xaxis, according to their decreasing number of occurrences (words with a same number of occurrences are lexicographically ordered), which is on the yaxis, in logarithmic form.
Several other nice representations of genomic frequencies may be found in the literature, for example by means of images (in [7], distance between images results in a measure of phylogenetic proximity, especially to distinguish eukaryotes from prokaryotes).
Results
Two important types of factors of genomes are hapaxes and repeats. A hapax of a genome G is a factor α of G such that α(G) = 1. A repeat of G is a factor α of G such that α(G) > 1. Two or more contiguous occurrences of one repeat form a sequence technically called tandem repeat, if the repeated sequence is shorter than 10 nucleotides, one has a minisatellite or short tandem repeat. They describe patterns helpful to determine individual’s inherited traits, namely to determine parentage or genealogical information.
Back to the dictionaries, the set H(G) of hapaxes of G and the set R(G) of repeats of G of course constitute a bipartition of D(G) (at least one element of Γ is a repeat and G is a hapax, therefore H(G) and R(G) are nonempty, also disjoint sets, such that their union is D(G)). We set
where ∩ is the settheoretic intersection.
Therefore, given a genome G of length n, for any k ≤ n we can read it according to the bipartition of its kgenomic dictionaries H_{k}(G) and R_{k}(G). Size variations of kgenomic, khapax and krepeat dictionaries, for k = 1,…,18, are analyzed in the following (see Tables 2, 3, 4 for numerical data), while the size of “forbidden dictionaries” (those composed by “nonappearing” kwords, said also “nullomers” [12]), for given genomes, is of course exponentially increasing with k.
Table 2. Indexes related to D_{6}(G)
Table 3. Indexes related to D_{12}(G)
Table 4. Indexes related to D_{18}(G)
According to data reported in Table 2, in the first three genomes of the list, D_{6}(G) slightly decreases and repetitiveness slightly increases for longer genomes. When the analyzed genomes length exceeds about 1,800,000 base pairs, the decomposition of D_{6 }in hapaxes and repeats keeps the identical respective cardinalities. All the 6genomic dictionaries are composed by only repeat words (i.e., they do not contain any hapax).
In Table 3, the number of hapax words H_{12 }(G) appears not related to the length of genome G, and neither to the cardinality of D_{12}(G); while the ratio of 12hapaxes over 12repeats HR_{12 }appears roughly decreasing with the genome length. This is due to the fact that 12repeat words constitute a considerable portion of 12genomic dictionary, actually a percentage (called RD_{12}) which increases with the genome length (from 11% to 90%). The average 12factors repeatability index, in the last column, accounts for the average repeatability of 12repeats in all the genomes.
In Table 4, cardinality of D_{18}and H_{18}increase with the genome length, as expected. As a notable result though, we can see that the 18repeatfactor ratio RD_{18} is firmly fixed (over all the genomes) on a very small portion of the 18genomic dictionary, mostly ranging from 0.01 to 0.07 (and always less than 1%), independently on the genome length. The 18hapaxrepeat ratio HR_{18} does not show a regular behavior with respect to the length, but its values are considerably greater for longer words (according to the data, for k = 12 and k = 18). The average 18factor repeatability index does not exhibit the regularity of the average 12factor repeatability with respect to the genome length, it even shows an exceptionally high value for the chromosome 19 of H. sapiens.
It is easy to see that any genomic factor containing a hapax as a substring is an hapax as well. Hence an hapax within the genome may be elongated (by keeping its property to be an hapax) up to reach the genome itself, which is of course an hapax. It is then interesting to evaluate, for each genome G: i) how H_{k}(G) varies with k (see http://www.cbmc.it/external/Infogenomics3 webcite), ii) the khapax positions (that is, how densely hapax words fall in the genetic regions), and iii) the shortest length of an hapax. Also, a ksimilarity between genomes G and G^{′} could be measured by H_{k}(G)∩H_{k}(G^{′}) (we have some work in progress on the computation of dictionary intersections).
The concepts of hapax and repeat provide a great number of related notions which permit to define important aspects in the analysis of real genomes. In following sections we will discuss numerical data, reported in tables, diagrams, and figures, which include the measure of the ratio between H_{k}(G) and R_{k}(G) as a function of k (that is, how the number of hapax words of a given length increases or decreases with respect to the number of repeats of that length). We observed a sort of transition phase effect in the passage from D_{12}(G) to D_{18}(G), in almost all genomes of Table 1, where a clear inversion appears in the ratio hapaxcardinality/repeatcardinality.
Dictionary based indexes
For a genome G we may define klexicality, that is, the ratio L_{k}(G) = D_{k}(G)/T_{k}(G), which expresses the percentage of distinct kfactors of G with respect to the all the kfactors present in G (in Tables 2, 3, 4, it is clear that the klexicality increases with the word length k, and does not exhibit any regularity with the genome length). Of course, the inverse of this ratio provides an average repeatability of kfactors in G.
A more refined measure for the averagekfactors repeatability in G may be now given as:
where khapaxes have been excluded by both the kgenomic multiset and the kgenomic dictionary (the symbol ∖ represents the settheoretic difference). Index AR_{k}(G) counts the proper (average) repeatability of krepeats in genome G (see Tables 3 and 4 for computed numerical values).
Finally, maximal repeats of a genome G are substrings occurring at least twice and having maximal length. Some numerical indexes related to this concept are i) the maximal repeat length MR(G), ii) the number of different maximal repeat sequences, and iii) the number of times each maximal subsequence is repeated (see Table 5).
Table 5. MR index and MRrepeat distance
All genomes turned out to have only one repeat having maximal length (and multiplicity 2), and the distance of the two positions (in proportion to the genome length) is reported in Table 5. They are in most cases relatively very close. Although for k = 6,12,18, R_{k} increases with the genome length n, there is no apparent correlation between n and the MR index (in all cases R_{MR} = 2).
Any substring of a repeat word is still a repeat, with an own multiplicity along the genome, and inside the repeat word itself. A further index is thus defined over genomes G, called MR(G) (maximal repeat length), as the maximal length of words Γsuch that γ(G) > 1. An algorithmic way to find it (for our genomes) starts from repeats out of D_{18}(G) (that are less than three a half millions) and checks how much they may be elongated on the genome by keeping their status of repeat words. Data related to the MR index computed over our genomes are reported in Table 5, where the only MRlong repeat of each genome exhibits a nontrivial structure (that is, different than polymers with a same nucleotide or similar patterns), and complex repeats are obtained for many lengths.
The importance of word repeatability is crucial in understanding the information content of texts. A genome analysis in terms of (shortest) hapaxes and (maximal) repeats, providing their relative distribution within the genome, highlights the associative nature of DNA as a container of information [13]. Localization (see Figure 1b) and frequency (see Figure 2) of DNA fragments of specific length is indeed crucial in understanding the information organization of genomes [14].
Repeatsharing gene networks
Once we discovered that the percentage of repeats in dictionaries is “low” (and decreasing with k), we focused on studying the positions of 18repeats along the genome, in order to check if they are more densely present in encoding regions or noncoding ones. This investigation allowed us to design a synthetic gene network in the following way: nodes are genes, and they are connected by an edge if they have at least one common repeat (that is, there exists a repeat which is a proper factor common to the two genes). An interest for this kind of diagram (see examples in Figures 3 and 4) finds a motivation in the hypothetic communication between genes due to competitions for short endogenous RNA sequences (around 20 bases long) proposed in [15].
Figure 3. Repeat sharing gene network of N. equitans. A subgraph is pointed out of the 18repeat sharing gene network of Nanoarcheaum equitans, a short genome (see Table 1) which is mostly (93%) formed by genes. As we may notice on the right, the gene NEQ475 is linked with the NEQ186 and NEQ457. It contains at least two occurrences of each of three different repeats, has 8 distinct repeats in common with NEQ186 and only one with NEQ 457.
Figure 4. Repeat sharing gene network ofE. coli. A subgraph is pointed out of the 18repeat sharing gene network of Escherichia coli, whose genome has an high percentage (89%) of genes. Four genes in the figure on the right turn out all connected, by only one repeat in half of the connections, and a quite high number of common repeat in the others.
We have work in progress to investigate these kparametrized labeled gene networks by standard methods of graph theory and network analysis. Gene nodes with higher degrees turned out to be actually involved in important long genetic pathways, and for specific values of k, between 16 and 18, drastic changes may be observed in the network conformation, while emerging several clusters of genes. However, this is out of the scope of this work, even if it will be a natural extension of it.
Discussion
In this session we would like to specifically discuss the computational results reported in all the tables, and the importance of reading a genome by its mutliplicitycomultiplicity kdistribution. In both cases internal structural properties of genomes emerge which highlight regularity indicators, based on the number and distribution of repeats.
For all our genomes of Table 1, listed according to an increasing genome length order, we report in Tables 2, 3, and 4 numerical data related to the computation of D_{k}(G),H_{k}(G),R_{k}(G) for k = 6, 12, and 18, respectively^{a}.
A peculiar phenomenon regarding hapax statistical distribution may be observed passing from the 12 to the 18genomic dictionary (see Tables 3 and 4). For all the genomes, by enlarging the k value, the number of hapax increases, even relatively to the number of repeats (roughly speaking, “most of the 12words are repeats while most of 18words are hapax”). Indeed, by computing for k = 12,18, we see that repeatability generally increases with genome length for k = 6,12, while this regularity disappears for k = 18.
More interestingly, the (relative) amount of hapaxes increases by some orders of magnitude with k passing from 12 to 18. Based on this observation coming from computational experiments, one could suppose that by increasing the word size, genomic dictionaries composed of only hapaxes may be computed (which would have been good news for genome reconstruction algorithms [16,17]). This intuition though has been invalidated by further computations (see Table 2). In fact, repeats having length of several thousands have been found within each of our genomes (see for example Figure 5, and the website http://www.cbmc.it/external/Infogenomics3 webcite), and 12→18 represents a sort of phase transition from scarce to abundant hapax/repeat distribution. This phenomenon would surely deserve a more detailed and generalized analysis.
Figure 5. Lengthcardinality repeat distributions. In this figure four examples are reported, related to the MR computations of Mycoplasma mycoides, Escherichia coli, Pseudomonas aeruginosa, and Sorangium cellulosum. Here we observe that R_{k} has an exponential decay with the word length k. Moreover, very long repeat words were found for any of the genomes we analyzed.
Random vs real genomes
We have carried out a systematic study of repeat distribution, of real and randomly permuted genomes (that are, random sequences having the same nucleotide frequencies of the original genome), in order to get new information on the structure of such relevant motifs [14].
We produced some diagrams showing how the number of genomic, hapax, and repeat words of a given length varies with respect to the length (see website http://www.cbmc.it/external/Infogenomics3 webcite), and a common remarkable finding is the similar shapes of the curves, where the transition aforementioned occurs. Cardinality trends of sets D_{k}(G) (dictionary words), R_{k}(G) (repeat words), and H_{k}(G) (hapax words), for k = 1,…,18 are compared for genomes and their random permutations, and specifically for Human chromosome, a greater difference between random and nonrandom situation may be clearly observed (see Figure 6).
Figure 6. Cardinality trends ofD_{k}(G)(chart on top),H_{k}(G)(second chart), andR_{k}(G) (bottom chart), for G being theHomo sapiens (chromosome 19), and fork = 1,…,18. Blue lines (big dots) represent dictionary trends of real genomic sequence, and red lines (small dots) represent dictionary trends for a random permutation of the same genomic sequence. First chart: number of genomic kwords; Second chart: number of hapax kwords; Third chart: number of repeat kwords.
If we compare the dictionaries of the genome with those of its random permutation (in Figure 6, respectively, big blue versus small red dots), we find quite similar curves. However, even when diagrams follow the same general trends, specific characters of these curves correspond to features which are typical of the single genomes [18]. In general, random values are always considerably greater than nonrandom values, for both hapax and whole dictionaries, while the opposite appears for repeats, before and after the distribution peaks.
All the data were confirmed along with several random permutations. However, apart of the comparison with permuted sequences, we would like to observe the shape of R_{k} in itself. Only in a limited range of values for k, R_{k}has a significant size, and such a range is [7,17] for all the analyzed genomes, with a pick around the value k=10, while both shifting towards the values 11, 12 for the pick, with the increasing of genome length.
Multiplicitycomultiplicity charts have been computed for all the genomes as well, by means of an application of the software described in the Methods section. displays some of them for 6words of four organisms: Escherichia coli, Saccharomyces cervisiae, Drosophila melanogaster and Homo sapiens (chromosome 19). Blue bars are related to real genome sequences and red bars concern random permutations of the same sequences. At a first glance, in real genome distributions (blue bars) we notice a common trend, very similar to a Poisson distribution, with specific peculiarities which characterize each genome. On the other hand, random permutations of genomic sequences have multimodal distributions which depend on base frequencies.
We observe that the multplicitycomultiplicity distribution of Escherichia coli has multiplicities (xaxis) between about 0 and about 5,400, whereas Drosophila melanogaster has multiplicities between about 5,000 and about 330,000. On the other hand, the maximum comultiplicity is 8 for Escherichia coli, and is 3 for Drosophila melanogaster (in Figure 7, see the yaxis of the first and the third charts). These parameters are very different even if the “shape” of the genomic sequences in the two charts is quite similar. In order to perform a comprehensive analysis of multplicitycomultiplicity distribution we have dealt with them as probability distributions, and we have computed about 25 statistical indexes which characterize them, such as, maximum, minimum and mean multiplicity, maximum, minimum and mean comultiplicity, standard deviation, kurtosis, skewness, mode, entropy, etc. In [18] these indexes have been successfully employed to classify genomes according to their organism kingdom.
Figure 7. Multiplicitycomultiplicity 6distributions forEscherichia coli,Saccharomyces cervisiae,Drosophila melanogaster andHomo sapiens (chromosome 19). Blue bars represent distributions of real genomic sequences, while red bars represent distributions of random permutations of genomic sequences
As a conclusion, in Figure 7 we would like to point out that in cases of random permutations of genomes, multimodal shapes may be observed, which depend on the base frequencies of genomes. However, the apparently more ordered concentrations of word multiplicities, around the modes, can be explained by considering that frequencies allow us to classify (and count) words corresponding to the same multiset (Parikh vector equivalent). Consequently, due to the random effect, being the words with the same multiset equally probable, they concentrate around the multiplicity associated to this probability. These distribution differences between randomly permuted genomes and real genomes is another measure of the information content that genomes have with respect to casual sequences.
Conclusions
Bipartition of a genomic dictionary in hapax and repeat words emphasizes the roots of precise string categories which are related to the functional organization of genomes. The set of 18repeats in our genomes has a digital size which is a couple of orders smaller than the whole genome, and it seems to have a role of “lexical” coding, that is, a semantics external to the genome. Other elements, with a notably bigger digital size, seem to have a role of addressing, delimiting, coordinating, just like positionidentification tags.
The definition, computation, and analysis of well characterized dictionary based genomic indexes have pointed out some phenomena of genomic regularity and specificity. They can highlight our knowledge about the internal logic of genome structure and organization, as well as about evolutional and functional attributes of genomes (as in [18], specifically devoted to genome clustering).
Future work
There are several lines of development that our research is intended to pursue. We are already working on some of these, mainly focused on the study of intersections among genomic dictionaries. It would be interesting to check the relationship between words recurrent in dictionary intersections and those which are known to be conserved along the evolutive lineages. Another research line concerns the intergenomic character of hapaxes and repeats. The question is about which hapaxes (respectively repeats) of a given genome occur in other genomes of a certain class by keeping their status of hapax (resp. repeat) when compared to the new context of words.
Finally, we conclude with a fundamental question which points out a novel perspective related to the approach developed in the paper: what is the essence of a genome? For genome functions, two aspects are essential: the presence of some factors and their relative positions. Discovering which factors are essential, the classes related to their roles, and the mechanisms for expressing their relative positions, could provide essential properties of genomes, even without a detailed knowledge of their whole sequence. The approach outlined in this paper could be considered as a first step in the exploration of this perspective.
Methods
The genome analysis described so far requires a rigorous protocol and a sophisticated technological infrastructure in order to be performed systematically. Dictionaries, tables, distributions and related indexes, described so far, need a lot of computational resources to be calculated, and advanced data exploration and visualization tools to be analyzed. We have developed a process (and a related software suite), shown in Figure 8, for informational index generation and analysis. It involves three main phases: (i) acquisition of genomic sequences from public databases, (ii) computation of informational indexes, which are subsequently stored in a database, (iii) visualization, exploration and quantitative analysis of these informational indexes.
Figure 8. Genome analysis process and software architecture.
Sequences were downloaded as FASTA files from NCBI genome database[19], UCSC Genome Bioinformatics website [20] and EMBLEBI website [21], and they were stored, with their accession numbers and identification data, on our server. About sixty sequences have been analyzed so far, corresponding to genomes of well known organisms, often constituting biological models, of remarkable relevance in the genomic analysis. All classes of Archea, Bacteria, and Eucaryotes^{b} are represented.
The software employed to process genomic sequences and to compute informational indexes is a sophisticated service oriented architecture based on Java web services. The Java EE application model guarantees the scalability, accessibility, and manageability needed by our application. Each index is computed by a specific web service which receives as an input a genomic sequence with some additional parameters, and stores the results in a MySQL database, representing the data warehouse of our infrastructure.
Optimized data structures and algorithms were required to perform index computation since huge amount of data had to be processed. The entire application is hosted by a high performance server having 16 processors and 24GB of RAM. Our index database currently contains about 100GB of data, consisting of 300 millions of records. The amount of information generated by web services is sometimes very large (e.g., a 12genomic dictionary D_{12}(G) could have up to 4^{12 }≈ 16 millions of words) and the storage of this information in databases could require quite a lot of time and specific database setting. The advantage to use web services to compute informational indexes is that they can be called by many kinds of application clients. In this section we have described only a Java application client, but web clients or nonJava clients (e.g., Microsoft.Net or Matlab clients) could be employed as well. Web services guarantee a great interoperability and extensibility to our application.
The visualization and exploration of such an enormous dataset requires specific tools as well. We have adopted a data access solution, called Qlik^{®}View [22], coming from the world of Business Intelligence (where sophisticated elaborations of huge moles of economic and financial data are performed). This tool enables an interactive exploration of large and complex datasets by means of a patented inmemory associative technology.
Figure 9 shows a screenshot of the Qlik^{®}View application, which has two main sections. A navigation menu, on the left, by which the user can select genome sequences, organism kingdoms and dictionary parameters. A central area containing visualization elements of genomic indexes, such as tables, charts, lists of words, and diagrams.
Figure 9. Visualization and exploration of informational indexes by means of a Qlik^{®}View application called InfoGenomics. Multiplicitycomultiplicity distributions of four genomic sequences are visualized in the same (central) chart in order to visually compare their profiles. The figure shows a table where number of occurrences and related number of words are listed, and can be selected in order to focus the exploration on specific features. A second chart, placed on the right, shows cumulative distributions, and a table placed on the bottom shows statistical indexes (e.g., mean, standard deviation) related to the distributions.
Tabs differ only in the central area, where informational indexes are displayed by means of several kinds of graphical objects provided by Qlik^{®}View. This way to visualize and browse the information is very powerful and enables the user to achieve a deep insight into the genomes. The following list summarizes the functionalities developed so far which contained in the tabs: genome basic indexes (genome identificators, base frequencies, gccontent, etc.); kDictionaries and MultiplicityComultiplicity distributions; normalizations of indexes at the previous item; statistical parameters (e.g., mean, standard deviation, mode, kempirical entropy, etc.) related to MultiplicityComultiplicity distributions; dictionary intersections; maximal repeat lengths; dictionary size trends.
Endnotes
^{a}When analyzing downloaded genomes, in some cases we have found a number num of unavoidable words, defined as those containing IUPAC (variable) symbols, which can assume one of the values A, T, C, G (see http://www.mun.ca/biochem/courses/3107/symbols.html webcite). When they are present in a genome, such as the case of Haemophilus Influenzae, they are eliminated from the computation of all words in the genome, then the kgenomic dictionary is built up not from n−k + 1 genomic klong words, but from the n−k−num + 1 regular words. Specifically, as value of num we have found: for H. influenzae’s 6/12/18genomic dictionary, respectively 646, 1,271, 1,877; for D. melanogaster’s 6/12/18genomic dictionary, respectively 1,225,656, 1,226,400, 1,227,144; for H. sapiens’ 6/12/18genomic dictionary, respectively 1,171,155, 1,173,045, 1,174,935.
^{b}A most detailed description of these genomes may be found in: http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/G/GenomeSizes.html webcite.
Competing interests
The authors declare that they have no competing interests.
Author’s contributions
This paper is a first step towards a project, called Infogenomics, conceived and designed by the last author VM, who defined an initial kernel of informational indexes, to be investigated and compared on specific genomes. A sophisticated service oriented architecture (SOA) based on Java web services and a Qlik^{®}View application, has been developed by the first author AC, in order to make possible the high amount of computations necessary for the informational analysis of genomes. All the authors discussed and agreed on the interpretation of experimental results, with a main role of GF in the preparation of the paper. All authors read and approved the final manuscript.
Acknowledgements
The first author was funded by CBMC (Center for Biomedical Computing), in Verona, Italy, which also provided us with the highperformance server where all the computations were performed.
References

Gibson DG, et al.: Creation of a Bacterial Cell Controlled by a Chemically Synthesized Genome.
Science 2010, 329(5987):5256. PubMed Abstract  Publisher Full Text

Gibson G, Muse SV: A Primer of Genome Science. Sinauer Associates Inc; 2009.

Percus JK: Mathematics of Genome Analysis. Cambridge Studies in Mathematical Biology: Cambridge University Press; 2007.

The ENCODE Project consortium: ENCODE.
Nature 2012, 489(7414):45113. PubMed Abstract  Publisher Full Text

Chor B, Horn D, Goldman N, Levy Y, Massingham T: Genomic DNA kmer spectra: models and modalities.
Genome Biol 2009, 10:R108. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Zhou F, Olman V, Xu Y: Barcodes for genomes and applications.
BMC Bioinf 2008, 9:546. BioMed Central Full Text

Deschavanne PJ, Giron A, Vilain J, Fagot G: Genomic Signature: Characterization and Classification of Species Assessed by Chaos Game Representation of Sequences.
Mol Biol Evol 1999, 16(10):13911399. PubMed Abstract  Publisher Full Text

Hao B, Qi J: Prokaryote phylogeny without sequence alignment: from avoidance signature to composition distance.
J Bioinf and Comput Biol 2004, 2:119. Publisher Full Text

Fofanov Y, Luo Y, Katili C, Wang J, Belosludtsev Y, Powdrill T, Belapurkar C, Fofanov V, Li TB, Chumakov S, Pettitt B: How independent are the appearances of nmers in different genomes?

Vinga S, Almeida: Alignmentfree sequence comparison  a review.
Bioinformatics 2003, 19(4):513523. PubMed Abstract  Publisher Full Text

Mantegna RN, Buldyrev S, Godberger A, Havlin S, Peng C, Simons M, Stanley H: Linguistic Features of Noncoding DNA Sequences.
Phys Rev Lett 1994, 73(23):31693172. PubMed Abstract  Publisher Full Text

Hampikian G, Andersen T: Absent sequences: nullomers and primes.

Haubold B, Pierstorff N, Möller F, Wiehe T: Genome comparison without alignment using shortest unique substrings.
BCM Bioinf 2005, 6:123. BioMed Central Full Text

Ichinose N, Yada T, Gotoh O: Largescale motif discovery using DNA Gray code and equiprobable oligomers.
Bioinformatics 2012, 28:2531. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tai Y, et al.: CodingIndependent Regulation of the Tumor Suppressor PTEN by Competing Endogenous mRNAs.
Cell 2011, 147:344357. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Fici G, Mignosi F, Restivo A, Sciortino M: Word assembly through minimal forbidden words.
Theor Comput Sci 2006, 359:214230. Publisher Full Text

Cicalese F, Erdös P, Lipták Z: Efficient reconstruction of RCequivalent strings.
In IWOCA 2010  LNCS 6460 Edited by Iliopoulos C, Smyth WF . 2011, 349362.

Castellini A, Manca V, Compri S, Tosadori G, Bicego M: Genome classification by dictionarybased indexes. In Poster, presented at the Int. Conf. on Pattern Recognition in Bioinformatics (PRIB2011). TU Delft; 2011.

NCBI Genome database [http://www.ncbi.nlm.nih.gov/sites/genome webcite]

UCSC Genome Bioinformatics website [http://hgdownload.cse.ucsc.edu/downloads.html webcite]

EMBLEBI website [http://www.ebi.ac.uk/genomes/ webcite]

QlikView website [http://www.qlikview.com/ webcite]