Table 8

The top 25 words in the entire genome

Unmasked

Masked

Unmasked


Word

S

ES

O

EO

OlnOEO

S

ES

O

EO

OlnOEO

RevComp

RC_Pos

Pal

PValues


AAAAAAAA

5

5

128631

119310

9675.67

5

5

101229

95334

6073.66

TTTTTTTT

1

No

0


TTTTTTTT

5

5

126533

117302

9585.11

5

5

98883

93091.2

5968.36

AAAAAAAA

0

No

1.67E-15


TATATATA

5

5

58215

49385.7

9575.32

5

5

29264

27159.9

2183.54

TATATATA

2

Yes

3.89E-15


ATATATAT

5

5

59429

53453

6298.28

5

5

30192

29596.8

601.111

ATATATAT

3

Yes

3.00E-15


TAAAAAAT

5

5

14823

11276.3

4053.8

5

5

11492

9148.23

2621.21

ATTTTTTA

5

No

4.44E-16


ATTTTTTA

5

5

14743

11385.1

3810.52

5

5

11392

9219.87

2409.99

TAAAAAAT

4

No

3.33E-16


GAAGAAGA

5

5

30102

26908.7

3375.68

5

5

22784

20523.6

2380.53

TCTTCTTC

7

No

0


TCTTCTTC

5

5

30267

27090.3

3356.11

5

5

23044

20902.7

2247.42

GAAGAAGA

6

No

0


TTTTAAAA

5

5

29354

26314.9

3208.24

5

5

19409

17519.9

1987.46

TTTTAAAA

8

Yes

2.55E-15


AATATATT

5

5

14170

11353.5

3140.06

5

5

11168

10179.5

1035.06

AATATATT

9

Yes

1.11E-16


TTTTCTTT

5

5

31066

28174.8

3034.69

5

5

26876

24423.6

2571.58

AAAGAAAA

11

No

0


AAAGAAAA

5

5

31033

28187.3

2984.8

5

5

26861

24502.1

2469

TTTTCTTT

10

No

1.11E-16


AGAGAGAG

5

5

19376

16630.5

2960.63

5

5

12615

11397.8

1280.05

CTCTCTCT

16

No

1.11E-16


TCTCTCTC

5

5

19179

16519.7

2862.73

5

5

12912

11634.1

1345.64

GAGAGAGA

14

No

4.44E-16


GAGAGAGA

5

5

20064

17413.4

2842.81

5

5

13136

11970.7

1220.21

TCTCTCTC

13

No

1.89E-15


AAGAAGAA

5

5

32397

29731.9

2781.12

5

5

24352

23296.2

1079.35

TTCTTCTT

19

No

0


CTCTCTCT

5

5

18513

15956.1

2751.61

5

5

12312

11212.7

1151.45

AGAGAGAG

12

No

1.11E-16


AGAAGAAG

5

5

26477

24049.7

2545.91

5

5

19161

18013.6

1183.17

CTTCTTCT

20

No

8.88E-16


TTATATAA

5

5

11402

9138.11

2523.66

5

5

9262

8518.12

775.46

TTATATAA

18

Yes

1.11E-15


TTCTTCTT

5

5

32333

29910

2518.58

5

5

24550

23579.9

989.811

AAGAAGAA

15

No

0


CTTCTTCT

5

5

26463

24183.9

2383.23

5

5

19432

18332.3

1132.03

AGAAGAAG

17

No

0


TTTTTCTT

5

5

30561

28331

2315.57

5

5

26516

24717.1

1862.84

AAGAAAAA

22

No

0


AAGAAAAA

5

5

30461

28234.7

2311.9

5

5

26488

24756.8

1790.32

TTTTTCTT

21

No

4.44E-16


TTTGTTTT

5

5

32141

29931

2289.6

5

5

27813

26102.2

1765.71

AAAACAAA

36

No

8.88E-16


Top 25 overrepresented words for the entire genome of Arabidopsis thaliana. The Word attribute describes the short nucleotide sequence associated with a putative word. S and ES describe the number of chromosomes a word occurs in and the number of chromosomes the word was expected to occur in respectively, while O and EO describe the total number of occurrences and the expected total number of occurrences. The score OlnOEO describes a statistical overrepresentation of the word in the genome and is based on a Markov Chain Background Model. Each set of attributes was computed for the masked as well as the unmasked version of the corresponding segment with the emphasis placed on the unmasked version (i.e. sorting of the table based on the unmasked OlnOEO score).

Further information for the word is provided through its reverse complement (RevComp) and the position of the reverse complement in the set of results (RC_Pos) as well as a notion describing if the word is a genomic palindrome (Pal).

Finally, PValues describes a p-value that is assigned in order to provide statistical insight allowing the determination if a word is relevant or was discovered as interesting by random chance.

Lichtenberg et al. BMC Genomics 2009 10:463   doi:10.1186/1471-2164-10-463

Open Data