Table 2

GenBank data sets

Organism group

Vertebrata

Arthropoda

Fungi

Magnoliophyta


Total CDS with introns

54729

34336

31441

95711


pseudogene

1899

90

101

789

not experimental

1204

515

504

9150

incomplete 5' end (<)

15622

10583

11143

11659

incomplete 3' end (>)

5417

1664

569

1561

cross-reference

10445

231

2

60

join (complement)

0

16

0

34

contains 'X'

106

120

71

100

contains 'U'

26

4

0

0

no initial 'M'

222

51

9

34

zero or negative length

36

7

17

35

annotated gap

480

6

0

25

length mismatch

466

19

11

18


Used for length statistics

18807

21030

19014

72247


non-gt...ag

1734

818

1159

3368

intron too short

550

1244

2354

12125


CDS accepted

16523

18968

15501

56754


After homology reduction

3542

4179

4525

12751


With signal peptides

755

769

431

1051

Without signal peptides

2552

3202

3814

10370


The number of genes (CDS features) found in GenBank within the four organism groups studied. The number of genes discarded for various reasons. The number kept after homology reduction. The numbers predicted to contain or not to contain a signal peptide.

Nielsen and Wernersson BMC Genomics 2006 7:256   doi:10.1186/1471-2164-7-256

Open Data