Table 1

GENETAG corpus statistics The 20K sentences were split into four subsets called Train, Test, Round1 and Round2.

Train

Test

Round1

Round 2

Total


Number of Sentences

7,500

2,500

5,000

5,000

20,000

Number of Words

204,195

68,043

137,586

137,977

547,801

Number of Tagged Genes = G

8,935

2,987

5,949

6,125

23,996

Total Number of Alternative Forms of Gene Names in G

6,583

2,158

4,275

4,505

17,531

Number of Gene Names in G with Alternative Forms = N

4,675

1,522

3,057

3,186

12,440

Average Number of Alternatives per Gene Name in N

1.66

1.67

1.62

1.65

1.65


Tanabe et al. BMC Bioinformatics 2005 6(Suppl 1):S3   doi:10.1186/1471-2105-6-S1-S3

Open Data