Table 1 |
|||||
|
GENETAG corpus statistics The 20K sentences were split into four subsets called Train, Test, Round1 and Round2. |
|||||
|
Train |
Test |
Round1 |
Round 2 |
Total |
|
|
|
|||||
|
Number of Sentences |
7,500 |
2,500 |
5,000 |
5,000 |
20,000 |
|
Number of Words |
204,195 |
68,043 |
137,586 |
137,977 |
547,801 |
|
Number of Tagged Genes = G |
8,935 |
2,987 |
5,949 |
6,125 |
23,996 |
|
Total Number of Alternative Forms of Gene Names in G |
6,583 |
2,158 |
4,275 |
4,505 |
17,531 |
|
Number of Gene Names in G with Alternative Forms = N |
4,675 |
1,522 |
3,057 |
3,186 |
12,440 |
|
Average Number of Alternatives per Gene Name in N |
1.66 |
1.67 |
1.62 |
1.65 |
1.65 |
|
|
|||||
|
Tanabe et al. BMC Bioinformatics 2005 6(Suppl 1):S3 doi:10.1186/1471-2105-6-S1-S3 |
|||||