StrainInfo (http://www.straininfo.net webcite) is a global catalog of microbial material, building upon the catalogs of Biological Resource Centers (BRCs) by integrating catalog entries of equivalent microbial material. Currently, the integration algorithm resolves the equivalent cultures and links all downstream information . However, in order to increase the information content of StrainInfo, it is necessary to add fine-grained semantic information. This information enters StrainInfo on the culture level (synchronization with BRC catalogs), but must be integrated to the strain level (i.e. the set of equivalent cultures) in order to be presented on so-called strain passports.
The adoption of Microbiological Common Language (MCL) XML synchronization quickly increased the volume of semantic data in StrainInfo . However, the effective data values of the different semantic fields still are raw textual entries and therefore are of varying detail, can have different forms or languages, and sometimes contain inconsistencies or even true errors. By consequence, in order to generate a strain level consensus value for each field, a specialized semantic integration of this data needs to be developed. As a case study for semantic integration in StrainInfo, the focus was put on the isolation habitat and location information fields due to their importance from both biological and legal (IP rights) perspective. An example of such data can be found in Table 1.
Table 1. Example isolation habitat and location data of a Pichia guilliermondii strain, as listed by different BRCs. For each column, we want to calculate a consensus value for the complete strain.
To integrate geographical information, named entity recognition is performed by annotating all geographic names with features from the GeoNames ontology. This yields a multitude of annotations, each annotation matching a name with one or more geographical features. As a large number of geographic names is not unique (e.g. Cambridge becoming annotated with both the USA and the UK instance), irrelevant annotations are removed by using other higher order features such as countries or continents found in the strain. In addition, the most specific feature is selected by removing the higher order features as this is redundant information that can be inferred from the ontology. The remaining annotation is the integration result; multiple remaining annotations or features being too distant indicate inconsistent data.
The habitat fields can also be integrated using a similar algorithm. However, in order to have enough ontological coverage, a combination of the Environmental Ontology (EnvO), the NCBI Taxonomy and Foundational Model of Anatomy (FMA) ontology is used. This possibly yields multiple orthogonal annotations, but for this field, having multiple annotations increases the information content and therefore does not indicate inconsistencies.
IEEE Trans. Knowl. Data Eng 2005, 17:1111-1126. Publisher Full Text
Res. Microbiol 2010, 161(6):439-445.
doi:10.1016/j.resmic.2010.02.005PubMed Abstract | Publisher Full Text