Additional File 2 This file provides the BioWarehouse schema as an SQL definition that can be converted to either Oracle or MySQL format using a tool provided within the BioWarehouse distribution. @- This file is preprocessed by translate-schema.lisp. @- NOTE: @- comments do not appear in preprocessor output file. @- Changes from oracle schema in constructing generic template: @- Change '--SD' to '-- SD' since MySQL didnt like the former syntax. @- Prefixed foreign key constraints with FKEY macro, so it can expand to '--' @- for MySQL until foreign keys are worked out in the schema. @- Added TABLETYPE macro to end of each table (MySQL - "TYPE=INNODB") @- Different table types enable transactions, foreign key support, etc. @- The schema template file uses macro symbols (those beginning with @) @- to specify DBMS-dependent syntax. @- @- Abstract data types: @- @wid: a Warehouse Identifier (a number that is unique throughout the Warehouse) @- @int16, @int32, @int64: signed integral types using 16, 32, and 64 bits resp. @- @real32, @real64: signed floating point types using 32, and 64 bits resp. @- @number: any number for which exact precision is preserved. @- @varchar(nnn): a varying-length string of length <= nnn. @- @string10, @string16, @string32: a varying-length string @- of length <= 2047, 65535, 4294967295 resp. @- @byte32: binary data of length <= 4294967295 bytes. @- @datetime: a date and/or a time. @- @- Other macro symbols: @- @widsequence: Defines a SEQUENCE used to generate WIDs. @- @initialwid: Defines starting value of the above SEQUENCE. @- @index-tablespace: Defines a TABLESPACE in which an INDEX resides. @- @using-index: Defines a TABLESPACE in which an INDEX referenced in a CONSTRAINT resides. @- @tabletype: Specifies a TABLE type (MySQL supports several). @- @autoinc: Specifies the AUTO-INCREMENT property for a numerical row. @- @index10: Marks INDEXes that use long strings. @- The schema template file is subject to certain conventions, for the use of @- our HTML documentation generator: @- o Comment lines directly above a create table are assumed to be @- a description of that table; @- o Comments on the same line (or on a following line with no SQL @- code before the start comment `--') are assumed to describe @- the column of the table; @- o Constraints must be within, and at the end, of a create table @- statement, not following it; @- o Create index statements must follow the table they refer to; @- o For de-normalized tables, to allow the documentation tool to @- pick up the reference, insert a standard syntax constraint @- line with the special comment start `-- SD ' (SD = SchemaDoc). @- These will be ignored by the database, but interpreted by the @- Schemadoc tool. @- ----------------------------------------------------------------------------- @- Tables that are used to define the Warehouse contents and its current state. @- ----------------------------------------------------------------------------- -- Describes this version of the BioSPICE Warehouse. -- This table contains exactly one row. -- CREATE TABLE Warehouse ( Version @number NOT NULL, -- Numbered version of the warehouse schema that is loaded LoadDate @datetime NOT NULL, -- Date that this schema was loaded MaxSpecialWID @wid NOT NULL, -- Maximum value allowed for SpecialWIDTable.PreviousWID / specialWID sequence Description @string10 -- Free-form text as desired by the warehouse maintainers ) @tabletype; INSERT INTO Warehouse (Version, LoadDate, MaxSpecialWID) VALUES (3.0, @now, 999); -- Warehouse ID (WID) allocation -- -- Generate a unlimited sequence of unique numbers to do the bookkeeping for -- Warehouse ID (WID) numbers, which serve as keys for all Warehouse entities. -- WIDs 0 and 1 are reserved and not allocated. -- @widsequence -- To allow for compatibility with RDBMS systems that do not support -- sequences, this table may be used to store the last used WID. -- MySQL dataset loaders use the following protocol for WID assignment: -- DELETE FROM WIDTable; -- INSERT INTO WIDTable VALUES (); -- SELECT last_insert_id(); -- It may also be used in RDBMS systems that do not support auto-increment -- of a counter; however additional protocol like table locking may be -- needed to assure the integrity of WIDs during concurrent assignment. -- This table contains exactly one row at any given time. -- CREATE TABLE WIDTable ( PreviousWID @wid @autoinc NOT NULL, -- Last-used Warehouse identifier. -- CONSTRAINT PK_WIDTable PRIMARY KEY (PreviousWID) @using-index ) @tabletype; INSERT INTO WIDTable (PreviousWID) values (@maxspecialwid); -- Generate a finite sequence of WIDs for use as special Warehouse entities, -- including DataSet WIDs. These are small integers. -- @specialwidsequence -- Special WIDs are low-valued WIDs, assigned to globally and frequently used -- WIDs within a dataset, such as its DataSetWID. They are both more -- mnemonic, and potentially more efficient to store. Warehouse.MaxSpecialWID -- delimits special and regular WIDs. -- Analogous to WIDTable, this table is used to allocate special WIDs. -- MySQL dataset loaders use the following protocol for WID assignment: -- DELETE FROM SpecialWIDTable; -- INSERT INTO SpecialWIDTable VALUES (); -- SELECT last_insert_id(); -- This table contains exactly one row at any given time. -- CREATE TABLE SpecialWIDTable ( PreviousWID @wid @autoinc NOT NULL, -- Last-used special WID CONSTRAINT PK_SpecialWIDTable PRIMARY KEY (PreviousWID) @using-index ) @tabletype; INSERT INTO SpecialWIDTable (PreviousWID) values (1); -- This table defines enumerated types used within the warehouse. Essentially, -- enumerated types correspond to small controlled vocabularies used within -- one or more warehouse tables. For example, the warehouse table Feature has -- a column called Type that defines the type of a feature within an amino-acid -- sequence. Examples types might be PHOS-RESIDUE and GLY-RESIDUE, corresponding -- to residues in a protein that are phosphorylated and glycosylated, respectively. -- Those two enumerated types would be entered into two rows of this Enumeration -- table. Note that in some cases the values actually stored in the warehouse -- table are numbers, which are stored in the Value field of the Enumeration table -- as the string form of that number. -- CREATE TABLE Enumeration ( TableName @varchar(50) NOT NULL, -- Table this enumeration is used in ColumnName @varchar(50) NOT NULL, -- Column this enumeration is used in Value @varchar(50) NOT NULL, -- The enumeration value Meaning @string10 -- A comment describing its meaning ) @tabletype; -- Table whose rows define each dataset currently loaded into the warehouse. -- CREATE TABLE DataSet ( WID @wid NOT NULL, -- Warehouse identifier for this entry. Name @varchar(255) NOT NULL, -- Name of the database. Version @varchar(50), -- Version of the database described by this warehouse dataset ReleaseDate @varchar(50), -- Date that the original database was first released. LoadDate @datetime NOT NULL, -- Date that the load of this data set into the warehouse began. ChangeDate @datetime, -- Date of the last change to this dataset, or the date the load -- completed if no changes have occurred. NULL means that the -- is currently loading, or a fatal error occurred on the load. -- NOTE: This feature is not supported by most loaders until version 4.0. HomeURL @varchar(255), -- Web address of the home page of the original database. QueryURL @varchar(255), -- A URL at which we can retrieve objects in this database via the WWW -- by substituting the unique ID of the object we wish to retrieve for the -- string "%s" in the query URL. An example URL is: -- http://database.university.edu/get-object/5050%s -- CONSTRAINT PK_DataSet1 PRIMARY KEY (WID) @using-index ) @tabletype; -- Defines metadata describing a warehouse object. -- Every warehouse object (that is, anything with a WID column) must have an associated Entry row. -- CREATE TABLE Entry ( OtherWID @wid NOT NULL, -- Warehouse ID of the entity InsertDate @datetime NOT NULL, -- Date entity was stored in the warehouse CreationDate @datetime, -- Date entity was created in its source dataset ModifiedDate @datetime, -- Date that the entry was last modified, either -- in the warehouse or in the source dataset LoadError CHAR(1) NOT NULL, -- 'T' if an error (eg. parse error on some field of the entry) -- occured during loading of this entry into the warehouse, else 'F'. LineNumber @int32, -- Either line number or entry number from dataset source file ErrorMessage @string10, DatasetWID @wid NOT NULL @fkeycomma -- The dataset the entity is from -- @FKEY CONSTRAINT FK_Entry FOREIGN KEY (DataSetWID) REFERENCES DataSet(WID) ON DELETE CASCADE -- SD CONSTRAINT FK_ENTRY1 FOREIGN KEY (OtherWID) REFERENCES Chemical(WID) -- SD CONSTRAINT FK_ENTRY2 FOREIGN KEY (OtherWID) REFERENCES Element(WID) -- SD CONSTRAINT FK_ENTRY3 FOREIGN KEY (OtherWID) REFERENCES Reaction(WID) -- SD CONSTRAINT FK_ENTRY4 FOREIGN KEY (OtherWID) REFERENCES Protein(WID) -- SD CONSTRAINT FK_ENTRY5 FOREIGN KEY (OtherWID) REFERENCES Feature(WID) -- SD CONSTRAINT FK_ENTRY6 FOREIGN KEY (OtherWID) REFERENCES Function(WID) -- SD CONSTRAINT FK_ENTRY7 FOREIGN KEY (OtherWID) REFERENCES EnzymaticReaction(WID) -- SD CONSTRAINT FK_ENTRY8 FOREIGN KEY (OtherWID) REFERENCES Gene(WID) -- SD CONSTRAINT FK_ENTRY9 FOREIGN KEY (OtherWID) REFERENCES BioSource(WID) -- SD CONSTRAINT FK_ENTRY10 FOREIGN KEY (OtherWID) REFERENCES Pathway(WID) -- SD CONSTRAINT FK_ENTRY11 FOREIGN KEY (OtherWID) REFERENCES NucleicAcid(WID) -- SD CONSTRAINT FK_ENTRY12 FOREIGN KEY (OtherWID) REFERENCES Term(WID) -- SD CONSTRAINT FK_ENTRY13 FOREIGN KEY (OtherWID) REFERENCES Support(WID) -- SD CONSTRAINT FK_ENTRY14 FOREIGN KEY (OtherWID) REFERENCES Citation(WID) -- SD CONSTRAINT FK_ENTRY15 FOREIGN KEY (OtherWID) REFERENCES Division(WID) -- SD CONSTRAINT FK_ENTRY16 FOREIGN KEY (OtherWID) REFERENCES GeneticCode(WID) -- SD CONSTRAINT FK_ENTRY17 FOREIGN KEY (OtherWID) REFERENCES Taxon(WID) ) @tabletype; @- ----------------------------------------------------------------------------- @- Tables that contain universally- or widely-accepted scientific data. @- These tables are independent of any dataset. @- ----------------------------------------------------------------------------- -- The periodic table of elements -- CREATE TABLE Element ( WID @wid NOT NULL, -- WID of this element Name @varchar(15) NOT NULL, -- Long name (e.g. 'Oxygen') ElementSymbol @varchar(2) NOT NULL, -- Symbol (e.g. 'O') AtomicWeight @real32 NOT NULL, -- Weighted average weight (e.g. 16.x) AtomicNumber @int16 NOT NULL, -- Number of protons -- CONSTRAINT PK_Element PRIMARY KEY (WID) @using-index ) @tabletype; -- Defines one or more valences for each element. -- CREATE TABLE Valence ( OtherWID @wid NOT NULL, -- WID of the element Valence @int16 NOT NULL @fkeycomma -- Number of valence electrons -- @FKEY CONSTRAINT FK_Valence FOREIGN KEY (OtherWID) REFERENCES Element(WID) ON DELETE CASCADE ) @tabletype; @- ----------------------------------------------------------------------------- @- Object tables. These represent the principal objects in the Warehouse. @- All object tables have a WID column that is globally unique identifier, @- and a DataSetWID column that refers to the DataSet row of the dataset that @- loaded the object. @- ----------------------------------------------------------------------------- -- Defines small molecular weight chemical compounds. -- CREATE TABLE Chemical ( WID @wid NOT NULL, -- Warehouse idetification number for this chemical Name @varchar(255) NOT NULL, -- Common name of the chemical BeilsteinName @varchar(50), -- The Beilstein name of this chemical SystematicName @varchar(255), -- The systematic name for this chemical CAS @varchar(50), -- The Chemical Abstract Service number Charge @int16, -- Charge of the chemical EmpiricalFormula @varchar(50), -- The empirical formula of this chemical -- Format: { ElementSymbol Integer }+ -- E.g. 'H2O1', 'C2H6O1', 'Na1Cl1'. MolecularWeightCalc @real32, -- Molecular weight calculated from chemical formula. -- Units: Daltons. MolecularWeightExp @real32, -- Molecular Weight determined through experimentation. Units: Daltons. OctH2OPartitionCoeff @varchar(50), -- Octanol-water-partition coefficient PKA1 @real32, -- First ionization potential PKA2 @real32, -- Second ionization potential PKA3 @real32, -- Third ionization potential WaterSolubility CHAR(1), -- 'T' if soluble in water, else 'F' Smiles @varchar(255), -- Ascii representation of the chemical structure -- Simplified Molecular Input Line Entry System -- http://www.epa.gov/medatwrk/databases/smiles.html DataSetWID @wid NOT NULL, -- Reference to the data set from which the entity came from -- @FKEY CONSTRAINT FK_Chemical1 FOREIGN KEY (DataSetWID) REFERENCES DataSet(WID) ON DELETE CASCADE, CONSTRAINT PK_Chemical1 PRIMARY KEY (WID) @using-index ) @tabletype; CREATE INDEX CHEMICAL_DWID_NAME ON Chemical(DataSetWID, Name) @index-tablespace; CREATE INDEX CHEMICAL_NAME ON Chemical(Name) @index-tablespace; -- Defines a chemical reaction. The reaction could be enzyme catalyzed -- or occur spontaneously. The reaction could involve small molecules, macromolecules, -- or a combination of the two. Every reaction will be stored in the warehouse in -- a given direction, for example, every reaction that has an assigned EC number is -- written in a direction assigned by the enzyme commission. In physiological settings, -- the reaction could occur in the direction the reaction is stored, the reverse direction, -- or both directions. -- -- Restrictions: Reaction only stores fully qualified EC numbers, e.g., NOT of the form "X.Y.Z.-" CREATE TABLE Reaction ( WID @wid NOT NULL, -- Warehouse ID for this entiry DeltaG @varchar(50), -- Delta G subzero prime value for this reaction ECNumber @varchar(50), -- Official enzyme-commission-number. ECNumberProposed @varchar(50), -- Proposed enzyme-commission-number Spontaneous CHAR(1), -- 'T' if the reaction occurs spontaneously, else 'F' DataSetWID @wid NOT NULL, -- WID of the dataset the reaction is from -- @FKEY CONSTRAINT FK_Reaction FOREIGN KEY (DataSetWID) REFERENCES DataSet(WID) ON DELETE CASCADE, CONSTRAINT PK_Reaction PRIMARY KEY (WID) @using-index ) @tabletype; CREATE INDEX REACTION_DWID ON Reaction(DataSetWID) @index-tablespace; -- Defines a specific protein, that is, a real-world protein which was either purified and isolated, -- OR is reasonably inferred by genomic analysis or other means, such as enzymological characterization. -- The protein could be a monomer or a multimer. In the latter case, a sequence would not be stored for such a record. -- CREATE TABLE Protein ( WID @wid NOT NULL, -- The warehouse ID of this protein -- Name @varchar(1000), -- Common name of the protein, if it exists Name @string10, -- Common name of the protein, if it exists AASequence @string32, -- Amino-acid sequence for this protein, if available Length @int32, -- Length of the amino-acid sequence for this protein, if available -- This value is not necessarily exact, as can be modified by LengthApproximate LengthApproximate @varchar(10), -- Indicates that the Length is an approximate. It could be 'gt' for -- greater than, 'lt' for less than and 'ne' to indicate that it is not -- equal. This is a controlled vocabulary. Charge @int16, -- Charge of the chemical Fragment CHAR(1), -- 'T' if protein is a fragment, else 'F' MolecularWeightCalc @real32, -- Molecular weight calculated from sequence. -- Units: kiloDaltons. MolecularWeightExp @real32, -- Molecular Weight determined through experimentation. -- Units: kiloDaltons. PICalc @varchar(50), -- Isoelectric point (pI) calculated from its sequence. PIExp @varchar(50), -- Isoelectric point (pI) determined through experimentation. DataSetWID @wid NOT NULL, -- WID of the dataset the protein is from -- @FKEY CONSTRAINT FK_Protein FOREIGN KEY (DataSetWID) REFERENCES DataSet(WID) ON DELETE CASCADE, CONSTRAINT PK_Protein PRIMARY KEY (WID) @using-index ) @tabletype; CREATE INDEX PROTEIN_DWID ON Protein(DataSetWID) @index-tablespace; -- Features define regions or points of interest on the protein sequence or -- nucleic acid sequence specified by SequenceWID. -- Features are not used to describe objects such as Genes, BioSources, etc, -- because attributes of those objects are described in their own tables. -- Exceptions: -- 1. Since pseudogenes are not entered in the Gene table, they are listed as sequence features in Feature -- CREATE TABLE Feature ( WID @wid NOT NULL, -- Warehouse identifier of this feature. Description @varchar(255), -- Textual description of this feature. Type @varchar(50), -- Type of feature. These type values come from the source dataset and -- are not necessarily enumerated. Example: "Promoter CYZ". Class @varchar(50), -- Class of feature. Assigns our typology for types of features, or qualifiers -- associated with features. These are our own enumerated type values, to allow us -- to classify features without losing the original (author-provided) type values -- stored in Type. Example: Class=Promoter. SequenceType CHAR(1) NOT NULL, -- Enumeration (see also SequenceWID) that indicates whether the sequence -- is protein or nucleic and how the sequence (if available) is represented: -- If 'P', feature resides on a protein. -- If 'S' or 'N', feature resides on a nucleic acid. SequenceWID @wid, -- References the Protein or Subsequence containing the sequence on which we -- are defining a feature: -- SequenceType of 'S' implies SequenceWID is nonNULL and references a Subsequence- -- sequence = Subsequence.Sequence (i.e., it is stored explicitly), -- SequenceType of 'N' implies SequenceWID (if nonNULL) references a NucleicAcid- -- sequence is the substring Subsequence.Sequence[StartPosition : EndPosition] -- where Subsequence is the full Subsequence of the nucleic acid. -- SequenceType of 'P' implies SequenceWID (if nonNULL) references a Protein- -- sequence is the substring Protein.AASequence[StartPosition : EndPosition]. StartPosition @int32, -- Start position of the feature within the NucleicAcid or Protein sequence. EndPosition @int32, -- End position of the feature within the NucleicAcid or Protein sequence. StartPositionApproximate @varchar(10), -- Indicates that the Start position of the coding region is an approximate value -- It could be 'gt' for greater than, 'lt' for less than and 'ne' to indicate that it is not -- equal. This is a controlled vocabulary. EndPositionApproximate @varchar(10), -- Indicates that the End position of the coding region is an approximate value. -- It could be 'gt' for greater than, 'lt' for less than and 'ne' to indicate that it is not -- equal. This is a controlled vocabulary. ExperimentalSupport CHAR(1), -- 'T' if the feature is supported by experimental evidence, else 'F' ComputationalSupport CHAR(1), -- 'T' if the feature is supported by computational evidence, else 'F' DataSetWID @wid NOT NULL, -- Reference to the data set from which the entity came from -- @FKEY CONSTRAINT FK_Feature FOREIGN KEY (DataSetWID) REFERENCES DataSet(WID) ON DELETE CASCADE, CONSTRAINT PK_Feature PRIMARY KEY (WID) @using-index -- SD CONSTRAINT FK_Feature1 FOREIGN KEY (SequenceWID) REFERENCES Subsequence(WID) -- SD CONSTRAINT FK_Feature2 FOREIGN KEY (SequenceWID) REFERENCES NucleicAcid(WID) -- SD CONSTRAINT FK_Feature3 FOREIGN KEY (SequenceWID) REFERENCES Protein(WID) ) @tabletype; -- Describes the non-enzymatic function(s) of proteins. That is, -- this table should not be used to describe the enzymatic function of a protein whose enzymatic -- function is described using the EnzymaticReaction table. Function names are stored as -- strings with no particular format or interpretation. -- CREATE TABLE Function ( WID @wid NOT NULL, -- Warehouse identifier for this function. Name @varchar(255), -- Name of this function. DataSetWID @wid NOT NULL, -- Data set the entity came from -- @FKEY CONSTRAINT FK_Function FOREIGN KEY (DataSetWID) REFERENCES DataSet(WID) ON DELETE CASCADE, CONSTRAINT PK_Function PRIMARY KEY (WID) @using-index ) @tabletype; -- Defines an association between a reaction and an enzyme that catalyzes -- that reaction (ProteinWID). In the case where we are defining an -- association between a subunit of a larger enzyme complex, and a -- reaction catalyzed by that subunit only when the subunit is part of -- that larger complex, ComplexWID specifies that larger complex (assumes -- that complex is in the warehouse). -- This table requires a WID as it is referenced in multiple linking tables -- to associate alternate compounds, cofactors, etc., to allow for multiple -- valued attributes of the enzymatic reaction. -- CREATE TABLE EnzymaticReaction ( WID @wid NOT NULL, -- WID of this enzymatic reaction ReactionWID @wid NOT NULL, -- WID of the catalyzed reaction ProteinWID @wid NOT NULL, -- WID of the enzyme ComplexWID @wid, -- WID of protein of which ProteinWID -- must be a subunit of for this reaction ReactionDirection @varchar(30), -- Reaction Direction Enumeration DataSetWID @wid NOT NULL, -- WID of the dataset the reaction is from -- @FKEY CONSTRAINT FK_EnzymaticReaction1 FOREIGN KEY (ReactionWID) REFERENCES Reaction(WID) ON DELETE CASCADE, @FKEY CONSTRAINT FK_EnzymaticReaction2 FOREIGN KEY (ProteinWID) REFERENCES Protein(WID) ON DELETE CASCADE, @FKEY CONSTRAINT FK_EnzymaticReaction3 FOREIGN KEY (ComplexWID) REFERENCES Protein(WID) ON DELETE CASCADE, @FKEY CONSTRAINT FK_EnzymaticReaction4 FOREIGN KEY (DataSetWID) REFERENCES DataSet(WID) ON DELETE CASCADE, CONSTRAINT PK_EnzymaticReaction PRIMARY KEY (WID) @using-index ) @tabletype; CREATE INDEX ER_DATASETWID ON EnzymaticReaction(DataSetWID) @index-tablespace; -- Defines the genetic codes with a name, translation table and a start codon for each code. -- See http://anatomy.med.unsw.edu.au/CBL/embryo/DNA/Genetic_Codes.htm for more information. -- Genbank ID is stored as DBID.XID, where DBID.OtherWID = GeneticCode.WID. -- CREATE TABLE GeneticCode ( WID @wid NOT NULL, -- Warehouse ID for this GeneticCode. NCBIID @varchar(2), -- 1- or 2-digit ID assigned to this code. See URL above. Name @varchar(100), -- Genetic code name. TranslationTable @varchar(64), -- Translation table for this genetic code. See URL above for format. StartCodon @varchar(64), -- Start Codon for this genetic code. See URL above for format DataSetWID @wid NOT NULL, -- Reference to the data set from which the entity came from. @FKEY CONSTRAINT FK_GeneticCode FOREIGN KEY (DataSetWID) REFERENCES DataSet(WID) ON DELETE CASCADE, CONSTRAINT PK_GeneticCode PRIMARY KEY (WID) @using-index ) @tabletype; -- Defines a GenBank Division, used in a ***taxonomic sense***. Contains the name of the division and a three letter code. -- modified 5/16/04 by YP to reflect our inferrence that there are two different definitions -- used by NCBI for the term "Division": either as a data organization -- (e.g., the data files associated with the CON division) or with a taxonomic usage of the term (e.g., the BCT division). -- According to this inferrence, there will never be a CON division in the Division table. -- CREATE TABLE Division ( WID @wid NOT NULL, -- Warehouse ID for this Division. Code @varchar(10), -- Genbank Division code (3 characters) which is an abbreviation for the GenBank Name. Name @varchar(100), -- Division name. DataSetWID @wid NOT NULL, -- Reference to the data set from which the entity came from. @FKEY CONSTRAINT FK_Division FOREIGN KEY (DataSetWID) REFERENCES DataSet(WID) ON DELETE CASCADE, CONSTRAINT PK_Division PRIMARY KEY (WID) @using-index ) @tabletype; -- Defines a taxon, which could be a species, genus, family or some other rank. -- These ranks are controlled vocabulary and are stored in the Enumeration table. -- The taxon table associates a taxon with its parent, gives the name, division code and the -- genetic code information for the taxon. -- CREATE TABLE Taxon ( WID @wid NOT NULL, -- Warehouse ID for this Taxon ParentWID @wid, -- Warehouse ID of the parent of this taxon Name @varchar(100), -- Taxonomic Name of this taxon Rank @varchar(100), -- Rank of this taxon (kingdom, superkingdom ..) DivisionWID @wid, -- Warehouse ID of the division this taxon belongs to. InheritedDivision CHAR(1), -- 'T' if division is inherited from parent, else 'F' GencodeWID @wid, -- Warehouse ID of the genetic code for this taxon. InheritedGencode CHAR(1), -- 'T' if gencode is inherited from parent, else 'F' MCGencodeWID @wid, -- Warehouse ID of the mitochondrial genetic code for this taxon. InheritedMCGencode CHAR(1), -- 'T' if the mitochondrial gencode is inherited from parent, else 'F' DataSetWID @wid NOT NULL, -- Reference to the data set from which the entity came from @FKEY CONSTRAINT FK_Taxon FOREIGN KEY (DataSetWID) REFERENCES DataSet(WID) ON DELETE CASCADE, @FKEY CONSTRAINT FK_Taxon_Division FOREIGN KEY (DivisionWID) REFERENCES Division(WID) ON DELETE CASCADE, @FKEY CONSTRAINT FK_Taxon_GeneticCode FOREIGN KEY (GencodeWID) REFERENCES GeneticCode(WID) ON DELETE CASCADE, CONSTRAINT PK_Taxon PRIMARY KEY (WID) @using-index ) @tabletype; -- Defines the biological source of an entry in the Protein, NucleicAcid or Gene tables ("the object"), -- as well as other tables eventually storing experimental data related to e.g., microarray or proteomic -- experiments. For example, BioSource might be used to specify the organism, organ, and tissue in -- which gene expression experiments were performed. -- DEPENDENCIES: -- 1. As BioSource references Taxon.WID to specify the species, the NCBI taxonomy dataset must have ALREADY been loaded -- prior to populating BioSource. -- 2. BioSource stores subtype information in BioSubtype via the BioSourceWIDBioSubtypeWID linking table @- Created by YP 4/27/04 -- CREATE TABLE BioSource ( WID @wid NOT NULL, -- Warehouse identifier for biological source. TaxonWID @wid, -- Warehouse GUID identifying WID of Taxon entry for the species. Always -- references species level or higher. Name @varchar(200), -- Informal name assigned to this source. -- Ex. (CMR): 'Escherichia coli K12-MG1655' Strain @varchar(50), -- Strain of organism from which object is derived, if applicable. This is a -- subspecies taxonomic classification not supported by Taxon because it is not -- a rigorous concept. -- Strain is defined as a population of homogeneous organisms possessing a set of -- defined characteristics. In bacteriology, the set of descendants that retains -- the characteristics of the ancestor. Members of a strain that subsequently -- differ from the original isolate are regarded as belonging either to a substrain -- of the original strain, or to a new strain. Organ @varchar(50), -- Organ of organism from which object is derived, if applicable. -- Organ is defined as any part of the body exercising a specific function, such as -- respiration, secretion, or digestion, if applicable. Organelle @varchar(50), -- Organelle of organism from which object is derived, if applicable. Tissue @varchar(50), -- Tissue of organism from which object is derived, if applicable. Can be used in -- conjunction with Organ. -- Tissue is defined as a collection of similar cells and the intercellular substances -- surrounding them. Multiple tissues can be part of an Organ. CellType @varchar(50), -- Cell type of organism from which object is derived, if applicable. -- Cell type is defined as the type of an individual cell, e.g., "myoblast". This is a -- cellular-level classification. CellLine @varchar(50), -- Cell line from which object is derived, if applicable -- Cell line usually pertains to an immortalized version of a given cell type DevelopmentStage @varchar(50), -- Stage of development associated with the object, if applicable Sex @varchar(15), -- Sex of the organism from which object is derived, if applicable -- Enumerated. Values=Male/Female/Hermaphrodite/Asexual/NULL (NULL here means not provided by dataset) DataSetWID @wid NOT NULL, -- The dataset this entity came from or is associated with -- @FKEY CONSTRAINT FK_BioSource1 FOREIGN KEY (TaxonWID) REFERENCES Taxon(WID) ON DELETE CASCADE, @FKEY CONSTRAINT FK_BioSource2 FOREIGN KEY (DataSetWID) REFERENCES DataSet(WID) ON DELETE CASCADE, CONSTRAINT PK_BioSource PRIMARY KEY (WID) @using-index ) @tabletype; -- Describes idiosyncratic sub-species descriptions such as "serotype", "variety", etc. -- These can be many to one (1 BioSource - many BioSubtypes) linked via BioSourceWIDBioSubtypeWID @- Created by YP 4/27/04 @- Added Version 3.0 -- CREATE TABLE BioSubtype ( WID @wid NOT NULL, -- Warehouse GUID for this BioSubtype Type @varchar(15), -- Type of subtype, e.g., "serotype" Value @varchar(50) NOT NULL, -- Value of the subtype, e.g., "O157" DataSetWID @wid NOT NULL, -- The dataset this entity came from or is associated with -- @FKEY CONSTRAINT FK_BioSubtype2 FOREIGN KEY (DataSetWID) REFERENCES DataSet(WID) ON DELETE CASCADE, CONSTRAINT PK_BioSubtype PRIMARY KEY (WID) @using-index ) @tabletype; -- Defines a specific nucleic acid molecule, such as DNA or of RNA. -- Entries in this table correspond to a real-world DNA or RNA molecules -- which at one point were purified and isolated ("something you can point at"). A row in this table -- can define a complete molecule, a fragment of a molecule, or a molecule that has been partially -- sequenced in different regions. -- This table will be used in several ways: (1) to associate a sequence with an entire -- replicon, or a region of a replicon, when the sequence of that replicon is known; (2) to associate -- a DNA sequence with a single gene; (3) to define a DNA or RNA molecule that has not been -- sequenced; (4) to define RNA molecules such as tRNAs. -- -- Features on a NucleicAcid molecule (such as promoters or binding sites) can be defined using -- the Feature table. -- -- The Subsequence table contains zero or more full or partial sequences contents of this molecule. -- If one represents the full sequence of this molecule, Subsequence.FullSequence = 'T' where -- Subsequence.NucleicAcidWID references this molecule. In this case, NucleicAcid.FullySequenced = 'T' as well. -- @- TODO: Could add Unsequenced flag -- @- Updated extensively Version 3.0 CREATE TABLE NucleicAcid ( WID @wid NOT NULL, -- Warehouse identifier for this NucleicAcid. Name @varchar(200), -- Name or description of this molecule. -- Ex: (CMR) 'Chromosome II Brucella melitensis 16M'. Type @varchar(30) NOT NULL, -- Enumeration: 'DNA' 'RNA' or 'na' Class @varchar(30) NOT NULL, -- Enumeration: describes the molecule as it exists in the organism; stores values such as RNA subtype ("pre-RNA", "mRNA"), replicon type (e.g., "chromosome", "plasmid") Topology @varchar(30), -- Enumeration: 'circular', 'linear' or 'other'. Strandedness @varchar(30), -- Enumeration indicating whether Nucleic Acid is single stranded, double stranded or mixed stranded. SequenceDerivation @varchar(30), -- Enumeration describing how the sequence was generated -- (e.g., from single clone, from assembly of sequences -- from other sequence entries, collection of different clones, etc). Fragment CHAR(1), -- 'T' if this is a fragment of a molecule, -- 'F' if this NucleicAcid describes an entire molecule. FullySequenced CHAR(1), -- 'T' if the molecule is completely -- sequenced within this dataset, else 'F'. MoleculeLength @int32, -- The length of the molecule, in nucleotides. This value can be an -- approximation. Even if this value is known to be exact, -- it may not necessarily be identical to that of TotalLength, as the molecule -- This value is calculated if it is not explicitely stated in the source -- dataset MoleculeLengthApproximate @varchar(10), -- Enumeration; specifies whether MoleculeLength stores an approximate value: -- 'gt' for greater than, -- 'lt' for less than, or -- 'ne' for not equal. CumulativeLength @int32, -- The cumulative number of nucleotides for Subsequences referenced by this NucleicAcid entry, -- whether contiguous or not. This value is a summation of the number of nucleotides for these Subsequences. -- If the molecule is completely sequenced, this value should be identical -- to that of MoleculeLength; in this case, both fields are populated. CumulativeLengthApproximate @varchar(10), -- Enumeration; specifies whether TotalLength approximates actual total length: -- 'gt' for greater than, -- 'lt' for less than, or -- 'ne' for not equal. GeneticCodeWID @wid, -- References the genetic code of this molecule. BioSourceWID @wid, -- References the biological source of this molecule. DataSetWID @wid NOT NULL, -- References the data set from which the entity came from -- @FKEY CONSTRAINT FK_NucleicAcid1 FOREIGN KEY (GeneticCodeWID) REFERENCES GeneticCode(WID) ON DELETE CASCADE, @FKEY CONSTRAINT FK_NucleicAcid2 FOREIGN KEY (BioSourceWID) REFERENCES BioSource(WID) ON DELETE CASCADE, @FKEY CONSTRAINT FK_NucleicAcid3 FOREIGN KEY (DataSetWID) REFERENCES DataSet(WID) ON DELETE CASCADE, CONSTRAINT PK_NucleicAcid PRIMARY KEY (WID) @using-index ) @tabletype; -- Contains a contiguous sequence of a nucleic acid molecule. Multiple subsequences can make up a given NucleicAcid entry. -- Contains either the full or partial nucleotide sequence of the nucleic acid molecule stored in NucleicAcid. -- Todo: Currently the schema has no way of indicating an approximate position -- of this Subsequence on the NucleicAcid as a whole. @- Ordering/map issues avoided for this release as too complex @- Added Version 3.0 -- CREATE TABLE Subsequence ( WID @wid NOT NULL, -- Warehouse identifier for this Subsequence. NucleicAcidWID @wid NOT NULL, -- References the containing nucleic acid molecule. FullSequence CHAR(1), -- 'T' if this is the complete sequence of the nucleic acid molecule -- referenced by NucleicAcidWID, else 'F'. Sequence @string32, -- Nucleotide sequence of this Subsequence. Length @int32, -- Number of nucleotides (and characters) in Sequence. This value -- is calculated if it is not explicitely stated in the source -- dataset LengthApproximate @varchar(10), -- Enumeration; indicates that Length approximates actual Sequence length: -- 'gt' for greater than, -- 'lt' for less than, or -- 'ne' for not equal. PercentGC @real32, -- Percentage of Sequence nucleotides that are either guanine or cytosine. -- This value is calculated if it is not explicitely stated in the source -- dataset Version @varchar(30), -- Dataset-specific information to indicate the version of this Subsequence. DataSetWID @wid NOT NULL, -- Reference to the data set from which the entity came from -- @FKEY CONSTRAINT FK_Subsequence1 FOREIGN KEY (NucleicAcidWID) REFERENCES NucleicAcid(WID) ON DELETE CASCADE, @FKEY CONSTRAINT FK_Subsequence2 FOREIGN KEY (DataSetWID) REFERENCES DataSet(WID) ON DELETE CASCADE, CONSTRAINT PK_Subsequence PRIMARY KEY (WID) @using-index ) @tabletype; -- Defines a notion of gene that is limited to procaryotic aspects of -- genes. Later versions of the warehouse will expand this definition to include -- eukaryotic aspects of genes. Separate tables define associations between a gene and -- (1) its biological source(s), -- (2) its protein product(s), if any, and -- (3) its RNA product(s), if any. -- CodingRegionStart is the position of the first nucleotide in the start codon; -- CodingRegionEnd is the position of the last nucleotide in the codon preceding -- the stop codon. @- Modified for 3.0 -- CREATE TABLE Gene ( WID @wid NOT NULL, -- Warehouse identifier this gene. Name @varchar(255), -- Name of the gene. NucleicAcidWID @wid, -- Reference to the NucleicAcid molecule (replicon) this gene resides upon. SubsequenceWID @wid, -- Reference to the Subsequence containing the nucleotide sequence of this Gene. @- Added 3.0 Type @varchar(100), -- Describes the type of molecule which is known to be *ultimately* produced by this gene -- enumerated values (polypeptide, pre-mRNA, rRNA, tRNA, etc) GenomeID @varchar(30), -- Unique ID assigned to this gene, such as by a genome project CodingRegionStart @int32, -- Base position of start of coding region. Start is always less than End, -- except for genes that wrap around the origin of a circular chromosome. CodingRegionEnd @int32, -- Base position of end of coding region. CodingRegionStartApproximate @varchar(10), -- Indicates that the Start position of the coding region is an approximate value. -- It could be 'gt' for greater than, 'lt' for less than and 'ne' to indicate that it is not -- equal. This is a controlled vocabulary. CodingRegionEndApproximate @varchar(10), -- Indicates that the End position of the coding region is an approximate value. -- It could be 'gt' for greater than, 'lt' for less than and 'ne' to indicate that it is not -- equal. This is a controlled vocabulary. Direction @varchar(25), -- Direction of transcription as defined in the enumeration table Interrupted CHAR(1), -- 'T' if the gene is interrupted, else 'F'. DataSetWID @wid NOT NULL, -- Reference to the data set from which the entity came from -- @FKEY CONSTRAINT FK_Gene1 FOREIGN KEY (NucleicAcidWID) REFERENCES NucleicAcid(WID) ON DELETE CASCADE, @FKEY CONSTRAINT FK_Gene2 FOREIGN KEY (DataSetWID) REFERENCES DataSet(WID) ON DELETE CASCADE, CONSTRAINT PK_Gene PRIMARY KEY (WID) @using-index ) @tabletype; CREATE INDEX GENE_DATASETWID ON Gene(DATASETWID) @index-tablespace; -- Pathways are graphs of reactions, grouped together according to a -- higher biological function they perform. Some pathways may be -- "template", "reference", "model" or "sum of organisms" pathways. Those -- pathways do not contain an BioSourceWID reference, and have the type -- field set to "R" (Reference/Model/Theoretical). For real organisms, -- the type is "O". -- CREATE TABLE Pathway ( WID @wid NOT NULL, -- Warehouse ID for this pathway Name @varchar(255) NOT NULL, -- Name of the pathway Type CHAR(1) NOT NULL, -- Type of pathway (see comment above) BioSourceWID @wid, -- BioSource this pathway occurs in DataSetWID @wid NOT NULL, -- WID of the dataset the pathway is from -- @FKEY CONSTRAINT FK_Pathway1 FOREIGN KEY (BioSourceWID) REFERENCES BioSource(WID) ON DELETE CASCADE, @FKEY CONSTRAINT FK_Pathway2 FOREIGN KEY (DataSetWID) REFERENCES DataSet(WID) ON DELETE CASCADE, CONSTRAINT PK_Pathway PRIMARY KEY (WID) @using-index ) @tabletype; @- Does KEGG loader use these indexes? CREATE INDEX PATHWAY_BSWID_WID_DWID ON Pathway(BioSourceWID, WID, DataSetWID)@index-tablespace; CREATE INDEX PATHWAY_TYPE_WID_DWID ON Pathway(TYPE, WID, DataSetWID)@index-tablespace ; CREATE INDEX PATHWAY_DWID ON Pathway(DataSetWID) @index-tablespace; -- A term is a controlled vocabulary term used within a particular dataset. -- E.g. the keyword attached to proteins in SwissProt. Terms can be arranged -- in a hierarchy (see the TermWIDParentWID table). -- CREATE TABLE Term ( WID @wid NOT NULL, -- Warehouse identifier for this term Name @varchar(100) NOT NULL, -- Name for the term DataSetWID @wid NOT NULL, -- WID of the dataset the pathway is from -- CONSTRAINT PK_Term PRIMARY KEY (WID) @using-index ) @tabletype; -- Defines a parameterized computation that has been performed on objects in the warehouse. -- TODO: Parameters are specified in the Parameter table. -- CREATE TABLE Computation ( WID @wid NOT NULL, -- WID of this computation Name @varchar(50) NOT NULL, -- Name of the algorithm, procedure, or program, e.g. 'BLAST' Description @string16, -- Description of the computation performed DataSetWID @wid NOT NULL, -- @FKEY CONSTRAINT FK_Computation FOREIGN KEY (DataSetWID) REFERENCES DataSet(WID) ON DELETE CASCADE, CONSTRAINT PK_Computation PRIMARY KEY (WID) @using-index ) @tabletype; -- Defines a literature citation. -- Typically a citation is associated with one or more Warehouse objects via the CitationWIDOtherWID table. -- CREATE TABLE Citation ( WID @wid NOT NULL, -- WID of this citation Citation @string10 NOT NULL, -- Line of text containing the citation PMID @number, -- Pubmed ID of the citation, if known DataSetWID @wid NOT NULL, -- @FKEY CONSTRAINT FK_Citation FOREIGN KEY (DataSetWID) REFERENCES DataSet(WID) ON DELETE CASCADE, CONSTRAINT PK_Citation PRIMARY KEY (WID) @using-index ) @tabletype; -- Describes point of contact information for an individual and/or an organization. -- CREATE TABLE PointOfContact ( WID @wid NOT NULL, -- Warehouse identifier for this entity Organization @varchar(100), -- Affiliated organization (company, university, institute) Department @varchar(100), -- Research group, lab, academic dept., etc., of organization Person @varchar(100), -- Name of person to contact: 'Lastname, Firstname' Email @varchar(50), -- E-mail address Phone @varchar(50), -- Telephone number Fax @varchar(50), -- Fax number URL @varchar(100), -- Web address of this contact Address @varchar(255), -- Postal address -- CONSTRAINT PK_POC PRIMARY KEY (WID) @using-index ) @tabletype; -- Defines an association between a warehouse entity and an external representation or depiction -- of that entity in a well-defined format. -- Can be used to store the contents itself, or to indicate that an archive is present at a URL. -- CREATE TABLE Archive ( WID @wid NOT NULL, -- Warehouse identifier for this entity OtherWID @wid NOT NULL, -- WID of entity archive is associated with Format @varchar(10) NOT NULL, -- Storage/compression format used eg. 'zip' 'tar' 'tgz' 'jpeg' Contents @byte32, -- Archive contents URL @string10, -- Web address the archive is found at, ideally a specific file ToolName @varchar(50), -- The name of the tool or software kit the archive is used by -- eg. 'MIAMExpress' DataSetWID @wid NOT NULL, -- DataSet.WID this entity is associated with -- CONSTRAINT PK_ARCHIVE PRIMARY KEY (WID) @using-index -- SD CONSTRAINT FK_Archive1 FOREIGN KEY (ExperimentWID) REFERENCES Experiment(WID) ) @tabletype; -- Defines an experiment. Provides a context for associating experimental data with it. -- Allows tree-structured experiments consisting of heterogeneous subexperiments, -- subexperiments corresponding to time-series observations, -- and repeated trials of identical experiments. -- For a hierarchical experiment, data should be associated with the Experiment -- at the appropriate level. For example, if data reflects results from averaging -- numerous identically conducted trials, that data should be associated with the -- Experiment representing the group of these trials. -- A Comment table entry may be created to contain discussion of results, etc. -- A DBID table entry may be created if the (sub)experiment has a unique name within its DataSet. -- SynonymTable table entries may be created to associate names with the (sub)experiment. -- If published, a reference may be created in Citation table. -- CREATE TABLE Experiment ( WID @wid NOT NULL, -- Warehouse identifier for this entity Type @varchar(50) NOT NULL, -- Mnemonic for the general type of experiment, e.g. 'microarray' ContactWID @wid, -- PointOfContact.WID for submitter/investigator ArchiveWID @wid, -- Archive.WID of external rep'n of experiment and its data. StartDate @datetime, -- Date+time experiment was started EndDate @datetime, -- Date+time experiment was completed Description @string16, -- Description of this experiment GroupWID @wid, -- Experiment.WID defining the parent group of this subexperiment GroupType @varchar(50), -- Enumeration indicating type of any and all direct subexperiments -- Exs: 'replicate' 'variant' 'step' 'time-series' GroupSize @int32 NOT NULL, -- Number of child experiments loaded GroupIndex @int32, -- Counter value identifying this experiment in a sequence or set -- of subexperiments (eg. time-series). NULL if GroupWID is NULL. TimePoint @int32, -- Time offset indicating when associated observations in a -- time-series experiment were conducted TimeUnit @varchar(20), -- Enumeration for units of TimePoint DataSetWID @wid NOT NULL, -- DataSet.WID this entity is associated with -- @FKEY CONSTRAINT FK_Experiment2 FOREIGN KEY (GroupWID) REFERENCES Experiment(WID) ON DELETE CASCADE, @FKEY CONSTRAINT FK_Experiment3 FOREIGN KEY (ContactWID) REFERENCES PointOfContact(WID) ON DELETE CASCADE, @FKEY CONSTRAINT FK_Experiment4 FOREIGN KEY (ArchiveWID) REFERENCES Archive(WID) ON DELETE CASCADE, @FKEY CONSTRAINT FK_Experiment5 FOREIGN KEY (DataSetWID) REFERENCES DataSet(WID) ON DELETE CASCADE, CONSTRAINT PK_Experiment PRIMARY KEY (WID) @using-index ) @tabletype; -- Specifies a relationship between one data entity and an experiment in which it was -- recorded as an observation or used in some other fashion. -- CREATE TABLE ExperimentData ( WID @wid NOT NULL, -- Warehouse identifier for this entity ExperimentWID @wid NOT NULL, -- Experiment entity Data @string32 NOT NULL, -- Data associated with experiment Role @varchar(50) NOT NULL, -- Describes how data is used in experiment -- ex: 'green intensity' Kind CHAR(1) NOT NULL, -- Characterizes relationship of data to experiment: -- 'O' = observation, 'C' = computed from observation(s), -- 'P' = parameter, 'M' = metadata describing other data DateProduced @datetime, -- When the observation was taken, the computation was performed, etc. DataSetWID @wid NOT NULL, -- The data set this entity came from or is associated with -- @FKEY CONSTRAINT FK_ExpData1 FOREIGN KEY (ExperimentWID) REFERENCES Experiment(WID) ON DELETE CASCADE, @FKEY CONSTRAINT FK_ExpData2 FOREIGN KEY (DataSetWID) REFERENCES DataSet(WID) ON DELETE CASCADE, CONSTRAINT PK_ExpData PRIMARY KEY (WID) @using-index ) @tabletype; -- Describes the source and strength of evidence for a particular fact in the warehouse. -- E.g. use to record that there are two sources of support for the function of a given -- protein, one computational and one experimental. This table has a WID for use in the -- CitationWIDOtherWID table. -- CREATE TABLE Support ( WID @wid NOT NULL, -- WID of this support OtherWID @wid NOT NULL, -- WID of supported entity, e.g. Protein Function Type @varchar(10), -- reference to an enumeration type Confidence @real32, -- 0 < confidence <= 1. DataSetWID @wid NOT NULL, -- The data set this entity came from or is associated with -- CONSTRAINT CK_Support CHECK (Confidence > 0 AND Confidence <= 1), CONSTRAINT PK_Support PRIMARY KEY (WID) @using-index -- SD CONSTRAINT FK_Support1 FOREIGN KEY (OtherWID) REFERENCES Chemical(WID) -- SD CONSTRAINT FK_Support2 FOREIGN KEY (OtherWID) REFERENCES Element(WID) -- SD CONSTRAINT FK_Support3 FOREIGN KEY (OtherWID) REFERENCES Reaction(WID) -- SD CONSTRAINT FK_Support4 FOREIGN KEY (OtherWID) REFERENCES Protein(WID) -- SD CONSTRAINT FK_Support5 FOREIGN KEY (OtherWID) REFERENCES Feature(WID) -- SD CONSTRAINT FK_Support6 FOREIGN KEY (OtherWID) REFERENCES Function(WID) -- SD CONSTRAINT FK_Support7 FOREIGN KEY (OtherWID) REFERENCES EnzymaticReaction(WID) -- SD CONSTRAINT FK_Support8 FOREIGN KEY (OtherWID) REFERENCES Gene(WID) -- SD CONSTRAINT FK_Support9 FOREIGN KEY (OtherWID) REFERENCES BioSource(WID) -- SD CONSTRAINT FK_Support10 FOREIGN KEY (OtherWID) REFERENCES Pathway(WID) -- SD CONSTRAINT FK_Support11 FOREIGN KEY (OtherWID) REFERENCES NucleicAcid(WID) -- SD CONSTRAINT FK_Support14 FOREIGN KEY (OtherWID) REFERENCES Citation(WID) ) @tabletype; @- ----------------------------------------------------------------------------- @- Tables that implement arbitrary relations among Warehouse objects. @- Naming convention is to use