<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>gb-2000-1-4-reviews2001</ui>
   <ji>GBJ</ji>
   <fm>
      <dochead>Tutorial</dochead>
      <bibl>
         <title>
            <p>Bases and spaces: resources on the web for accessing the draft human genome</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Semple</snm>
               <fnm>Colin</fnm>
               <insr iid="I1"/>
               <email>Colin.Semple@ed.ac.uk</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Medical Genetics Section, Department of Medical Sciences, The University of Edinburgh, Molecular Medicine Centre, Western General Hospital, Edinburgh, EH4 2XU, UK</p>
            </ins>
         </insg>
         <source>Genome Biology</source>
         <issn>1465-6906</issn>
         <pubdate>2000</pubdate>
         <volume>1</volume>
         <issue>4</issue>
         <fpage>reviews2001.1</fpage>
         <lpage>reviews2001.5</lpage>
         <url>http://genomebiology.com/2000/1/4/reviews/2001</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="doi">10.1186/gb-2000-1-4-reviews2001</pubid>
               <pubid idtype="pmpid">11178254</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <pub>
            <date>
               <day>16</day>
               <month>10</month>
               <year>2000</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2000</year>
         <collab>GenomeBiology.com</collab>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Summary</p>
            </st>
            <p>Much is expected of the draft human genome sequence, and yet there is no central resource to host the plethora of sequence and mapping information available. Consequently, finding the most useful and reliable human genome data and resources currently available on the web can be challenging, but is not impossible.</p>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="BMC" subtype="man_spc_id" id="30010010">Genome studies</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010002">Bioinformatics</classification>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Nice press release, shame about the data</p>
         </st>
         <p>The entire sequence of the human genome is not expected to be finished for some time, and gaps are expected to persist into 2003 [<abbr bid="B1">1</abbr>]. In the meantime, the genome exists in 'draft' form: multiple segments of sequence in which we have high confidence, placed relative to one another by mapping information of lower confidence. Many biologists study particular regions of the genome, such as those involved in positional cloning of disease genes, and this type of work is greatly accelerated by having most of the sequence of the region of interest. The draft human genome now includes this information for most of the genome. Unfortunately, no single resource unites the available human genomic sequences with their locations and their gene content, but by combining the varied resources currently available it is possible to devise strategies that fully exploit the draft genome data. So what resources and information are available so far and where can we find them? Note that the databases and resources mentioned in this article and the corresponding URLs are listed in Table <tblr tid="T1">1</tblr></p>
      </sec>
      <sec>
         <st>
            <p>Raw sequence data</p>
         </st>
         <p>As recently as 1996, the entire GenBank (Table <tblr tid="T1">1</tblr>) database contained around 0.65 Gb of DNA sequence; but the draft human genome sequence alone runs to more than 3.08 Gb. Most of the draft sequence is present in GenBank (Table <tblr tid="T1">1</tblr>) as unfinished, fragmentary BAC (bacterial artificial chromosome) sequences. These consist of a number of non-overlapping, arbitrarily ordered, fragments, or 'contigs', which have been artificially concatenated to produce a single sequence entry for each BAC. Typically, each contig within a BAC is separated from the next by a large number of bases, labeled 'N'. All unfinished BAC entries are subject to irregular updates until they are finished, and this might alter the number and size of the contigs they contain. The most straightforward web interface for retrieving BAC sequence (and various other types of data) is Entrez (Table <tblr tid="T1">1</tblr>) at the National Centre for Biotechnology Information (NCBI), which also includes substantial online documentation.</p>
         <p>Most BAC sequence entries contain information about the BAC in the 'DEFINITION' field near the top of the Entrez display. For example, the DEFINITION field in the Entrez entry AP001002 (Table <tblr tid="T1">1</tblr>) contains the BAC clone name (678K21) and the cytogenetic band to which it has been localized (11q14). Many entries give much less annotation; for example, at present, Entrez entry AC007104 (Table <tblr tid="T1">1</tblr>) provides no clone name, nor even the clone library, and gives the location as simply 'chromosome 4'. I will discuss ways to find a more specific location for this clone below, but we can retrieve the clone name using a little-known feature of Entrez. Under the 'Display' pull-down menu simply select 'ASN.1' (which is a sequence format used internally at NCBI) and redisplay the entry. The sequence data are now unreadable, but near the top of the file is the clone name '301J10'. The same information is retrievable from the 'XML' and 'Graphics' display formats, but not under the default GenBank format.</p>
         <p>A related site, the Human BAC Ends (Table <tblr tid="T1">1</tblr>) site at The Institute for Genomic Research (TIGR), provides access to more than 743,000 end sequences from 470,000 BAC clones. (Typically, end sequences consist of several hundred base pairs from the clone ends.) It is possible to search the sequences with either a clone name or a sequence of interest. As the unfinished BAC sequences in GenBank do not always contain the sequences from the BAC ends, the BAC end sequences may provide extra sequence data for a clone of interest. In addition, the end sequences can help to identify the fragments of unfinished BAC sequences that represent the ends of the clone. One caveat is that, as usual, the annotation of these sequences should be treated with a certain degree of caution, because clone ends have been known to be attributed to the wrong clone [<abbr bid="B2">2</abbr>]. Conveniently, the BAC end sequences at TIGR are provided with any repetitive sub-sequences masked (they are replaced with runs of the letter X). Repetitive sequences are a recurring problem in dealing with genomic sequence, particularly interspersed repeats  (regions of very similar sequence descended from various classes of transposable elements) [<abbr bid="B3">3</abbr>]. Interspersed repeats often span hundreds or thousands of bases and so can appear as spurious overlaps between genomic sequence fragments. The excellent program RepeatMasker (Table <tblr tid="T1">1</tblr>) does a good job of masking both interspersed and simple repeats. Simple repeats are stretches of sequence made up of units consisting of one or more bases, which may be repeated hundreds of times. They can be used as genetic markers, for example, in disease association studies, so finding them and annotating them properly is an important task. The Sputnik (Table <tblr tid="T1">1</tblr>) program provides a fast and elegant method for annotating simple repeats, giving each repeat's location, classification (on the basis of repeat unit length - dinucleotides, trinucleotides and so on) and sequence.</p>
         <p>Ideally, it would be desirable to retrieve the genomic sequence of a region of interest defined by the user, rather than multiple segments restricted to the size of BAC clones. A heroic, preliminary assembly of the draft genome sequence is available on the Working Draft Sequence (Table <tblr tid="T1">1</tblr>) site from David Haussler's group at the University of California, Santa Cruz (UCSC). Although this assembly contains over 200,000 gaps as well as some misassemblies and incorrectly ordered sequences, as it is updated with more sequence data it will become an important resource. The information is incorporated into the Entrez <it>Homo sapiens</it> genome view (Table <tblr tid="T1">1</tblr>) at NCBI, which is a graphical viewer designed to integrate sequence data with mapping information from various sources. Again, this NCBI interface will be a potent tool when more sequence data are available; it is already the best integration of data for finished chromosomes such as 21 and 22.</p>
         <tbl id="T1">
            <title>
               <p>Table 1</p>
            </title>
            <caption>
               <p>Referenced URLs</p>
            </caption>
            <tblbdy cols="2">
               <r>
                  <c ca="left">
                     <p>Website</p>
                  </c>
                  <c ca="left">
                     <p>URL</p>
                  </c>
               </r>
               <r>
                  <c cspan="2">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>BLAST</p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://www.ncbi.nlm.nih.gov/BLAST/</url>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>CEPH-Genethon</p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://www.cephb.fr/ceph-genethon-map.html</url>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Electronic-PCR (e-PCR)</p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://www.ncbi.nlm.nih.gov/genome/sts/epcr.cgi</url>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Ensembl</p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://www.ensembl.org/</url>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Entrez</p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Nucleotide</url>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Entrez entry AP001002 </p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&amp;db=Nucleotide&amp;list_uids=8117673&amp;dopt=GenBank</url>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Entrez entry AC007104 </p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&amp;db=Nucleotide&amp;list_uids=5523795&amp;dopt=GenBank</url>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Entrez <it>Homo sapiens</it> genome view</p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://www.ncbi.nlm.nih.gov/cgi-bin/Entrez/hum_srch?chr=hum_chr.inf&amp;query</url>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>EuroGeneIndexes</p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://corba.ebi.ac.uk/EST/egi.html</url>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>FPC</p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://www.sanger.ac.uk/Software/fpc</url>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>GenBank</p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html</url>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>GeneMap'99</p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://www.ncbi.nlm.nih.gov/genemap99/page.cgi?F=Home.html</url>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Genome Database (GDB)</p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://www.gdb.org/</url>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>The Genome Channel</p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://compbio.ornl.gov/tools/channel/index.html</url>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Human Accession Map</p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://genome.wustl.edu:8021/pub/gsc1/fpc_files/freeze_2000_06_15/MAP/</url>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Human BAC Ends</p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://www.tigr.org/tdb/humgen/bac_end_search/bac_end_intro.html</url>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Human Gene Index</p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://www.tigr.org/tdb/hgi/index.html</url>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Human Genome BAC map</p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://genome.wustl.edu/gsc/human/Mapping/</url>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>NIX</p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://www.hgmp.mrc.ac.uk/Registered/Webapp/nix/</url>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>RepeatMasker</p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://www.genome.washington.edu/UWGC/analysistools/repeatmask.htm</url>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Sputnik</p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://www.abajian.com/sputnik/</url>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>STACK</p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://www.sanbi.ac.za/Dbases.html</url>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>TIGR Gene Indices</p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://www.tigr.org/tdb/tgi.shtml</url>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>TNG4 radiation hybrid map</p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://www-shgc.stanford.edu/Mapping/Marker/RHTNG4index.html</url>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>UniGene</p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://www.ncbi.nlm.nih.gov/UniGene/index.html</url>
                     </p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>Working Draft Sequence</p>
                  </c>
                  <c ca="left">
                     <p>
                        <url>http://genome.ucsc.edu</url>
                     </p>
                  </c>
               </r>
            </tblbdy>
         </tbl>
      </sec>
      <sec>
         <st>
            <p>Expressed sequence data</p>
         </st>
         <p>Before the flood of genomic sequence from the Human Genome Project, full sequences were available for only a small proportion of human genes. Most human genes were represented only by expressed sequence tags (ESTs; fragments of mRNA sequences). Various efforts have been made to cluster overlapping EST sequences to give a longer representative sequence for each gene [<abbr bid="B4">4</abbr>]. The most comprehensive of these efforts is the human UniGene (Table <tblr tid="T1">1</tblr>) database at the NCBI, in which ESTs and mRNAs from GenBank (Table <tblr tid="T1">1</tblr>) that share overlapping subsequences have been grouped together into clusters. UniGene can be searched either with UniGene (Table <tblr tid="T1">1</tblr>) cluster accession numbers or with GenBank (Table <tblr tid="T1">1</tblr>) sequence accession numbers for ESTs or mRNAs. Clusters are linked to related mapping, sequence and expression data at the NCBI, and each cluster should represent a separate gene. As UniGene (Table <tblr tid="T1">1</tblr>) is automatically and regularly generated, it often contains errors. One serious problem is chimeric clusters, produced as a consequence of sequencing from chimeric clones (artifactual cDNAs that contain sequences from two different genes). TIGR also maintains a clustered EST database, called the Human Gene Index (HGI, Table <tblr tid="T1">1</tblr>), which has more stringent clustering criteria than UniGene (Table <tblr tid="T1">1</tblr>). Another human expressed sequence database, called STACK (Table <tblr tid="T1">1</tblr>), is held at the South African National Bioinformatics Institute (SANBI). In STACK (Table <tblr tid="T1">1</tblr>), expressed sequences are separated with respect to tissue of origin before clustering, and an attempt is made to represent differently spliced transcripts of the same gene. Unlike UniGene (Table <tblr tid="T1">1</tblr>), the STACK (Table <tblr tid="T1">1</tblr>), HGI (Table <tblr tid="T1">1</tblr>) and EuroGeneIndexes (Table <tblr tid="T1">1</tblr>) sites produce consensus sequences for clusters. Transcribed sequence databases are also available for species other than human; UniGene (Table <tblr tid="T1">1</tblr>) holds data for mouse, rat and zebrafish and there are TIGR Gene Indices (Table <tblr tid="T1">1</tblr>) for various other species. The EuroGeneIndexes (Table <tblr tid="T1">1</tblr>) at the European Bioinformatics Institute (EBI) also contain expressed sequence clusters for a number of non-human species. It is worth remembering that all expressed sequence databases will contain repetitive sequences because much of the sequence is from untranslated regions of genes.</p>
      </sec>
      <sec>
         <st>
            <p>Mapping data</p>
         </st>
         <p>The fragmentary human genome sequence is of little use without some idea of how the pieces fit together, so a map is needed that relates distinct landmarks, or sequence tagged sites (STSs), around the genome. Different types of mapping data provide maps of different resolutions. Genetic maps, based on the frequency of recombination events between STSs, are of relatively poor resolution - on the order of hundreds, or more often thousands, of kilobases. Physical mapping techniques can resolve STSs only tens of kilobases apart. In the early stages of the Human Genome Project, an important task was to construct a high-resolution genetic map of the genome, and the Genome Database (GDB, Table <tblr tid="T1">1</tblr>) was set up to curate such data. Genetic mapping data allowed genomic regions to be broadly defined, and efforts proceeded to physical mapping for finer distinctions. Various physical mapping projects have confirmed the physical order of genetic maps and extended genome maps to include further STSs and transcribed sequences. A physical map of the genome based on overlapping YAC (yeast artificial chromosomes) contigs was among the first to be published and the data are available from CEPH-Genethon (Table <tblr tid="T1">1</tblr>). One of the most important physical mapping techniques to emerge has been radiation hybrid (RH) mapping [<abbr bid="B5">5</abbr>]. RH maps are orderings of STSs based on assay scores of the STSs against a whole genome radiation hybrid 'panel'. Such panels consist of hybrid cell lines that contain different fragments of human genomic DNA. Each STS is assayed against each cell line to discover whether it is present in the genomic fragments particular to that cell line. The pattern of presence or absence in the cell lines making up a panel constitutes the retention pattern of the STS, and, by comparing STS retention patterns, the distance between STSs can be estimated. In this way, the TNG4 radiation hybrid map (Table <tblr tid="T1">1</tblr>) was generated at the Stanford Human Genome Centre (SHGC) and provides an average of 60 kb resolution across the genome. Such impressive estimates of resolution must be tempered, however, by the ambiguity that often accompanies RH-derived marker ordering. Comparisons of STS orders in sequenced regions of the genome with orders derived from RH maps suggest that RH map orders may be wrong up to 50% of the time [<abbr bid="B6">6</abbr>]. A consortium of RH mapping centres has produced a transcript map of the genome based on RH mapping data, named GeneMap'99 (Table <tblr tid="T1">1</tblr>), which is accessible at the NCBI. The STS content of a sequence of interest can be determined online using the electronic-PCR (e-PCR, Table <tblr tid="T1">1</tblr>) program at NCBI. This is a rapid sequence-search algorithm that searches your sequence for occurrences of the STS sequences in GenBank (Table <tblr tid="T1">1</tblr>). </p>
         <p>An important new source of mapping data has become available with the release of the draft genome: the fingerprint analysis of BAC clones for the genome project at Washington University Genome Sequencing Centre (WUGSC). The human genome BAC map (Table <tblr tid="T1">1</tblr>) provides the highest resolution human mapping data yet made available and is likely to do so until publication of the full human genome. The overlaps between clones are calculated using the program FPC (Table <tblr tid="T1">1</tblr>) on the basis of clone restriction fragment patterns or fingerprints. The resulting contigs are estimated to cover 97-98% of the genome. The fingerprint analysis has also been extended to show the sequence accession numbers for those clones that have been sequenced forming a Human Accession Map (Table <tblr tid="T1">1</tblr>).</p>
      </sec>
      <sec>
         <st>
            <p>Genome sequence annotation</p>
         </st>
         <p>Once a region of the genome has been sequenced, the immediate concern is to identify the genes, if any, that are present. Broadly speaking, the computational annotation of genomic sequence proceeds by two methods: <it>ab initio</it> gene prediction, and detection of similarity. Strictly defined, <it>ab initio</it> prediction of genes relies on the presence of compositional biases in genomic sequence that are characteristic of exons. Similarity to known transcribed or protein sequences can be used as further evidence of the accuracy of an <it>ab initio</it> prediction. Many gene prediction programs combine these types of evidence and show considerable success in detecting genes [<abbr bid="B7">7</abbr>,<abbr bid="B8">8</abbr>]. Computational predictions must be treated with caution, however, before they have been confirmed at the bench.</p>
         <p>The Ensembl (Table <tblr tid="T1">1</tblr>) database aims to provide a basic level of computational annotation for the draft genome. It localises BAC sequences in the genome according to a combination of mapping data and runs the sequences themselves through an 'analysis pipeline'. This pipeline consists of repeat masking the sequence, processing it with a gene prediction program called Genscan and then searching the predicted genes against sequence databases. Predicted genes that match known genes become Ensembl (Table <tblr tid="T1">1</tblr>) genes and are stored in the searchable Ensembl database. The Genome Channel (Table <tblr tid="T1">1</tblr>) is an analogous pipeline system that gives more detailed annotation, including CpG islands (areas of DNA that have a relatively high cytosine and guanine content), poly-adenylation sites and gene predictions from more than one gene prediction program. With rather more effort, it is possible to get very detailed annotation for a genomic sequence of interest through the NIX (Table <tblr tid="T1">1</tblr>) interface at the Human Genome Mapping Project Resource Centre (HGMP). Sequences submitted to NIX (Table <tblr tid="T1">1</tblr>) are processed by a variety of programs that detect repetitive regions, exons, tRNA genes, promoters, CpG islands, poly-adenylation sites and similarity to known proteins or transcribed sequences. The NIX interface is only available to registered HGMP users but it is possible for academic scientists to register without charge.</p>
      </sec>
      <sec>
         <st>
            <p>Putting the data to work</p>
         </st>
         <p>Although I am a computational biologist, most of my work involves collaboration with molecular biologists generating real data at the bench. I find that, after a hard day in their labs, people very rarely ask me to discuss the available draft genome resources. Their problems are invariably specific to a small number of genes or genomic regions. Where in the genome is gene X? What is in genomic region Y? These are the commonest questions, and I suggest generic approaches to answering them below. Even in the best-case scenario, however, where gene X is well characterized and already mapped, there will often be additional information to be extracted. What are the neighbouring genes and what is their relative order and orientation? What non-coding features (regulatory elements, pseudogenes and repetitive regions) lie in the vicinity?</p>
         <sec>
            <st>
               <p>Where in the genome is gene X?</p>
            </st>
         </sec>
         <p>As the draft sequence is estimated to cover more than 90% of the genome, the chances of finding part or all of gene X in unfinished BAC sequence are high. If the available sequence of gene X contains any non-coding DNA, it should first be masked using RepeatMasker (Table <tblr tid="T1">1</tblr>). A BLAST (Table <tblr tid="T1">1</tblr>) search of the sequence of gene X against the section of the database that contains the draft sequence is all that is necessary to find the relevant BACs. Using the NCBI Advanced BLAST (Table <tblr tid="T1">1</tblr>) site it is possible to limit the search to human draft genome sequence by selecting the 'htgs' database and 'Homo sapiens' in 'Advanced options' (Figure <figr fid="F1">1</figr>). Assuming the sequence quality is good for gene X, the BLAST (Table <tblr tid="T1">1</tblr>) output should show at least one segment of BAC sequence that is almost identical (greater than or equal to 98% identical is a reasonable rule of thumb) over a reasonable stretch of gene X. In the absence of any good match to a BAC sequence, the best option is to BLAST (Table <tblr tid="T1">1</tblr>) search gene X against human EST sequences (the 'human ests' database at NCBI) and search UniGene (Table <tblr tid="T1">1</tblr>) with matching EST accession numbers, because many UniGene (Table <tblr tid="T1">1</tblr>) clusters contain mapped ESTs. If gene X is found within a BAC sequence, the BAC should be repeat masked and submitted to e-PCR (Table <tblr tid="T1">1</tblr>) at NCBI, which will often provide one or more STSs. These STSs may be localized to a genetic or RH map using their accession numbers to search either the GDB (Table <tblr tid="T1">1</tblr>) or the Stanford RH maps. If the BAC that contains gene X does not contain any STSs, it can be used, after masking repeats, to search the htgs database again to discover overlapping BACs. Again, the intention is to find identical sequences, allowing for sequencing errors, and a reasonable rule-of-thumb measure of 'identical' is a stretch of greater than 1 kb showing greater than or equal to 98% identity in the BLAST output. Overlapping BACs may be annotated as coming from the same chromosome as gene X, or the first BAC and can be submitted to e-PCR (Table <tblr tid="T1">1</tblr>) and assigned a location. Supporting evidence for a collection of overlapping BACs can be obtained from the Human Accession Map (Table <tblr tid="T1">1</tblr>) and the Working Draft Sequence (Table <tblr tid="T1">1</tblr>) site at UCSC (Figure <figr fid="F2">2</figr>). If these two resources include the BACs but do not show them to overlap, one should be suspicious.</p>
         <sec>
            <st>
               <p>What is in genomic region Y?</p>
            </st>
         </sec>
         <p>Determining what is in genomic region Y involves a process similar to mapping gene X. The more sequence we start with from region Y the better. In the worst case, we might only have one STS sequence that is known to be from region Y but is not part of any known transcript. As in the section 'Where in the genome is gene X', after repeat-masking the STS we can use it to search the htgs database using NCBI Advanced BLAST (Table <tblr tid="T1">1</tblr>) for matching genomic sequence. This first stage is the most troublesome; as STSs are a only a few hundred bases long it is desirable to have some corroborating evidence to back up any apparent match to a BAC. This can come in the form of mapping data. Other markers or genes near the STS on genetic or RH maps would be expected to appear in the sequence of the apparently matching BAC or in BACs that overlap with it. Once we have reliably placed the starting STS sequence in a BAC, the task is to build up a contig of overlapping BACs around it, as in the section 'Where in the genome is gene X'. The Human Accession Map (Table <tblr tid="T1">1</tblr>) at WUGSC and Working Draft Sequence (Table <tblr tid="T1">1</tblr>) site at UCSC can be used as guides to choosing overlapping BACs that extend your contig furthest. Many BAC sequences, particularly those in earlier stages of sequencing, contain cloning vector sequences that can generate spurious BLAST (Table <tblr tid="T1">1</tblr>) matches. It is possible to use RepeatMasker (Table <tblr tid="T1">1</tblr>) to also mask vector sequences but this is not an option offered on the RepeatMasker (Table <tblr tid="T1">1</tblr>) web server. If a BAC sequence generates a large number of BLAST (Table <tblr tid="T1">1</tblr>) matches then the sequence should be searched against the entire sequence database ('nr' at NCBI BLAST - Table <tblr tid="T1">1</tblr>) to look for the presence of bacterial sequence. It is important to remember that the word contig is actually rather inappropriate here, because most BAC sequences are fragmented and incomplete. Generally BAC sequences are 150-200 kb long when complete, so it is possible to estimate roughly the amount of missing sequence. This process should eventually result in a list of BAC accession numbers that represents most of the sequence in region Y. The accession numbers that result can be used to search Ensembl (Table <tblr tid="T1">1</tblr>) to give the minimum number and identities if genes in region Y. More detailed analysis, including the identification of non-coding sequence features, can be carried out using NIX (Table <tblr tid="T1">1</tblr>). This strategy is probably only practical for investigating modestly sized regions of, say, less than 1 Mb; for larger regions, for example, a chromosomal band, it is easier to approach the process as an automated task - ask your friendly local bioinformaticist. </p>
         <p>The web resources described here are those I find most useful, and are a 'snapshot' as of September 2000. They will, of course, be subject to change or updates as more information becomes available. Other people will have their own opinions and methods on how to use web resources for accessing the draft human genome - there is no substitute for experience. </p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>Advanced BLAST</p>
            </caption>
            <text>
               <p>Advanced BLAST. The htgs database can be selected (near the top of the page) and 'Homo sapiens' can be selected in the advanced options (near the bottom).</p>
            </text>
            <graphic file="gb-2000-1-4-reviews2001-1"/>
         </fig>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>Working draft sequence</p>
            </caption>
            <text>
               <p>Working draft sequence</p>
            </text>
            <graphic file="gb-2000-1-4-reviews2001-2"/>
         </fig>
      </sec>
   </bdy>
   <bm>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Gaps in the human genome project.</p>
            </title>
            <aug>
               <au>
                  <snm>Roach</snm>
                  <fnm>JC</fnm>
               </au>
               <au>
                  <snm>Siegel</snm>
                  <fnm>AF</fnm>
               </au>
               <au>
                  <snm>van den Engh</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Trask</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Hood</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Nature </source>
            <pubdate>1999</pubdate>
            <volume>401</volume>
            <fpage>843-845</fpage>
            <lpage>12642</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">10553897</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Human BAC ends quality assessment and sequence analyses.</p>
            </title>
            <aug>
               <au>
                  <snm>Zhao</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Malek</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Mahairas</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Fu</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Nierman</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Venter</snm>
                  <fnm>JC</fnm>
               </au>
               <au>
                  <snm>Adams</snm>
                  <fnm>MD</fnm>
               </au>
            </aug>
            <source>Genomics</source>
            <pubdate>2000</pubdate>
            <volume>63</volume>
            <fpage>321</fpage>
            <lpage>332</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/geno.1999.6082</pubid>
                  <pubid idtype="pmpid" link="fulltext">10704280</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Interspersed repeats and other mementos of transposable elements in mammalian genomes.</p>
            </title>
            <aug>
               <au>
                  <snm>Smit</snm>
                  <fnm>AF</fnm>
               </au>
            </aug>
            <source>Curr Opin Genet Dev</source>
            <pubdate>1999</pubdate>
            <volume>9</volume>
            <fpage>657</fpage>
            <lpage>663</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0959-437X(99)00031-3</pubid>
                  <pubid idtype="pmpid" link="fulltext">10607616</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Comparison of gene indexing databases.</p>
            </title>
            <aug>
               <au>
                  <snm>Bouck</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Yu</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Gibbs</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Worley</snm>
                  <fnm>K</fnm>
               </au>
            </aug>
            <source>Trends Genet</source>
            <pubdate>1999</pubdate>
            <volume>15</volume>
            <fpage>159</fpage>
            <lpage>162</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0168-9525(99)01709-6</pubid>
                  <pubid idtype="pmpid" link="fulltext">10203827</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Radiation hybrid mapping information.</p>
            </title>
            <url>http://compgen.rutgers.edu/rhmap/</url>
         </bibl>
         <bibl id="B6">
            <title>
               <p>A fast and scalable radiation hybrid map construction and integration strategy.</p>
            </title>
            <aug>
               <au>
                  <snm>Agarwala</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Applegate</snm>
                  <fnm>DL</fnm>
               </au>
               <au>
                  <snm>Maglott</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Schuler</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Schaffer</snm>
                  <fnm>GD</fnm>
               </au>
            </aug>
            <source>Genome Res </source>
            <pubdate>2000</pubdate>
            <volume>10</volume>
            <fpage>350</fpage>
            <lpage>364</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1101/gr.10.3.350</pubid>
                  <pubid idtype="pmpid" link="fulltext">10720576</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Finding the genes in genomic DNA.</p>
            </title>
            <aug>
               <au>
                  <snm>Burge</snm>
                  <fnm>CB</fnm>
               </au>
               <au>
                  <snm>Karlin</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Curr Opin Struct Biol</source>
            <pubdate>1998</pubdate>
            <volume>8</volume>
            <fpage>346</fpage>
            <lpage>354</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0959-440X(98)80069-9</pubid>
                  <pubid idtype="pmpid">9666331</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Gene prediction: the end of the beginning.</p>
            </title>
            <aug>
               <au>
                  <snm>Semple</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>GenomeBiology</source>
            <pubdate>2000</pubdate>
            <volume>1</volume>
            <fpage>reports4012.1</fpage>
            <lpage>4012.3</lpage>
            <url>http://genomebiology.com/2000/1/2/reports/4012</url>
            <xrefbib>
               <pubid idtype="doi">10.1186/gb-2000-1-2-reports4012</pubid>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
