<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
	<ui>1471-2105-8-S2-S9</ui>
	<ji>1471-2105</ji>
	<fm>
		<dochead>Research</dochead>
		<bibl>
			<title>
				<p>Constrained hidden Markov models for population-based haplotyping</p>
			</title>
			<aug>
				<au id="A1" ca="yes">
					<snm>Landwehr</snm>
					<fnm>Niels</fnm>
					<insr iid="I1"/>
					<email>landwehr@informatik.uni-freiburg.de</email>
				</au>
				<au id="A2">
					<snm>Mielik&#228;inen</snm>
					<fnm>Taneli</fnm>
					<insr iid="I2"/>
					<email>taneli.mielikainen@iki.fi</email>
				</au>
				<au id="A3">
					<snm>Eronen</snm>
					<fnm>Lauri</fnm>
					<insr iid="I2"/>
					<email>lauri.eronen@cs.helsinki.fi</email>
				</au>
				<au id="A4">
					<snm>Toivonen</snm>
					<fnm>Hannu</fnm>
					<insr iid="I1"/>
					<insr iid="I2"/>
					<email>hannu.toivonen@cs.helsinki.fi</email>
				</au>
				<au id="A5">
					<snm>Mannila</snm>
					<fnm>Heikki</fnm>
					<insr iid="I2"/>
					<email>mannila@cs.helsinki.fi</email>
				</au>
			</aug>
			<insg>
				<ins id="I1">
					<p>Machine Learning Lab, Department of Computer Science, Albert-Ludwigs-University Freiburg, Germany</p>
				</ins>
				<ins id="I2">
					<p>HIIT Basic Research Unit, Department of Computer Science, University of Helsinki, Finland</p>
				</ins>
			</insg>
			<source>BMC Bioinformatics</source>
			<supplement>
				<title>
					<p>Probabilistic Modeling and Machine Learning in Structural and Systems Biology</p>
				</title>
				<editor>Samuel Kaski, Juho Rousu, Esko Ukkonen</editor>
				<note>Research</note>
			</supplement>
			<conference>
				<title>
					<p>Probabilistic Modeling and Machine Learning in Structural and Systems Biology</p>
				</title>
				<location>Tuusula, Finland</location>
				<date-range>17&#8211;18 June 2006</date-range>
				<url>http://www.cs.helsinki.fi/group/bioinfo/events/pmsb06/</url>
			</conference>
			<issn>1471-2105</issn>
			<pubdate>2007</pubdate>
			<volume>8</volume>
			<issue>Suppl 2</issue>
			<fpage>S9</fpage>
			<url>http://www.biomedcentral.com/1471-2105/8/S2/S9</url>
			<xrefbib>
				<pubidlist><pubid idtype="pmpid">17493258</pubid><pubid idtype="doi">10.1186/1471-2105-8-S2-S9</pubid>
				</pubidlist></xrefbib>
		</bibl>
		<history>
			<pub>
				<date>
					<day>03</day>
					<month>5</month>
					<year>2007</year>
				</date>
			</pub>
		</history>
		<cpyrt>
			<year>2007</year>
			<collab>Landwehr et al; licensee BioMed Central Ltd.</collab>
			<note>This is an open access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
		</cpyrt>
		<abs>
			<sec>
				<st>
					<p>Abstract</p>
				</st>
				<sec>
					<st>
						<p>Background</p>
					</st>
					<p><it>Haplotype Reconstruction </it>is the problem of resolving the hidden phase information in genotype data obtained from laboratory measurements. Solving this problem is an important intermediate step in gene association studies, which seek to uncover the genetic basis of complex diseases. We propose a novel approach for haplotype reconstruction based on constrained hidden Markov models. Models are constructed by incrementally refining and regularizing the structure of a simple generative model for genotype data under Hardy-Weinberg equilibrium.</p>
				</sec>
				<sec>
					<st>
						<p>Results</p>
					</st>
					<p>The proposed method is evaluated on real-world and simulated population data. Results show that it is competitive with other recently proposed methods in terms of reconstruction accuracy, while offering a particularly good trade-off between computational costs and quality of results for large datasets.</p>
				</sec>
				<sec>
					<st>
						<p>Conclusion</p>
					</st>
					<p>Relatively simple probabilistic approaches for haplotype reconstruction based on structured hidden Markov models are competitive with more complex, well-established techniques in this field.</p>
				</sec>
			</sec>
		</abs>
	</fm>
	<bdy>
		<sec>
			<st>
				<p>Background</p>
			</st>
			<p>Analysis of genetic variation in human populations is critical to the understanding of the genetic basis for complex diseases. Most studied differences in DNA are single-nucleotide variations at particular positions in the genome, which are called <it>single nucleotide polymorphisms </it>(SNPs). The positions are also called <it>markers </it>and the different possible values <it>alleles</it>. A <it>haplotype </it>is a sequence of SNP alleles along a region of a chromosome, and concisely represents the (variable) genetic information in that region. In the search for DNA sequence variants that are related to common diseases (so-called <it>gene mapping </it>studies), haplotype-based approaches have become a central theme <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>.</p>
			<p>In diploid organisms such as humans there are two <it>homologous </it>(i.e., almost identical) copies of each chromosome. Current practical laboratory measurement techniques produce a <it>genotype </it>&#8211; for <it>m </it>markers, a sequence of <it>m </it>unordered pairs of alleles. The genotype reveals which two alleles are present at each marker, but not their respective chromosomal origin. In order to obtain haplotypes from genotype data, this hidden phase information needs to be reconstructed. There are two alternative approaches: If family trios are available, most of the ambiguity in the phase can be resolved analytically. If not, population-based computational methods have to be used to estimate the haplotype pair for each genotype. Because trios are more difficult to recruit and more expensive to genotype, population-based approaches are often the only cost-effective method for large-scale studies. Consequently, the study of such techniques has received much attention recently <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr></abbrgrp>. In this paper, we propose and evaluate a novel approach for population-based haplotyping based on constrained hidden Markov models.</p>
			<sec>
				<st>
					<p>Population-based haplotype reconstruction</p>
				</st>
				<p>A haplotype <it>h </it>is a sequence of alleles <it>h</it>[<it>i</it>] in markers <it>i </it>= 1,...,<it>m</it>. In most cases, only two alternative alleles occur at an SNP marker, so we can assume that <it>h </it>&#8712; {0, 1}<sup><it>m</it></sup>. A genotype <it>g </it>is a sequence of unordered pairs <it>g</it>[<it>i</it>] = {<m:math name="1471-2105-8-S2-S9-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>h</m:mi><m:mi>g</m:mi><m:mn>1</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGObaAdaqhaaWcbaGaem4zaCgabaGaeGymaedaaaaa@3079@</m:annotation></m:semantics></m:math>[<it>i</it>], <m:math name="1471-2105-8-S2-S9-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>h</m:mi><m:mi>g</m:mi><m:mn>2</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGObaAdaqhaaWcbaGaem4zaCgabaGaeGOmaidaaaaa@307B@</m:annotation></m:semantics></m:math>[<it>i</it>]} of alleles in markers <it>i </it>= 1,...,<it>m</it>. Hence, <it>g </it>&#8712; {{0, 0}, {1, 1}, {0, 1}}<sup><it>m</it></sup>. A marker with alleles {0, 0} or {1, 1} is <it>homozygous </it>whereas a marker with alleles {0, 1} is <it>heterozygous</it>.</p>
				<sec>
					<st>
						<p>Problem 1 (haplotype reconstruction)</p>
					</st>
					<p><it>Given a multiset </it><m:math name="1471-2105-8-S2-S9-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi mathvariant="script">G</m:mi><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFge=raaa@382D@ </m:annotation></m:semantics></m:math><it>of genotypes, find for each g </it>&#8712; <m:math name="1471-2105-8-S2-S9-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi mathvariant="script">G</m:mi><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFge=raaa@382D@</m:annotation></m:semantics></m:math><it> the most likely haplotypes </it><m:math name="1471-2105-8-S2-S9-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>h</m:mi><m:mi>g</m:mi><m:mn>1</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGObaAdaqhaaWcbaGaem4zaCgabaGaeGymaedaaaaa@3079@ </m:annotation></m:semantics></m:math><it>and </it><m:math name="1471-2105-8-S2-S9-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>h</m:mi><m:mi>g</m:mi><m:mn>2</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGObaAdaqhaaWcbaGaem4zaCgabaGaeGOmaidaaaaa@307B@</m:annotation></m:semantics></m:math><it> which are a </it>consistent <it>reconstruction of g, i.e., g</it>[<it>i</it>] = {<m:math name="1471-2105-8-S2-S9-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>h</m:mi><m:mi>g</m:mi><m:mn>1</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGObaAdaqhaaWcbaGaem4zaCgabaGaeGymaedaaaaa@3079@</m:annotation></m:semantics></m:math>[<it>i</it>], <m:math name="1471-2105-8-S2-S9-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>h</m:mi><m:mi>g</m:mi><m:mn>2</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGObaAdaqhaaWcbaGaem4zaCgabaGaeGOmaidaaaaa@307B@</m:annotation></m:semantics></m:math>[<it>i</it>]} <it>for each i </it>= 1,...,<it>m</it>.</p>
					<p>If <m:math name="1471-2105-8-S2-S9-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi>&#8459;</m:mi><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFlecsaaa@3763@</m:annotation></m:semantics></m:math> denotes a mapping <m:math name="1471-2105-8-S2-S9-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi mathvariant="script">G</m:mi><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFge=raaa@382D@</m:annotation></m:semantics></m:math> &#8594; {0, 1}<sup><it>m </it></sup>&#215; {0, 1}<sup><it>m</it></sup>, associating each genotype <it>g </it>&#8712; <m:math name="1471-2105-8-S2-S9-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi mathvariant="script">G</m:mi><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFge=raaa@382D@</m:annotation></m:semantics></m:math> with a pair &#10216;<m:math name="1471-2105-8-S2-S9-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>h</m:mi><m:mi>g</m:mi><m:mn>1</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGObaAdaqhaaWcbaGaem4zaCgabaGaeGymaedaaaaa@3079@</m:annotation></m:semantics></m:math>, <m:math name="1471-2105-8-S2-S9-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>h</m:mi><m:mi>g</m:mi><m:mn>2</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGObaAdaqhaaWcbaGaem4zaCgabaGaeGOmaidaaaaa@307B@</m:annotation></m:semantics></m:math>&#10217; of haplotypes, the goal is to find the <m:math name="1471-2105-8-S2-S9-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi>&#8459;</m:mi><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFlecsaaa@3763@</m:annotation></m:semantics></m:math> that maximizes <b>P</b>(<m:math name="1471-2105-8-S2-S9-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi>&#8459;</m:mi><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFlecsaaa@3763@</m:annotation></m:semantics></m:math> | <m:math name="1471-2105-8-S2-S9-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi mathvariant="script">G</m:mi><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFge=raaa@382D@</m:annotation></m:semantics></m:math>). It is usually assumed that the sample <m:math name="1471-2105-8-S2-S9-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi mathvariant="script">G</m:mi><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFge=raaa@382D@</m:annotation></m:semantics></m:math> is in Hardy-Weinberg equilibrium, i.e., that <b>P</b>(&#10216;<m:math name="1471-2105-8-S2-S9-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>h</m:mi><m:mi>g</m:mi><m:mn>1</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGObaAdaqhaaWcbaGaem4zaCgabaGaeGymaedaaaaa@3079@</m:annotation></m:semantics></m:math>, <m:math name="1471-2105-8-S2-S9-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>h</m:mi><m:mi>g</m:mi><m:mn>2</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGObaAdaqhaaWcbaGaem4zaCgabaGaeGOmaidaaaaa@307B@</m:annotation></m:semantics></m:math>&#10217;) = <b>P</b>(<m:math name="1471-2105-8-S2-S9-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>h</m:mi><m:mi>g</m:mi><m:mn>1</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGObaAdaqhaaWcbaGaem4zaCgabaGaeGymaedaaaaa@3079@</m:annotation></m:semantics></m:math>)<b>P</b>(<m:math name="1471-2105-8-S2-S9-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>h</m:mi><m:mi>g</m:mi><m:mn>2</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGObaAdaqhaaWcbaGaem4zaCgabaGaeGOmaidaaaaa@307B@</m:annotation></m:semantics></m:math>) for all <it>g </it>&#8712; <m:math name="1471-2105-8-S2-S9-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi mathvariant="script">G</m:mi><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFge=raaa@382D@</m:annotation></m:semantics></m:math>, and that genotypes are independently sampled from the same distribution. With such assumptions, the likelihood <b>P</b>(<m:math name="1471-2105-8-S2-S9-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi>&#8459;</m:mi><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFlecsaaa@3763@</m:annotation></m:semantics></m:math> | <m:math name="1471-2105-8-S2-S9-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi mathvariant="script">G</m:mi><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFge=raaa@382D@</m:annotation></m:semantics></m:math>) of the reconstruction <m:math name="1471-2105-8-S2-S9-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi>&#8459;</m:mi><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFlecsaaa@3763@</m:annotation></m:semantics></m:math> given <m:math name="1471-2105-8-S2-S9-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi mathvariant="script">G</m:mi><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFge=raaa@382D@</m:annotation></m:semantics></m:math> is proportional to <m:math name="1471-2105-8-S2-S9-i5" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mstyle displaystyle="true"><m:msub><m:mo>&#8719;</m:mo><m:mrow><m:mi>g</m:mi><m:mo>&#8712;</m:mo><m:mi mathvariant="script">G</m:mi></m:mrow></m:msub><m:mrow><m:mi>P</m:mi><m:mo stretchy="false">(</m:mo><m:msubsup><m:mi>h</m:mi><m:mi>g</m:mi><m:mn>1</m:mn></m:msubsup><m:mo stretchy="false">)</m:mo><m:mi>P</m:mi><m:mo stretchy="false">(</m:mo><m:msubsup><m:mi>h</m:mi><m:mi>g</m:mi><m:mn>2</m:mn></m:msubsup><m:mo stretchy="false">)</m:mo></m:mrow></m:mstyle></m:mrow><m:annotation encoding="MathType-MTEF">
MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaadaqeqaqaamrr1ngBPrwtHrhAYaqeguuDJXwAKbstHrhAGq1DVbaceaGae8xgHaLaeiikaGIaemiAaG2aa0baaSqaaiabdEgaNbqaaiabigdaXaaakiabcMcaPiab=LriqjabcIcaOiabdIgaOnaaDaaaleaacqWGNbWzaeaacqaIYaGmaaGccqGGPaqkaSqaaiabdEgaNjabgIGioprtHrhAL1wy0L2yHvtyaeXbnfgDOvwBHrxAJfwnaGqbaiab+zq8hbqab0Gaey4dIunaaaa@53AE@</m:annotation></m:semantics></m:math> if the reconstruction is consistent for all <it>g </it>&#8712; <m:math name="1471-2105-8-S2-S9-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi mathvariant="script">G</m:mi><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFge=raaa@382D@</m:annotation></m:semantics></m:math>, and zero otherwise. In population-based haplotyping, a probabilistic model <it>&#955; </it>for the distribution over haplotypes is estimated from the available genotype information <m:math name="1471-2105-8-S2-S9-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi mathvariant="script">G</m:mi><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFge=raaa@382D@</m:annotation></m:semantics></m:math>. The distribution estimate <b>P</b>(<it>h </it>| <it>&#955;</it>) is then used to find the most likely reconstruction <m:math name="1471-2105-8-S2-S9-i4" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi>&#8459;</m:mi><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFlecsaaa@3763@</m:annotation></m:semantics></m:math> for <m:math name="1471-2105-8-S2-S9-i3" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mi mathvariant="script">G</m:mi><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFge=raaa@382D@</m:annotation></m:semantics></m:math> under Hardy-Weinberg equilibrium.</p>
					<p>The genetic variation in SNPs is mostly due to two causes: <it>mutation </it>and <it>recombination</it>. Mutations are relatively rare, they occur with a frequency of about 10<sup>-8</sup>. While SNPs are themselves results of ancient mutations, mutations are usually ignored in statistical haplotype models due to their rarity.</p>
					<p>Recombination introduces variability by breaking up the chromosomes of the two parents and reconnecting the resulting segments to form a new and different chromosome for the offspring. Because the probability of a recombination event between two markers is lower if they are near to each other, there is a statistical correlation (so-called <it>linkage disequilibrium</it>) between markers which decreases with increasing marker distance. Statistical approaches to haplotype modeling are based on exploiting such patterns of correlation.</p>
				</sec>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Methods</p>
			</st>
			<p>This section presents the proposed method for haplotype reconstruction. We discuss the statistical model employed and present an incremental algorithm for efficiently learning the model structure from genotype data. Finally, datasets and systems used in the experimental evaluation are described.</p>
			<sec>
				<st>
					<p>(Hidden) Markov models for haplotyping</p>
				</st>
				<p>We model the probability distribution on haplotypes by a left-right Markov model <it>&#955; </it>with 2&#183;<it>m </it>states, with a state space as shown in Figure <figr fid="F1">1</figr>. A haplotype (of length 4 in the example) is sampled by traversing a path through the model from left to right. The Markov assumption <m:math name="1471-2105-8-S2-S9-i6" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:mi>P</m:mi><m:mo stretchy="false">(</m:mo><m:mi>h</m:mi><m:mo stretchy="false">)</m:mo><m:mo>=</m:mo><m:mstyle displaystyle="true"><m:msubsup><m:mo>&#8719;</m:mo><m:mrow><m:mi>t</m:mi><m:mo>=</m:mo><m:mn>1</m:mn></m:mrow><m:mi>m</m:mi></m:msubsup><m:mrow><m:msub><m:mi>P</m:mi><m:mi>t</m:mi></m:msub><m:mo stretchy="false">(</m:mo><m:mi>h</m:mi><m:mo stretchy="false">[</m:mo><m:mi>t</m:mi><m:mo stretchy="false">]</m:mo><m:mo>|</m:mo><m:mi>h</m:mi><m:mo stretchy="false">[</m:mo><m:mi>t</m:mi><m:mo>&#8722;</m:mo><m:mn>1</m:mn><m:mo stretchy="false">]</m:mo><m:mo>,</m:mo><m:mi>&#955;</m:mi><m:mo stretchy="false">)</m:mo></m:mrow></m:mstyle></m:mrow><m:annotation encoding="MathType-MTEF">
MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaatuuDJXwAK1uy0HMmaeHbfv3ySLgzG0uy0HgiuD3BaGabaiab=LriqjabcIcaOiabdIgaOjabcMcaPiabg2da9maaradabaGae8xgHa1aaSbaaSqaaiabdsha0bqabaGccqGGOaakcqWGObaAcqGGBbWwcqWG0baDcqGGDbqxcqGG8baFcqWGObaAcqGGBbWwcqWG0baDcqGHsislcqaIXaqmcqGGDbqxcqGGSaaliiGacqGF7oaBcqGGPaqkaSqaaiabdsha0jabg2da9iabigdaXaqaaiabd2gaTbqdcqGHpis1aaaa@571C@</m:annotation></m:semantics></m:math> is motivated by the observation that linkage disequilibrium decreases with increasing marker distance.</p>
				<fig id="F1">
					<title>
						<p>Figure 1</p>
					</title>
					<caption>
						<p>A Markov model over haplotypes</p>
					</caption>
					<text>
						<p><b>A Markov model over haplotypes</b>. The highlighted path encodes the haplotype "0110".</p>
					</text>
					<graphic file="1471-2105-8-S2-S9-1"/>
				</fig>
				<p>Parameters are of the form <b>P</b><sub><it>t</it></sub>(<it>h</it>[<it>t</it>] | <it>h</it>[<it>t </it>- 1], <it>&#955;</it>), the probability of sampling the new allele <it>h</it>[<it>t</it>] at position <it>t </it>after observing the allele <it>h</it>[<it>t </it>- 1] at position <it>t </it>- 1. Note that separate (conditional) allele distributions <b>P</b><sub><it>t </it></sub>are defined for every sequence position <it>t </it>&#8712; {1,...,<it>m</it>}, as linkage disequilibrium patterns will vary for different markers. This also means that the allele encoding at a given marker position, i.e., which allele is represented as '0' and which as '1', does not affect the distributions that can be represented.</p>
				<p>This model is not directly applicable in haplotype reconstruction, because in reality only genotypes are observed whereas the phase information is hidden. The hidden phase information can be modeled by a hidden Markov model <it>&#955;' </it>as shown in Figure <figr fid="F2">2</figr>. A path through this model corresponds to sampling a pair of haplotypes (ordered allele pairs, in angle brackets), while the corresponding genotype (unordered pairs, in curly brackets) is emitted.</p>
				<fig id="F2">
					<title>
						<p>Figure 2</p>
					</title>
					<caption>
						<p>A hidden Markov model over genotypes</p>
					</caption>
					<text>
						<p><b>A hidden Markov model over genotypes</b>. Possible paths for genotype observation '{0, 1}', '{1, 1}', '{0, 1}', '{0, 0}' are highlighted. The corresponding haplotype pairs are {(0100, 1110), (0110, 1100), (1100, 0110), (1110, 0100)}.</p>
					</text>
					<graphic file="1471-2105-8-S2-S9-2"/>
				</fig>
				<p>To reflect the Hardy-Weinberg equilibrium assumption, constraints have to be placed on transition probabilities. A transition in this model corresponds to independently sampling two new alleles <it>h</it><sup>1</sup>[<it>t</it>] and <it>h</it><sup>2</sup>[<it>t</it>] at marker <it>t </it>based on their respective histories <it>h</it><sup>1</sup>[<it>t </it>- 1] and <it>h</it><sup>2</sup>[<it>t </it>- 1]. Therefore, the corresponding probability is actually the product of probabilities for sampling <it>h</it><sup><it>i</it></sup>[<it>t</it>] after <it>h</it><sup><it>i</it></sup>[<it>t </it>- 1]:</p>
				<p><b>P</b><sub><it>t</it></sub>(<it>h</it><sup>1</sup>[<it>t</it>], <it>h</it><sup>2</sup>[<it>t</it>] | <it>h</it><sup>1</sup>[<it>t </it>- 1], <it>h</it><sup>2</sup>[<it>t </it>- 1], <it>&#955;'</it>) = <b>P</b><sub><it>t</it></sub>(<it>h</it><sup>1</sup>[<it>t</it>] | <it>h</it><sup>1</sup>[<it>t </it>- 1], <it>&#955;</it>)<b>P</b><sub><it>t</it></sub>(<it>h</it><sup>2</sup>[<it>t</it>] | <it>h</it><sup>2</sup>[<it>t </it>- 1], <it>&#955;</it>).</p>
				<p>In this way, all parameters of <it>&#955;' </it>can be re-expressed as products of parameters of the model <it>&#955; </it>on haplotypes outlined above. Furthermore, <it>&#955;' </it>can be transformed into an equivalent HMM in which these constraints involving products of parameters are replaced with standard parameter tying constraints, which tie parameters in <it>&#955;' </it>to those in <it>&#955;</it>.</p>
				<p>An advantage of this approach is that the model <it>&#955;' </it>can be trained directly from genotype data using Baum-Welsh algorithm <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>, while implicitly estimating the distribution over haplotypes encoded in <it>&#955;</it>. Furthermore, the most likely reconstruction of a genotype can be directly obtained by the Viterbi algorithm <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. The presented idea of embedding a model on haplotypes into a model on genotypes in which the genotype phase is the hidden state information, and learning this model using EM, is related to the approaches used in the HIT <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> and fastPHASE <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> systems. In HIT, haplotypes are modeled as recombinations of a set of founder haplotypes, and an instance of the EM algorithm is derived to directly estimate the founders from genotype observations. In fastPHASE, haplotypes are modeled using local clusters, and cluster membership of a haplotype is determined by a hidden Markov model. Again, an instance of the EM algorithm for estimating the clusters directly from genotype data can be derived.</p>
				<sec>
					<st>
						<p>Higher-order models and sparse distributions</p>
					</st>
					<p>The main limitation of the model presented so far is that it only takes into account dependencies between adjacent markers. Expressivity can be increased by using a Markov model of order <it>k </it>&gt; 1 for the underlying haplotype distribution <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>:</p>
					<p>
						<m:math name="1471-2105-8-S2-S9-i7" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mrow>
									<m:mi>P</m:mi>
									<m:mo stretchy="false">(</m:mo>
									<m:mi>h</m:mi>
									<m:mo stretchy="false">)</m:mo>
									<m:mo>=</m:mo>
									<m:mstyle displaystyle="true">
										<m:munderover>
											<m:mo>&#8719;</m:mo>
											<m:mrow>
												<m:mi>t</m:mi>
												<m:mo>=</m:mo>
												<m:mn>1</m:mn>
											</m:mrow>
											<m:mi>m</m:mi>
										</m:munderover>
										<m:mrow>
											<m:msub>
												<m:mi>P</m:mi>
												<m:mi>t</m:mi>
											</m:msub>
											<m:mo stretchy="false">(</m:mo>
											<m:mi>h</m:mi>
											<m:mo stretchy="false">[</m:mo>
											<m:mi>t</m:mi>
											<m:mo stretchy="false">]</m:mo>
											<m:mo>|</m:mo>
											<m:mi>h</m:mi>
											<m:mo stretchy="false">[</m:mo>
											<m:mi>t</m:mi>
											<m:mo>&#8722;</m:mo>
											<m:mi>k</m:mi>
											<m:mo>,</m:mo>
											<m:mi>t</m:mi>
											<m:mo>&#8722;</m:mo>
											<m:mn>1</m:mn>
											<m:mo stretchy="false">]</m:mo>
											<m:mo>,</m:mo>
											<m:mi>&#955;</m:mi>
											<m:mo stretchy="false">)</m:mo>
											<m:mo>,</m:mo>
										</m:mrow>
									</m:mstyle>
								</m:mrow>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaatuuDJXwAK1uy0HMmaeHbfv3ySLgzG0uy0HgiuD3BaGabaiab=LriqjabcIcaOiabdIgaOjabcMcaPiabg2da9maarahabaGae8xgHa1aaSbaaSqaaiabdsha0bqabaGccqGGOaakcqWGObaAcqGGBbWwcqWG0baDcqGGDbqxcqGG8baFcqWGObaAcqGGBbWwcqWG0baDcqGHsislcqWGRbWAcqGGSaalcqWG0baDcqGHsislcqaIXaqmcqGGDbqxcqGGSaaliiGacqGF7oaBcqGGPaqkcqGGSaalaSqaaiabdsha0jabg2da9iabigdaXaqaaiabd2gaTbqdcqGHpis1aaaa@5CD9@</m:annotation>
							</m:semantics>
						</m:math>
					</p>
					<p>where <it>h</it>[<it>j</it>, <it>i</it>] is a shorthand for <it>h</it>[max{1, <it>j</it>}]...<it>h</it>[<it>i</it>]. Unfortunately, the number of parameters in such a model increases exponentially with the history length <it>k</it>. Fortunately, observations on real-world data (e.g., <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>) show that only few conserved haplotype fragments from the set of 2<sup><it>k </it></sup>possible binary strings of length <it>k </it>actually occur. This can be exploited by modeling sparse distributions, where fragment probabilities which are estimated to be very low are set to zero. More precisely, let <it>p </it>= <b>P</b><sub><it>t</it></sub>(<it>h</it>[<it>t</it>] | <it>h</it>[<it>t </it>- <it>k</it>, <it>t </it>- 1]) and define for some small <it>&#949; </it>&gt; 0 a regularized distribution</p>
					<p>
						<m:math name="1471-2105-8-S2-S9-i8" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mrow>
									<m:msub>
										<m:mover accent="true">
											<m:mi>P</m:mi>
											<m:mo>^</m:mo>
										</m:mover>
										<m:mi>t</m:mi>
									</m:msub>
									<m:mo stretchy="false">(</m:mo>
									<m:mi>h</m:mi>
									<m:mo stretchy="false">[</m:mo>
									<m:mi>t</m:mi>
									<m:mo stretchy="false">]</m:mo>
									<m:mo>|</m:mo>
									<m:mi>h</m:mi>
									<m:mo stretchy="false">[</m:mo>
									<m:mi>t</m:mi>
									<m:mo>&#8722;</m:mo>
									<m:mi>k</m:mi>
									<m:mo>,</m:mo>
									<m:mi>t</m:mi>
									<m:mo>&#8722;</m:mo>
									<m:mn>1</m:mn>
									<m:mo stretchy="false">]</m:mo>
									<m:mo stretchy="false">)</m:mo>
									<m:mo>=</m:mo>
									<m:mrow>
										<m:mo>{</m:mo>
										<m:mrow>
											<m:mtable columnalign="left">
												<m:mtr columnalign="left">
													<m:mtd columnalign="left">
														<m:mn>0</m:mn>
													</m:mtd>
													<m:mtd columnalign="left">
														<m:mrow>
															<m:mi>i</m:mi>
															<m:mi>f</m:mi>
															<m:mtext>&#160;</m:mtext>
															<m:mi>p</m:mi>
															<m:mo>&#8804;</m:mo>
															<m:mi>&#949;</m:mi>
															<m:mo>;</m:mo>
														</m:mrow>
													</m:mtd>
												</m:mtr>
												<m:mtr columnalign="left">
													<m:mtd columnalign="left">
														<m:mn>1</m:mn>
													</m:mtd>
													<m:mtd columnalign="left">
														<m:mrow>
															<m:mi>i</m:mi>
															<m:mi>f</m:mi>
															<m:mi/>
															<m:mi>p</m:mi>
															<m:mo>&gt;</m:mo>
															<m:mn>1</m:mn>
															<m:mo>&#8722;</m:mo>
															<m:mi>&#949;</m:mi>
															<m:mo>;</m:mo>
														</m:mrow>
													</m:mtd>
												</m:mtr>
												<m:mtr columnalign="left">
													<m:mtd columnalign="left">
														<m:mi>p</m:mi>
													</m:mtd>
													<m:mtd columnalign="left">
														<m:mrow>
															<m:mi>o</m:mi>
															<m:mi>t</m:mi>
															<m:mi>h</m:mi>
															<m:mi>e</m:mi>
															<m:mi>r</m:mi>
															<m:mi>w</m:mi>
															<m:mi>i</m:mi>
															<m:mi>s</m:mi>
															<m:mi>e</m:mi>
															<m:mo>.</m:mo>
														</m:mrow>
													</m:mtd>
												</m:mtr>
											</m:mtable>
										</m:mrow>
									</m:mrow>
								</m:mrow>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaatuuDJXwAK1uy0HMmaeHbfv3ySLgzG0uy0HgiuD3BaGabaiqb=LriqzaajaWaaSbaaSqaaiabdsha0bqabaGccqGGOaakcqWGObaAcqGGBbWwcqWG0baDcqGGDbqxcqGG8baFcqWGObaAcqGGBbWwcqWG0baDcqGHsislcqWGRbWAcqGGSaalcqWG0baDcqGHsislcqaIXaqmcqGGDbqxcqGGPaqkcqGH9aqpdaGabaqaauaabaaadiaaaeaacqaIWaamaeaaieaacqGFPbqAcqGFMbGzcqqGGaaicqWGWbaCcqGHKjYOiiGacqqF1oqzcqGG7aWoaeaacqaIXaqmaeaacqGFPbqAcqGFMbGztCvAUfeBSjuyZL2yd9gzLbvyNv2CaeXbuLwBLnhiov2DGi1BTfMBaGqbaiaa8bcacqWGWbaCcqGH+aGpcqaIXaqmcqGHsislcqqF1oqzcqGG7aWoaeaacqWGWbaCaeaacqGFVbWBcqGF0baDcqGFObaAcqGFLbqzcqGFYbGCcqGF3bWDcqGFPbqAcqGFZbWCcqGFLbqzcqGGUaGlaaaacaGL7baaaaa@80AB@</m:annotation>
							</m:semantics>
						</m:math>
					</p>
					<p>If the underlying distribution is sufficiently sparse, <m:math name="1471-2105-8-S2-S9-i9" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>P</m:mi><m:mo>^</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaatuuDJXwAK1uy0HMmaeHbfv3ySLgzG0uy0HgiuD3BaGabaiqb=Lriqzaajaaaaa@376D@</m:annotation></m:semantics></m:math> can be represented using a relative small number of parameters. The corresponding sparse Markov model structure (in which transitions with probability 0 are removed) will reflect the pattern of conserved haplotype fragments present in the population. How such a sparse model structure can be learned without ever constructing the prohibitively complex distribution <b>P</b> will be discussed in the next section.</p>
				</sec>
				<sec>
					<st>
						<p>SpaMM: a level-wise learning algorithm</p>
					</st>
					<p><b>Algorithm 1 </b>The level-wise SpaMM learning algorithm.</p>
					<p>&#160;&#160;&#160;Initialize <it>k </it>:= 1</p>
					<p>&#160;&#160;&#160;<it>&#955;</it><sub>1 </sub>:= INITIAL-MODEL()</p>
					<p>&#160;&#160;&#160;<it>&#955;</it><sub>1 </sub>:= EM-TRAINING(<it>&#955;</it><sub>1</sub>)</p>
					<p>&#160;&#160;&#160;<b>repeat</b></p>
					<p>&#160;&#160;&#160;&#160;&#160;&#160;<it>k </it>:= <it>k </it>+ 1</p>
					<p>&#160;&#160;&#160;&#160;&#160;&#160;<it>&#955;</it><sub><it>k </it></sub>:= EXTEND-AND-REGULARIZE(<it>&#955;</it><sub><it>k</it>-1</sub>)</p>
					<p>&#160;&#160;&#160;&#160;&#160;&#160;<it>&#955;</it><sub><it>k </it></sub>:= EM-TRAINING(<it>&#955;</it><sub><it>k</it></sub>)</p>
					<p>&#160;&#160;&#160;<b>until </b><it>k </it>= <it>k</it><sub><it>max</it></sub></p>
					<p>To construct the sparse order-<it>k </it>hidden Markov model, we propose a learning algorithm &#8211; called <b>SpaMM </b>for <b>Spa</b>rse <b>M</b>arkov <b>M</b>odeling &#8211; that iteratively refines hidden Markov models of increasing order (Algorithm 1). More specifically, the idea of SpaMM is to identify conserved fragments using a level-wise search, i.e., by extending short fragments (in low-order models) to longer ones (in high-order models), and is inspired by the well-known Apriori data mining algorithm <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. The algorithm starts with a first-order Markov model <it>&#955;</it><sub>1 </sub>on haplotypes where initial transition probabilities are set to <m:math name="1471-2105-8-S2-S9-i10" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mover accent="true"><m:mi>P</m:mi><m:mo>&#729;</m:mo></m:mover><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaatuuDJXwAK1uy0HMmaeHbfv3ySLgzG0uy0HgiuD3BaGabaiqb=Lriqzaacaaaaa@3766@</m:annotation></m:semantics></m:math><sub><it>t</it></sub>(<it>h</it>[<it>t</it>] | <it>h</it>[<it>t </it>- 1], <it>&#955;</it><sub>1</sub>) = 0.5 for all <it>t </it>&#8712; {1,...,<it>m</it>}, <it>h</it>[<it>t</it>], <it>h</it>[<it>t </it>- 1] &#8712; {0, 1}. This model can be embedded into a hidden Markov model <m:math name="1471-2105-8-S2-S9-i11" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:msup><m:mi>&#955;</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mn>1</m:mn></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWF7oaBgaqbamaaBaaaleaacqaIXaqmaeqaaaaa@2F8F@</m:annotation></m:semantics></m:math> on genotypes as explained above, and <m:math name="1471-2105-8-S2-S9-i11" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:msup><m:mi>&#955;</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mn>1</m:mn></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWF7oaBgaqbamaaBaaaleaacqaIXaqmaeqaaaaa@2F8F@</m:annotation></m:semantics></m:math> can be trained from the available genotype data using the standard EM algorithm. As parameters in <m:math name="1471-2105-8-S2-S9-i11" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msub><m:msup><m:mi>&#955;</m:mi><m:mo>&#8242;</m:mo></m:msup><m:mn>1</m:mn></m:msub></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaaiiGacuWF7oaBgaqbamaaBaaaleaacqaIXaqmaeqaaaaa@2F8F@</m:annotation></m:semantics></m:math> are tied to those in <it>&#955;</it><sub>1</sub>, this yields new estimates for the parameters <b>P</b><sub><it>t</it></sub>(<it>h</it>[<it>t</it>] | <it>h</it>[<it>t </it>- 1], <it>&#955;</it><sub><it>k</it></sub>) in <it>&#955;</it><sub><it>k</it></sub>. This training procedure is summarized in the function EM-TRAINING(<it>&#955;</it><sub>1</sub>).</p>
					<p>The function EXTEND-AND-REGULARIZE(<it>&#955;</it><sub><it>k</it>-1</sub>) takes as input a model of order <it>k </it>- 1 and returns a model <it>&#955;</it><sub><it>k </it></sub>of order <it>k</it>. In <it>&#955;</it><sub><it>k</it></sub>, initial transition probabilities are set to</p>
					<p>
						<m:math name="1471-2105-8-S2-S9-i12" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mrow>
									<m:msub>
										<m:mover accent="true">
											<m:mi>P</m:mi>
											<m:mo>&#729;</m:mo>
										</m:mover>
										<m:mi>t</m:mi>
									</m:msub>
									<m:mo stretchy="false">(</m:mo>
									<m:mi>h</m:mi>
									<m:mo stretchy="false">[</m:mo>
									<m:mi>t</m:mi>
									<m:mo stretchy="false">]</m:mo>
									<m:mo>|</m:mo>
									<m:mi>h</m:mi>
									<m:mo stretchy="false">[</m:mo>
									<m:mi>t</m:mi>
									<m:mo>&#8722;</m:mo>
									<m:mi>k</m:mi>
									<m:mo>,</m:mo>
									<m:mi>t</m:mi>
									<m:mo>&#8722;</m:mo>
									<m:mn>1</m:mn>
									<m:mo stretchy="false">]</m:mo>
									<m:mo>,</m:mo>
									<m:msub>
										<m:mi>&#955;</m:mi>
										<m:mrow>
											<m:mi>k</m:mi>
											<m:mo>+</m:mo>
											<m:mn>1</m:mn>
										</m:mrow>
									</m:msub>
									<m:mo>=</m:mo>
									<m:mrow>
										<m:mo>{</m:mo>
										<m:mrow>
											<m:mtable columnalign="left">
												<m:mtr columnalign="left">
													<m:mtd columnalign="left">
														<m:mn>0</m:mn>
													</m:mtd>
													<m:mtd columnalign="left">
														<m:mrow>
															<m:mi>i</m:mi>
															<m:mi>f</m:mi>
															<m:mtext>&#8201;</m:mtext>
															<m:msub>
																<m:mi>P</m:mi>
																<m:mi>t</m:mi>
															</m:msub>
															<m:mo stretchy="false">(</m:mo>
															<m:mi>h</m:mi>
															<m:mo stretchy="false">[</m:mo>
															<m:mi>t</m:mi>
															<m:mo stretchy="false">]</m:mo>
															<m:mo>|</m:mo>
															<m:mi>h</m:mi>
															<m:mo stretchy="false">[</m:mo>
															<m:mi>t</m:mi>
															<m:mo>&#8722;</m:mo>
															<m:mi>k</m:mi>
															<m:mo>+</m:mo>
															<m:mn>1</m:mn>
															<m:mo>,</m:mo>
															<m:mi>t</m:mi>
															<m:mo>&#8722;</m:mo>
															<m:mn>1</m:mn>
															<m:mo stretchy="false">]</m:mo>
															<m:mo>,</m:mo>
															<m:msub>
																<m:mi>&#955;</m:mi>
																<m:mi>k</m:mi>
															</m:msub>
															<m:mo stretchy="false">)</m:mo>
															<m:mo>&#8804;</m:mo>
															<m:mi>&#949;</m:mi>
															<m:mo>;</m:mo>
														</m:mrow>
													</m:mtd>
												</m:mtr>
												<m:mtr columnalign="left">
													<m:mtd columnalign="left">
														<m:mn>1</m:mn>
													</m:mtd>
													<m:mtd columnalign="left">
														<m:mrow>
															<m:mi>i</m:mi>
															<m:mi>f</m:mi>
															<m:mtext>&#8201;</m:mtext>
															<m:msub>
																<m:mi>P</m:mi>
																<m:mi>t</m:mi>
															</m:msub>
															<m:mo stretchy="false">(</m:mo>
															<m:mi>h</m:mi>
															<m:mo stretchy="false">[</m:mo>
															<m:mi>t</m:mi>
															<m:mo stretchy="false">]</m:mo>
															<m:mo>|</m:mo>
															<m:mi>h</m:mi>
															<m:mo stretchy="false">[</m:mo>
															<m:mi>t</m:mi>
															<m:mo>&#8722;</m:mo>
															<m:mi>k</m:mi>
															<m:mo>+</m:mo>
															<m:mn>1</m:mn>
															<m:mo>,</m:mo>
															<m:mi>t</m:mi>
															<m:mo>&#8722;</m:mo>
															<m:mn>1</m:mn>
															<m:mo stretchy="false">]</m:mo>
															<m:mo>,</m:mo>
															<m:msub>
																<m:mi>&#955;</m:mi>
																<m:mi>k</m:mi>
															</m:msub>
															<m:mo stretchy="false">)</m:mo>
															<m:mo>&gt;</m:mo>
															<m:mn>1</m:mn>
															<m:mo>&#8722;</m:mo>
															<m:mi>&#949;</m:mi>
															<m:mo>;</m:mo>
														</m:mrow>
													</m:mtd>
												</m:mtr>
												<m:mtr columnalign="left">
													<m:mtd columnalign="left">
														<m:mrow>
															<m:mn>0.5</m:mn>
														</m:mrow>
													</m:mtd>
													<m:mtd columnalign="left">
														<m:mrow>
															<m:mi>o</m:mi>
															<m:mi>t</m:mi>
															<m:mi>h</m:mi>
															<m:mi>e</m:mi>
															<m:mi>r</m:mi>
															<m:mi>w</m:mi>
															<m:mi>i</m:mi>
															<m:mi>s</m:mi>
															<m:mi>e</m:mi>
															<m:mo>,</m:mo>
														</m:mrow>
													</m:mtd>
												</m:mtr>
											</m:mtable>
										</m:mrow>
									</m:mrow>
								</m:mrow>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaatuuDJXwAK1uy0HMmaeHbfv3ySLgzG0uy0HgiuD3BaGabaiqb=LriqzaacaWaaSbaaSqaaiabdsha0bqabaGccqGGOaakcqWGObaAcqGGBbWwcqWG0baDcqGGDbqxcqGG8baFcqWGObaAcqGGBbWwcqWG0baDcqGHsislcqWGRbWAcqGGSaalcqWG0baDcqGHsislcqaIXaqmcqGGDbqxcqGGSaaliiGacqGF7oaBdaWgaaWcbaGaem4AaSMaey4kaSIaeGymaedabeaakiabg2da9maaceaabaqbaeaaamGaaaqaaiabicdaWaqaaGqaaiab9LgaPjab9zgaMjaaykW7cqWFzecudaWgaaWcbaGaemiDaqhabeaakiabcIcaOiabdIgaOjabcUfaBjabdsha0jabc2faDjabcYha8jabdIgaOjabcUfaBjabdsha0jabgkHiTiabdUgaRjabgUcaRiabigdaXiabcYcaSiabdsha0jabgkHiTiabigdaXiabc2faDjabcYcaSiab+T7aSnaaBaaaleaacqWGRbWAaeqaaOGaeiykaKIaeyizImQae4xTduMaei4oaSdabaGaeGymaedabaGae0xAaKMae0NzayMaaGPaVlab=LriqnaaBaaaleaacqWG0baDaeqaaOGaeiikaGIaemiAaGMaei4waSLaemiDaqNaeiyxa0LaeiiFaWNaemiAaGMaei4waSLaemiDaqNaeyOeI0Iaem4AaSMaey4kaSIaeGymaeJaeiilaWIaemiDaqNaeyOeI0IaeGymaeJaeiyxa0LaeiilaWIae43UdW2aaSbaaSqaaiabdUgaRbqabaGccqGGPaqkcqGH+aGpcqaIXaqmcqGHsislcqGF1oqzcqGG7aWoaeaacqaIWaamcqGGUaGlcqaI1aqnaeaacqqFVbWBcqqF0baDcqqFObaAcqqFLbqzcqqFYbGCcqqF3bWDcqqFPbqAcqqFZbWCcqqFLbqzcqGGSaalaaaacaGL7baaaaa@B291@</m:annotation>
							</m:semantics>
						</m:math>
					</p>
					<p>i.e., transitions are removed if the probability of the transition conditioned on a shorter history is smaller than <it>&#949;</it>. This procedure of iteratively training, extending and regularizing Markov models of increasing order is repeated up to a maximum order <it>k</it><sub><it>max</it></sub>.</p>
					<p>Figure <figr fid="F3">3</figr> shows the models learned in the first 4 iterations of the SpaMM algorithm on a real-world dataset. Note how some of the possible transitions are pruned, conserved fragments are isolated and the number of states in the final model is significantly smaller than for a full model of that order. Furthermore, the set of paths through the structure is a concise representation of all haplotypes that have non-zero probability according to the model.</p>
					<fig id="F3">
						<title>
							<p>Figure 3</p>
						</title>
						<caption>
							<p>Visualization of the SpaMM structure learning algorithm</p>
						</caption>
						<text>
							<p><b>Visualization of the SpaMM structure learning algorithm</b>. Sparse models <it>&#955;</it><sub>1</sub>,...,<it>&#955;</it><sub>4 </sub>of increasing order learned on the Daly dataset are shown. Black/white nodes encode more frequent/less frequent allele in population. Conserved fragments identified in <it>&#955;</it><sub>4 </sub>are highlighted.</p>
						</text>
						<graphic file="1471-2105-8-S2-S9-3"/>
					</fig>
					<p>For a given genotype <it>g</it>, a reconstructed haplotype pair &#10216;<m:math name="1471-2105-8-S2-S9-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>h</m:mi><m:mi>g</m:mi><m:mn>1</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGObaAdaqhaaWcbaGaem4zaCgabaGaeGymaedaaaaa@3079@</m:annotation></m:semantics></m:math>, <m:math name="1471-2105-8-S2-S9-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>h</m:mi><m:mi>g</m:mi><m:mn>2</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGObaAdaqhaaWcbaGaem4zaCgabaGaeGOmaidaaaaa@307B@</m:annotation></m:semantics></m:math>&#10217;<sub><it>k </it></sub>can be obtained from every model <it>&#955;</it><sub><it>k</it></sub>. At the same time, the Viterbi algorithm computes <b>P</b>(&#10216;<m:math name="1471-2105-8-S2-S9-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>h</m:mi><m:mi>g</m:mi><m:mn>1</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGObaAdaqhaaWcbaGaem4zaCgabaGaeGymaedaaaaa@3079@</m:annotation></m:semantics></m:math>, <m:math name="1471-2105-8-S2-S9-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>h</m:mi><m:mi>g</m:mi><m:mn>2</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGObaAdaqhaaWcbaGaem4zaCgabaGaeGOmaidaaaaa@307B@</m:annotation></m:semantics></m:math>&#10217;<sub><it>k </it></sub>| <it>g</it>, <it>&#955;</it><sub><it>k</it></sub>), an estimate of the confidence of the reconstruction. In SpaMM, the reconstruction &#10216;<m:math name="1471-2105-8-S2-S9-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>h</m:mi><m:mi>g</m:mi><m:mn>1</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGObaAdaqhaaWcbaGaem4zaCgabaGaeGymaedaaaaa@3079@</m:annotation></m:semantics></m:math>, <m:math name="1471-2105-8-S2-S9-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>h</m:mi><m:mi>g</m:mi><m:mn>2</m:mn></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGObaAdaqhaaWcbaGaem4zaCgabaGaeGOmaidaaaaa@307B@</m:annotation></m:semantics></m:math>&#10217;<sub><it>k* </it></sub>with the highest confidence is returned as the final solution:</p>
					<p>
						<m:math name="1471-2105-8-S2-S9-i13" xmlns:m="http://www.w3.org/1998/Math/MathML">
							<m:semantics>
								<m:mrow>
									<m:mi>k</m:mi>
									<m:mo>&#8727;</m:mo>
									<m:mo>=</m:mo>
									<m:munder>
										<m:mrow>
											<m:mi>arg</m:mi>
											<m:mo>&#8289;</m:mo>
											<m:mi>max</m:mi>
											<m:mo>&#8289;</m:mo>
										</m:mrow>
										<m:mrow>
											<m:mi>k</m:mi>
											<m:mo>&#8712;</m:mo>
											<m:mo>{</m:mo>
											<m:mn>1</m:mn>
											<m:mo>,</m:mo>
											<m:mn>...</m:mn>
											<m:mo>,</m:mo>
											<m:msub>
												<m:mi>k</m:mi>
												<m:mrow>
													<m:mi>m</m:mi>
													<m:mi>a</m:mi>
													<m:mi>x</m:mi>
												</m:mrow>
											</m:msub>
											<m:mo>}</m:mo>
										</m:mrow>
									</m:munder>
									<m:mi>P</m:mi>
									<m:mo stretchy="false">(</m:mo>
									<m:msub>
										<m:mrow>
											<m:mo>&#9001;</m:mo>
											<m:msubsup>
												<m:mi>h</m:mi>
												<m:mi>g</m:mi>
												<m:mn>1</m:mn>
											</m:msubsup>
											<m:mo>,</m:mo>
											<m:msubsup>
												<m:mi>h</m:mi>
												<m:mi>g</m:mi>
												<m:mn>2</m:mn>
											</m:msubsup>
											<m:mo>&#9002;</m:mo>
										</m:mrow>
										<m:mi>k</m:mi>
									</m:msub>
									<m:mo>|</m:mo>
									<m:mi>g</m:mi>
									<m:mo>,</m:mo>
									<m:msub>
										<m:mi>&#955;</m:mi>
										<m:mi>k</m:mi>
									</m:msub>
									<m:mo stretchy="false">)</m:mo>
									<m:mo>.</m:mo>
								</m:mrow>
								<m:annotation encoding="MathType-MTEF">
 MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGRbWAcqGHxiIkcqGH9aqpdaWfqaqaaiGbcggaHjabckhaYjabcEgaNjGbc2gaTjabcggaHjabcIha4bWcbaGaem4AaSMaeyicI4Saei4EaSNaeGymaeJaeiilaWIaeiOla4IaeiOla4IaeiOla4IaeiilaWIaem4AaS2aaSbaaWqaaGqaciab=1gaTjab=fgaHjab=Hha4bqabaWccqGG9bqFaeqaamrr1ngBPrwtHrhAYaqeguuDJXwAKbstHrhAGq1DVbaceaGccqGFzecucqGGOaakcqGHPms4cqWGObaAdaqhaaWcbaGaem4zaCgabaGaeGymaedaaOGaeiilaWIaemiAaG2aa0baaSqaaiabdEgaNbqaaiabikdaYaaakiabgQYiXpaaBaaaleaacqWGRbWAaeqaaOGaeiiFaWNaem4zaCMaeiilaWccciGae03UdW2aaSbaaSqaaiabdUgaRbqabaGccqGGPaqkcqGGUaGlaaa@6B66@</m:annotation>
							</m:semantics>
						</m:math>
					</p>
					<p>The idea of using frequent fragments to build Markov models for haplotypes has also been used in the HaploRec method <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. In HaploRec, a set of fragments (of any length) that are frequent according to the current model is kept, and updated after each iteration of the EM algorithm.</p>
				</sec>
			</sec>
			<sec>
				<st>
					<p>Experimental methodology and evaluation</p>
				</st>
				<p>The proposed method was implemented in the SpaMM haplotyping system <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. We compared its accuracy and computational performance to several other state-of-the art haplotype reconstruction systems: PHASE version 2.1.1 <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>, fastPHASE version 1.1 <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>, GERBIL as included in GEVALT version 1.0 <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>, HIT <abbrgrp><abbr bid="B5">5</abbr></abbrgrp> and HaploRec (variable order Markov model) version 2.0 <abbrgrp><abbr bid="B13">13</abbr></abbrgrp>. All methods were run using their default parameters. The fastPHASE system, which also employs EM for learning a probabilistic model, uses a strategy of averaging results over several random restarts of EM from different initial parameter values. This reduces the variance component of the reconstruction error and alleviates the problem of local minima in EM search. As this is a general technique applicable also to our method, we list results for fastPHASE with averaging (fastPHASE) and without averaging (fastPHASE-NA).</p>
				<p>The methods were compared using publicly available real-world datasets, and larger datasets simulated with the Hudson coalescence simulator <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>. As real-world data, we used a collection of datasets from the Yoruba population in Ibadan, Nigeria <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>, and the well-known dataset of Daly et al <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>, which contains data from a European-derived population. For these datasets, family trios are available, and thus true haplotypes can be inferred analytically. Non-transmitted parental chromosomes of each trio were combined to form additional artificial haplotype pairs. Markers with minor allele frequency of less than 5% and genotypes with more than 15% missing values were removed. Note that if all trio members are heterozygous, the haplotype of the child can not be inferred. In this case, the genotype at this marker position is observed but the marker is ignored when computing the accuracy of the method.</p>
				<p>For the Yoruba population, information on 3.8 million SNPs spread over the whole genome is available. We sampled 100 sets of 500 markers each from distinct regions on chromosome 1 (<b>Yoruba-500</b>), and from these smaller datasets by taking only the first 20 (<b>Yoruba-20</b>) or 100 (<b>Yoruba-100</b>) markers for every individual. There are 60 individuals in the dataset after preprocessing, with an average fraction of missing values of 3.6%. For the <b>Daly </b>dataset, there is information on 103 markers and 174 individuals available after data preprocessing, and the average fraction of missing values is 8%. Although results on a single dataset are not very meaningful, the Daly dataset was included because it has been used frequently in the literature.</p>
				<p>The number of genotyped individuals in these real-world datasets is rather small. For most disease association studies, sample sizes of at least several hundred individuals are needed <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>, and we are ultimately interested in haplotyping such larger datasets. Unfortunately, we are not aware of any publicly available real-world datasets of this size, so we have to resort to simulated data. We used the well-known Hudson coalescence simulator <abbrgrp><abbr bid="B14">14</abbr></abbrgrp> to generate 50 artificial datasets, each containing 800 individuals (<b>Hudson </b>datasets). The simulator uses the standard Wright-Fisher neutral model of genetic variation with recombination. A chromosomal region of 150 kb was simulated. The probability of mutation in each base pair was set to 10<sup>-8 </sup>per generation, and the probability of cross-over between adjacent base pairs was set to 10<sup>-8</sup>. These values result in a mutation probability for the entire chromosomal region of <it>&#956; </it>= 0.0015 and cross-over probability of <it>&#961; </it>= 0.0015. The diploid population size, <it>N</it><sub>0</sub>, was set to the standard 10000, yielding mutation parameter <it>&#952; </it>= 4<it>N</it><sub>0</sub><it>&#956; </it>= 60, and the recombination parameter <it>r </it>= 60. For each data set, a sample of 1600 chromosomes was generated, and these were paired to form 800 genotypes. On average, one simulation produced approximately 493 segregating sites. For each data set, 50 markers were chosen from the segregating sites with minor allele frequency of at least 5%, such that marker spacing was as uniform as possible. The resulting average marker spacing was 3.0 kb. To come as close to the characteristics of real-world data as possible, some alleles were masked (marked as missing) after simulation. More specifically, the missing allele pattern found in the Yoruba datasets was superimposed onto the simulated data, shortening patterns to the size of the target marker map and repeating them as needed for additional individuals.</p>
				<p>The accuracy of the reconstructed haplotypes produced by the different methods was measured by normalized switch error. The switch error of a reconstruction is the minimum number of recombinations needed to transform the reconstructed haplotype pair into the true haplotype pair. To normalize, switch errors are summed over all individuals in the dataset and divided by the total number of switch errors that could have been made.</p>
			</sec>
		</sec>
		<sec>
			<st>
				<p>Results</p>
			</st>
			<p>Table <tblr tid="T1">1</tblr> shows normalized switch error for all methods on the real-world datasets Yoruba and Daly. For the dataset collections Yoruba-20, Yoruba-100 and Yoruba-500 errors are averaged over the 100 datasets. PHASE and Gerbil did not complete on Yoruba-500 in two weeks (all experiments were run on standard PC hardware with a 3.2 GHz processor and 2 GB of main memory). Overall, the PHASE system achieves highest reconstruction accuracies. After PHASE, fastPHASE with averaging is most accurate, then SpaMM, and then HaploRec. Figure <figr fid="F4">4</figr> shows the average runtime of the methods for marker maps of different lengths. The most accurate method PHASE is also clearly the slowest. fastPHASE and SpaMM are substantially faster, and HaploRec and HIT very fast. Gerbil is fast for small marker maps but slow for larger ones. For fastPHASE, fastPHASE-NA, HaploRec, SpaMM and HIT, computational costs scale linearly with the length of the marker map, while the increase is superlinear for PHASE and Gerbil, so computational costs quickly become prohibitive for longer maps.</p>
			<tbl id="T1">
				<title>
					<p>Table 1</p>
				</title>
				<caption>
					<p>Reconstruction accuracy on Yoruba and Daly data.</p>
				</caption>
				<tblbdy cols="5">
					<r>
						<c ca="left">
							<p>Method</p>
						</c>
						<c ca="center">
							<p>Yoruba-20</p>
						</c>
						<c ca="center">
							<p>Yoruba-100</p>
						</c>
						<c ca="center">
							<p>Yoruba-500</p>
						</c>
						<c ca="center">
							<p>
								<b>Daly</b>
							</p>
						</c>
					</r>
					<r>
						<c cspan="5">
							<hr/>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>PHASE</p>
						</c>
						<c ca="center">
							<p>
								<b>0.027</b>
							</p>
						</c>
						<c ca="center">
							<p>
								<b>0.025</b>
							</p>
						</c>
						<c ca="center">
							<p>
								<it>n.a.</it>
							</p>
						</c>
						<c ca="center">
							<p>0.038</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>fastPHASE</p>
						</c>
						<c ca="center">
							<p>0.033</p>
						</c>
						<c ca="center">
							<p>0.031</p>
						</c>
						<c ca="center">
							<p>
								<b>0.034</b>
							</p>
						</c>
						<c ca="center">
							<p>
								<b>0.027</b>
							</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>SpaMM</p>
						</c>
						<c ca="center">
							<p>0.034</p>
						</c>
						<c ca="center">
							<p>0.037</p>
						</c>
						<c ca="center">
							<p>0.040</p>
						</c>
						<c ca="center">
							<p>0.033</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>HaploRec</p>
						</c>
						<c ca="center">
							<p>0.036</p>
						</c>
						<c ca="center">
							<p>0.038</p>
						</c>
						<c ca="center">
							<p>0.046</p>
						</c>
						<c ca="center">
							<p>0.034</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>fastPHASE-NA</p>
						</c>
						<c ca="center">
							<p>0.041</p>
						</c>
						<c ca="center">
							<p>0.060</p>
						</c>
						<c ca="center">
							<p>0.069</p>
						</c>
						<c ca="center">
							<p>0.045</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>HIT</p>
						</c>
						<c ca="center">
							<p>0.042</p>
						</c>
						<c ca="center">
							<p>0.050</p>
						</c>
						<c ca="center">
							<p>0.055</p>
						</c>
						<c ca="center">
							<p>0.031</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>GERBIL</p>
						</c>
						<c ca="center">
							<p>0.044</p>
						</c>
						<c ca="center">
							<p>0.051</p>
						</c>
						<c ca="center">
							<p>
								<it>n.a</it>
							</p>
						</c>
						<c ca="center">
							<p>0.034</p>
						</c>
					</r>
				</tblbdy>
				<tblfn>
					<p>Normalized switch error is shown for the Daly dataset, and average normalized switch error over the 100 datasets in the Yoruba-20, Yoruba-100 and Yoruba-500 dataset collections.</p>
				</tblfn>
			</tbl>
			<fig id="F4">
				<title>
					<p>Figure 4</p>
				</title>
				<caption>
					<p>Runtime as a function of the number of markers</p>
				</caption>
				<text>
					<p><b>Runtime as a function of the number of markers</b>. Average runtime per dataset on Yoruba datasets for marker maps of length 25 to 500 for SpaMM, fastPHASE, fastPHASE-NA, PHASE, Gerbil, HaploRec, and HIT are shown (logarithmic scale). Results are averaged over 10 out of the 100 datasets in the Yoruba collection.</p>
				</text>
				<graphic file="1471-2105-8-S2-S9-4"/>
			</fig>
			<p>Performance of the systems on larger datasets with up to 800 individuals was evaluated on the 50 simulated Hudson datasets. As for the real-world data, the most accurate methods were PHASE, fastPHASE, SpaMM and HaploRec. Figure <figr fid="F5">5</figr> shows the normalized switch error of these four methods as a function of the number of individuals (results of Gerbil, fastPHASE-NA, and HIT were significantly worse and are not shown). PHASE was the most accurate method also in this setting, but the relative accuracy of the other three systems depended on the number of individuals in the datasets. While for relatively small numbers of individuals (50&#8211;100) fastPHASE outperforms SpaMM and HaploRec, this is reversed for 200 or more individuals.</p>
			<fig id="F5">
				<title>
					<p>Figure 5</p>
				</title>
				<caption>
					<p>Reconstruction accuracy as a function of the number of samples available</p>
				</caption>
				<text>
					<p><b>Reconstruction accuracy as a function of the number of samples available</b>. Average normalized switch error on the Hudson datasets as a function of the number of individuals for SpaMM, fastPHASE, PHASE and HaploRec is shown. Results are averaged over 50 datasets.</p>
				</text>
				<graphic file="1471-2105-8-S2-S9-5"/>
			</fig>
			<p>A problem closely related to haplotype reconstruction is that of genotype imputation. Here, the task is to infer the most likely genotype values (unordered allele pairs) at marker positions where genotype information is missing, based on the observed genotype information. With the exception of HaploRec, all haplotyping systems included in this study can also impute missing genotypes. To test imputation accuracy, between 10% and 40% of all markers were masked randomly, and then the marker values inferred by the systems were compared to the known true marker values. Table <tblr tid="T2">2</tblr> shows the accuracy of inferred genotypes for different fractions of masked data on the Yoruba-100 datasets and Table <tblr tid="T3">3</tblr> on the simulated Hudson datasets with 400 individuals per dataset. PHASE was too slow to run in this task as its runtime increases significantly in the presence of many missing markers. Evidence from the literature <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> suggests that for this task, fastPHASE outperforms PHASE and is indeed the best method available. In our experiments, on Yoruba-100 fastPHASE is most accurate, SpaMM is slightly less accurate than fastPHASE, but more accurate than any other method (including fastPHASE-NA). On the larger Hudson datasets, SpaMM is significantly more accurate than any other method.</p>
			<tbl id="T2">
				<title>
					<p>Table 2</p>
				</title>
				<caption>
					<p>Average error for reconstructing masked genotypes on Yoruba-100.</p>
				</caption>
				<tblbdy cols="5">
					<r>
						<c ca="left">
							<p>Method</p>
						</c>
						<c ca="center">
							<p>10%</p>
						</c>
						<c ca="center">
							<p>20%</p>
						</c>
						<c ca="center">
							<p>30%</p>
						</c>
						<c ca="center">
							<p>40%</p>
						</c>
					</r>
					<r>
						<c cspan="5">
							<hr/>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>fastPHASE</p>
						</c>
						<c ca="center">
							<p>
								<b>0.045</b>
							</p>
						</c>
						<c ca="center">
							<p>
								<b>0.052</b>
							</p>
						</c>
						<c ca="center">
							<p>
								<b>0.062</b>
							</p>
						</c>
						<c ca="center">
							<p>
								<b>0.075</b>
							</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>SpaMM</p>
						</c>
						<c ca="center">
							<p>0.058</p>
						</c>
						<c ca="center">
							<p>0.066</p>
						</c>
						<c ca="center">
							<p>0.078</p>
						</c>
						<c ca="center">
							<p>0.096</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>fastPHASE-NA</p>
						</c>
						<c ca="center">
							<p>0.067</p>
						</c>
						<c ca="center">
							<p>0.075</p>
						</c>
						<c ca="center">
							<p>0.089</p>
						</c>
						<c ca="center">
							<p>0.126</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>HIT</p>
						</c>
						<c ca="center">
							<p>0.070</p>
						</c>
						<c ca="center">
							<p>0.079</p>
						</c>
						<c ca="center">
							<p>0.087</p>
						</c>
						<c ca="center">
							<p>0.098</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>GERBIL</p>
						</c>
						<c ca="center">
							<p>0.073</p>
						</c>
						<c ca="center">
							<p>0.091</p>
						</c>
						<c ca="center">
							<p>0.110</p>
						</c>
						<c ca="center">
							<p>0.136</p>
						</c>
					</r>
				</tblbdy>
				<tblfn>
					<p>From 10% to 40% of all genotypes were masked randomly. Results are averaged over 100 datasets.</p>
				</tblfn>
			</tbl>
			<tbl id="T3">
				<title>
					<p>Table 3</p>
				</title>
				<caption>
					<p>Average error for reconstructing masked genotypes on Hudson.</p>
				</caption>
				<tblbdy cols="5">
					<r>
						<c ca="left">
							<p>Method</p>
						</c>
						<c ca="center">
							<p>10%</p>
						</c>
						<c ca="center">
							<p>20%</p>
						</c>
						<c ca="center">
							<p>30%</p>
						</c>
						<c ca="center">
							<p>40%</p>
						</c>
					</r>
					<r>
						<c cspan="5">
							<hr/>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>fastPHASE</p>
						</c>
						<c ca="center">
							<p>0.035</p>
						</c>
						<c ca="center">
							<p>0.041</p>
						</c>
						<c ca="center">
							<p>0.051</p>
						</c>
						<c ca="center">
							<p>0.063</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>SpaMM</p>
						</c>
						<c ca="center">
							<p>
								<b>0.017</b>
							</p>
						</c>
						<c ca="center">
							<p>
								<b>0.023</b>
							</p>
						</c>
						<c ca="center">
							<p>
								<b>0.034</b>
							</p>
						</c>
						<c ca="center">
							<p>
								<b>0.052</b>
							</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>fastPHASE-NA</p>
						</c>
						<c ca="center">
							<p>0.056</p>
						</c>
						<c ca="center">
							<p>0.062</p>
						</c>
						<c ca="center">
							<p>0.074</p>
						</c>
						<c ca="center">
							<p>0.087</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>HIT</p>
						</c>
						<c ca="center">
							<p>0.081</p>
						</c>
						<c ca="center">
							<p>0.093</p>
						</c>
						<c ca="center">
							<p>0.108</p>
						</c>
						<c ca="center">
							<p>0.127</p>
						</c>
					</r>
					<r>
						<c ca="left">
							<p>GERBIL</p>
						</c>
						<c ca="center">
							<p>0.102</p>
						</c>
						<c ca="center">
							<p>0.122</p>
						</c>
						<c ca="center">
							<p>0.148</p>
						</c>
						<c ca="center">
							<p>0.169</p>
						</c>
					</r>
				</tblbdy>
				<tblfn>
					<p>From 10% to 40% of all genotypes were masked randomly. Results are averaged over 50 datasets.</p>
				</tblfn>
			</tbl>
			<p>Our experimental results confirm PHASE as the most accurate but also computationally most expensive haplotype reconstruction system <abbrgrp><abbr bid="B6">6</abbr><abbr bid="B11">11</abbr></abbrgrp>. If more computational efficiency is required, fastPHASE yields the most accurate reconstructions on small datasets, and SpaMM is preferable for larger datasets. SpaMM also infers missing genotype values with high accuracy. For small datasets, it is second only to fastPHASE; for large datasets, it is substantially more accurate than any other method in our experiments.</p>
			<p>The presented method is quite basic: it does not use fine-tuned priors for EM, multiple EM restarts or averaging techniques <abbrgrp><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>, or cross-validates model parameters <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. Moreover, most statistical models employed in haplotyping are specifically tailored to this problem, and reflect certain assumptions about haplotype structure. For example, the HIT method assumes that there is a limited number of founder haplotypes for a population, and GERBIL assumes block-like haplotype patterns. These systems are only effective if the underlying assumptions are valid. HIT, for instance, was less accurate than PHASE in our study, but has been shown to be competitive with PHASE on population samples from Finland <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, a population isolate for which the assumption of a small number of founders is particularly realistic <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. Similarly, performance of GERBIL will suffer if haplotypes do not exhibit a block-like structure. In contrast, the sparse higher-order Markov chains used in SpaMM are a general sequence modeling technique. Detailed assumptions about the haplotype structure are replaced by the structure-learning component of the algorithm. The resulting model is rather flexible, and subsumes block-like or mosaic-like haplotype structures (cf. Figure <figr fid="F3">3</figr>). In fact, the proposed approach is not limited to haplotype analysis, and an interesting direction for future work is to apply it also to other sequence modeling tasks.</p>
		</sec>
		<sec>
			<st>
				<p>Conclusion</p>
			</st>
			<p>We proposed a simple haplotype reconstruction method that is based on iterative refinement and regularization of constrained hidden Markov models (SpaMM). The method was compared against several other state-of-the-art haplotyping systems on real-world genotype datasets with 60&#8211;100 individuals and larger simulated datasets with up to 800 individuals. In the experimental study, PHASE was the most accurate, but also computationally most demanding haplotype reconstruction system. fastPHASE and SpaMM are slightly less accurate but much faster, and scale well to long marker maps. The relative performance of these two systems depends on the number of samples available: while fastPHASE is slightly more accurate for small datasets, SpaMM is superior for datasets with several hundred genotype samples. As large datasets are ultimately needed for successful disease association studies, the presented method is a promising alternative to existing approaches.</p>
		</sec>
		<sec>
			<st>
				<p>Authors' contributions</p>
			</st>
			<p>TM, NL and HM developed the haplotyping method. NL implemented the method and carried out the experiments. LE contributed data to the experimental evaluation. HM and HT coordinated the research. All authors contributed to the preparation of the manuscript.</p>
		</sec>
	</bdy>
	<bm>
		<ack>
			<sec>
				<st>
					<p>Acknowledgements</p>
				</st>
				<p>The authors would like to thank Luc De Raedt and Kristian Kersting for helpful discussions and comments. This work was supported by the European Union IST programme, contract no. FP6-508861, <it>Application of Probabilistic Inductive Logic Programming II</it>; and by Finnish Funding Agency for Technology and Innovation (Tekes). Hannu Toivonen has been supported by Alexander von Humboldt foundation.</p>
				<p>This article has been published as part of <it>BMC Bioinformatics </it>Volume 8, Supplement 2, 2007: Probabilistic Modeling and Machine Learning in Structural and Systems Biology. The full contents of the supplement are available online at <url>http://www.biomedcentral.com/1471-2105/8?issue=S2</url>.</p>
			</sec>
		</ack>
		<refgrp>
			<bibl id="B1">
				<title>
					<p>A haplotype map of the human genome</p>
				</title>
				<aug>
					<au>
						<cnm>The International HapMap Consortium</cnm>
					</au>
				</aug>
				<source>Nature</source>
				<pubdate>2005</pubdate>
				<volume>437</volume>
				<fpage>1299</fpage>
				<lpage>1320</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1038/nature04226</pubid>
						<pubid idtype="pmpid" link="fulltext">16255080</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B2">
				<title>
					<p>A comprehensive literature review of haplotyping software and methods for use with unrelated individuals</p>
				</title>
				<aug>
					<au>
						<snm>Salem</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Wessel</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Schork</snm>
						<fnm>N</fnm>
					</au>
				</aug>
				<source>Human Genomics</source>
				<pubdate>2005</pubdate>
				<volume>2</volume>
				<fpage>39</fpage>
				<lpage>66</lpage>
				<xrefbib>
					<pubid idtype="pmpid" link="fulltext">15814067</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B3">
				<title>
					<p>A Survey of Computational Methods for Determining Haplotypes</p>
				</title>
				<aug>
					<au>
						<snm>Halld&#243;rsson</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Bafna</snm>
						<fnm>V</fnm>
					</au>
					<au>
						<snm>Edwards</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Lippert</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Yooseph</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Istrail</snm>
						<fnm>S</fnm>
					</au>
				</aug>
				<source>Computational Methods for SNPs and Haplotype Inference, Volume 2983 of Lecture Notes in Computer Science</source>
				<pubdate>2004</pubdate>
				<fpage>26</fpage>
				<lpage>47</lpage>
			</bibl>
			<bibl id="B4">
				<title>
					<p>A tutorial on hidden Markov models and selected applications in speech recognition</p>
				</title>
				<aug>
					<au>
						<snm>Rabiner</snm>
						<fnm>L</fnm>
					</au>
				</aug>
				<source>Proceedings of the IEEE</source>
				<pubdate>1989</pubdate>
				<volume>77</volume>
				<issue>2</issue>
				<fpage>257</fpage>
				<lpage>286</lpage>
				<xrefbib>
					<pubid idtype="doi">10.1109/5.18626</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B5">
				<title>
					<p>A hidden Markov technique for haplotype reconstruction</p>
				</title>
				<aug>
					<au>
						<snm>Rastas</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Koivisto</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Mannila</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Ukkonen</snm>
						<fnm>E</fnm>
					</au>
				</aug>
				<source>WABI, Volume 3692 of Lecture Notes in Computer Science</source>
				<publisher>Springer</publisher>
				<editor>Casadio R, Myers G</editor>
				<pubdate>2005</pubdate>
				<fpage>140</fpage>
				<lpage>151</lpage>
			</bibl>
			<bibl id="B6">
				<title>
					<p>A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase</p>
				</title>
				<aug>
					<au>
						<snm>Scheet</snm>
						<fnm>P</fnm>
					</au>
					<au>
						<snm>Stephens</snm>
						<fnm>M</fnm>
					</au>
				</aug>
				<source>Am J Hum Genet</source>
				<pubdate>2006</pubdate>
				<volume>78</volume>
				<fpage>629</fpage>
				<lpage>644</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1086/502802</pubid>
						<pubid idtype="pmpid" link="fulltext">16532393</pubid>
						<pubid idtype="pmcid">1424677</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B7">
				<title>
					<p>A Markov chain approach to reconstruction of long haplotypes</p>
				</title>
				<aug>
					<au>
						<snm>Eronen</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Geerts</snm>
						<fnm>F</fnm>
					</au>
					<au>
						<snm>Toivonen</snm>
						<fnm>H</fnm>
					</au>
				</aug>
				<source>Pacific Symposium on Biocomputing</source>
				<publisher>World Scientific</publisher>
				<editor>Altman RB, Dunker AK, Hunter L, Jung TA, Klein TE</editor>
				<pubdate>2004</pubdate>
				<fpage>104</fpage>
				<lpage>115</lpage>
				<xrefbib>
					<pubid idtype="pmpid">14992496</pubid>
				</xrefbib>
			</bibl>
			<bibl id="B8">
				<title>
					<p>High-resolution haplotype structure in the human genome</p>
				</title>
				<aug>
					<au>
						<snm>Daly</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Rioux</snm>
						<fnm>J</fnm>
					</au>
					<au>
						<snm>Schaffner</snm>
						<fnm>S</fnm>
					</au>
					<au>
						<snm>Hudson</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Lander</snm>
						<fnm>E</fnm>
					</au>
				</aug>
				<source>Nature Genetics</source>
				<pubdate>2001</pubdate>
				<volume>29</volume>
				<fpage>229</fpage>
				<lpage>232</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1038/ng1001-229</pubid>
						<pubid idtype="pmpid" link="fulltext">11586305</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B9">
				<title>
					<p>Fast discovery of association rules</p>
				</title>
				<aug>
					<au>
						<snm>Agrawal</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Mannila</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Srikant</snm>
						<fnm>R</fnm>
					</au>
					<au>
						<snm>Toivonen</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Verkamo</snm>
						<fnm>A</fnm>
					</au>
				</aug>
				<source>Advances in Knowledge Discovery and Data Mining</source>
				<publisher>AAAI/MIT Press</publisher>
				<editor>Fayyad U, Piatetsky-Shapiro G, Smyth P, Uthurusamy R</editor>
				<pubdate>1996</pubdate>
				<fpage>307</fpage>
				<lpage>328</lpage>
			</bibl>
			<bibl id="B10">
				<title>
					<p>SpaMM &#8211; a haplotype reconstruction method</p>
				</title>
				<aug>
					<au>
						<snm>Landwehr</snm>
						<fnm>N</fnm>
					</au>
					<au>
						<snm>Mielik&#228;inen</snm>
						<fnm>T</fnm>
					</au>
					<au>
						<snm>Eronen</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Toivonen</snm>
						<fnm>H</fnm>
					</au>
					<au>
						<snm>Mannila</snm>
						<fnm>H</fnm>
					</au>
				</aug>
				<url>http://www.informatik.uni-freiburg.de/~landwehr/haplotyping.html</url>
			</bibl>
			<bibl id="B11">
				<title>
					<p>Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation</p>
				</title>
				<aug>
					<au>
						<snm>Stephens</snm>
						<fnm>M</fnm>
					</au>
					<au>
						<snm>Scheet</snm>
						<fnm>P</fnm>
					</au>
				</aug>
				<source>Am J Hum Genet</source>
				<pubdate>2005</pubdate>
				<volume>76</volume>
				<fpage>449</fpage>
				<lpage>462</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1086/428594</pubid>
						<pubid idtype="pmpid" link="fulltext">15700229</pubid>
						<pubid idtype="pmcid">1196397</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B12">
				<title>
					<p>A block-free hidden Markov model for genotypes and its applications to disease association</p>
				</title>
				<aug>
					<au>
						<snm>Kimmel</snm>
						<fnm>G</fnm>
					</au>
					<au>
						<snm>Shamir</snm>
						<fnm>R</fnm>
					</au>
				</aug>
				<source>Journal of Computational Biology</source>
				<pubdate>2005</pubdate>
				<volume>12</volume>
				<issue>10</issue>
				<fpage>1243</fpage>
				<lpage>1259</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1089/cmb.2005.12.1243</pubid>
						<pubid idtype="pmpid" link="fulltext">16379532</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B13">
				<title>
					<p>HaploRec: efficient and accurate large-scale reconstruction of haplotypes</p>
				</title>
				<aug>
					<au>
						<snm>Eronen</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Geerts</snm>
						<fnm>F</fnm>
					</au>
					<au>
						<snm>Toivonen</snm>
						<fnm>H</fnm>
					</au>
				</aug>
				<source>BMC Bioinformatics</source>
				<pubdate>2006</pubdate>
				<volume>7</volume>
				<fpage>542</fpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="pmcid">1766938</pubid>
						<pubid idtype="pmpid" link="fulltext">17187677</pubid>
						<pubid idtype="doi">10.1186/1471-2105-7-542</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B14">
				<title>
					<p>Generating samples under a wright-fisher neutral model of genetic variation</p>
				</title>
				<aug>
					<au>
						<snm>Hudson</snm>
						<fnm>R</fnm>
					</au>
				</aug>
				<source>Bioinformatics</source>
				<pubdate>2002</pubdate>
				<volume>18</volume>
				<fpage>337</fpage>
				<lpage>338</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/bioinformatics/18.2.337</pubid>
						<pubid idtype="pmpid" link="fulltext">11847089</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B15">
				<title>
					<p>Genome-wide association studies: theoretical and practical concerns</p>
				</title>
				<aug>
					<au>
						<snm>Wang</snm>
						<fnm>W</fnm>
					</au>
					<au>
						<snm>Barratt</snm>
						<fnm>B</fnm>
					</au>
					<au>
						<snm>Clayton</snm>
						<fnm>D</fnm>
					</au>
					<au>
						<snm>Todd</snm>
						<fnm>J</fnm>
					</au>
				</aug>
				<source>Nature Reviews Genetics</source>
				<pubdate>2005</pubdate>
				<volume>6</volume>
				<fpage>109</fpage>
				<lpage>118</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1038/nrg1522</pubid>
						<pubid idtype="pmpid" link="fulltext">15716907</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
			<bibl id="B16">
				<title>
					<p>Molecular genetics of the finnish disease heritage</p>
				</title>
				<aug>
					<au>
						<snm>Peltonen</snm>
						<fnm>L</fnm>
					</au>
					<au>
						<snm>Jalanko</snm>
						<fnm>A</fnm>
					</au>
					<au>
						<snm>Varilo</snm>
						<fnm>T</fnm>
					</au>
				</aug>
				<source>Human Molecular Genetics</source>
				<pubdate>1999</pubdate>
				<volume>8</volume>
				<fpage>1913</fpage>
				<lpage>1923</lpage>
				<xrefbib>
					<pubidlist>
						<pubid idtype="doi">10.1093/hmg/8.10.1913</pubid>
						<pubid idtype="pmpid" link="fulltext">10469845</pubid>
					</pubidlist>
				</xrefbib>
			</bibl>
		</refgrp>
	</bm>
</art>
