School of Science and Engineering, Reykjavik University, Reykjavík, 101, Iceland

Abstract

Background

Alu polymorphisms are some of the most common polymorphisms in the genome, yet few methods have been developed for their detection.

Methods

We present algorithms to discover Alu polymorphisms using paired-end high throughput sequencing data from multiple individuals. We consider the problem of identifying sites containing polymorphic Alu insertions.

Results

We give efficient and practical algorithms that detect polymorphic Alus, both those that are inserted with respect to the reference genome and those that are deleted. The algorithms have a linear time complexity and can be run on a standard desktop machine in a very short amount of time on top of the output of tools standard for sequencing analysis.

Conclusions

In our simulated dataset we are able to locate 98.1% of Alus inserted with respect to the reference and 97.7% of Alus deleted, our simulations also show an excellent correlations between the deletions detected in parents and children. We further run our algorithms on publicly available data from the 1000 genomes project and find several thousand Alu polymorphisms in each individual.

Introduction

We consider the problem of detecting polymorphic Alu insertions from DNA sequence reads using high throughput paired-end sequencing data.

Genomewide association studies (GWAS) proceed by identifying a number of individuals carrying a disease or a trait and comparing these individuals to those that do not or are not known to carry the disease/trait. Both sets of individuals are then genotyped for a large number of Single Nucleotide Polymorphism (SNP) genetic variants which are then tested for association to the disease/trait. GWAS have been able to successfully identify a very large number of polymorphism associated to disease (e.g.

Whole genome resequencing using next generation sequencers is rapidly becoming the sledgehammer of genomewide association studies. Increasingly, GWAS are done in conjunction with the sequencing of number of individuals

Copy number variations, have been shown to be influential factors in many diseases

An Alu sequence is an approximately 300 basepair long sequence derived from 7SL RNA gene

Almost all of the recently integrated human Alu elements belong to one of several small and closely related

The current rate of Alu insertion is estimated to be of the order of one Alu insertion in every 200 births

We give an algorithm targeted to finding Alu polymorphism from next generation paired-end sequencing data. In what follows we will start by giving our problem framework, followed by a description of our algorithms and finally we show some experimental results.

Methods

Problem framework

The input to our problem is a reference genome and a set of paired-end sequence reads from a set of individuals. The genome sequence of the reference individual is known and will be highly similar, but not identical, to the genome of the individual(s) being sequenced. Paired-end sequencing reads consist of a read of a fixed length, followed by a short spacing, followed by another read. The spacing between the two reads follows a probability distribution,

**Supplementary material contains a more detailed description of our methods, additional simulation results and results on the 1000 genomes data**.

Click here for file

When the polymorphic Alu is not contained in the reference, we consider the Alu to be inserted with respect to the reference. When the polymorphic Alu sequence is contained in the reference genome and some of the sequenced individuals we consider the Alu sequence to be deleted with respect to the reference, even though evolutionary the sequence most likely has been inserted.

The output of our algorithm is a set of locations in the genome where an Alu sequence is inserted in some individual(s) as well as the sequence reads of the individuals being studied for these insertions. As each individual contains two haplotypes a polymorphic Alu may be inserted on one, both or neither of these haplotypes.

We formulate four versions of the problem of identifying Alus, when the Alu sequences are inserted or deleted with respect to the reference genome, both for identifying these polymorphism on a single individual and on multiple individuals.

Problem 1

Single Individual Deleted Alu identification problem

**Input **A set of paired-end sequence reads from a single individual and a reference genome.

**Output **A list of locations in the genome where an Alu is deleted with respect to the reference genome.

Problem 2

Multiple Individual Deleted Alu identification problem

**Input **A set of paired-end sequence reads from multiple individuals and a reference genome.

**Output **A list of locations in the genome where there exists an individual with an Alu deleted with respect to the reference genome.

Problem 3

Single Individual Inserted Alu identification problem

**Input **A set of paired-end sequence reads from a single individual and a reference genome.

**Output **A list of locations in the genome where an Alu is inserted with respect to the reference genome.

Problem 4

Multiple Individual Inserted Alu identification problem

**Input **A set of paired-end sequence reads from multiple individuals and a reference genome.

**Output **A list of locations in the genome where there exists an individual with an Alu inserted with respect to the reference genome.

Following the identification of polymorphic regions we need to determine which individuals are polymorphic for each polymorphism.

Problem 5

Alu genotyping problem

**Input **A single location in the reference genome known to contain a polymorphic Alu. A set of individuals and a set of sequence reads for each individual.

**Output **For each individual, a genotype call, assigning the individual 0, 1 or 2 copies of the given Alu, representing an Alu on neither, one or both haplotypes.

We start by giving the common algorithmic framework for our algorithms and then proceed to giving algorithms for each of the problems in turn. We start by describing our approach for the detection of deleted regions in a single individual. We then extend this to recognizing deletions in multiple individuals simultaneously. We then show how these ideas can be extended to identifying inserted Alus, first in a single individual and finally in multiple individuals simultaneously.

Algorithm framework

Our algorithms start by mapping the sequence reads to the reference genome and analyzing the output of such a mapping.

Alu Mate

We start by preprocesing the sequence reads to make them easier for manipulation. The initial step of our algorithm is to map the sequencing reads to the human reference genome build 37 (hg19) using the Burrows Wheeler Aligner (BWA)

A read pair is defined as

Lemma 1

Algorithm Alu Mate runs in _{r}) time, where _{r }is the number of reads.

Analysis of mapped reads

The output of Alu mate is a mapping of sequence reads to the reference genome and an assignment of

Figure

Example of an Alu deletion

**Example of an Alu deletion**. Example of an Alu deletion. Arrows show read directions. Black arrows show normal mapping reads, red lines show the insert between them. The leftmost figure shows a normal individual, center an heterozygote and rightmost an individual homozygote for an Alu deletion. The location of the Alu is shown with a thick red line in the bottom of each figure.

Figure

Example of an Alu insertion

**Example of an Alu insertion**. Example of an Alu insertion. Arrows show read direction, black arrows show reads mapping normally, red lines show the insert between them, green arrows show

Detection of deleted Alus

We consider an Alu sequence deleted when it occurs in the reference assembly, but not in the individual(s) being sequenced. There are two primary signs of deletion, some of the reads will be split, containing one part from each side of the deletion. The second signal is that there are reads that have one end mapping to each of the two sides of the Alu being considered and a corresponding increase in their insert length. The distance between these reads, as measured with respect to the reference genome will be in expectation be longer than _{Alu}, where _{Alu}, is the length of the deleted Alu. Detecting deleted Alus is considerably simpler than detecting inserted Alus, as the location of the Alu is known. For detecting Alu deletions we hence only need to consider locations that have been already annotated to contain Alus.

Genotyping deleted Alus

For each Alu annotated in the reference genome we determine the genotypes of the polymorphism of the individual. We let _{ϵ }be the _{1-ϵ }be the 1 - _{1-ϵ }to the left and right of the estimated Alu.

We construct a set _{1-ϵ }to the left and right of the Alu. Here

We then compute the probability of observing the insert lengths in _{Alu }is _{Alu }is the length of the Alu.

Deleted Alus in Multiple Individuals

When considering multiple individuals simultaneously we can construct a likelihood ratio statistic for the occurance of the deletion. We let the individuals be labeled from 1 through _{i }be the set of sequence reads belonging to individual _{0 }be the frequency of the homozygote Alu carriers in the population, _{1 }be the frequency of the heterozygote and _{2 }be the frequency of the homozygote non-Alu. Then the joint likelihood of the data given an Alu deletion is:

We apply a likelihood ratio test to test whether a deletion is significantly more likely than the model on the statistic

Under the null this statistic obeys a chi square distribution with two degrees of freedom

If we assume Hardy-Weinberg equilibrium

The corresponding likelihood ratio test will then obey a chi square distribution with one degree of freedom. We use the one degree of freedom test in the remainder of the paper.

Inserted Alu identification

One of the main complications in detecting Alu polymorphisms is the fact that members of the Alu family are all highly similar. The Alu insertions which we are looking for will be similar to sequences already inserted and other sequences that also may have been inserted.

The mapping of reads not mapping to Alu regions is generally more reliable, however a number of problems may occur; the region being considered may be duplicated, or the read may be chimeric, where due to artifacts in the sequencing process two parts of the read come from different parts of the genome. This implies that not all

Identifying potential inserts

As described earlier, we label Alu mates as either _{r }+ _{r }+ _{r }is the right endpoint of the _{l }- _{l }- _{l }is the left endpoint of the

We say that an Alu position, _{r }+ _{1-ϵ }≥ _{r }+ _{ϵ }≤ _{ϵ }and _{1-ϵ }are defined as before. Similarly an Alu position, _{l }- _{1-ϵ }≤ _{l }- _{ϵ }≥

Problem 6

Alu genotyping problem

**Input **A set

**Output **A set

**Objective **min |

**Constraints **Each

We note that the most general version of this problem reduces to a set covering problem, which can be shown to be hard to even approximate

For our empirical evaluations we set

Optimal algorithm

To search for regions likely to contain an Alu sequence we make a single pass through the genome. For each position, _{1-ϵ }to the left _{1-ϵ }to the right of

The time complexity of the algorithm is _{1-ϵ}), where

Covering multiple individuals

One way to detect Alu insertions in multiple individuals is to pool the data into a single dataset and ignore the fact that there are multiple individuals being sequenced. This simple idea will however lack power to find infrequent Alus. A region containing multiple _{1 }and _{2 }be constants, representing the cost of introducing an Alu insertion to the population and the cost of introducing an Alu insertion to each individual. We let _{j }

Problem 7

**Input **A set _{i }of _{i }of

**Output **A set

**Objective **min | _{1 }| _{2 }Σ_{j }| _{j }|

**Constraints **Each _{i }and _{i }is either in _{a}.

We have not been able to determine the computational complexity of this problem and leave open whether or not the problem is NP-hard.

Heuristic algorithm

When tuning these parameters we set _{1 }= _{2 }= 2, representing that we require two sequence reads in each individual to warrant introducing a Alu insert in the population and two sequence reads to warrant introducing the Alu to the individual.

We solve this problem using a heuristic. To prune the number of regions that we need to consider we start by considering each individual at a time. In each individual we search for regions where there are at least a small number of _{1-ϵ}. We then merge the insert locations of two individuals if they appear to be very close to each other.

Genotyping of inserted Alus

Given the location of potential Alu insertions we run an algorithm similar to the one that we ran for Alus that are deleted with respect to the reference.

Until convergence

Estimate length of Alu insertion

Re-estimate positions

Insert the Alu insertion in silico in the position determined.

Apply the algorithm for deleted from reference for genotype calling.

Alu insertion length estimation

We assume that there is a single insertion event that occurred in all of the individuals simultaneously. For each read pair, _{t}, a position within the Alu of the Alu read _{t}, mean distance between the two, _{t }and standard deviation in distance between the two, _{t}. The means and the standard deviation are estimated from the reads of each individual independently.

Assume we know a position _{Alu }where there is an insertion. Now consider all Alu read pairs in the interval [_{Alu }- _{1-ϵ}, _{Alu }+ _{l-ϵ}]. Now assume that we have aligned all Alu read pairs in this interval to the same Alu, of length _{Alu}. Our model of the true length of the Alu is that it is _{Alu }+

We now estimate _{Alu }- _{1-ϵ}, _{Alu}] and use these to estimate _{t }= _{Alu }- _{t}, then the estimate of ^{t }= _{t }- _{t }- _{t}, with standard deviation _{t}. When considering multiple reads the maximum likelihood estimate of

Similarly we get an estimate for _{Alu}, _{Alu }+ _{l-ϵ}] and use these to get an estimate of _{t }= _{t }- _{Alu}, then the estimate of _{t}. When considering multiple reads the maximum likelihood estimate of

Alu insert position reestimation

Each read gives an estimate of the location of the inserted Alu. A joint estimate is determined from all of the reads in a given region. This is done in the same manor as described above, where we isolate _{Alu }from the equations instead of

In silico insertion and deleted algorithm

Once the location of the Alu insertion and the length of the Alu is determined a new sequence is constructed containing the Alu at the inserted location. Following the construction of this new sequence a graph, identical to the one described for Alus deleted with respect to the reference, containing the location of the reads in the interval is constructed as before.

The in silico constructed genomic sequence now contains the Alu that we previously considered to be inserted. The Alu sequence is therefore deleted with respect to this sequence and we can apply the same algorithm as before.

Results

We run our experiments on simulated data and on data from the 1000 genomes project.

Simulated data

We benchmark our algorithms on simulated data. We downloaded chromosome 22 of build 37 of the human genome, as well as the RepeatMasker track to identify Alu sequences in the build. We downloaded a database of Alu sequences from RepBase

The 100 chromosomes where then paired to construct 50 diploid individuals, with each individual containing on average 50 Alu insertions. The Alu insert locations were chosen randomly on the chromosome, with the constraint that no Alu was added within _{1-ϵ }basepairs of another Alu and no more than 1% of basepairs are annotated N in a 2_{1-ϵ }basepair window surrounding the introduced Alu. This allows us to focus our results only on Alu insertions that are distant from other Alus and is not meant to representative of the process in which Alu's are inserted. Reads were simulated using the program SimSeq

Alu insertion

The set of individuals were selected to have similar coverage and being genotyped under similar conditions. We benchmark our Alu insertion identification algorithm by considering the mapping of the reads of the simulated individuals to the reference genome, results are shown in Table

Alus inserted with respect to the reference

**Expected**

**Found(%)**

Error free

1512

1483 (98.1%)

2% error

1512

1446(95.6%)

Number of Alus found inserted with respect to the reference in simulated genotype data.

We ran our insertion algorithm on each individual independently. When tuning our algorithms to find no false positives we find 96.4% of all Alus inserted. The false negatives are mostly from individuals that are heterozygote for the insertion and are mostly when there is other surrounding variation.

Alu deletion

We benchmark our Alu deletion identification algorithm by considering the mapping of the reads of the simulated individuals to a simulated individual that contains all the Alus, results are shown in Table

Alus deleted with respect to the reference

**Expected**

**Found(%)**

Error free

1422

1390(97.7%)

2% error

1422

1385(97.4%)

Number of Alus found deleted with respect to the reference in simulated genotype data.

In Additional file

Verification on triad data

We investigated whether the the deletions that we detected were transmitted to the children. We simulated fifty trios where we independently simulated two chromosomes with randomly inserted Alus for each parent. We then randomly selected one chromosome from each parent to use for the child. We found very high concordance between parent and the child, as shown in Table

Trio results

**Found in child**

**Matches parents**

Homozygote deleted

997

997 (100%)

Heterozygote

368

362 (98.4%)

The number of deletions found in child that were also found in a consistent manor in its parents. The first line shows when the child is homozygote for the deletion. The second line shows the results when the child carries only a single copy of the deletion.

1000 genomes

We run our experiments on twenty individuals from LWK: Luhya in Webuye, Kenya population of the 1000 genomes project

We find an average of 1418 Alus that are deleted with respect to the reference. This corresponds to a rate of approximately

We find an average of 5990 Alus that are inserted with respect to the reference. A table showing the number of inserted Alus in each individual in the is shown in Additional file

dbRIP

Stewart et al.

When we compare the deleted Alus of two individuals we found that 61.5% of the deletions found in one individual are also found in another individual. For inserted Alus this number is 15.6%. The reason for this difference is the fact that Alus generally have a low frequency, the deleted Alus are generally the ones that have been inserted into the reference genome and hence they will not be present in a large number of the other individuals, while the inserted have only been inserted into a subset of the population.

Timing

We ran our computations on desktop machine using a single 3.06 GHz Intel i5 processor. On average each individual of the 1000 genomes data took 1hr and 44 minutes to analyze regions that are deleted with respect to the reference and 2hrs and 1 minute to analyze regions that are inserted with respect to the reference.

Alu families

We investigate which Alu families are deleted. We estimate the Alu family from the repeat masker annotations (cf. Table

Estimated Alu families

**Total**

AluY

22660(82,15%)

AluS

3167(11,48%)

AluJ

1758(6,38%)

Estimated Alu families of Alus deleted with respect to the reference genome using 1000 Genomes data.

Conclusions

A number of improvements can be made to the the algorithm that we have presented. Broken reads, those where one part maps to the reference genome and one part maps to an insertion or where one part maps to one side of an deletion and one part to the other, can be used to improve the algorithms described here. In our algorithm we study only the single best mapping of each sequence read. An alternative would be to study multiple mapping of reads to the reference genome. We will attempt to explore such solutions, however our experimental results suggests that this will provide little gain for most regions of the genome with considerable added algorithmic complexity. Our future goals are to extend the methods developed here to find other types of structural variations.

List of abbreviations

SNP: single nucleotide polymorphism; DNA: deoxyribonucleic acid; RNA: ribonucleic acid; LINE: long interspersed elements; SINE: short interspersed elements; GWAS: genomewide association studies; LWK: Luhya in Webuye, Kenya.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

JIS implemented the software. BVH and JIS ran the experiments. BVH and JIS developed the algorithms. BVH wrote the first draft of the paper. BVH and JIS contributed to writing the final version of the paper.

Acknowledgements

JIS was supported by the Icelandic Research Fund for Graduate Students (grant nr. R-10-0008).

This article has been published as part of