Table 2

Objectives and rationales for each step of the data processing procedure to validate variants and genotypes.

Steps

Objectives

Rationales


1

Remove reads with incomplete barcode information.

Reads cannot be assigned to any individual.

Remove reads that display imperfect match with the primers.

An imperfect match with primers may cause a shift in the reading frame and/or errors in the barcode.

Remove sequences that are unique within the pool.

Unique sequences probably result from sequencing errors (first assumption). This step reduced the dataset and then facilitated subsequent data analyses. Note that this step may be relaxed for small datasets (unique sequences will also be removed during Step 3).

Remove reads that display indels that are not multiples of three base pairs.

Such indels probably result from sequencing errors (second assumption). Note that this step may be relaxed when focusing on non-functional genes.


2

Remove samples with a low number of sequences.

A low number of sequences may induce an incomplete genotyping (second assumption).

The minimum number of sequences required to obtain a reliable genotype is estimated taking into account the number of copies amplified for the gene studied (threshold 1).


3

Remove variants with a low number of sequences for a given sample.

Variants represented only rarely within samples probably result from sequencing errors (first assumption).

The minimum number of sequences required to validate a given variants of an individual genotype is estimated from the distribution of variant frequencies for the given sample (threshold 2).


4

Remove variants that do not correspond to the gene studied, using sequence alignment.

Some inconsistencies may still exist in the dataset such as recombinant chimeric sequences originating from a mixture of the sequences of two different alleles, [42], pseudogenes or paralogs, which can occur at high frequencies within individual samples.


See text for the definitions of assumptions and thresholds.

Galan et al. BMC Genomics 2010 11:296   doi:10.1186/1471-2164-11-296

Open Data