Unit of Statistical Genetics, Center for Genomic Medicine, Graduate School of Medicine, Kyoto University, 53 Shogoin Kawahara-cho, Sakyo-Ku, Kyoto, Japan

Department of Forensic Medicine and Molecular Pathology, Graduate School of Medicine, Kyoto University, Yoshida-Konoe-Cho, Sakyo-Ku, Kyoto, Japan

Abstract

Background

DNA profiling is essential for individual identification. In forensic medicine, the likelihood ratio (LR) is commonly used to identify individuals. The LR is calculated by comparing two hypotheses for the sample DNA: that the sample DNA is identical or related to a reference DNA, and that it is randomly sampled from a population. For multiple-fatality cases, however, identification should be considered as an assignment problem, and a particular sample and reference pair should therefore be compared with other possibilities conditional on the entire dataset.

Results

We developed a new method to compute the probability via permanents of square matrices of nonnegative entries. As the exact permanent is known as a #P-complete problem, we applied the Huber–Law algorithm to approximate the permanents. We performed a computer simulation to evaluate the performance of our method via receiver operating characteristic curve analysis compared with LR under the assumption of a closed incident. Differences between the two methods were well demonstrated when references provided neither obligate alleles nor impossible alleles. The new method exhibited higher sensitivity (0.188 vs. 0.055) at a threshold value of 0.999, at which specificity was 1, and it exhibited higher area under a receiver operating characteristic curve (0.990 vs. 0.959,

Conclusions

Our method therefore offers a solution for a computationally intensive assignment problem and may be a viable alternative to LR-based identification for closed-incident multiple-fatality cases.

Background

DNA profiling is crucial in the identification of human remains, particularly when other physical clues are absent. Currently, the genetic status of an individual is commonly profiled by short tandem repeat (STR) loci. There are two types of approaches to DNA-based identification of an unidentified body: 1) direct matching, and 2) kinship testing. Direct matching is performed when the reference DNA of a victim can be obtained from his/her belongings (direct reference). When direct reference is not available, indirect reference is obtained from the victim’s relatives, and a probability distribution of genotypes is inferred for the victim. In either approach, identification is determined based on the likelihood ratio (LR) between two hypotheses: that the DNA of an unidentified body is that of a particular missing person (MP) (hypothesis _{1}), and that it is randomly sampled from a population (hypothesis _{2}). This approach with use of STRs has been spread since mid-1990th as a routine use for individual identification. The first application to mass disaster identification was reported in 1997

To illustrate the problems, if we assume that the number of victims is two and that the DNAs of two bodies and reference DNAs for two MPs are all available, we can assume with a high degree of confidence that either body 1 or body 2 is MP 1 or MP 2. The conventional LR-based method (LR method) independently compares body 1 to MP 1, body 1 to MP 2, body 2 to MP 1, and body 2 to MP 2. However, these four comparisons are not independent; if the probability of body 1 corresponding to MP 1 is very high, the probability of body 1 corresponding to MP 2 should be very small because only two possibilities are considered for the identity of body 1 and the two probabilities must add up to 1. Another concern with the LR method is as follows. If we consider a LR of 1000 to be sufficiently high to approve the identity of a body, what if the LRs for both of our MPs exceed 1000? In this case, we might decide that the MP with the higher LR is body 1. But what if the LR for MP 1 is 1001, and that for MP 2 is 1002? Should we elect MP 2? When the number of victims is small, for example less than 10, the “best” assignment might be successfully determined by “intelligence”, but this method would fail if the number of victims is as large as 1000 or above. The fact that LR is always independent of the probabilities of other pairs in the data raises another problem in LR: Unavailable data cannot be taken into account. Here, “unavailable data” means unrecovered bodies and MPs with no available reference. As unavailable data are a source of uncertainty, the amount of unavailable data should affect the test result. Note that LR is reduced by considering a prior probability, one divided by the total number of victims, but test results do not differ regardless of whether data are complete or most are absent.

To address these problems, many-to-many approaches need to be developed. Lin et al. _{1}, that the identity of body ^{’}_{2}, instead of _{2}, that the identity of body

We considered identification in multiple-fatality cases as an assignment problem and developed a method to compute the probability that body

To handle this problem, data must contain the same number of bodies and MPs both theoretically and technically: Theoretically, the number of bodies (whether or not they were recovered) must be the same as the number of MPs (whether or not they were reported); technically, the algorithm to solve this problem requires a square matrix as described below. Therefore, when some of the bodies are not recovered, then they are assumed to exist, but with a genotype probability distribution that is the same as that of the population. Similarly, when some of the victims are not reported as missing or have no available reference DNA, then they are assumed to exist, but with a genotype probability distribution that is the same as that of the population. The key idea in computing the probability is to find the sum of weights of perfect matching in a bipartite graph, known as the permanent of a square matrix. Exact computation of a permanent is #P-complete

In this paper, we first describe the logic of our method. We then present the results of a computer simulation performed to evaluate the performance of our method compared with that of the conventional LR-based method, along with results showing the influence of the presence of unavailable data, using receiver operating characteristic (ROC) curve analysis, and the results of assessment of the accuracy and processing time of the approximation algorithm in the case of DNA identification.

Methods

Notation

Assume that a fatality incident has a total of _{w} victims. Assume that their bodies (some of which may not have been recovered) are denoted by _{i}), _{w}, and the corresponding MPs (some of whom may not have been reported) are denoted by _{j}), _{w}. Because all bodies may not have been recovered, let _{b} ≤ _{w} be the number of recovered bodies. Similarly, because some victims of the fatality may not have been reported, let _{m} ≤ _{w} be the number of reported MPs. A victim is considered “reported” if either a direct or an indirect reference is available for genotyping. There are four possible situations: 1) _{w} = _{b} = _{m}, 2) _{w} > _{b} and _{w} = _{m}, 3) _{w} = _{b} and _{w} > _{m}, and 4) _{w} > _{b} and _{w} > _{m}. In cases 2), 3), and 4), unrecovered bodies and/or unreported MPs exist. _{i}), _{w} denotes the probability distributions of the genotypes of _{w} bodies with the following properties of _{i}; 1) the number of elements equals to the number of possible genotypes, 2) each element represents the probability of the genotype being consistent with the genotype of _{i}, 3) the sum of all the elements equals 1, and 4) when the genotype can be determined, one element is 1 and the other elements are 0. The elements of _{i} take either one of the two possible forms:

where _{b,i} denotes the probability distribution of the genotype of _{i}, given that the DNA from the body is available for genotyping, and _{*} is the probability distribution of genotypes in a population, whereby genotypes of _{w} – _{b} unrecovered bodies are assumed to follow _{*}.

Similarly, _{j}), _{w} denotes the probability distributions of the genotypes of _{w} MPs, and _{j} has the same properties as _{i}. The elements of _{j} are also given as follows:

Here, _{m,j} is the probability distribution of the genotype of _{j}, given its reference DNA, and the probability distribution for _{w} – _{m} unreported MPs is assumed to follow _{*}; _{i} and _{j} can be expressed as a vector in which the value of each element is the probability of each genotype.

We define an _{w} × _{w} matrix _{i,j}), where _{i,j} is the probability that _{i} and _{j} have an identical genotype, and is calculated as the inner product of _{i} and _{j}. We also define an _{w} × _{w} matrix _{i,j}), where _{i,j} is the probability that _{i} and _{j} are identical conditional on a matrix

To calculate a matrix _{i,j}). The set of permutations on {1, 2, ⋯, _{n} and is needed to define perm(_{n}, _{-i,-j} to denote the (

For the criterion for assignment based on _{i} and _{j} pair are “assigned” when _{i,j} exceeds _{i} to be _{j}. Otherwise, the identity of _{i} is “suspended”, i.e., not approved to be _{j}.

Matrix

Probability _{i,j} is calculated as the inner product of _{i} and _{j}. A matrix

Here, _{*,j}, _{i,*}, and _{*,*} indicate the inner products of _{*} and _{m,j}, _{b,i} and _{*}, and two _{*}, respectively. Moreover, _{1} is an _{b} × _{m} matrix; _{2} is an (_{w} – _{b}) × _{m} matrix consisting of identical rows, _{3} is an _{b} × (_{w} – _{m}) matrix consisting of identical columns, _{4} is an (_{w} – _{b}) × (_{w} – _{m}) matrix consisting of identical values, _{*,*}. Each submatrix corresponds to the collection of each of the following four cases:

Case 1. _{i} is recovered and _{j} is reported.

Case 2. _{i} is not recovered and _{j} is reported.

Case 3. _{i} is recovered and _{j} is not reported.

Case 4. _{i} is not recovered and _{j} is not reported.

Matrix _{i}, the probability of an observed genotype is 0.95, and the total probability of other genotypes is 0.05. By this, any Mendelian errors do not cause zero in matrix

Permanent

In this section, we define the permanent and describe the equations that are important for explaining our method.

The permanent of a matrix

Each term expresses the likelihood, or weight, of a permutation. The permanent of _{–i,–j} can be expressed by _{i,j} as follows:

The numerator is the sum of weights of permutations in which _{i,j} is included. Here, perm(_{–i,–j}) is given by canceling _{i,j} in all terms, and perm(_{–i,–j} and _{i,j} as follows:

Matrix

Assigning each of _{i} and _{σ (i)} are paired with each other. That is, if a permutation _{i} and _{j} are always paired with each other. Here, we refine the definition of _{i,j} as the ratio of the sum of the likelihoods of _{i,j} as follows:

A matrix _{i,j} for all pairs of

Because both the numerator and the denominator of Eq. (4) include the permanent, we apply the following strategy to minimize estimation error. Let _{i,j}) denote an _{w} × _{w} matrix consisting of the numerator of Eq. (4). We avoid using perm(

Here,

Determining assignment

We determine assignment based on a matrix _{i} and _{j} to be identical if _{i,j} > _{i} and _{j} are _{i} and _{j} are _{i,j} > _{i,j} ≤

When the value of _{i}, when _{i,j} > _{w} × _{w} matrix consisting of the numerator of Eq. (4), are equal to perm(_{j}.

In other words, a matrix

What value should be used for cutoff depends on situations and criteria conventionally accepted in a society. Typically, in the LR method, LR of 10^{6} is used, which is equivalent to probability of 0.999999. However, the LR method does not tell you the true probability of making a mistake because there are only two possibilities considered in the LR method; 1) the body is obtained from the MP and 2) the body is randomly drawn from the population. On the other hand, the probability given by the permanent method is truly the probability of making a wrong assignment. Therefore, we consider that the permanent method does not need to follow the conventional cutoff. After close discussion with a forensic expert, expecting one mistake in 1000 of the times is stringent enough to discuss the utility of the permanent method.

The permanent method is illustrated by Figure

An example of output of the permanent method

**An example of output of the permanent method.** The matrix on the left represents a matrix _{1} being _{1}, or _{1,1}, is 0.9999, and therefore, _{1} is assigned to _{1}. The identity of _{1} cannot be _{2} or _{3}, and it can be _{4} but not probable. Note that values in a column and row in the matrix

We summarize the permanent method here so that the advantage of the permanent method would be clear. Assuming that we are interested in the probability that the identity of body 1 (_{1}) is MP 1 (_{1}). _{1,1} is the probability that _{1} has the genotype of _{1} given the reference DNA for _{1}, and relationship between _{1} and the reference person. Therefore, _{1,1} is independent of genotypes of other bodies and reference DNAs of other MPs. A likelihood of a possible hypothesis of pairing _{w} bodies and _{w} MPs is calculated by multiplying elements in matrix _{1,1} is given by the proportion of the sum of the likelihoods of the hypotheses in which _{1} and _{1} are paired to the sum of the likelihoods of all possible hypotheses. By using matrix _{1} corresponding to _{1} influences the probabilities of _{1} corresponding other MPs. For example, in matrix _{1,1} = 0.95, and _{1,2} = 0.98; _{1,2} is only slightly higher than _{1,1}, so we would not determine the identity of _{1}. Matrix _{1,1} = 0.0001 and _{1,2} = 0.9999 depending on other elements in matrix _{2,1} is much higher than _{1,1}, for example _{2,1} = 0.9999999999, and all the other elements in the first column is 0 or very small. In this case, matrix _{2} is assigned to _{1}, and therefore, _{1} must not be _{1}, and therefore, _{1} is assigned to _{2}.

Likelihood ratio and posterior probability

The LR of _{i} corresponding to _{j} is defined as _{i,j} divided by the genotype frequency of _{i} in a population,

The numerator of Eq. (6) corresponds to the likelihood that hypothesis _{1} is true (the DNA of an unidentified body is that of a particular MP), and the denominator corresponds to the likelihood that _{2} is true (the DNA is randomly sampled from a population). Therefore, the numerator is equal to _{i,j} defined in the permanent method. The difference between the two methods becomes apparent after we calculate matrix _{1,j}, _{w} are divided by the genotype frequency of _{1}.

In multiple-fatality cases, the prior odds are commonly considered as 1/(_{w} – 1), and according to the Bayesian theorem, the posterior odds are given by the product of LR and the prior odds, _{w} – 1). We use the posterior odds,

Assessment of performance

Probability and odds are interchangeable, and therefore we can use the same assignment criterion, _{i,j} and

In our case, sensitivity is defined as the ratio of the number of assigned truly identical pairs to the total number of truly identical pairs, and specificity is defined as the ratio of the number of suspended truly non-identical pairs to the total number of truly non-identical pairs. AUC is calculated by summing the areas of trapezoids formed between two adjacent cut-off points (trapezoidal method), and its confidence interval (CI) is estimated by the DeLong method

In the LR method, two or more MPs can be assigned to a body by any cutoff

Simulating a population

We performed a computer-based simulation to assess the performance of the permanent method for DNA profiles of 15 STR loci available in the ABI AmpFISTR Identifiler® PCR Amplification Kit (Applied Biosystem, Foster City, CA, USA). Theoretically, as the number of markers increases, the performance of both the permanent and the LR methods increases. However, both methods assume independence of markers. That is a possible limitation to the number of independent markers, and currently identification by a large number of loci has not been applied to practice. The amelogenin locus was excluded from the study because it is used for sex determination. Although two loci are located within chromosome 2 and another two loci within chromosome 5, we assumed that all 15 loci are independent because the influence of the recombination rate is not an issue when assessing the performance of measures. We used previously reported allele frequencies in a Japanese population

DNA profiles of 12,500 families were simulated for the 15 STR loci. Simulated pedigree trees are shown in Figure

An example of simulated family pedigrees

**An example of simulated family pedigrees.** All simulated families have this form of tree. Gender pattern in generation II and III may differ.

Generating families with status patterns

For each individual, we set one of three statuses, _{i} is 1 with all the others 0. A vector _{j}, which is to be estimated from the reference DNA, was computed by an algorithm and program implemented by the authors

We calculated a matrix ^{17}

**Family types**

**Permanent**

**LR**

**M**

**T**

**AUC (95% CI)**^{a}

**Mean (SD)**^{b}

**Se/sp**^{c}

**AUC (95% CI)**^{a}

**Mean (SD)**^{b}

**Se/sp**^{c}

**
P
**

^{a}Area under curve (AUC) of pooled results of 20 datasets for each family type (estimated 95% confidence interval (CI)).

^{b}Mean of AUC (standard deviation (SD)) of 20 datasets for each family type.

^{c}Sensitivity and specificity at threshold 0.999.

1

III-1

III-2

1.000 (1.000-1.000)

1.000 (0.000)

0.990/1

0.999 (0.998-0.999)

0.999 (0.001)

0.445/1

1.6E-05

2

I-1

II-3,III-1

0.999 (0.998-0.999)

0.988 (0.005)

0.285/1

0.976 (0.969-0.984)

0.976 (0.021)

0.045/1

1.2E-10

3

I-1

III-1

0.982 (0.977-0.987)

0.982 (0.020)

0.038/1

0.944 (0.932-0.956)

0.945 (0.030)

0.005/1

1.9E-13

4

III-1

I-1

0.982 (0.977-0.987)

0.982 (0.020)

0.035/1

0.944 (0.932-0.956)

0.945 (0.030)

0.005/1

2.1E-13

5

II-1

III-1

0.975 (0.969-0.980)

0.974 (0.020)

0.030/1

0.933 (0.920-0.947)

0.933 (0.029)

0.010/1

1.2E-16

6

III-1

II-1

0.974 (0.969-0.980)

0.974 (0.021)

0.032/1

0.933 (0.920-0.947)

0.933 (0.029)

0.010/1

1.3E-16

7

III-1

I-1,I-3

1.000 (1.000-1.000)

0.999 (0.002)

0.673/1

0.988 (0.984-0.992)

0.988 (0.011)

0.118/1

2.1E-08

8

I-1

I-3,I-4,III-1

0.995 (0.993-0.997)

0.994 (0.009)

0.168/1

0.962 (0.953-0.972)

0.962 (0.020)

0.033/1

2.8E-15

9

I-1

I-3,III-1

0.990 (0.987-0.993)

0.989 (0.014)

0.078/1

0.951 (0.940-0.961)

0.951 (0.026)

0.015/1

9.6E-20

10

III-1

III-3

0.821 (0.799-0.843)

0.818 (0.074)

0.000/1

0.788 (0.764-0.812)

0.788 (0.068)

0.000/1

2.5E-08

Mix

0.990 (0.987-0.993)

0.988 (0.013)

0.188/1

0.959 (0.950-0.968)

0.958 (0.022)

0.055/1

9.6E-15

Data for ROC analyses

For comparison with the LR method, we assumed a complete dataset, i.e., _{w} = _{b} = _{m}. Complete data are required to compare two methods under the same condition because the LR method does not take unavailable data into account. We generated 20 datasets, each of which consisted of 20 families randomly drawn from the pooled families. Uniform data were generated by assigning the same status pattern to all 400 families, 20 datasets × 20 families. Because we used 10 types of pedigrees, there were 200 datasets, 10 types × 20 sets, in total. We used the same 400 families for all family types, so that performance could be compared using a set of families derived from the same set of founders. For comparison in a more realistic situation, 20 mixed datasets were generated by assigning status patterns randomly chosen from the 10 types with an equal probability to the same 400 families. The same sets of families were used again for the same reason. To summarize the results of 20 datasets, we pooled 8,000 (20 datasets × 20 bodies × 20 families) values of matrix

For assessing the influence of unavailable data, 20 families were randomly drawn from the pooled families. The bodies of MPs of these families were assumed to have been recovered, and we call this data _{b} < _{m} = _{w}, 2) _{m} < _{b} = _{w}, and 3) _{b}, _{m} < _{w}. For situation 1, family data were added to the complete part, and body data that were unavailable were completed using _{*}. For situation 2, body data were added, and family data that were unavailable were completed via _{*}. For situation 3, equal numbers of families and bodies were added, and the remainder were completed with _{*}. A matrix

Approximation of the permanent of the square matrix

We employed an algorithm to approximate a permanent of a nonnegative square matrix, as described by Huber and Law ^{–2}ln(2/

Assessment of approximation accuracy and computation time

To assess accuracy and computation time in our case, test matrices were generated using values obtained from a matrix

Mean computation time was obtained by approximating three times for each matrix size and for the same combinations of parameters.

Computation environment and software

Genotype simulation, computation of conditional probabilities and LRs, evaluation of performance, and assessment of accuracy and processing time were performed with R v2.13.1 _{j}, a matrix

Results

Comparison of permanent method and LR-based method

First, we assessed performance on uniform datasets to compare performance for each family type. Figure

Distributions of conditional probabilities of permanent method and posterior odds of LR method

**Distributions of conditional probabilities of permanent method and posterior odds of LR method.** Distributions of identical pairs and non-identical pairs are shown in red and blue, respectively. Minimal and maximal values of the distributions are shown for non-identical pairs (left) and for identical pairs (right). Probabilities obtained with permanent method are shown as odds. Values are obtained from 20 uniform datasets of either type 1 **(A)**, type 4 **(B)**, type 10 **(C)**, or mixed **(D)**.

The histograms indicate that the distributions of identical pairs and non-identical pairs overlap less when family types 1 and 4 were tested with the permanent method as compared with the LR method. Notably, when tested with the permanent method, the distributions of identical pairs and non-identical pairs show no overlap for family type 1, or sibship tests. Therefore, the two distributions are more clearly separated when the permanent method is used. For family type 10, or between-cousins test, however, the distributions do not differ much between the two methods. Figure

ROC curves of pooled results obtained from 20 datasets for each

**ROC curves of pooled results obtained from 20 datasets for each family type. ****(A)** Uniform data of type 1. **(B)** Uniform data of type 4. **(C)** Uniform data of type 10. **(D)** Mixed data. Discriminant performance was compared between the permanent method (solid line) and the LR method (dashed line). AUC (95% confidence interval (CI)) and

**Distributions of conditional probabilities of permanent method and posterior odds of LR method.** Distributions of identical pairs and non-identical pairs are shown in red and blue, respectively. Probabilities obtained with the permanent method are shown as odds. Values are obtained from 20 uniform datasets of family types 2, 3, 5, 6, 7, 8, or 9. Family types are defined in Table

Click here for file

**ROC curves of pooled results obtained from 20 datasets for each family type.** ROC curves of test results for family types 2, 3, 5, 6, 7, 8, and 9 are shown. Discriminant performance was compared between the permanent method (solid line) and the LR method (dashed line). AUC (95% confidence interval (CI)) and

Click here for file

**Supplementary tables (Table S1 – S6). ****Table S1.** Results of the ROC analysis for each dataset for the uniform-pedigree analysis. **Table S2.** Counts of family types that were randomly sampled for the mixed-pedigree analysis. **Table S3.** Results of the ROC analysis for each dataset for the mixed-pedigree analysis. **Table S4.** Counts of family types in the complete part and additional parts that were randomly sampled for the analysis of the incomplete datasets. **Table S5.** Results of the ROC analysis for each dataset for the analysis of the incomplete datasets. **Table S6.** Acceptance numbers corresponding to values of

Click here for file

Next, we assessed performance on mixed datasets for comparison in a more practical situation. Counts of simulated family types included in each dataset are listed in Table S2 (see Additional file

Specificity and sensitivity resulting from the mixed datasets

**Specificity and sensitivity resulting from the mixed datasets. ****(A)** Specificity is plotted against threshold values, **(B)** Sensitivity is plotted against **-**log_{10}(1-

ROC analysis indicated some difference between the two methods in terms of overall performance. We also compared differences in the judgment of each pair. Figure

•top right: assigned by both methods,

•top left: assigned by LR method and suspended by permanent method,

•bottom left: suspended by both methods, and

•bottom right: assigned by permanent method but suspended by LR method.

Posterior odds of LR method plotted against conditional probabilities of permanent method for mixed datasets

**Posterior odds of LR method plotted against conditional probabilities of permanent method for mixed datasets.** Distribution of test values of the permanent method and the LR method resulting from the mixed pedigree datasets is shown for identical pairs (black), and non-identical pairs (gray). Probabilities obtained with the permanent method are shown as odds. Dashed lines indicate the level equivalent to 0.999. Top-right and bottom-left regions indicate that the judgments of the two methods were consistent, and top-left and bottom-right regions indicate that the judgments between the two methods differed.

Top-right and bottom-left regions indicate that the judgments of the two methods were consistent, while the top-left and bottom-right regions indicate that the judgments between the two methods differed. Table

**Judgment**

**
v**

**
v**

**Permanent**

**LR**

**Non-identical**

**Identical**

**Non-identical**

**Identical**

Permanent: permanent method,

Suspend

Suspend

7600

325

7600

357

Assign

0

0

0

0

Assign

Suspend

0

53

0

31

Assign

0

22

0

12

Demonstration of how the permanent method performs for incomplete data

Here we demonstrate the performance of the permanent method on incomplete data compared with complete data. Table S4 lists the counts of family types drawn by random sampling in each dataset (see Additional file

**Data**

**Situation**

**Method**

**Additional data**

**AUC (SD)**^{b}

**Decrease from complete data (SD)**^{c}

**Se/sp**^{b}

^{a}Data shown for incomplete cases were results of permanent method.

^{b}For incomplete datasets, the complete parts with the same 20 victims were used to calculate AUC. Sensitivity and specificity were calculated from all entries in both the complete part and the incomplete part. Mean values of 10 datasets for each situation and each number of additional data are shown for AUC.

^{c}Mean decrease from AUC of complete data measured with the permanent method.

^{d}Data shown for complete cases were obtained using only a complete part with the permanent method (Perm.) and LR method.

Complete^{d}

Perm.

0

0.9945

0.20/1

LR

0

0.9651

0.05/1

Incomplete^{a}

1

Perm.

4

0.9937 (0.0005)

0.0008 (0.0005)

0.06/1

Perm.

10

0.9913 (0.0015)

0.0032 (0.0015)

0.05/1

2

Perm.

4

0.9928 (0.0018)

0.0017 (0.0018)

0.14/1

Perm.

10

0.9908 (0.0017)

0.0036 (0.0017)

0.09/1

3

Perm.

4

0.9920 (0.0016)

0.0025 (0.0016)

0.05/1

Perm.

10

0.9882 (0.0048)

0.0063 (0.0048)

0.05/1

Accuracy parameters of approximation and computation time

We investigated the distribution of the estimates of the permanent in the setting of DNA identification. Figure ^{0.09} = 1.23 and 10^{0.08} = 1.20, respectively (Figure

Accuracy and computation time of approximation of permanent

**Accuracy and computation time of approximation of permanent. ****(A)** Distribution of 100 estimates of a permanent for each matrix size 20 (top) and 30 (bottom). Acceptance number **(B)** Computation time and acceptance number. CPU time (seconds) is plotted against the acceptance number for matrices of size 10 (circle), 20 (triangle), 30 (square), and 40 (cross). Dashed lines indicate the acceptance number,

**Conditional probabilities obtained with permanent method for assumed exact results and worst-scenario results.** Conditional probabilities obtained from worst-effect approximation errors are plotted against those obtained from assumed exact computation of permanent for mixed datasets (shown in a log_{10} scale).

Click here for file

Discussion

The individual identification of unidentified bodies is an important issue. Developments in DNA-based techniques have provided strong evidence for personal identification. Despite progress in genotyping, challenging cases are still encountered, such that LR fails to provide strong evidence because kinship between sample and reference DNA itself is not close. If it occurs in a single-fatality case, DNA would not be the first-line evidence. In the case of multiple fatalities, approaches optimized for multiple-fatality cases may overcome this problem to some extent. LR-based identification is optimal when comparison is only possible with a random person from a population, i.e., a single-fatality case. In multiple-fatality cases, however, comparison can be performed with other DNA obtained from the same case. Currently, identification in multiple-fatality cases is based on a conventional LR weighted by prior odds. Although the LR method can take the total number of victims into account via the prior odds, it still compares each pair only with the population where the possible identity of a body can be limited to one of MPs. The new approach we have described in this paper considers the identification problem as an assignment problem and provides the probability for a particular pair of sample and reference DNA conditional on the entire observed data. Because the permanent method considers all assignment hypotheses, it can be reasonably speculated that, given pairs assigned with high confidence, conditional probabilities of non-identical pairs decrease and those of other identical pairs increase. Thus, we expect the permanent method to be capable of discriminating between identical and non-identical pairs that cannot be clearly discriminated with LR.

In a separate study, the distribution of combined sibship index (CSI) with 15 STR loci was investigated, and it was found that the distribution of CSI for siblings overlapped with that for random pairs; it was also found that 1.3–1.6% of siblings had CSI less than 1, and 1.4–1.9% of random pairs had CSI greater than 1

The results of our computer simulation show that the performance of the permanent method is significantly higher than that of the LR method. In the case of individual identification, high specificity is required even at the cost of sensitivity, because finding an incorrect identity is a crucial problem. Thus, we also focused on sensitivity and specificity at

In this paper, we used only cases in which references provide neither obligate alleles nor impossible alleles; that is, any genotypes are consistent without assuming mutation. This is because the current genotyping system provides a discriminating power strong enough to overcome the theoretical problem in the LR method, and both methods perform highly with almost no difference between them (data not shown). Thus, the power of the permanent method may be maintained up to some extent, but not as far as between cousins (although the ROC analysis indicated that the permanent method performed better in terms of AUC, the sensitivity and specificity at the practical threshold did not differ between the two methods). In practice, it is clear that we should attempt to collect reference DNA from relatives who are as closely related as possible, in order to obtain obligate alleles. If this is not possible, it is important to consider whether references provide impossible alleles. Consider a situation in which the DNA of the parents of a MP is not available but that of all four grandparents is available, but for financial reasons we can only type two persons. Two grandparents on the same side may provide impossible alleles, but two from each side do not. In our simulation, these two situations led to a substantial difference in results; when two grandparents on the same side were used, both the permanent and the LR method were more powerful in determining the identities (data not shown).

Importantly, the permanent method can take uncertainty into account. Two situations lead to uncertainty: the presence of unavailable data, and the number of unknown victims. As simulated in this paper, the former can be considered by the permanent method. Although the latter is not demonstrated, the same strategy can be applied: We would set an expected total number of victims and complete the unavailable data with _{*}. As we do not know the exact number of victims, the expected number would be arbitrary to some extent. Judgment in the LR method does not change no matter how much unavailable data exist, whereas assignment by the permanent method becomes more difficult with more unavailable data. Therefore, the test value of the permanent method appears to reflect our assignment confidence better than that of the LR method.

Although the theoretical basis of our approach is reasonable and adequate, the approach is computationally challenging, because the permanent method requires estimation of the permanents _{w} × _{w} times. Computation time can be the primary cause of limitation. The most efficient algorithm for the exact permanent requires Θ(^{n}) arithmetic operation. It has been proven that exact computation of the permanent is a #P-complete problem, even for 0,1 matrices, and thus computation in polynomial time is not possible ^{4} log ^{10} (log^{3}), ^{7} (log^{4}), and ^{4} log

Because computation of the permanent method depends on approximation algorithms of a matrix, we must describe the algorithm-specific consideration. The Huber–Law algorithm defines two accuracy parameters

To reduce computation time, a Monte Carlo method can be applied where the total number of iterations is set instead of

Conclusions

The permanent method provides further evidence for identification in terms of conditional probability. We have shown that this method is capable of detecting identical pairs of low LR and is highly robust in terms of specificity. With two methods used in combination, DNA-based identification may exhibit higher performance. It is also important that the permanent method is capable of weighting the presence of unavailable data, unlike the LR method. Currently, the permanent method is computationally limited to relatively small datasets obtained from closed incidents.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

MN participated in the design of the study, wrote the program, carried out the computer-simulation and the statistical analysis, and drafted the manuscript. KT conceived of the study, and supervised the forensic aspect of the study. RY developed the algorithm, wrote the program, designed the study, and helped to draft the manuscript. All authors read and approved the final manuscript.

Acknowledgements

We started this study upon the historical disaster, the 2011 earthquake off the Pacific coast of Tōhoku with tsunami. We pray for the victims and the recovery from the destruction.