Graduate School of Informatics, Kyoto University, Yoshida, Kyoto 606-8501, Japan

Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho, Uji, Kyoto 611-0011, Japan

Abstract

Background

Enumeration of chemical graphs satisfying given constraints is one of the fundamental problems in chemoinformatics and bioinformatics since it leads to a variety of useful applications including structure determination of novel chemical compounds and drug design.

Results

In this paper, we consider the problem of enumerating all tree-like chemical graphs from a given set of feature vectors, which is specified by a pair of upper and lower feature vectors, where a feature vector represents the frequency of prescribed paths in a chemical compound to be constructed. This problem can be solved by applying the algorithm proposed by Ishida

Conclusions

Our proposed algorithm is useful for enumerating tree-like chemical graphs with given upper and lower bounds on path frequencies.

Introduction

Development of novel drugs is one of the major goals in chemoinformatics and bioinformatics. To achieve this purpose, it is important not only to investigate common chemical properties over chemical compounds having common structural patterns

In the field of machine learning, the

To enumerate tree-like chemical graphs, Fujiwara

In this paper, we are given a set of feature vectors, which is specified by a pair of upper and lower feature vectors, and enumerate all tree-like chemical graphs satisfying one of the vectors. It seems that this can be done by simply applying the algorithm proposed by Ishida

Methods

Preliminaries and problem formulation

A graph is called a _{0}, _{1}, _{1}, _{2}, _{2}, …, _{k}_{k}_{i}_{j}_{j – 1} and _{j}_{0}, _{1}, …, _{k}_{1},_{2}, …,_{s}_{+}. A multigraph _{0}, _{1}, …, _{k}_{0}), _{1}), …, _{k}_{K}_{+}) of _{K}

A chemical compound and its feature vector

**A chemical compound and its feature vector**. An illustration of a (Σ, _{1}(

Let

Enumeration of Tree-like chemical graphs with given Path Frequency (ETPF)

Given a set Σ of labels, a valence function _{+} and a feature vector _{K}

Observe that a large number of chemical compounds contain a high proportion of hydrogens. Based on this fact, another model can be considered in the problem ETPF by removing all hydrogen atoms. These two different models were proposed by Fujiwara

In this paper, we consider the problem of enumerating all tree-like chemical graphs based on given upper and lower feature vectors because we want to relax the feature vector constraint in the problem ETPF. For feature vectors _{1} and _{2} of level _{1} ≤ _{2} to be _{1}[_{2}[

An instance of ETULF

**An instance of ETULF.** An instance of ETULF with upper and lower feature vectors, which admits two different solutions.

Enumeration of Tree-like chemical graphs with given Upper and Lower bounds on path Frequencies (ETULF)

Given a set Σ of labels, a valence function _{+} and feature vectors _{U}_{L}_{L}_{U}_{L}_{K}_{U}

For the problem ETULF, we assume that _{L}_{U}_{L}_{U}

Note that the number _{ℓ∈Σ}

Canonical representation of trees and the branching operation

In this section, we explain a canonical representation of trees introduced by Fujiwara

First of all, we introduce a root of a tree based on the following theorem.

**Theorem 1 (Jordan ****) **

Such a vertex

Next we introduce a canonical representation of trees that must be unique up to isomorphism. Let _{0} (which is not necessarily its unicentroid). Suppose that it is embedded in the plane as an ordered tree, where _{0} is located at the top part. Without loss of generality, let _{0}, _{1}, …, _{n – 1} be indexed by the depth-first search (DFS) that starts from _{0} and visits vertices from the left to the right. Define the _{0} to

Given an arbitrary order of labels, we define the order of depth-label sequences as follows. For any _{1} and _{2}, we denote _{1}) >_{2}) if _{1}) is _{2}). Then the

Thus our branching task is to list all centroid-rooted left-heavy trees with

Therefore we only need to enumerate the (leaf) nodes of

Bounding operations

In this section, we explain how to check the validity of the current tree

**(C1)** The root of

**(C2) **

**(C3) **_{K}_{U}_{L}_{K}

**(C4) **

**(C5) **

(C1) and (C2) are the same as the work by Fujiwara

Feature-vector-cut procedure

In the problem ETULF, we cannot use the bounding operation proposed by Fujiwara

Let _{K}_{u}_{L}

If

In addition, if |

If

Detachment-cut procedure

This subsection describes the definition of detachment _{+}, an _{v}^{1}, ^{2} …, ^{r}^{(}^{v}^{)}}, so that each edge {^{i}_{u}^{j}_{v}^{i}^{i}^{i}^{j}_{u}_{v}

To obtain a chemical graph _{+}, an

which is necessary for all the edges incident to vertex ^{i}_{v}

which is a requirement that each vertex _{i}

A multigraph and a

**A multigraph and a ρ-detachment**. A multigraph

**Theorem 2 (Nagamochi ****) **_{+}

_{v}_{∈ }_{X}r

Ishida _{K}_{U}_{L}

Let _{1}, _{2}, …, _{s}_{U}_{L}^{≤ K + 1} → ℤ_{+} be feature vectors. Let _{0}, …, _{h}_{j}_{j}_{i}_{U}_{L}

We next introduce a vertex with a new label _{s+1} of valence _{U}_{U}_{U}_{U}_{1}, …, _{s}_{s}_{+1} | _{i}_{i}_{L}_{L}_{L}_{L}_{1}, …, _{s}_{s}_{+1} | _{i}_{i}_{i}_{j}_{i}_{j}

Detachment-cut

**Detachment-cut**. Bounding operation by detachment-cut, where vectors _{U}_{L}

Using _{U}_{L}

(a)

(b) _{U} – X_{U}_{U}_{U}

In the first condition, we check whether the number of the rest of bonds is large enough to satisfy the lower feature vector constraint. In the second condition, we check whether _{U}

Multiplicity-cut procedure

This subsection describes a new bounding operation based on multiplicity for the problem ETULF. Let _{m}

On the other hand, if we treat a multiple edge as a simple one, the number of edges _{s}

which means that only

Let _{e}

Now we describe the

Let _{0}, _{1}, …, _{k}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{0}, _{1},…, _{i}_{j}_{j – 1}), _{i}_{j}_{j – 1}) in _{i}_{j}_{j – 1} | _{i}_{j}_{j – 1}), _{i}

By the definition of _{i}

If _{i}_{i}

Multiplicity-cut

**Multiplicity-cut**. An illustration of the multiplicity-cut procedure, where

Results

This section reports the experimental results of our algorithm. First of all, we mention that the problem ETULF can be solved by applying the algorithm proposed by Ishida

Now we compare the performances of two algorithms, SimEnum and RepEnum, and we also compare the performances of two algorithms, SimEnum including multiplicity-cut and SimEnum not including multiplicity-cut. We have tested the algorithm SimEnum for some widths between upper and lower feature vectors. Tests were carried out on a PC with CPU AMD Athlon Dual Core Processor 5050e using instances based on some chemical compounds selected from the KEGG LIGAND database

We define _{+} to be a _{U}_{L}_{U}_{U}_{L}_{L}

Table

Comparison of previous method and our method

Entry Formula

SimEnum

RepEnum

_{v}

time (s)

nodes

solutions

time (s)

nodes

solutions

solved

1

1

3^{6}

1037.04

177,074,686

414,890

163.32

44,340,488

414,890

729

2

1

3^{18}

2.97

392,246

44

T.O.

2,381,360,000

N.F.

65,909,572

3

1

3^{34}

1.22

145,213

2

T.O.

3,293,260,000

N.F.

96,860,588

C00062

26

4

1

3^{53}

0.33

34,539

1

T.O.

2,780,050,000

N.F.

81,766,176

C_{6}H_{14}N_{2}O_{4}

5

1

3^{71}

0.24

20,361

1

T.O.

1,561,230,000

N.F.

45,918,529

6

1

3^{85}

0.25

15,166

1

T.O.

569,590,000

N.F.

16,752,647

7

1

3^{96}

0.18

14,547

1

T.O.

79,870,000

N.F.

2,349,117

1

1

3^{6}

T.O.

377,260,000

N.F.

T.O.

413,000,000

N.F.

460

2

1

3^{18}

7.24

845,760

25

T.O.

1,442,760,000

N.F.

70,175,902

3

1

3^{31}

2.81

307,151

7

T.O.

3,316,970,000

N.F.

195,115,882

C03343

37

4

1

3^{47}

1.03

99,945

1

T.O.

2,494,780,000

N.F.

146,751,764

C_{16}H_{22}O_{4}

5

1

3^{64}

0.98

87,600

1

T.O.

1,050,480,000

N.F.

61,792,941

6

1

3^{82}

0.76

60,194

1

T.O.

315,820,000

N.F.

18,577,647

7

1

3^{99}

0.57

42,538

1

T.O.

41,450,000

N.F.

2,438,235

1

1

3^{8}

T.O.

157,320,000

N.F.

T.O.

200,490,000

N.F.

1,388

2

1

3^{26}

37.59

1,940,295

238

T.O.

2,911,390,000

N.F.

66,167,954

3

1

3^{48}

1.71

60,792

3

T.O.

2,673,940,000

N.F.

60,771,363

C07178

46

4

1

3^{71}

0.35

14,248

1

T.O.

1,925,490,000

N.F.

43,761,136

C_{21}H_{28}N_{2}O_{5}

5

1

3^{92}

0.27

10,866

1

T.O.

743,940,000

N.F.

16,907,727

6

1

3^{110}

0.27

10,680

1

T.O.

93,880,000

N.F.

2,133,636

7

1

3^{125}

0.24

9,276

1

T.O.

19,270,000

N.F.

437,954

1

1

3^{5}

T.O.

382,470,000

N.F.

T.O.

552,290,000

N.F.

61

2

1

3^{16}

T.O.

211,800,000

N.F.

T.O.

530,930,000

N.F.

10,451,912

3

1

3^{27}

1395.13

144,244,042

206

T.O.

3,314,260,000

N.F.

194,956,470

C03690

61

4

1

3^{41}

121.36

11,332,363

4

T.O.

2,392,530,000

N.F.

140,737,058

C_{24}H_{38}O_{4}

5

1

3^{57}

83.70

6,978,557

2

T.O.

958,650,000

N.F.

56,391,176

6

1

3^{75}

40.11

2,923,819

1

T.O.

298,600,000

N.F.

17,564,705

7

1

3^{92}

16.50

1,096,128

1

T.O.

38,670,000

N.F.

2,274,705

Comparison of SimEnum and RepEnum for the problem ETULF.

Note:

(1) C00062, C03343, C07178, and C03630 are the chemical compounds in the KEGG LIGAND database, respectively;

(2)

(3)

(4)

(5) _{v}

(6) “time (s)” is the CPU time in seconds;

(7) T.O. means “time over” (the time limit is set to be 1,800 seconds);

(8) “nodes” is (the sum of) the number of nodes of family trees that are traversed;

(9) “solutions” is the number of all possible solutions;

(10) “solved” is the number of feature vectors which the algorithm RepEnum solved in the time limit; and (11) N.F. means “not found.”

**Comparison of multiplicity-cut** Comparison of SimEnum including multiplicity-cut and SimEnum not including multiplicity-cut for the problem ETULF. Note: (1) “add multiplicity-cut” is the algorithm SimEnum including multiplicity-cut; and (2) “no multiplicity-cut” is the algorithm SimEnum not including multiplicity-cut.

Click here for file

Table

Comparison of varying width

Entry Formula

SimEnum

time (s)

nodes

solutions

2

0

0.51

55,196

6

2

1

3.58

400,501

44

2

2

7.58

835,509

503

C00062

26

2

3

10.84

1,163,548

2,351

C_{6}H_{14}N_{2}O_{4}

2

4

12.55

1,349,057

5,430

2

5

13.29

1,431,075

9,852

2

50

14.31

1,537,496

25,425

2

0

0.34

35,952

9

2

1

8.39

845,760

25

2

2

48.27

4,815,369

41

C03343

37

2

3

149.83

14,781,738

305

C_{16}H_{22}O_{4}

2

4

377.01

37,435,878

40,732

2

5

639.68

63,459,180

106,870

2

50

1118.75

110,703,034

510,079

2

0

2.33

111,781

16

2

1

46.81

2,246,578

238

2

2

96.52

4,715,072

1,375

C07178

46

2

3

152.18

7,420,060

6,824

C_{21}H_{28}N_{2}O_{5}

2

4

179.42

8,744,563

19,180

2

5

199.66

9,677,513

29,891

2

50

255.01

12,292,587

54,861

5

0

19.50

1,482,017

2

5

1

220.14

16,063,569

5

5

2

439.12

33,037,741

32

C03690

61

5

3

684.88

52,207,745

178

C_{24}H_{38}O_{4}

5

4

1024.96

78,509,554

349

5

5

1285.55

98,762,291

615

5

50

T.O.

136,835,134

N.F.

Comparison of the performance for varying

Here, we briefly discuss practical values on

Conclusions

We considered the problem of enumerating all tree-like chemical graphs from a given set of feature vectors, which is specified by upper and lower feature vectors based on frequencies of paths, and proposed a new exact branch-and-bound algorithm. Our experimental results show that our algorithm outperforms the naive algorithm based on a previous method. In comparison to the algorithm based on Ishida

However, the search space of the problem ETULF is much larger than that of the problem ETPF due to upper and lower constraints and in fact there are many search nodes for solving the problem ETULF by our algorithm. One of the future works is to improve the bounding operations, or introduce a new bounding operation. Actually, in the feature-vector-cut mentioned in subsection , information of a lower feature vector _{L}

Competing interests

The authors declare that they have no competing interests.

Author’s contributions

HN gave the basic idea based on discussions with TA and MS. MS developed and implemented the algorithms, and carried out the experiments. MS, HN, and TA authored and approved the manuscript.

Acknowledgements

This work was partially supported by Grant-in-Aid #22240009 from Mext, Japan.

This article has been published as part of