Department of Computer Science, San Francisco State University, San Francisco, CA 94132, USA

Open University Program, San Francisco State University, San Francisco, CA 94132, USA

Sandler Center for Drug Discovery, University of California, San Francisco, CA 94158, USA

Department of Chemistry and Biochemistry, San Francisco State University, San Francisco, CA 94132, USA

Small Molecule Discovery Center, University of California, San Francisco, CA 94158, USA

Department of Pathology, University of California, San Francisco, CA 94158, USA

Abstract

Background

Neglected tropical diseases, especially those caused by helminths, constitute some of the most common infections of the world's poorest people. Development of techniques for automated, high-throughput drug screening against these diseases, especially in whole-organism settings, constitutes one of the great challenges of modern drug discovery.

Method

We present a method for enabling high-throughput phenotypic drug screening against diseases caused by helminths with a focus on schistosomiasis. The proposed method allows for a quantitative analysis of the systemic impact of a drug molecule on the pathogen as exhibited by the complex continuum of its phenotypic responses. This method consists of two key parts: first, biological image analysis is employed to automatically monitor and quantify shape-, appearance-, and motion-based phenotypes of the parasites. Next, we represent these phenotypes as time-series and show how to compare, cluster, and quantitatively reason about them using techniques of time-series analysis.

Results

We present results on a number of algorithmic issues pertinent to the time-series representation of phenotypes. These include results on appropriate representation of phenotypic time-series, analysis of different time-series similarity measures for comparing phenotypic responses over time, and techniques for clustering such responses by similarity. Finally, we show how these algorithmic techniques can be used for quantifying the complex continuum of phenotypic responses of parasites. An important corollary is the ability of our method to recognize and rigorously group parasites based on the variability of their phenotypic response to different drugs.

Conclusions

The methods and results presented in this paper enable automatic and quantitative scoring of high-throughput phenotypic screens focused on helmintic diseases. Furthermore, these methods allow us to analyze and stratify parasites based on their phenotypic response to drugs. Together, these advancements represent a significant breakthrough for the process of drug discovery against schistosomiasis in particular and can be extended to other helmintic diseases which together afflict a large part of humankind.

Background

Neglected tropical diseases (NTDs) constitute the most common infections of the world's poorest people. This class of diseases encompasses a number of infection categories including helminth infections (schistosomiasis, lymphatic filariasis, onchocerciasis), protozoan infections (leishmaniasis, Chagas' disease, African trypanosomiasis), bacterial infections (cholera, leprosy, bovine tuberculosis), viral infections (dengue fever, rabies, yellow fever), fungal infections (

Schistosomiasis in humans is caused by three major species of trematodes,

Modern drug discovery conventionally begins by identifying a molecular target (typically a protein or an enzyme) associated with a disease. Next, a large number of putative drug molecules are screened for activity against the target in in-vitro high-throughput screens (HTS) to identify "hits" which are passed onto later stages of the drug discovery pipeline for chemical optimization, optimization of the drug pharmacokinetics and pharmacodynamics, and ultimately clinical trials. The initial screening stage can typically involve a very large number of molecules (hundreds of thousands to millions), since even small variations in structure can significantly influence activity against the target. Given this context, we note that HTS platforms for

The whole-organism screening approach differs from the conventional HTS-based strategy. HTS is built around the use of in-vitro single enzyme activity-based screens, single read-out cell-based assays, and involves very large number of molecules which are tested in parallel using 96-, 384- or 1536-well plates. The distinctions of whole-organism screening from HTS, lead to both advantages and disadvantages. A crucial advantage is that the effect of a drug molecule can be studied in terms of the cumulative systemic effects it introduces in the parasite, rather than just in terms of how it interacts with a specific protein or enzyme in isolation. That is, the effects of the drug on the totality of targets and pathways can be explored in whole-organism screens. This can be expected to reduce the possibility of late-stage attrition of hits found through such screens. On the other hand, whole organism screens tend to be low throughput and are not easily extendable to HTS settings. This constrains, both in terms of diversity and density, the exploration of the chemical space during lead-identification. Finally, as multi-cellular organisms, schistosomes display multiple and changing phenotypes in response to how compounds interfere with their normal bio-chemical functioning. (see Figure

Examples of phenotypes exhibited by the schistosomula; (A) control (B) when exposed to the drug Lovastatin, and (C) when exposed to the drug Praziquantel (PZQ)

**Examples of phenotypes exhibited by the schistosomula;** (A) control (B) when exposed to the drug Lovastatin, and (C) when exposed to the drug Praziquantel (PZQ).

Problem characteristics and proposed solutions

An important long-term goal in the development of drugs against NTDs in general and schistosomiasis in particular, involves the development of high-throughput whole-organism screening methods. In the following, we enumerate some of the key challenges towards solving this problem and summarize the contribution of this paper towards addressing each of the challenges:

_{50 }value) is over-simplistic when dealing with a multi-cellular and complex macro-parasites that can manifest a variety of temporally varying phenotypes. The need to screen compound libraries based on quantification of complex phenotypic responses of pathogens is also underlined by the fact that a drug may not necessarily lead to immediate death yet nonetheless perturb the parasite's ability to survive, e.g., through disruption of the larval migration program, tegumental perturbations releasing antigens targeted by the immune system, or the ability of adult worms to maintain position within the predilection site. As an example, the drug PZQ produces both tetanic paralysis of the musculature, resulting in loss of position as well as tegumental damage, and the exposure of surface proteins that then contribute to an immune system-mediated attack on the parasites. We propose an image analysis-based approach for automatic segmentation and tracking of parasites and computation of descriptors that capture phenotypic responses in terms of changes in parasite shape, appearance, and motion. These descriptors are represented as time-series and provide a multi-dimensional time-varying representation of parasite phenotypes.

Distinctions from prior research

The use of quantitative phenotyping in biology and drug discovery has occurred along two directions. The first of these involves the study of phenotypic variations in model organisms such as

The investigations and results presented in this paper extend the framework proposed by Singh et al. for automated phenotypic screening

Method

Parasite identification by image segmentation

In contrast to cellular segmentation, a topic that has received considerable attention in bio-image analysis, the problem of segmenting schistosomula in drug screens presents certain specific challenges.

In order to distinguish the background from the parasite, we modify and extend the region-based voting approach from _{1 }

In Eq.(1), _{max }_{min }

In the final step of the segmentation, morphological processing is employed to separate touching parasites. It begins by detecting the edges of the original image. For this, the Canny edge operator

Illustration of the segmentation results and comparison with other methods

**Illustration of the segmentation results and comparison with other methods****.**** Top row **from left-to-right: The original image (note that the bottom right region has a shadow), results of the region-based distributing function showing oversegmentation, relevant edges, and the image after filtering of debris and small regions. **Bottom row **from left-to-right: Final results with the proposed method after closing and filling holes in regions and separation of touching parasites, results obtained by mean-shift segmentation

Parasite tracking

The ability to analyze time-varying phenotypic response of parasites requires tracking each parasite across the entire video sequence. Given an initial segmentation, for each parasite, this involves establishing a correspondence between its positions in successive frames. Once the parasites are tracked across the video, their appearance, shape, and motion can be described quantitatively. In designing a tracking system for the parasites in HTS, the following challenges have to be addressed:

1. Robust handling of the erroneous or ambiguous segmentation of the parasites. Specifically, due to their tendency to mingle, errors in segmentation can result in clusters of parasites merging and splitting in a variety of ambiguous combinations (see Figure

(a) Parasites can be located in close proximity to each other in manners that lead to segmentation errors

(a) Parasites can be located in close proximity to each other in manners that lead to segmentation errors. (b) Bipartite graph describing the splitting of a blob (containing two parasites) into two blobs containing a single parasite each. (c) Group of four parasites, erroneously assigned to a single blob after segmentation. Analysis of the various combinations of intensity-boxes leads to the recovery of one of the four parasites from this blob.

2. Precision in defining individual parasites, so that the phenotypes can be accurately measured over the entire duration of observation. This is especially important since we plan to use the entire phenotypic response of the parasite in our analysis.

3. Accounting for the unique motion characteristics of the parasites; unlike many problems in vision-based tracking where the object being tracked moves rapidly, schistosomula can exhibit significant movement due to twists and turns of their bodies, without appreciable translation of their body positions.

We design our tracking approach to consist of three conceptual levels: the

In the blob level, each distinct foreground region (putatively representing a parasite) is represented using its bounding box in the ** G**(

The cost function used to rank the graphs is shown in Eq.(6). In this equation, _{i }

In the parasite level, our approach takes a different strategy than that proposed in

Quantitative description of phenotypes

In Table

Quantitative phenotype descriptors and their descriptions

**Descriptor name**

**Formula**

**Description**

**Size**

Area

See description

The total number of pixels identified during segmentation.

Change in area

Area(

The area of the parasite in the current frame at time

**Shape**

End point length/Skeleton length

See description

Ratio of the Euclidean length of the shortest line between the two endpoints of the skeleton to the length of the skeleton. The skeleton is created by thinning the segmented region until it is represented by a line corresponding to the curve of the body. Branching of the skeleton is handled by iteratively applying the MATLAB spur operator that identifies and removes isolated edge points until only two edges remain

**Movement**

Image difference

Image(

The number of pixels that moved from time

Perimeter (also for description of size)

See description

The number of pixels representing the boundary of the segmented region.

Axis ratio (also for shape description)

MinorAxisLength/MajorAxisLength

Ratio of the minor axis length to the major axis length. The major and minor axes are computed for an ellipse with the same normalized second central moments as the region.

**Texture**

Entropy

-_{2}

Statistical measure of randomness related to the texture of an image where

Contrast

Σ|^{2}

The intensity contrast between a pixel and its neighbors throughout the region.

Correlation

The intensity correlation between a pixel and its neighbors.

Energy

The sum of the squared elements in the GLCM (gray-level co-occurrence matrix). The GLCM measures how often two intensities occur side by side.

Homogeneity

Measures the closeness of the distribution of the elements in the GLCM to the GLCM diagonal.

**Color**

Average grayscale

See description

The mean intensity value and standard deviation found in the region.

Average red

Average green

Average blue

Grayscale histogram

See description

A histogram with bins 0-255 representing the count of each intensity value present in the region.

Red histogram

Green histogram

Blue histogram

Time series analysis of phenotypes

In analyzing the phenotypic responses of individual parasites, our goal is to identify groups of similar phenotypic patterns. Conceptually, this problem requires clustering the phenotypes over time. However, when the time dimension is involved, the clustering problem becomes harder because each data point is not an individual instance but a sequence of data (collected over time). This implies that we are dealing with very high-dimensional data. Furthermore, given that our solution needs to work in high-throughput settings over large data sets, efficiency becomes a paramount consideration. Representing a time series symbolically constitutes one of the well known ways of complexity reduction. This approach, called SAX, has also been shown to improve clustering due to the smoothing effect of dimensionality reduction

Symbolic representation of time series (SAX)

SAX

(a) Original HTS data showing the effect of the drug Imipramine in terms of "Area" for a parasite and its PAA representation, (b) the symbolic representation

**(a) Original HTS data showing the effect of the drug Imipramine in terms of "Area" for a parasite and its PAA representation, (b) the symbolic representation****.** The sequence of length 120 (

A lookup table that contains the breakpoints

**β**

**
a
**

**3**

**4**

**5**

**6**

**7**

**8**

**9**

**10**

β_{1}

-0.43

-0.67

-0.84

-0.97

-1.07

-1.15

-1.22

-1.28

β_{2}

0.43

0

-0.25

-0.43

-0.57

-0.67

-0.76

-0.84

β_{3}

−

0.67

0.25

0

-0.18

-0.32

-0.43

-0.52

β_{4}

−

−

0.84

0.43

0.18

0

-0.14

-0.25

β_{5}

−

−

−

0.97

0.57

0.32

0.14

0

β_{6}

−

−

−

−

0.97

0.67

0.43

0.25

β_{7}

−

−

−

−

−

1.15

0.76

0.52

β_{8}

−

−

−

−

−

−

1.22

0.84

β_{9}

−

−

−

−

−

−

−

1.28

Automatic determination of piecewise aggregate approximation

Directly applying SAX to large and varied data common to HTS, requires properly estimating the two control parameters _{1 }norm to find the set of longest line segments, which fit the data with the minimum sum of absolute errors along each of the line segments. Given

where

and

The goal of the objective function is to maximize the length of a line segment _{j }_{j }_{j}_{j}_{j-1 }_{j}

Optimal segmentation and symbolic representation of the time-series from Figure 4(a)

**Optimal segmentation and symbolic representation of the time-series from Figure 4(a).** The frame number is shown on the X-axis and the parasite body size (in pixels) is shown on the Y-axis.

Algorithm 1.

**1. i = 1**

**2. breakPoints ←{}**

**3. while i <n**

**4. j = i + 1**

**5. if j == n**

**6. break;**

**7. end if**

**8. compute F**

**9. while j <n**

**10. j = j + 1**

**11. compute **

**12. if **

**13. **

**14. else**

**15. j←j**-1

**16. breakPoints ← breakPoints ∪ j**

**17. break;**

**18. end if**

**19. end while**

**20. i=j**

**21. end while**

Algorithm 2.

**1. seeds ← getNeighbors( D, p, ε**)

**2. if size(result) > MinPts**

**3. p.clusterId ← none**

**4. return False;**

**5. else**

**6. update p.clusterId**

**7. seeds ← delete(seeds, p)**

**8. while ~isempty(seeds)**

**9. currentP←getFirstSeed(seeds)**

**10. result ←getNeighbors( D, currentP, ε)**

**11. if size(result) >= MinPts**

**12. for i from 1 to size(result)**

**13. q←get(result, i)**

**14. if q.clusterId is unclassified or noise**

**15. if q.clusterId is unclassified**

**16. seeds ←append(seeds, q)**

**17. end if**

**18. update q.clusterId**

**19. end if**

**20. end for**

**21. end if**

**22. seeds ←delete(seeds, currentP)**

**23. end while**

**24. return True**

**25. end if**

Definition of a similarity measure between time-series

Given two time series

The

A lookup table used by the MINDIST function, (

**a**

**b**

**c**

**d**

**a**

0

0

0.67

1.34

**b**

0

0

0

0.67

**c**

0.67

0

0

0

**d**

1.34

0.67

0

0

(a) Two original time series

**(a) Two original time series C and Q (b) PAA representations of the two original sequences using the uniform segmentation (c) The symbolic representations of the two PAAs from (b).** X-axis denotes the frame number and the Y-axis denotes the body size (in pixels).

For comparing time-series as represented by symbolic representations of varying lengths, we investigate the following two distance measures:

where

where

Clustering time series representation of phenotypes

We investigate two different clustering methods for clustering the time-series based description of phenotypes:

In DBSCAN, given an object _{1},..., _{n}_{1 }= _{n }_{i}_{+1 }is directly density-reachable from _{i}

Our use of DBSCAN starts with an arbitrary time series

We use two data sets to investigate the applicability of hierarchical clustering and DBSCAN. The first data set is the synthetic control chart time series from the UCI machine learning repository. Because the class label of every sequence is known for this set, we use it for studying the validity of the resulting clusters. Three sequences of length 60 are selected from each of the three classes; normal, decreasing, and upward shift. Figure

Optimal segmentation and symbolic representation of the sequences from the UCI database with

**Optimal segmentation and symbolic representation of the sequences from the UCI database with a = 6.** The X-axis denotes the time step and the sequence value is shown on the Y-axis.

Dendrograms constructed by the agglomerative hierarchical clustering for the UCI dataset

**Dendrograms constructed by the agglomerative hierarchical clustering for the UCI dataset.** The X-axis denotes the sequence number and the Y-axis denotes values obtained using various distance measures.

The second data used by us is a control group. For our experiment, we use the image difference descriptor and study 22 parasites. This descriptor corresponds to the motion exhibited by a parasite; greater the motion, larger the image difference. Figure

Dendrograms constructed by agglomerative hierarchical clustering using values of the image difference descriptor: (a) Control group (b) Phenotypes measured on the 7^{th }day after exposed to drug Imipramine

**Dendrograms constructed by agglomerative hierarchical clustering using values of the image difference descriptor: (a) Control group (b) Phenotypes measured on the 7 ^{th }day after exposed to drug Imipramine.** The X-axis: denotes the sequence number and the Y-axis denotes distance.

For this data set, no significant difference was found in the results from DBSCAN and agglomerative hierarchical clustering, indicating the validity of the clustering results of both methods. A manual inspection of the data confirmed this conclusion and the clusters that were obtained.

The third data set captures the effect of the drug Imipramine after parasites had been exposed to it for seven days. DBSCAN found two clusters for this data set, cluster-1 = (1, 7, 9, 11, 13, 14) and cluster-2 = (2, 3, 4, 5, 6, 8, 10, 11, 12, 15, 16) with

Identifying representative time series

Finding a representative time series for a given cluster requires identifying one of the constituent time series which best characterizes the phenotypic diversity of the cluster. Different principles can be used to identify such a representative. In this paper, we propose two methods that approach this question from different perspectives. In the first method, we define the representative to be a time series that has the minimum sum-of-distances (MSD) with all the other time series in that cluster. We use DTW defined over the symbolic time-series representations to determine the representative.

The second method used by us is based on the notion of a low dimensional vector called sketch, which has been used for discovering approximately repeating subsequence

In Eq.(13), _{2 }norm is applied to identify a sketch with the least sum of distances to all other sketches and the time series corresponding to this sketch is declared as the cluster representative.

In Figure ^{th }frame in the observation period is depicted). Since the movement occurred over the same time-duration, the reader can see greater motion exhibited by parasites from the first cluster. Since the parasite identified using the sketch shows greater mobility than the one obtained using DTW+MSD, we postulate that sketching may be a better approach for finding representative time series. Figure ^{th }day after exposed to the drug Imipramine. Figure

Representative time series of the two clusters

**Representative time series of the two clusters****.** Note that the magnitude of (a) and (c) are greater than that of (b) and (d). The frame number is depicted on the X-axis while the Y-axis denotes the change in area.

Shape change in the parasites corresponding to the representative time series for each cluster

**Shape change in the parasites corresponding to the representative time series for each cluster****.** The snapshots depict parasites at the first frame and at every 45^{th }frame thereafter. As can be seen, based on the rate of shape change, the parasites in (a) and (c) are more active than those in (b) and (d).

Representative time series of the two clusters

**Representative time series of the two clusters****.** The data was collected on the seventh day after exposure to the drug Imipramine. Note that the magnitude of (a) is greater than that of (b) and the magnitude of (c) is also greater than that of (d). In this figure the frame number is depicted on the X-axis and the change in the size of the parasites is shown on the Y-axis.

Snapshots depicting the shape change in the parasites corresponding to the representative time series for each cluster from Figure 12 at the first frame and every 45^{th }frame thereafter

**Snapshots depicting the shape change in the parasites corresponding to the representative time series for each cluster from Figure 12 at the first frame and every 45 ^{th }frame thereafter**

Experiments

In the following, we present a number of experiments and case studies to validate the proposed method and apply it to analyze data from phenotypic screens. The data used in this experiment was generated by screening six compounds which are shown in Figure

The structures of the six compounds used in the experiments

**The structures of the six compounds used in the experiments**
**.**

We begin by presenting results that quantify the accuracy of the image segmentation and tracking. Next, experiments related to time series clustering and the identification of representative time series are presented. Included in this section are results from a case study which was conducted to compare the phenotypic response within two groups of parasites. The control constituted the first group while the second group was exposed to the current gold standard drug PZQ. An important result from this study was that the effect of PZQ could be stratified in terms of three distinct phenotype clusters. Following this, in the section "Cluster identification using phenotypes from control and multiple compounds", the ability of the proposed method to automatically segregate phenotypes arising from the effect of different compounds is analyzed. A key goal of this analysis was to find the best combination of the alphabet size, the segment length and the distance function for use in subsequent experiments. Results of using these parameters for classifying the phenotypes elicited by the six compounds are presented in Section "Clustering of phenotypes elicited from all compounds". These results constitute proof of concept for the proposed method. Finally, in the section "Identification of representative time series" we present a case study that involves determining the representative phenotype models for the controls as well as for parasites exposed to different compounds.

Analysis of the accuracy of image segmentation and tracking

To determine the effectiveness of the segmentation and tracking, we manually counted the number of parasites in five videos (See Table

Image segmentation and tracking accuracy of five groups

**Compound**

**Total**

** parasites**

**Segmented**

** parasites**

**Tracked**

** parasites**

**False positives**

**Segmentation**

** accuracy**

**Tracked**

** accuracy**

Pravastatin

50

33

41

3

66%

82%

Simvastatin

67

58

62

5

87%

93%

Imipramine

53

43

39

4

81%

74%

PZQ

42

27

31

8

64%

74%

Control

60

45

51

9

75%

85%

Data pre-processing and parameters employed in time-series analysis

The video data was collected and analyzed using the methods described in Sections "Parasite identification by image segmentation" and "Parasite tracking". As is well known, real-world data tends to be noisy. To reduce the noise, each value of the descriptor in the time series was replaced by the mean of neighboring values within a sliding window and then outliers were smoothed out by a density-based local outlier detection method

Case study: analysis of phenotypes of control vs PZQ

In this case study, we analyzed the phenotypic differences exhibited by control parasites and those exposed to PZQ. Two clusters were identified for the control group. Cluster1 was a singleton (^{th }frame were found to be nearly uniform. Figure

Representative time series and standard deviation at every 10^{th }frame,

**Representative time series and standard deviation at every 10 ^{th }frame, a = 6**

Shapes of the representative parasites over time

**Shapes of the representative parasites over time****.** Starting from the first frame, the snapshots are taken every 45^{th }frame. (a) Representative of the dominant cluster in the control data. (b - d) The representative parasites from each of the three clusters found in the set that was exposed to PZQ. The reader may note the distinct differences of the phenotypic response of the parasites in each of the clusters.

Cluster identification using phenotypes from control and multiple compounds

The goal of this experiment was to find the best combination of the alphabet size, the segment length and the distance function so as to separate the phenotypic response of parasites that were exposed to different compounds. We used the shape descriptor defined as the ratio of the end-point distance to the length of the skeleton. This descriptor is especially well suited to distinguish parasites having normal shape from those that are straight. Four samples were selected from the three groups: control, PZQ and Lovastatin. The sampling was made in the following way. For each group, all of the time series were optimally segmented and then symbolically represented using our proposed method. Given the strings, clusters were found by the agglomerative hierarchical clustering and then four samples were selected from each one of the clusters. Four distance functions were tried by the agglomerative hierarchical clustering: MinDist of the original SAX method, edit distance, Euclidean distance and DTW. The distance for every pair of the symbolic representations of time series sequences was computed by each of the three distance functions and then the clusters of those time series sequences were identified based on the distances. When Euclidean distance was used, the raw data were used instead of the symbolic representation. A data set was formed by the total 12 time series from the three groups; Control = {1, 2, 3, 4}, PZQ = {5, 6, 7, 8}, Lovastatin = {9, 10, 11, 12}. Note that the parasites from the control group were slim and long, while those treated with PZQ & Lovastatin had curved and oval shapes respectively. By doing so, the data set was clearly separated and the ability of the method to distinguish the phenotypes could be tested unambiguously. Three distance functions were tried to investigate the clustering accuracy. In Figure

Dendrograms constructed by various distance functions, alphabet sizes and segment lengths

**Dendrograms constructed by various distance functions, alphabet sizes and segment lengths****.** Control group = {1, 2, 3, 4}, PZQ = {5, 6, 7, 8}, Lovastatin = {9, 10, 11, 12}. Dendrograms constructed using SAX and MinDist with varying alphabet sizes and varying number of segments are shown in parts (1-6). For each case, the six segments (

Next, the Levenshtein distance was applied to the symbolic representations of optimally segmented time series. The parasites from the control group were found to be well separated but the other two groups were not properly distinguished (Figure

Clustering of phenotypes elicited from all compounds

This experiment represented an extension of the one carried out in the previous section with the random choice of parasites. Here phenotypic responses of parasite exposed to each of the compounds were collected from video observations made on the 7^{th }day. Four parasites were randomly selected from each of the seven groups including the control group as follows: Control = {1 - 4}, Chlorpromazine = {5 - 8}, Imipramine = {9 - 12}, Lovastatin = {13 -16}, Pravastatin = {17 -20}, PZQ = {21 - 24}, and Simvastatin = {25 - 28}. Perceptually, the appearance of the parasites constituted the most significant phenotype. Consequently, in this experiment the average grayscale intensity was used as the descriptor. The agglomerative hierarchical clustering algorithm was employed using DTW distance. The results of the clustering along with the ground truth and a manual description of the parasite appearance are shown in Table

Clusters obtained for phenotypes elicited from all compounds

**Compound**

**Cluster 1**

**Cluster 2**

**Manual phenotype assignment**

** (based on parasite appearance)**

Control

1, 2, 3, 4

Light

Chlorpromazine

5, 7

Dark

6, 8

Light

Imipramine

9, 11, 12

Dark

10

Light

Lovastatin

13, 14, 15, 16

Dark

Pravastatin

17, 18, 19, 20

Light

PZQ

21, 22, 23, 24

Light

Simvastatin

25, 26, 27, 28

Dark

Identification of representative time series

In this experiment, we further analyzed the data from Section "Cluster identification using phenotypes from control and multiple compounds", where a shape descriptor was used to cluster the phenotypes arising as a response to Lovastatin and PZQ in comparison to the control. The representative time series of each cluster (Figure

Representative time series and standard deviation of each cluster,

**Representative time series and standard deviation of each cluster, a = 4**

Representative parasites for each of the three clusters

**Representative parasites for each of the three clusters****.** The figure shows snapshots that were captured starting at the first frame and at every 45^{th }frame thereafter.

Conclusions

The research presented in this paper represents a significant breakthrough towards quantitative phenotypic drug screening against neglected diseases, such as schistosomiasis, where the effect of a drug on the target pathogen is manifested through a continuum of complex phenotypes. The proposed method lies at the interface of disciplines. From the algorithmic perspective, a key contribution of this work has been the adaptation and extension of techniques from time-series data analysis for representation and reasoning about phenotypes exhibited by schistosomula. An important result from this perspective has been the development of a rigorous approach to automatically quantify and characterize the phenotypic responses different parasites to a drug. Consequently, the proposed method can be crucial for development of high-throughput phenotypic screens where a much larger fraction of the chemical space can be explored during lead discovery. Another important result lies in the ability of the method to detect and represent variability in the response of different parasites when they were exposed to the same drug in identical environmental conditions. Recognizing such stratifications in the parasite population may be significant in more ways than one. Among others, detection of such variability can play a major role in driving exploration of the pathogen's biology and in understanding the development of resistance to drugs. Furthermore, the existence of such variability also underlines the need for developing new computational and statistical methods that can robustly analyze highly variable data from high-throughput phenotypic screens.

Competing interests

The interpretation of the data, findings and conclusions contained within this paper were not influenced by personal, financial and non-financial relationships with the funders.

Authors' contributions

RS formulated the problem and proposed the time-series analysis-based solution framework. The methods for image segmentation and image feature extraction were designed and implemented by AMD and RS. US and RS designed and implemented the tracking algorithm. The time-series analysis methods were designed and implemented by HL and RS. BMS, CRC, SC, and MA were involved in assay design. The video data was captured by BMS and CRC. DA was responsible for storage and management of the video data. The computational experiments were conducted and analyzed by HL, RS, AMD, and US. The paper was written by RS and HL. All authors read and approved the final manuscript.

Acknowledgements and funding

The authors would like to thank Ai Sasho, who participated in the design of the video storage system and Laurent Mennillo, who was involved in developing an early version of the segmentation algorithm. Thanks are due to Eamonn Keogh for the SAX code and to the anonymous reviewers for their comments. RS would also like to thank Jim McKerrow, who introduced him to the study of neglected tropical diseases. This research was funded in part by the NIH-NIAID through grant 1R01A1089896-01, the NSF through grant IIS-0644418 (CAREER), the Bill & Melinda Gates Foundation through grant OPP1017584, and a Joint Venture Grant from the California State University Program for Education and Research in Biotechnology (CSUPERB) and the Sandler Center for Drug Discovery. The findings and conclusions contained within are those of the authors and do not necessarily reflect positions or policies of the funders.

This article has been published as part of