Few genetic factors predisposing to the sporadic form of amyotrophic lateral sclerosis (ALS) have been identified, but the pathology itself seems to be a true multifactorial disease in which complex interactions between environmental and genetic susceptibility factors take place. The purpose of this study was to approach genetic data with an innovative statistical method such as artificial neural networks to identify a possible genetic background predisposing to the disease. A DNA multiarray panel was applied to genotype more than 60 polymorphisms within 35 genes selected from pathways of lipid and homocysteine metabolism, regulation of blood pressure, coagulation, inflammation, cellular adhesion and matrix integrity, in 54 sporadic ALS patients and 208 controls. Advanced intelligent systems based on novel coupling of artificial neural networks and evolutionary algorithms have been applied. The results obtained have been compared with those derived from the use of standard neural networks and classical statistical analysis
Advanced intelligent systems based on novel coupling of artificial neural networks and evolutionary algorithms have been applied. The results obtained have been compared with those derived from the use of standard neural networks and classical statistical analysis. An unexpected discovery of a strong genetic background in sporadic ALS using a DNA multiarray panel and analytical processing of the data with advanced artificial neural networks was found. The predictive accuracy obtained with Linear Discriminant Analysis and Standard Artificial Neural Networks ranged from 70% to 79% (average 75.31%) and from 69.1 to 86.2% (average 76.6%) respectively. The corresponding value obtained with Advanced Intelligent Systems reached an average of 96.0% (range 94.4 to 97.6%). This latter approach allowed the identification of seven genetic variants essential to differentiate cases from controls: apolipoprotein E arg158cys; hepatic lipase -480 C/T; endothelial nitric oxide synthase 690 C/T and glu298asp; vitamin K-dependent coagulation factor seven arg353glu, glycoprotein Ia/IIa 873 G/A and E-selectin ser128arg.
This study provides an alternative and reliable method to approach complex diseases. Indeed, the application of a novel artificial intelligence-based method offers a new insight into genetic markers of sporadic ALS pointing out the existence of a strong genetic background.
Amyotrophic lateral sclerosis (ALS), the most common form of motoneuron disease, is a relatively rare (incidence: 1–3/100.000 per year), progressive and fatal disease characterised by neurodegeneration involving primarily motor neurons of the cerebral cortex, brain stem and spinal cord. To date, most studies have focused upon the familial form of the disease, which accounts for less then 10% of cases, and which is usually inherited as autosomal dominant inheritance. The gene coding for copper/zinc superoxide dismutase 1 (SOD1) appears to be mutated in 10–20% in the familial form .
Genetic risk factors for ALS have been extensively studied and some "major genes", in addition to SOD1, have been recognised as being responsible for the monogenic inheritance pattern. There are now at least six dominant inherited adult onset ALS genes of which only three have been identified so far . However, most ALS cases seem to be a typical multifactorial disease deriving from the interaction between a number of genes and environmental factors, some of which are still not established as causing of the disease, including brain and spinal cord trauma, strenuous physical activity, exposure to radiation .
Current hypotheses suggest a complex interplay between multiple mechanisms including genetic risk factors, oxidative stress, neuroexcitatory toxicity, mitochondrial dysfunction, intermediate neurofilament disorganization, failure of intracellular mineral homeostasis involving zinc, copper, or calcium, disrupted axonal transport, abnormal protein aggregation or folding, and neuroinflammation [3,4]. Recently there has been growing interest in the role played by non-neuronal neighbourhood cells in the pathogenesis of motor neuron injury and in the dysfunction of specific molecular signal pathways [5,6].
Among the genetic factors that may predispose to sporadic ALS, neurofilaments, apolipoprotein epsilon 4 genotype, excitotoxicity genes, ciliary neurotrophic factor (CTNF), cytochrome P450 debrisoquine hydroxylase CYP2D6, apurinic apyrimidinic endonuclease (APEX), mitochondrial manganese superoxide dismutase SOD2, monoamine oxidase allele B and paraoxonases, have been reported in different studies, partly with contradictory results [2,4,7-9]. Not all the published studies have been replicated, probably because of the different populations analysed as well as insufficient sample size. On the other hand, different studies have employed either tissue microdissection or microarray technologies to search for other "low penetrant" or "susceptibility" genes that are more common in the population and often polymorphic and the combination and interaction of these with environmental factors may contribute to modulate individual risk [10-12]. Recently, several genome-wide association studies have been performed with innovative approaches, i.e. the Illumina platform, and the authors have identified SNPs (single nucleotide polymorphisms) potentially associated with ALS [13-16]. However most genome-wide association studies have not confidently identified risk genes that are replicated in every study. The most likely causes are disease heterogeneity, allelic heterogeneity, small effect sizes and probably, insufficient sample size. However, so far no microarrays panel has been specifically developed for ALS and the aetiology of the disease still remains to be defined.
Some years ago our group had the opportunity of working on another multifactorial complex disease such as venous thrombosis and analysing the results by an innovative statistical approach, Artificial Neural Networks (ANNs) . Indeed, ANNs promises to improve the predictive value of traditional statistical data analysis. Initially, a known set of data, from a given problem with a known solution, is learned by the ANNs and subsequently the networks can reconstruct the fuzzy rules which may be underlying a complex set of data. ANNs have been successfully used in many areas of medicine as recently illustrated in an extensive review by Lisboa , as well as by Ritchie et al  where neural networks were used for supervised pattern recognition in genetic epidemiology, and also in SNPs association studies [20-22]. Much effort has been spent to adapt ANNs architectures and the ensembles to specific problems to be solved. More specifically many novel computational approaches have been developed and applied with special attention to complex gene-gene, gene-environment interactions and ANNs [19-23].
The literature data together with the impressive results we obtained with ANNs, by which we were able to identify a subset of polymorphisms related to the disease, prompted us to employ the same approach also in ALS hoping to discover specific genetic patterns underlying the sporadic form of this disease. We applied a multiarray approach including allelic variations in genes that could be involved in the pathogenesis of ALS disease since it has been demonstrated that inflammation, cellular adhesion, and lipid pathways are linked to such a disease [10,11]. On the contrary, it has never been demonstrated that regulation of blood pressure, coagulation, homocysteine metabolism and matrix integrity pathways are directly linked to ALS even though they could be indirectly.
Genotyping of ALS cases and controls was performed. We applied advanced intelligent systems based on novel coupling of artificial neural networks and evolutionary algorithms and compared the results with those obtained by linear discriminant analysis and a simple back propagation approach.
Surprisingly, we discovered a novel strong genetic background allowing a correct classification of cases and controls with a higher than 90% accuracy.
The study population included subjects of Caucasian origin belonging to Italian ancestry and consisted of 54 sporadic ALS (SALS) patients and 208 control subjects.
Diagnostic Criteria for ALS disease were based on the World Federation of Neurology El Escorial Revisited document . All patients diagnosed to have Definite, Probable or Probably laboratory-supported ALS, who gave their informed consent, were included in the study. The diagnosis of Possible ALS was also accepted. According to common clinical practice, our cases were subdivided into bulbar and spinal onset on the basis of the first symptoms reported by each patient. All patients, referred to the Department of Neurology of Niguarda Hospital, Milan from 2001 to 2005, were defined sporadic when the disease was present in a single member of the family and when no mutations were present in SOD1 gene.
Control subjects were selected from a healthy control population, randomly collected from healthy blood donors admitted to the "Healthy Blood Donor Service" of Niguarda Ca' Granda Hospital. We checked the absence of personal and familial history of ALS in this group through direct interview.
This study was approved by the local ethics committee.
DNA was extracted using a salting out procedures . We applied a multilocus assay, as previously described [17,26], to genotype 60 biallelic polymorphisms within 35 genes that were selected from pathways of lipid and homocysteine metabolism, regulation of blood pressure, coagulation, inflammation, cellular adhesion and matrix integrity. The following polymorphisms (SNPs) were genotyped: LPA 93C/T, 121 G/A, APOA4 thr347ser, glu360his, APOBthr71ile, APOC3 641C/A, 482C/T, 455 T/C, 1100 C/T, 3175 C/G, 3206 T/G, APOE cys112arg, arg158cys, ADRB3 trp64arg, PPARγ pro12ala, LIPC 480C/T, LPL 93 T/G, asp9asn, asn291ser, ser447term, PON1 met55leu, gln192arg, PON2 ser311cys, LDLR NcoI+/-, CETP-631C/A, -629 C/A, ile405val, TNF beta thr26asn, MTHFR 677 C/T, NOS3 -922 A/G, -690 C/T, glu298asp, DCP1 IVS16 ins/del, AGTR1 1166A/C, AGT met235thr, NPPA 664 G/A, NPPA 2238 T/C, ADD1 gly460trp, SCNN1A trp493arg, ala663thr, GNB3 825 C/T, ADRB2 arg16gly, ADRB2 gln27glu, MMP3 (-1171) 5A/6A, FII 20210 G/A, FV arg506gln, FVII -230 10 bp del/ins, arg353glu, PAI -675 G5/G4, 11053 G/T, FGB -455 G/A, ITGA2 873 G/A, ITGB3 leu33pro, SELE ser128arg, leu554phe, ICAM gly214arg, TNF alpha -376 G/A, -308G/A, -244 G/A, -238 G/A.
The marker TNF beta thr26asn is twice present in the arrays as a control for the multiplex PCR and the hybridization procedures.
All ALS subjects were screened for SOD1 mutation through PCR amplification and direct sequencing according to standard procedures .
Each record related to a known clinical condition or to a sample population, and comprised 62 variables corresponding to the 60 SNPs plus case and control. We eliminated from the database those markers for which only one genotype was present (APOB Arg3500Gln, CBS Ile278Thr, CETP Asp442Gly, 14G(+1) A and 14(+3) T ins) both in cases and controls. All the analysed polymorphisms may have three genotype classes: wild type, heterozygous and homozygous status. The association of these variables with ALS status was tested by ANNs and the results were compared with those obtained by a linear discriminant analysis. The models we used aimed at correct classification of the subjects in two classes:
1) SALS patients (cases),
2) healthy subjects (controls).
No other specific genetic model potentially linked to the analysed SNP was evaluated; ANNs are able to build a model with a strong genetic basis just collecting all the information included within the SNP without any a priori definition. The mathematical approach of ANNs consists in measuring the general dependence of random variables related to a group of subject without making any assumption about the nature of their underlying relationships.
Artificial neural networks analysis
In this study we applied supervised ANNs, in order to develop a model able to predict with high degree of accuracy the diagnostic class starting from genotype data alone.
Supervised ANNs are networks which learn by examples, calculating an error function during the training phase and adjusting the connection strengths in order to minimize the error function. The learning constraint of the supervised ANNs make their own output coincide with the predefined target. The general form of these ANNs is: y = f(x,w*), where w* constitutes the set of parameters which best approximate the function.
We employed the Back Propagation (BP) ANNs . This type of ANN belongs to a very large family of ANNs, that normally uses a specific kind of law of learning named Feed Forward (FF). In the FF ANNs the signal proceeds from the input to the output of the ANN, crossing all of the nodes once only. The architecture of these networks is characterized by different layers of interconnected nodes (input, hidden and output nodes), which processes the input signal according to a non-linear function (generally, of sigmoid type). The fundamental equation that characterizes the activation of a single node and, therefore, the signal transfer from one layer to another is:
Learning, i.e. the modelling of the input-output relation represented by data, occurs through minimization of the error in output and retropropagation of this to the internal nodes, one hidden units, using the algorithm of the descending gradient in the majority of cases. In particular each weight is corrected by the formula:
for the last layer and:
for all the other layers.
In theory, a Back Propagation having a sufficient number of hidden units is able to reconstruct any y = f(x) function.
The BP used in this work was intentionally improved through the use of the SoftMax equation, specific for classification problems :
and through the use of the Selfmomentum equation  which appears as follows:
where the learning cycle is indicated by n.
From a practical point of view, the Selfmomentum equation allows the solution of all of the problems solved by the Momentum, in a much faster way, maintaining the unitary learning coefficient (Rate = 1).
The architecture of ANN BP-FF is an input layer according to the number of selected variables, one hidden layer according to the different input layer (min 2 nodes, max 12 nodes). The output layer consisting in one of two prediction targets (SALS cases; control).
We employed as benchmark linear discriminant analysis (LDA) applied on the same training and testing data sets used for ANNs. For the analysis of LDA, the SAS version 6.04 (SAS Institute, Cary, NC, USA) using forward stepwise procedure was employed.
Preprocessing methods and experimental protocols
Data preprocessing was performed using two different re-sampling criteria of the global dataset.
- Random criterion
We employed the so-called 5 × 2 cross-validation protocol . In this procedure the study sample is five-times randomly divided into two sub-samples, always different but containing similar distribution of cases and controls: the training one (containing the dependent variable) and the testing one. During the training phase the ANNs learn a model of data distribution and then, on the basis of such a model, classify subjects in the testing set in a blind way. Training and testing sets are then reversed and consequently 10 analyses for every model employed are conducted.
-Optimized criterion: TWIST system
The TWIST system is an ensemble of two algorithms: "Training and Testing" (T&T) and "Imput Selection" (I.S.) algorithm .
The T&T system is a robust data resampling technique that is able to arrange the source sample into sub-samples that all possess a similar probability density function. In this way, the data is split into two or more sub-samples in order to train, test and validate the ANN models more effectively. The T&T is based on a population of n ANNs managed by an evolutionary system. In its simplest form, this algorithm reproduces several distribution models of the complete dataset DΓ (one for every ANN of the population) in two subsets (, the Training Set and , the Testing Set). During the learning process each ANN, according to its own data distribution model, is trained on the subsample and blind-validated on the subsample .
The performance score reached by each ANN in the testing phase represents its "fitness" value (i.e., the individual probability of evolution). The genome of each "network-individual" thus codifies a data distribution model with an associated validation strategy. The n data distribution models are combined according to their fitness criteria using an evolutionary algorithm. The selection of "network-individuals" based on fitness determines the evolution of the population; that is, the progressive improvement of performance of each network until the optimal performance is reached, which is equivalent to the better division of the global dataset into subsets. The evolutionary algorithm mastering this process, named "Genetic Doping Algorithm" (GenD) (33) has similar characteristics to a genetic algorithm but it's able to maintain an inner instability during the evolution, carrying out a natural increase of biodiversity and a continuous "evolution of the evolution" in the population. The elaboration of T&T is articulated in two phases:
- preliminary phase: in this phase an evaluation of the parameters of the fitness function that will be used on the global dataset is performed. During this phase an inductor is configured, which consists of an ANN with an algorithm (A) Back Propagation standard. For this inductor the optimal configuration to reach the convergence is stabilized at the end of different training trials on the global dataset DΓ; in this way the configuration that most "suits" the available dataset is determined: the number of layers and hidden units and some possible generalizations of the standard learning law. The parameters thus determined define the configuration and the initialization of all the individual-networks of the population and will then stay fixed in the following computational phase. Basically, during this preliminary phase there is a fine-tuning of the inductor that defines the fitness values of the population's individuals during evolution.
The accuracy of the ANN performance with the testing set will be the fitness of that individual (that is, of that hypothesis of distribution into two halves of the whole dataset).
- Computational phase: the system extracts from the global dataset the best training and testing sets. During this phase the individual-network of the population is running, according to the established configuration and the initialization parameters. From the evolution of the population, managed by the GenD algorithm, the best distribution of the global dataset D Γ into two subsets is generated, starting from the initial population of possible solutions . Preliminary experimental sessions are performed using several different initialization and configuration of the network in order to achieve the best partition of the global dataset.
Parallel to T&T runs I.S. The IS system is an adaptive system, which is also based on the evolutionary algorithm GenD, and which is able to evaluate the relevance of the different variables of the dataset in an intelligent way. Therefore it can be considered on the same level as a feature selection technique. From a formal point of view, I.S. is an artificial organism based on the GenD algorithm and consists of a population of ANN, in which each one carries out a selection of the independent variables on the available database. The elaboration of I.S., as for T&T, is developed in two phases:
- Preliminary phase: during this phase an inductor is configured to evaluate the parameters of the fitness function. This inductor is a standard Back-Propagation ANN. The parameters configuration and the initialization of the ANNs are carried out with particular care to avoid possible over-fitting problems that can be present when the database is characterized by a high number of variables that describe a low quantity of data. The number of epochs E0 necessary to train the inductor is determined through preliminary experimental tests.
- Computational phase: the inductor is active, according to the stabilized configuration and the fixed initialization parameters, to extract the most relevant variables of the training and testing subsets. Each individual-network of the population is trained on the training set and tested on the testing set .
The evolution of the individual-network of the population is based on the algorithm GenD. In the I.S. approach the GenD genome is built by n binary values, where n is the cardinality of the original input space. Every gene indicates if an input variable is to be used or not during the evaluation of the population fitness. Through the evolutionary algorithm, the different "hypotheses" of variable selection, generated by each ANNs of the population, change over time, at each generation: this leads to the selection of the best combination of input variables. As in the T&T systems the genetic operators crossover and mutation are applied on the ANNs population; the rates of occurrence for both operators are self-determinated by the system in adaptive way at each generation.
When the evolutionary algorithm no longer improves its performance, the process stops, and the best selection of the input variables is employed on the testing subset.
In order to improve the speed and the quality of the solutions that have to be optimized, the GenD algorithm makes the evolutionary process of the artificial populations more natural and less centered on the individual liberalism culture.
The combined action of T&T and I.S. systems allow us to solve two frequent problems in managing ANNs. Both systems are based on a Genetic Algorithm, the Genetic Doping Algorithm (GenD) developed at Semeion Research Centre .
GenD was provided with 100 individuals, generated randomly. Each individual represents a possible distribution of the whole dataset into two subsets. Two independent Multilayers Perceptrons (MLPs) with 4 hidden units, are trained for 100 epochs and tested in blinded manner on the two subsets. A function of the testing results of the two independent MLPs defines the fitness of each individual.
A crossover function is applied on the populations of 100 individuals and new individuals are generated. A mutation operator is applied to the new individuals and to the individuals whose fitness is weakest. In the GenD algorithm the rate of crossover and the rate of mutation are self-determined by the system in adaptive way at each generation. This loop is applied for at least 300 generations, or stopped when the system does not show any significant improvement at least from 50 generations. The individual whose distribution of the whole dataset into two subsets is the best one from the blind testing results is saved and then used as optimal distribution to train and test more sophisticated ANNs.
We implemented both algorithms in C language and we used a Pentium III CPU to run the system on real data. Around 48 hours were spent for each run. We remind that T&T and I.S. algorithms have to be used only once to train the system. Once trained, the system can answer on line to any new pattern.
After this processing, the features that were most significant for the classification were selected and at the same time the training set and the testing set were created with a function of probability distribution similar to the one that provided the best results in the classification.
A supervised Multi Layer Perceptron, with four hidden units, was then used for the classification task.
We collected 54 patients (mean age at onset of disease 59.62 years; range 53.7 – 65.5 years): 28 males (56.4 years; 46.9 – 65.8) and 26 females (62.9 years; 57.8 – 67.9) with a male/female ratio 1.1:1. The site of clinical onset was spinal in 61.1% (33/54) and bulbar in 38.9% (21/54) of cases. The mean disease duration at the time of observation was 3.2 years (range 1–10 years). The frequency of bulbar onset in females (16/29) resulted greater than in males (5/28).
All patients were previously screened for SOD1 gene mutation by sequence analysis and no genetic variations were found.
Control subjects were 144 males and 67 females; age range 21 to 75 years, (average 38.94).
Table 1 summarizes the distribution of the SNPs in the two groups of patients and controls. The reliability of the whole molecular procedure (multiplex and hybridization steps) was checked by the TNF beta thr26asn polymorphism that gave the same results in both strips A and B for the same subject analyzed (see 17 and 26 for details).
Table 1. Genotype distribution at each marker locus: wild type (WT); heterozygous type (Hetero); homozygous type (Homo).
Classification performances with ANNs
Table 2. Results obtained applying the random criterion and classifying with the linear discriminant analysis.
Table 3. Results obtained applying the random criterion and classifying with a simple back propagation.
In these experiments we applied the random criterion to divide the dataset five times in training and testing sub-sets applying the 5 × 2 Cross Validation protocol.
The predictive accuracy obtained with Linear Discriminant Analysis and standard artificial neural networks ranged from 70% to 79% (average 75.31%) and from 69.1 to 86.2% (average 76.6%) respectively.
With the TWIST approach, every experiment was conducted in a blind and independent manner in two directions: training with sub-sample A and blind testing with sub-sample B vs training with sub-sample B and blind testing with sub-sample A. The results from the best five applications of TWIST procedures are reported in Table 4. This advanced intelligent system, through the final selection of a subgroup of 25–27 variables along ten independent applications, provided the highest predictive performance with a sensitivity ranging from 92.0% to 100% (average 96.75%), and a specificity ranging from 91.67% to 98.81% (average 95.78%) and with an overall accuracy ranging from 94.4 to 97.6% (average 96.0%). In all the TWIST system experiments the 90% overall accuracy threshold was exceeded whereas Back Propagation and Linear Discriminant Analysis never exceeded the 80% threshold.
Table 4. Results of ten experiment obtained applying TWIST procedure in an independent manner to the whole dataset.
Genetic variants selected by the five TWIST procedures
Seven genetic variants were always independently selected by the five TWIST procedures: apolipoprotein E (APOE) (chromosome 19q13.2) arg158cys; hepatic lipase (LIPC) (chromosome 15q21-23) -480 C/T; endothelial nitric oxide synthase (NOS3) (chromosome 7q36) 690 C/T and glu298asp; vitamin K-dependent coagulation factor seven (F7) (chromosome 13q34) arg353glu, glycoprotein Ia/IIa (ITGA2) (chromosome 5q23-q31) 873 G/A; E-selectin (SELE) (chromosome 1q22-q25) ser128arg.
Table 5 gives the results obtained with ANNs using only the input data derived from these variants.
Table 5. Results obtained with ANNs using only the seven genetic variants selected by TWIST procedure.
Genetic variants independently selected by four TWIST procedures
The number of genetic variants selected four times over five experiments consisted of: peroxisome proliferator activated receptor gamma (PPARG) pro12ala (chromosome 3p25), lipoprotein lipase (LPL) asp9asn (chromosome 8p22), paraoxonase 1 (PON1) met55leu and paraoxonase 2 (PON2) ser311cys (chromosome 7q21.3), tumor necrosis factor beta (TNF beta) thr26 asn (chrom 6p21.3), methylenetetrahydrofolate reductase (MTHFR) 677 C/T (chrom 1p36.3), angiotensin II receptor type1 (AGTR1) 1166 A/C (chromosome 3q21-25), atrial natriuretic peptide (NPPA) 664 G/A (chrom 1p36-21), epithelial Na channel subunit (SCNN1A) trp493arg, (chromosome 12p13), FVII -232 ins/del, SELE leu554phe, Tumor Necrosis Factor alpha (TNFalpha) -376 G/A and -308 G/A (chromosome 6p21.3).
The TNF beta thr26asn was used as further control. First it was selected by four TWIST systems and later, since the information linked to such a variation was already recruited, none of the TWIST systems selected this SNP.
Genetic variants never selected by any TWIST procedures
The following gene/genetic variants were never selected by the five TWIST procedures: apolipoprotein A4 (APOA4) (chromosome 11q23) thr347ser; apolipoprotein C3 (APOC3) (chromosome 11q23.1-q23.2) -641 C/A and 482 C/T; beta 3 adrenergic receptor (ADRB3) trp64arg (8p12-p11.2); LPL ser447term; PON1 gln192arg; low density lipoprotein receptor (LDLR) (chromosome 19p13.3) exon 18 NcoI +/-; cholesteryl ester transfer protein (CETP) -631 C/A and -629 C/A (chromosome 16q21); NOS3 922 A/G; G-protein beta 3 subunit (GNB3) 825 C/T (chromosome 12p13); beta 2 adrenergic receptor (ADBR2) arg16gly (chromosome 5q31-q32); beta fibrinogen (FGB) -455 G/A (chromosome 4q28); TNF alfa -238 G/A and TNF beta thr26asn.
The mechanism of neurodegeneration in ALS remains an enigma. The major problem is that little is known about the disease mechanism, making candidate gene selection difficult and haphazard. It follows that an unconventional approach of making no a priori assumptions about the location of the variants of interest might be appropriate, provided that a similarly unconventional statistical approach is available to manage the data complexity.
Comparison of results obtained using three different analytical approaches (classical statistics, standard neural networks and advanced artificial neural networks), points out the need to employ systems that are really able of handling the disease complexity instead of treating the data with reductionist approaches unable to detect multiple genes of smaller effect in predisposing to the disease. The possibility of obtaining high diagnostic accuracy from limited and selected genetic information using these new analytical tools, shows that even in so-called sporadic ALS the genetic background plays a fundamental role.
Another important obstacle in approaching the molecular basis of a rare disease like ALS in a conventional manner, is related to the difficulty of finding a homogeneous sample population large enough to be analysed for a wide number of genetic variants. Artificial neural networks, at variance with the classical statistical tests, can manage complexity even with relatively small samples and the subsequent unbalanced ratio between variables and records. In this connection, it is important to note that adaptive learning algorithms of inference, based on the principle of a functional estimation like artificial neural networks, overcome the problem of dimensionality.
Internal validation of the prediction accuracy is one of the most important problems in neural networks analysis. In fact, the restriction of training procedures to only a part of the dataset, generally half of it, causes a potential loss of power to recognize hidden patterns. In this study optimization of the training and testing procedures were addressed using the evolutionary training and testing algorithm, which ensured that the two halves of the dataset contained the same amount of relevant information. Thus, the best division of the whole dataset into a training and a testing set was reached after a finite number of generations. Finally ANNs were able to identify gene combinations (allelic variants) that are likely to produce accurate predictions of ALS for a single individual, regardless of some possible limitations such as Male/Female ratio and age differences among the case and control groups. This study enrolled more than 50 medical cases with an accurate diagnosis of ALS and we were able to test them for 69 SNPs in 35 genes. Although the SALS patients analyzed represent a small cohort, it is nevertheless really representative from an epidemiological point of view (e.g. male/female ratio, bulbar/spinal ratio).
Besides, all ALS patients were previously screened for SOD1 gene mutations with negative results, thus confirming the sporadic nature of the disease. However, the sample size of 54 cases analysed for more than 60 SNPs, prompted us to look for valid, powerful and efficient statistical tools to approach and evaluate our data.
On the basis of the observed results some information related to the methodological approaches used can be assumed. The multiarray approach was previously validated by ourselves  and others  and contains TNF beta as the internal control.
Indeed, ApoE arg158cys was selected by all the five TWISTs while the ApoE cys112arg was selected only once. For NOS variants, the position -922 in the promoter region was never selected while the -690 variant in the promoter region too and the non synonymous variant in position 698 were both selected by all the five TWISTs. The two Factor VII and Selectin (SELE) genetic variants both containing the information necessary for the correct attribution to the disease vs healthy status, were selected five times (FVII arg353glu and SELE ser128arg) and four times (FVII del/ins and SELE leu554phe), respectively. The role of the paroxonase in predisposing to ALS disease appears to be confirmed: PON1 met 55leu and PON2 ser311cys were chosen four times, whereas PON1 gln192arg was never. PPARγ pro12ala was chosen four times: we can assume a generic role of this receptor on ALS disease since PPARγ is at the crossroads between lipid metabolism and innate immune response . In addition, we noticed, for example, that in the same TNF locus, 6p21.3, lies also the HFE gene for hemocromatosis and the peripherin gene, both previously involved in ALS disease .
Few genetic variants were never selected by any of the TWIST procedures. One possible reason is that some information had already been picked up by the systems, e.g. for PON1, NOS and TNF. Moreover, regarding APOA4 and APO C3 variants we observed that they lie on chromosome 11 which may not be at all involved in the disease. Indeed, a very recent paper on genome wide genotyping in ALS , found no SNPs associated with the disease on chromosome 11.
From a biological point of view, the identified gene variations confirm some of the already known results (ApoE and PON for example) and identify new gene/genetic variations not known to be involved in the disease. Our results strengthen the involvement of oxidative stress as well as angiogenesis (NOS) and immune response (TNF) pathways. Besides, our results shed light on the involvement of lipid pathways (LIPC, PPARγ). Indeed, a role for polyunsaturated fatty acids has been postulated for the misfolding protein aggregations in several neurodegenerative diseases including familial ALS . Furthermore polyunsaturated fatty acids could be enzymatically converted into various lipid mediators such as leukotriene and prostaglandins which have a strong biological activity in several signalling pathways .
Our study has a major focus on disentangling the effect of interacting multiple low penetrance alleles on complex diseases. We analysed genetic variables within genes possibly involved in the ALS disease and thanks to artificial intelligence agents such as those employed in this study, on the basis of a subset of genetic data only, we were are able to conveniently differentiate ALS cases from control subjects. We still do not know which specific variation within the subset of SNP is linked to the disease, however ANNs are able to discriminate among cases and controls with only seven genetic SNPs.
We are aware that this is an exploratory study and that it should be replicated in another and much larger sample size, nevertheless this study offers new insight into genetic markers of sporadic ALS pointing out the existence of a strong genetic background. The data provide useful information to direct future research on the complexity of the genetic profile of ALS subjects.
SP carried out the molecular genetic study, participated in the design of the study, coordinated and drafted the manuscript. EG participated in the design of the study, in the statistical analysis, coordinated and drafted the manuscript. MCP participated in genotyping. AM participated in conceiving the study. MB performed the statistical analysis and developed the intelligent systems. All authors read and approved the final manuscript
We are indebted to the patients for their cooperation. We wish to thank Dr. S. Cheng and the Dept of Human Genetics – Roche Molecular Systems -Inc. Alameda – CA -USA for supporting us with the arrays. We thank Drs. R. Causarano, S. Jann and R. Sterzi from the Neurology Unit of Niguarda Ca' Granda Hospital for the clinical selection of the patients.
Jiang Y-M, Yamamoto M, Kobayashi Y, Yoshihara T, Liang Y, Terao S, Takeuchi H, Ishigaki S, Katsuno M, Adachi H, Niwa J-I, Tanaka F, Doyu M, Yoshida M, Hashizume Y, Sobue G: Gene expression profile of spinal motor neurons in sporadic amyotrophic lateral sclerosis.
Schymick JC, Scholz SW, Fung H-C, Britton A, Arepalli S, Gibbs JR, Lombardo F, Matarin M, Kasperaviciute D, Hernandez DG, Crews C, Bruijn L, Rothstein J, Mora G, Restagno G, Chiò A, Singleton A, Hardy J, Traynor BJ: Genome wide genotyping in amyotrophic lateral sclerosis and neurologically normal controls: first stage analysis and public release of data.
The Lancet Neurology 2007, 6:322-8. Publisher Full Text
Dunckley T, Huentelman MJ, Craig DW, Pearson JV, Szelinger S, Joshipura K, Halperin RF, Stamper C, Jensen KR, Letizia D, Hesterlee SE, Pestronk A, Levine T, Bertorini T, Graves MC, Mozaffar T, Jackson CE, Bosch P, McVey A, Dick A, Barohn R, Lomen-Hoerth C, Rosenfeld J, O'Connor DT, Zhang K, Crook R, Ryberg H, Hutton M, Katz J, Simpson EP, Mitsumoto H, Bowser R, Miller RG, Appel SH, Stephan DA: Whole-genome analysis of sporadic amyotrophic lateral sclerosis.
Kasperaviciute D, Weale ME, Shianna KV, Banks GT, Simpson CL, Hansen VK, Turner MR, Shaw CE, Al-Chalabi A, Pall HS, Goodall EF, Morrison KE, Orrell RW, Beck M, Jablonka S, Sendtner M, Brockington A, Ince PG, Hartley J, Nixon H, Shaw PJ, Schiavo G, Wood NW, Goldstein DB, Fisher EM: Large-scale pathways-based association study in amyotrophic lateral sclerosis.
Blauw HM, Veldink JH, van Es MA, van Vught PW, Saris CG, Zwaag B, Franke L, Burbach JP, Wokke JH, Ophoff RA, Berg LH: Copy-number variation in sporadic amyotrophic lateral sclerosis: a genome-wide screen.
The Lancet Neurology 2008, 7:319-26. Publisher Full Text
Ritchie MD, White BC, Parker JS, Hahn LW, Moore JH: Optimization of neural network architecture using genetic programming improves detection and modeling of gene-gene interactions in studies of human diseases.
North BV, Curtis D, Cassell PG, Hitman GA, Sham PC: Assessing optimal Neural Network Architecture for identifying disease-associated multi-marker genotypes using permutation test, and application to Calpain 10 polymorphisms associated with diabetes.
J Neurology Sci 1999, 169:2-12. Publisher Full Text
Battistini S, Giannini F, Greco G, Bibbò G, Ferrera L, Marini V, Causarano R, Casula M, Lando G, Patrosso MC, Caponnetto C, Origine P, Marocchi A, Del Corona A, Siciliano G, Carrera P, Mascia V, Giagheddu M, Carcassi C, Orrù S, Garrè C, Penco S: SOD1 mutations in amyotrophic lateral sclerosis: results from a multicenter Italian study.
Bridle JS: Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition. Edited by Fogelman-Soulié F, Hérault J. Neuro-computing: Algorithms, Architectures. Springer-Verlag, NY; 1989.
Expert Systems 2000, 17:149-170. Publisher Full Text
Neural Computation 1998, 10:1885-1924. Publisher Full Text
Buscema M, Grossi E, Intraligi M, Garbagna N, Andriulli A, Breda M: An Optimized Experimental Protocol Based on Neuro-Evolutionary Algorithms. Application to the Classification of Dyspeptic Patients and to the Prediction of the Effectiveness of Their Treatment.
Expert Systems 2004, 21:63-79. Publisher Full Text