The rise of machine learning: how to avoid the pitfalls in data analysis

Posted by Biome on 13th August 2014 - 2 Comments


The phrase artificial intelligence often brings to mind futuristic visions of human-like machines, however the ability of a machine to learn is a concept that is already in play today – one that is proving particularly popular in biomedical circles. Taking a machine learning approach enables a computer program to discern the key features of one dataset and then apply what it has learnt to make predictions of another dataset. The biomedical applications are numerous, with researchers looking into the detection of brain cancers to the diagnosis heart valve disease. However, as with any approach to data analysis, there are limitations that must be considered in order to discern how reliable the outcomes are. In a recent review article in BioMedical Engineering OnLine from Editor-in-Chief Kenneth Foster, Robert Koprowski from the Institute of Computer Science, Poland, and Joseph Skufca from Clarkson University, USA, some of the pitfalls and potential solutions are explored. Here Foster explains more about the breadth of applications that machine learning lends itself to, and what researchers looking to employ this method should watch out for.

 

What is machine learning?

Machine learning refers generically to software that can learn from data. Familiar examples include optical character recognition, spam filtering, automatic face recognition, and various data mining applications. This technology is broadly related to artificial intelligence and statistical pattern recognition, and relies on computationally intensive statistics methods that have been developed over the past few decades. Machine learning has grown tremendously in popularity over the past decade, and is now appearing in widely sold statistics packages.

 

What applications does machine learning lend itself to?

Machine learning allows you to study datasets to identify trends that may not appear from more superficial analysis. Machine learning techniques are being employed by bench scientists for genomic analysis, and by drug companies to develop new drugs. ‘Infoveillance’ uses machine learning to detect disease outbreaks by statistical analysis of Google searches or postings in the social media, at far lower cost than traditional methods. And of course the security agencies use machine learning and data mining to search for possible terrorist risks. Economically and scientifically, machine learning is a very big deal.

 

What are the pitfalls and/or challenges of using the machine learning approach?

There are always pitfalls when you retrospectively examine data and try to find patterns. There are the usual problems that arise from post hoc data analysis, compounded by the fact that machine learning techniques are complex and not at all transparent. How do you know whether the patterns that you find are real? Many investigators are using machine learning techniques but are not necessarily inclined or trained to assess the reliability of the results that they are getting.

My particular concern, as Editor-in-Chief of BioMedical Engineering Online, is that I see many studies that apply machine learning techniques to biomedical data such as electrocardiograms or mammograms and train ‘classifiers’ that are supposed to be able to distinguish data from healthy subjects and those with disease. The authors say that the classifier can be used to detect or diagnose disease. There are many studies of this sort – just search on Google Scholar for ‘machine learning’ and ‘medical diagnosis’. But designing a medically useful diagnostic technique is a big project and the human consequences of being wrong can be very high.

I see two main problems with many of the machine learning studies I encounter as an editor. First, the studies are typically small, which opens the possibility of over-fitting. This means that you can easily train a classifier that separates the training data into ‘healthy’ and ‘diseased’ but does not generalise well when applied to new data. To design a successful classifier may require a larger data set than is usually available for the kinds of biomedical engineering studies that I have in mind.

Second, to show that a classifier is medically useful would require clinical studies that are very different from the biomedical engineering studies that were done to develop the classifier in the first place. By proposing that a classifier can be used for diagnosis, an investigator might raise hopes in the medical community that would be very expensive and time consuming to follow up on. I want to sensitise readers of our journal to some of the pitfalls they might encounter with this approach, and help increase the chances that the technology they develop will eventually prove useful in the real world of medicine.

 

What key considerations should researchers keep in mind when deciding whether to adopt a machine learning approach and developing classifiers?

Machine learning techniques require access to a sufficiently large data set to succeed. This may not be possible in a small study. Moreover the classifier needs to be developed in a medically informed way. Investigators need to consider machine learning techniques and training classifiers as more than a simple add-on to a routine statistical analysis, but as part and parcel of the whole experimental process.

From my perspective, the main need is to develop classifiers that generalise well, and that work well with data from patients that are representative of the entire population for which it will be used.

 

In your review in BioMedical Engineering Online you raise the issue of ‘transportability’. Can you explain what this is, and how it should be addressed?
If a classifier is developed by one group of investigators as a diagnostic method, it must work well in other medical centres, in the hands of people who might not be as committed to the method as the original investigators were. The same requirement, of course, applies to all diagnostic tests, not just those developed using machine learning.

While a small bioengineering group cannot undertake a large multicentre clinical trial of a new diagnostic technique, at least it can develop and validate classifiers using scientifically correct methods and a medically informed approach. Because it is so expensive and time consuming to develop medically useful diagnostic tests, it is important to show whether a new technique has little chance of succeeding at an early stage of development, if in fact that is the case. Investigators are not inclined to look for reasons why their approaches might fail, but they should.

 

With an increase in the number of studies using machine learning, what role does the publishing community have to play?

Reviewers and editors need to be able to assess the validity of the studies that report the use of these techniques, including the validation tests that the authors used. Authors need to report their methods in sufficient detail to allow editors and reviewers (and, eventually, readers) to assess the validity of the study design. Often they do not.

The problem is that many editors and reviewers simply do not have enough knowledge of the subtleties of the methodology to allow them to properly assess such studies. It may be time to consider referring papers to machine learning consultants, in the way that editors sometimes refer papers to statistics consultants. But of course experts in machine learning and related technologies are already swamped with projects of their own, and finding such consultants might not be easy for an editor. Perhaps we should use machine learning to find good reviewers!

 

More about the author(s)

Kenneth Foster, Professor, University of Pennsylvania, USA.

Kenneth Foster, Professor, University of Pennsylvania, USA.

Kenneth Foster is a professor in the Department of Bioengineering at the University of Pennsylvania, USA. He received his PhD in physics from Indiana University, USA, after which he embarked on a career in research focused on the interaction between nonionising radiation and biological systems. His current research interests centre around the biomedical applications of nonionising radiation from audio through to microwave frequency ranges, as well as the health and safety aspects of electromagnetic fields. Foster is also a Fellow of the Institute for Electrical and Electronic Engineers and the American Institute of Medical and Biological Engineering.

Review

Machine learning, medical diagnosis, and biomedical engineering research - commentary

Foster KR, Koprowski R and Skufca JD
BioMedical Engineering OnLine 2014, 13:94

Go to article >>
  • Pingback: Biome | Machine learning mistakes | Bioinformat...

  • Ged Ridgway

    Thanks for an interesting and helpful article.

    I think the STARD statement (STAndards for the Reporting of Diagnostic accuracy studies, http://www.stard-statement.org) is useful and currently under-appreciated in machine learning, and I don’t think it’s mentioned in the article or blog post.

    STARD comes from a fairly traditional statistical background aimed at
    assessing simple scalar measures, so it doesn’t address over-fitting
    issues specific to high-dimensional machine learning, but those issues are (at least usually) well appreciated in the machine learning literature.

    For example, the STARD checklist makes clear that accuracy (and/or other measures such as sensitivity and specificity) should be accompanied by confidence intervals. This helps to address the issue of adequate (test) sample sizes that you discuss, since small test samples lead to wide confidence intervals. Small training sets with adequate test sets usually just lead to poor performance.

    Following up on that last point, I think your Figure 2 is perhaps a little too pessimistically interpreted. The balanced accuracy (average of sensitivity and specificity) is actually not bad above about 25 cases; it would be interesting to know the area under the ROC curve, rather than just the single pair of sensitivity and specificity values, as I think the latter perhaps exaggerates the instability of small training data-sets (i.e. some of the variation is just due to variation of the automatically determined cut-point, rather than true variation in performance). To be clear, I agree that more data is helpful, but I wouldn’t be quite so dismissive of smaller studies.