Big Data in Chemistry

Edited by Igor V. Tetko, Helmholtz Zentrum München, Germany

The increasing volume of biomedical data in chemistry and life sciences requires the development of new methodologies and approaches for their analysis. Artificial Intelligence (AI) and machine learning, especially neural networks, are increasingly used in the chemical industry, in particular with respect to Big Data.

The goal of this special collection in Journal of Cheminformatics is to show progress and exemplify the current needs, trends and requirements for machine learning in chemical data analysis. In particular, it focuses on the use of chemical informatics and machine learning methodologies to analyse chemical Big Data, e.g. to predict biological activities and physico-chemical properties, facilitate property-oriented data mining, predict biological targets for compounds on a large scale, design new chemical compounds, and analyse large virtual chemical spaces.

The collection mainly contains a selection of articles to be presented during the BIGCHEM special session of the International Conference on Artificial Neural Networks (ICANN2019), which is co-organized by the European Neural Network Society and the Horizon2020 Marie Skłodowska-Curie Innovative Training Networks European Industrial Doctorate "Big Data in Chemistry" project.

From Big Data to Artificial Intelligence: chemoinformatics meets new challenges

The increasing volume of biomedical data in chemistry and life sciences requires development of new methods and approaches for their analysis. Artificial Intelligence and machine learning, especially neural ne...

Authors: Igor V. Tetko and Ola Engkvist

Citation: Journal of Cheminformatics 2020 12:74

Content type: Editorial Published on: 18 December 2020
- View Full Text
- View PDF
AiZynthFinder: a fast, robust and flexible open-source software for retrosynthetic planning

We present the open-source AiZynthFinder software that can be readily used in retrosynthetic planning. The algorithm is based on a Monte Carlo tree search that recursively breaks down a molecule to purchasable...

Authors: Samuel Genheden, Amol Thakkar, Veronika Chadimová, Jean-Louis Reymond, Ola Engkvist and Esben Bjerrum

Citation: Journal of Cheminformatics 2020 12:70

Content type: Software Published on: 17 November 2020
- View Full Text
- View PDF
Memory-assisted reinforcement learning for diverse molecular de novo design

In de novo molecular design, recurrent neural networks (RNN) have been shown to be effective methods for sampling and generating novel chemical structures. Using a technique called reinforcement learning (RL),...

Authors: Thomas Blaschke, Ola Engkvist, Jürgen Bajorath and Hongming Chen

Citation: Journal of Cheminformatics 2020 12:68

Content type: Research article Published on: 10 November 2020
- View Full Text
- View PDF
Molecular representations in AI-driven drug discovery: a review and practical guide

The technological advances of the past century, marked by the computer revolution and the advent of high-throughput screening technologies in drug discovery, opened the path to the computational analysis and v...

Authors: Laurianne David, Amol Thakkar, Rocío Mercado and Ola Engkvist

Citation: Journal of Cheminformatics 2020 12:56

Content type: Review Published on: 17 September 2020
- View Full Text
- View PDF
One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome

Molecular fingerprints are essential cheminformatics tools for virtual screening and mapping chemical space. Among the different types of fingerprints, substructure fingerprints perform best for small molecule...

Authors: Alice Capecchi, Daniel Probst and Jean-Louis Reymond

Citation: Journal of Cheminformatics 2020 12:43

Content type: Research article Published on: 12 June 2020
- View Full Text
- View PDF
QSAR-derived affinity fingerprints (part 2): modeling performance for potency prediction

Affinity fingerprints report the activity of small molecules across a set of assays, and thus permit to gather information about the bioactivities of structurally dissimilar compounds, where models based on ch...

Authors: Isidro Cortés-Ciriano, Ctibor Škuta, Andreas Bender and Daniel Svozil

Citation: Journal of Cheminformatics 2020 12:41

Content type: Research article Published on: 5 June 2020
- View Full Text
- View PDF
QSAR-derived affinity fingerprints (part 1): fingerprint construction and modeling performance for similarity searching, bioactivity classification and scaffold hopping

An affinity fingerprint is the vector consisting of compound’s affinity or potency against the reference panel of protein targets. Here, we present the QAFFP fingerprint, 440 elements long in silico QSAR-based...

Authors: C. Škuta, I. Cortés-Ciriano, W. Dehaen, P. Kříž, G. J. P. van Westen, I. V. Tetko, A. Bender and D. Svozil

Citation: Journal of Cheminformatics 2020 12:39

Content type: Research article Published on: 29 May 2020
- View Full Text
- View PDF
SMILES-based deep generative scaffold decorator for de-novo drug design

Molecular generative models trained with small sets of molecules represented as SMILES strings can generate large regions of the chemical space. Unfortunately, due to the sequential nature of SMILES strings, t...

Authors: Josep Arús-Pous, Atanas Patronov, Esben Jannik Bjerrum, Christian Tyrchan, Jean-Louis Reymond, Hongming Chen and Ola Engkvist

Citation: Journal of Cheminformatics 2020 12:38

Content type: Research article Published on: 29 May 2020
- View Full Text
- View PDF
Assessing the information content of structural and protein–ligand interaction representations for the classification of kinase inhibitor binding modes via machine learning and active learning

For kinase inhibitors, X-ray crystallography has revealed different types of binding modes. Currently, more than 2000 kinase inhibitors with known binding modes are available, which makes it possible to derive...

Authors: Raquel Rodríguez-Pérez, Filip Miljković and Jürgen Bajorath

Citation: Journal of Cheminformatics 2020 12:36

Content type: Research article Published on: 24 May 2020
- View Full Text
- View PDF
Activity landscape image analysis using convolutional neural networks

Activity landscapes (ALs) are graphical representations that combine compound similarity and activity data. ALs are constructed for visualizing local and global structure–activity relationships (SARs) containe...

Authors: Javed Iqbal, Martin Vogt and Jürgen Bajorath

Citation: Journal of Cheminformatics 2020 12:34

Content type: Research article Published on: 18 May 2020
- View Full Text
- View PDF
GEN: highly efficient SMILES explorer using autodidactic generative examination networks

Recurrent neural networks have been widely used to generate millions of de novo molecules in defined chemical spaces. Reported deep generative models are exclusively based on LSTM and/or GRU units and frequent...

Authors: Ruud van Deursen, Peter Ertl, Igor V. Tetko and Guillaume Godin

Citation: Journal of Cheminformatics 2020 12:22

Content type: Research article Published on: 10 April 2020
- View Full Text
- View PDF
COVER: conformational oversampling as data augmentation for molecules

Training neural networks with small and imbalanced datasets often leads to overfitting and disregard of the minority class. For predictive toxicology, however, models with a good balance between sensitivity an...

Authors: Jennifer Hemmerich, Ece Asilar and Gerhard F. Ecker

Citation: Journal of Cheminformatics 2020 12:18

Content type: Research article Published on: 18 March 2020
- View Full Text
- View PDF
Transformer-CNN: Swiss knife for QSAR modeling and interpretation

We present SMILES-embeddings derived from the internal encoder state of a Transformer [1] model trained to canonize SMILES as a Seq2Seq problem. Using a CharNN [2] architecture upon the embeddings results in high...

Authors: Pavel Karpov, Guillaume Godin and Igor V. Tetko

Citation: Journal of Cheminformatics 2020 12:17

Content type: Research article Published on: 18 March 2020
- View Full Text
- View PDF
Mol-CycleGAN: a generative model for molecular optimization

Designing a molecule with desired properties is one of the biggest challenges in drug development, as it requires optimization of chemical compound structures with respect to many complex properties. To improv...

Authors: Łukasz Maziarka, Agnieszka Pocha, Jan Kaczmarczyk, Krzysztof Rataj, Tomasz Danel and Michał Warchoł

Citation: Journal of Cheminformatics 2020 12:2

Content type: Research article Published on: 8 January 2020
- View Full Text
- View PDF
Building attention and edge message passing neural networks for bioactivity and physical–chemical property prediction

Neural Message Passing for graphs is a promising and relatively recent approach for applying Machine Learning to networked data. As molecules can be described intrinsically as a molecular graph, it makes sense...

Authors: M. Withnall, E. Lindelöf, O. Engkvist and H. Chen

Citation: Journal of Cheminformatics 2020 12:1

Content type: Research article Published on: 8 January 2020
- View Full Text
- View PDF
A de novo molecular generation method using latent vector based generative adversarial network

Deep learning methods applied to drug discovery have been used to generate novel structures. In this study, we propose a new deep learning architecture, LatentGAN, which combines an autoencoder and a generativ...

Authors: Oleksii Prykhodko, Simon Viet Johansson, Panagiotis-Christos Kotsias, Josep Arús-Pous, Esben Jannik Bjerrum, Ola Engkvist and Hongming Chen

Citation: Journal of Cheminformatics 2019 11:74

Content type: Research article Published on: 3 December 2019
- View Full Text
- View PDF
Randomized SMILES strings improve the quality of molecular generative models

Recurrent Neural Networks (RNNs) trained with a set of molecules represented as unique (canonical) SMILES strings, have shown the capacity to create large chemical spaces of valid and meaningful structures. He...

Authors: Josep Arús-Pous, Simon Viet Johansson, Oleksii Prykhodko, Esben Jannik Bjerrum, Christian Tyrchan, Jean-Louis Reymond, Hongming Chen and Ola Engkvist

Citation: Journal of Cheminformatics 2019 11:71

Content type: Research article Published on: 21 November 2019
- View Full Text
- View PDF
Combining structural and bioactivity-based fingerprints improves prediction performance and scaffold hopping capability

This study aims at improving upon existing activity predictions methods by augmenting chemical structure fingerprints with bio-activity based fingerprints derived from high-throughput screening (HTS) data (HTS...

Authors: Oliver Laufkötter, Noé Sturm, Jürgen Bajorath, Hongming Chen and Ola Engkvist

Citation: Journal of Cheminformatics 2019 11:54

Content type: Research article Published on: 8 August 2019
- View Full Text
- View PDF