Skip to main content

Big Data in Chemistry

Edited by Igor V. Tetko, Helmholtz Zentrum München, Germany

The increasing volume of biomedical data in chemistry and life sciences requires the development of new methodologies and approaches for their analysis. Artificial Intelligence (AI) and machine learning, especially neural networks, are increasingly used in the chemical industry, in particular with respect to Big Data.

The goal of this special collection in Journal of Cheminformatics is to show progress and exemplify the current needs, trends and requirements for machine learning in chemical data analysis. In particular, it focuses on the use of chemical informatics and machine learning methodologies to analyse chemical Big Data, e.g. to predict biological activities and physico-chemical properties, facilitate property-oriented data mining, predict biological targets for compounds on a large scale, design new chemical compounds, and analyse large virtual chemical spaces.

The collection mainly contains a selection of articles to be presented during the BIGCHEM special session of the International Conference on Artificial Neural Networks (ICANN2019), which is co-organized by the European Neural Network Society and the Horizon2020 Marie Skłodowska-Curie Innovative Training Networks European Industrial Doctorate "Big Data in Chemistry" project


  1. We present the open-source AiZynthFinder software that can be readily used in retrosynthetic planning. The algorithm is based on a Monte Carlo tree search that recursively breaks down a molecule to purchasable...

    Authors: Samuel Genheden, Amol Thakkar, Veronika Chadimová, Jean-Louis Reymond, Ola Engkvist and Esben Bjerrum
    Citation: Journal of Cheminformatics 2020 12:70
  2. Affinity fingerprints report the activity of small molecules across a set of assays, and thus permit to gather information about the bioactivities of structurally dissimilar compounds, where models based on ch...

    Authors: Isidro Cortés-Ciriano, Ctibor Škuta, Andreas Bender and Daniel Svozil
    Citation: Journal of Cheminformatics 2020 12:41
  3. An affinity fingerprint is the vector consisting of compound’s affinity or potency against the reference panel of protein targets. Here, we present the QAFFP fingerprint, 440 elements long in silico QSAR-based...

    Authors: C. Škuta, I. Cortés-Ciriano, W. Dehaen, P. Kříž, G. J. P. van Westen, I. V. Tetko, A. Bender and D. Svozil
    Citation: Journal of Cheminformatics 2020 12:39
  4. Molecular generative models trained with small sets of molecules represented as SMILES strings can generate large regions of the chemical space. Unfortunately, due to the sequential nature of SMILES strings, t...

    Authors: Josep Arús-Pous, Atanas Patronov, Esben Jannik Bjerrum, Christian Tyrchan, Jean-Louis Reymond, Hongming Chen and Ola Engkvist
    Citation: Journal of Cheminformatics 2020 12:38
  5. For kinase inhibitors, X-ray crystallography has revealed different types of binding modes. Currently, more than 2000 kinase inhibitors with known binding modes are available, which makes it possible to derive...

    Authors: Raquel Rodríguez-Pérez, Filip Miljković and Jürgen Bajorath
    Citation: Journal of Cheminformatics 2020 12:36
  6. Recurrent neural networks have been widely used to generate millions of de novo molecules in defined chemical spaces. Reported deep generative models are exclusively based on LSTM and/or GRU units and frequent...

    Authors: Ruud van Deursen, Peter Ertl, Igor V. Tetko and Guillaume Godin
    Citation: Journal of Cheminformatics 2020 12:22
  7. Training neural networks with small and imbalanced datasets often leads to overfitting and disregard of the minority class. For predictive toxicology, however, models with a good balance between sensitivity an...

    Authors: Jennifer Hemmerich, Ece Asilar and Gerhard F. Ecker
    Citation: Journal of Cheminformatics 2020 12:18
  8. We present SMILES-embeddings derived from the internal encoder state of a Transformer [1] model trained to canonize SMILES as a Seq2Seq problem. Using a CharNN [2] architecture upon the embeddings results in high...

    Authors: Pavel Karpov, Guillaume Godin and Igor V. Tetko
    Citation: Journal of Cheminformatics 2020 12:17
  9. Designing a molecule with desired properties is one of the biggest challenges in drug development, as it requires optimization of chemical compound structures with respect to many complex properties. To improv...

    Authors: Łukasz Maziarka, Agnieszka Pocha, Jan Kaczmarczyk, Krzysztof Rataj, Tomasz Danel and Michał Warchoł
    Citation: Journal of Cheminformatics 2020 12:2
  10. Neural Message Passing for graphs is a promising and relatively recent approach for applying Machine Learning to networked data. As molecules can be described intrinsically as a molecular graph, it makes sense...

    Authors: M. Withnall, E. Lindelöf, O. Engkvist and H. Chen
    Citation: Journal of Cheminformatics 2020 12:1
  11. Deep learning methods applied to drug discovery have been used to generate novel structures. In this study, we propose a new deep learning architecture, LatentGAN, which combines an autoencoder and a generativ...

    Authors: Oleksii Prykhodko, Simon Viet Johansson, Panagiotis-Christos Kotsias, Josep Arús-Pous, Esben Jannik Bjerrum, Ola Engkvist and Hongming Chen
    Citation: Journal of Cheminformatics 2019 11:74
  12. Recurrent Neural Networks (RNNs) trained with a set of molecules represented as unique (canonical) SMILES strings, have shown the capacity to create large chemical spaces of valid and meaningful structures. He...

    Authors: Josep Arús-Pous, Simon Viet Johansson, Oleksii Prykhodko, Esben Jannik Bjerrum, Christian Tyrchan, Jean-Louis Reymond, Hongming Chen and Ola Engkvist
    Citation: Journal of Cheminformatics 2019 11:71
  13. This study aims at improving upon existing activity predictions methods by augmenting chemical structure fingerprints with bio-activity based fingerprints derived from high-throughput screening (HTS) data (HTS...

    Authors: Oliver Laufkötter, Noé Sturm, Jürgen Bajorath, Hongming Chen and Ola Engkvist
    Citation: Journal of Cheminformatics 2019 11:54