Data Science To Empower
Life Science Innovation

Empowering drug discovery with Big Data and Artificial Intelligence

Any data that requires support from technological and infrastructural investments in order to get meaningful insights is defined as “Big Data.” The main reasons contributing for Big Data are the exponential growth in the data due to increased usage and the requirement to integrate these datasets for gaining valuable insights. A good example is the data in drug discovery processes (1).

This blog aims to provide insights into various types of Big Data in drug discovery, and highlights the applications of Big Data in fast-tracking the drug discovery process by using machine learning (ML) approaches.

What is Big Data in drug discovery?

Big data in drug discovery refers to the data collected from biological, chemical, pharmacological and clinical domains (2). The attributes that define the characteristics of these datasets include: fast producing, large size, complex, heterogeneous and high value data with commercial opportunities. Some of the large datasets of use in drug discovery processes are highlighted below:

Biology datasets:

Biological data provides insights to understand the underlying mechanisms associated with disease state, prediction and validation of potential target proteins for therapeutics, development of new bioassay techniques for identifying treatment modalities associated with potential targets, predictions on how treatments will interact with the body when given to a patient and finally assistance in the design of effective clinical trials (2).

The data types that define biological data are: drug target data, OMICS data (genomic, transcriptomic, proteomic and metabolomic data), exome data, GWAS data, gene expression data, disease-relevant animal and cellular models data, gene knockout or knockdown data etc.

Chemistry datasets:

Chemistry datasets are useful in the design of high-throughput screening libraries which assist in identifying and validating therapeutic targets in silico. These datasets assist in the prediction of molecular properties required for drug compounds and help provide insights in understanding how those molecules interact with biological macromolecules (3).

The data types that define chemistry data are: chemical structural representations, chemical line notations or identifiers (SMILES & InChI), molecular property descriptors, topological descriptors, topographical descriptors, structure-activity-relationship (SAR) and compound specific biological data.

Pharmacology datasets:

Pharmacological data in drug discovery provides information about the compounds or drugs tested in animal models in combination with assay data on protein targets in cell- or tissue- based models that allows the investigation of the effects of compounds at different levels of biological complexity (4).

The data types that define the pharmacological data are: absorption, distribution, metabolism, elimination, toxicity (ADMET) data, functional in-vitro assay and in-vivo assay properties.

Clinical datasets:

The clinical datasets in drug discovery provide the valuable information in relation to the patient data (5).

The data types that define the clinical datasets are: safety and efficacy data, treatment response and side-effect profiles, patient stratification data, competitive landscape and trial design data.

The information contained in all the aforementioned large and complex datasets offer opportunities to explore and understand mechanisms associated with a disease state, and provides the possibility to prevent and treat such conditions.

What is artificial intelligence and what are its applications in drug discovery?

Scientists working globally in drug discovery research generate voluminous pharmaceutical Big Data which is by nature, multisource and multidimensional. It is becoming increasingly difficult to not only stay informed on all the available literature, but also, to properly parse and integrate this Big Data into one’s own work-flows within various research projects. In order to overcome the hurdles associated with Big Data in drug discovery, pharmaceutical or information technology companies adopted artificial intelligence (AI) technologies to provide robust solutions that could fast track the drug discovery process.

When a machine exhibits human cognitive skills like the ability to learn and solve a problem, then the term describing the actions of the machine is defined as artificial intelligence (AI) (6). AI comprises of technologies like Machine Learning and Deep Learning methods. Machine Learning methods are well established for learning and prediction of novel properties, while Deep Learning methods show great prospects in drug design owing to their powerful generalization and feature extraction capability. Both these methods have made remarkable progress, in their usefulness and applicability and offer opportunities across all stages of drug discovery (7,8) .

Some of the applications of artificial intelligence in drug discovery include:

  • Protein design and function
    • Prediction of protein folding
    • Prediction of protein-protein interactions
  • Hit discovery
    • Generation of chemical libraries or new molecule fingerprints
    • Virtual screening
    • Drug repurposing
  • Hit to lead optimization
    • Generating models for de novo design of drugs
    • QSAR models prediction
    • Prediction of molecular descriptors
    • Prediction of topological & topographical descriptors
  • Prediction of ADMET properties
    • Prediction of pharmacokinetic parameters like ADME properties
    • Prediction of toxicity properties
    • Pharmacodynamics modeling

Challenges and limitations associated with Big Data & AI in drug discovery

Some of the major challenges associated with Big Data in drug discovery include: data generation, data integration, data quality, data storage and management (2). Furthermore, errors in reproducibility and standardization of data, data format difficulties for chemical structure representations, missing original data, lack of contextual information, insufficient availability of disease-relevant human data in some disease areas, curse of dimensionality, bias in data, gaps of fundamental understanding in many diseases, issues associated in clinical-translational for target discovery and validation, complexities in managing entity name space and ontologies are other critical challenges associated with Big Data in drug discovery. In addition, protecting patient data and de-identification of personal data are legitimate concerns with respect to data storage and management.

Although artificial intelligence technologies are promising new techniques, their related studies still have some limitations (8). The processing and analyzing of large amount of data will affect the performance reliability in generating data models. Interpretation of complex data, as in the case of data associated with biological mechanisms, is another limitation for models generated using these methods.

How Excelra can support your AI-based drug discovery programs

Standardized and high-quality datasets are essential for AI/ML based drug discovery programs. Excelra’s GOSTAR is the world’s largest medicinal chemistry intelligence database providing comprehensive and structured SAR data for more than 8 million compounds. Available as a ‘one-stop data source’ for in silico drug discovery, GOSTAR captures a variety of small molecule activities encompassing SAR, physicochemical, metabolic, ADME and toxicological profiles into a relational database format.

GOSTAR datasets are created with industry-accepted ontologies that can be delivered in flexible file-formats such as:

  • Flat files
  • Hierarchical files
  • Databases (Oracle, MySQL, etc.)
  • Semantic format

“10 of the Top 20 pharma companies utilize GOSTAR to support their drug discovery programs”

To know more, visit and download a brochure.

Try GOSTAR today. To schedule a free demo, write to us on:


  • Benke K, Benke G. Artificial Intelligence and Big Data in Public Health. Int J Environ Res Public Health. 2018 10;15(12).
  • Brown N, Cambruzzi J, Cox PJ, Davies M, Dunbar J, Plumbley D, et al. Big Data in Drug Discovery. Prog Med Chem. 2018;57(1):277–356.
  • Petitjean M, Camproux A-C. In Silico Medicinal Chemistry: Computational Methods to Support Drug Design. Edited by Nathan Brown. ChemMedChem. 2016;11(13):1480–1.
  • Xie L, Draizen EJ, Bourne PE. Harnessing Big Data for Systems Pharmacology. Annu Rev Pharmacol Toxicol. 2017 Jan 6;57:245–62.
  • Singh G, Schulthess D, Hughes N, Vannieuwenhuyse B, Kalra D. Real world big data for clinical research and drug development. Drug Discov Today. 2018;23(3):652–60.
  • Christina Aguis. Evolution of AI: Past, Present, Future [Internet]. 2019. Available from:
  • Zhong F, Xing J, Li X, Liu X, Fu Z, Xiong Z, et al. Artificial intelligence in drug design. Sci China Life Sci. 2018 Oct;61(10):1191–204.
  • Zhang L, Tan J, Han D, Zhu H. From machine learning to deep learning: progress in machine intelligence for rational drug discovery. Drug Discov Today. 2017;22(11):1680–5.