Categories
Blog

The Drug-Target Interaction Heatmap

The Drug-Target Interaction Heatmap

A heatmap is a two-dimensional data visualization approach that displays the magnitude of a phenomenon as color. The color shift might be via hue or intensity, giving the reader clear visual indications about how the occurrence is clustered or evolves over space. Heatmaps are classified into two types: cluster heatmaps and spatial heatmaps. The sorting of rows and columns is intentional and somewhat arbitrary in a clustered heatmap, and the magnitudes are laid out into a matrix of fixed cell size whose rows and columns are discrete phenomena and categories, to suggest clusters or portray them as discovered via statistical analysis. The cell size is arbitrary, but it must be large enough to be seen. The position of a magnitude on a spatial heatmap, on the other hand, is determined by its location in that space, and there is no concept of cells; the phenomena are assumed to change continuously.

Data scientists and data analysts examine and determine essential links and characteristics among different points in a dataset, as well as aspects of those data points when working with small and large datasets. Heatmaps depict these data points and their interactions in a high-dimensional context without becoming excessively compressed and visually unpleasant. In data analysis, heatmaps enable specific variables of rows and/or columns to be plotted on the axes.

The drug-target interaction heatmap facilitates decision-making about the potential off-target activity of drug candidates early in the new drug development and drug repurposing workflows. In terms of important factors, a heatmap provides a clear picture of the interactions between drugs and their targets. This allows for the quick identification of the most important interactions.

GOSTAR presents these findings in a visually intuitive manner, allowing end-users to easily interpret the data and draw conclusions.

References

  1. Heat map. In: Google Arts and Culture. https://artsandculture.google.com/entity/heat-map/m09yl47?hl=en Accessed 21 April 2022.
  2. Exploratory Data Analysis. In: IBM Cloud Learn Hub. https://www.ibm.com/cloud/learn/exploratory-data-analysis Accessed 21 April 2022.
Categories
Blog

Matched Molecular Pair Analysis

Matched Molecular Pair Analysis

The complexity in molecular design is selecting what to do next based on existing data, medicinal chemistry knowledge, experience, and intuition. In small compound sets, a skilled chemist can discern trends and correlations by eye. As the number of molecules increases, more methodical procedures are required.

The Matched Molecular Pair (MMP) analysis, which compares closely related chemical structures pairwise across a big dataset, is one method in the medicinal chemist’s toolbox for accomplishing this. Since the structures of the two molecules in question differ very slightly, any change in a physical or biological feature between the matched molecular pair can be more easily interpreted.

In 2004, Kenny and Sadowski coined the term Molecular Matched Pair (MMP) for a subset of QSAR; it is now a widely used concept in drug design processes [1]. Matched molecular pairs differ only in small single-point alterations, which are referred to as chemical transformations. As the structural difference between the two molecules is minimal, any differences in physical properties or observed biological effects can simply be linked to it. In 2010, Hussain and Rea published an approach to find matched molecular pairs and relate them to the distribution of value differences for each transformation, and it has since become a popular tool for analyzing huge chemistry datasets.

MMP is typically used to describe a pair of compounds that differ structurally at a single site because of a well-defined transformation accompanied by a change in a property value. To rationalize observed structure-property relationships (SPR) and compound optimization, the relationship between structural and property change is used. Aside from assisting in hypothesis creation and testing, MMP can also be used to find outliers, such as a pair of compounds that have a sudden change in a property, known as an activity cliff. These compounds are typically the most intriguing to investigate in the development of compounds aimed at increasing the property that exhibits this change.

GOSTAR provides tools for determining the matched molecular pairs and analyzing activity landscapes across compound datasets.

References

  1. Kenny P.W., Sadowski J. Structure modification in chemical databases. In: Oprea T., editor. Cheminformatics in drug discovery. Wiley-VCH Weinheim; Germany, 2004, 271.
  2. Hussain J, Rea C. Computationally efficient algorithm to identify matched molecular pairs (MMPs) in large data sets. J Chem Inf Model. 2010, 50(3), 339-348.
Categories
Blog

Interactive Property Space Exploration

Interactive Property Space Exploration

Lipophilicity plays a significant role in small molecule drug design and discovery. A partition coefficient, logP, can be used to describe the lipophilicity of an organic compound. It is expressed as the ratio of the unionized compound’s concentration in the organic and aqueous phases at equilibrium. The distribution of species in compounds containing ionizable groups is influenced by pH and the lipophilicity of a molecule is affected by its ionization state. As a result, the distribution coefficient (logD) of a compound is defined, which considers the dissociation of weak acids and bases. In aqueous conditions, highly lipophilic substances are often less soluble. Lipophilic compounds, on the other hand, may have good solubility in oils and lipids, making them good candidates for lipid-based formulations.

Lipophilicity influences potency, selectivity, permeability, absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties. High lipophilicity, with logP greater than five, is associated with limited solubility, increased clearance, and poor oral absorption. Furthermore, highly lipophilic drugs have a predisposition for interacting with hydrophobic targets other than the primary target, thereby enhancing promiscuity and toxicity. Low lipophilicity can reduce permeability and potency, resulting in lower bioavailability and overall efficacy. Compounds with logP greater than one or less than four are thought to have better physicochemical and ADME properties for oral drugs.

Lipophilicity is often regarded as a key indicator of potential promiscuity, with many property–promiscuity studies indicating that drug promiscuity rises with the increase in lipophilicity. This tendency is concerning since increasing a molecule’s lipophilicity can improve its efficacy at the primary target; however, this can be counterbalanced by an increase in off-target promiscuity [2]. Lipophilicity is a key element in determining a drug’s affinity for protein targets and in modulating ADMET characteristics. As a result, the combination of high target potency and high lipophilicity may increase the likelihood of ADMET-related attrition. 

Therefore, medicinal chemistry optimization needs to be balanced and multidimensional. GOSTAR empowers medicinal chemists to efficiently explore the property space against a variety of bioactivity endpoints.

References

  1. Gao Y, Gesenberg C, Zheng W. Oral Formulations for Preclinical Studies: Principle, Design, and Development Considerations, Developing Solid Oral Dosage Forms (Second Edition), Academic Press. 2017, 455-495.
  2. Armstrong D, Li S, Frieauff W, Martus H.J, Reilly J, Mikhailov D, Whitebread S, Urban L. Predictive Toxicology: Latest Scientific Developments and Their Application in Safety Assessment, Comprehensive Medicinal Chemistry III, Elsevier. 2017, 94-115.
Categories
Blog

GOSTAR- The largest online medicinal chemistry intelligence database

Data Science To Empower
Life Science Innovation

GOSTAR- The largest online medicinal chemistry intelligence database

In medicinal chemistry, the relationship between molecular structure of a compound and its biological activity is referred to as Structure Activity Relationship (SAR). Medicinal chemists modify biomedical molecules by inserting new chemical groups into the compound and test those modifications for their biological effects. Determining and identifying SARs is key to many aspects of the drug discovery process, ranging from hit identification to lead optimization.

Although information on millions of compounds and their bio-activities e.g. reaction ability, solubility, target activity etc., is freely available to the public, it is very challenging to infer a meaningful and novel SAR from that information. The underlying problem in here is the un-structured and heterogeneous nature of these datasets contributed by the scientific & research community in journals, scientific articles, patents, regulatory documents and various secondary sources. Owing to the increasing structural diversity among hit compounds and their potency distribution, it is becoming a challenge to analyze the SAR information. If these relationships are properly extracted, associated and analyzed, they provide valuable information that would support drug discovery and development. To this end, there has been an increasing need and interest in mining and structuring SAR information from bioactivity data available in the public domain.

Global Online Structure Activity Relationship Database (GOSTAR)

Excelra, a leading global biopharma data and analytics company, has responded to this pertinent need by developing a knowledge repository, Global Online Structure Activity Relationship Database (GOSTAR), which provides a 360-degree view of millions of compounds linking their chemical structure to the biological, pharmacological and therapeutic information. GOSTAR contains high-quality, manually annotated and very well-structured SAR data captured from various primary sources (patents and top journals of medicinal chemistry) and secondary sources (conference meetings & abstracts, company drug development pipelines, company annual reports, clinical registries and drug approval reports).

Who can use GOSTAR and how?

The main objective for creating GOSTAR is to assist medicinal chemists, computational chemists and cheminformaticians in their quest for identifying potential small molecules that have decent biological effect and could be of a specific therapeutic use. GOSTAR enables users to quickly visualize, explore, analyze and evaluate SAR data based on their project requirements. The users can explore various SAR associations by searching various identifiers like drug names, chemical structures, bibliography, compound development stage and activity endpoints.

What are the applications of GOSTAR?

Better understanding of SAR data will enable the users to take correct decisions in exploring the chemical space while designing a drug.

Following are the applications of GOSTAR:

  • Target profiling – GOSTAR enables a holistic exploration of the chemical space around a target of interest & enables the users to understand the pathways and indications in which a given target is implicated
  • Structure based drug design – GOSTAR can be used as a compound library to perform virtual screening and hit identification in traditional structure-based drug design methodologies
  • Lead optimization – GOSTAR enables lead optimization by suggesting the structure activity relationships with improved potency, reduced off-target activities, and physiochemical/metabolic properties
  • Assay validation – GOSTAR suggests the right functional assays for secondary validation for the chemical modifications while involved in the tuning of the hit molecule
  • Drug repurposing and Translational science – GOSTAR data can be mined to interrogate diverse targets with a compound of interest to understand the feasibility and viability for drug rescue or for label expansion
  • Competitive intelligence and Novelty analysis – GOSTAR captures drug lifecycle information such as indication, phase of development, sponsor and recruitment/approval status including suspended trials along with the reason for discontinuation that can be used for building the competitive landscape around the drug/target/indication.

Why GOSTAR?

Currently, there are hundreds and thousands of chemical classes, and it often becomes daunting task to identify potential candidates for therapeutic use. In such cases, using knowledge repositories like GOSTAR, we can rapidly characterize data points that can help to efficiently capture and encode specific SAR. Below are the key features that showcase why GOSTAR is the ideal and simplistic solution for the complex task of gathering SAR data.

  • Reachability – Easy content accessibility to a wide and diverse user community
  • Utility – Maximize the utilization of content to create insights/concepts
  • Applicability – Selective utilization of content in diverse early discovery programs targeting unmet medical needs
  • Reliability – Standardized and normalized content to support traditional as well as AI/ML driven discovery programs

Try GOSTAR today. To schedule a free demo, write to us at: marketing@excelra.com

For more information on GOSTAR, visit: https://www.gostardb.com/gostar/

Categories
Blog

G-Protein Coupled Receptors: Structures, Research Landscape and Trends

G-Protein Coupled Receptors:
Structures, Research Landscape and Trends

A brief review of GPCR family: The largest family of druggable targets

G protein-coupled receptors (GPCRs) have become a hot frontier in basic research of life sciences and therapeutic discovery of translational medicines and is widely pursued by both academic and industrial research for drug discovery. They represent an important opportunity for both small molecule-based and antibody-based therapeutics and are the largest family of targets for approved drugs. The discovery of a diverse set of molecules targeting this family could become valuable assets, by solving unexploited horizons like establishing target biological functions and disease relevance.

GPCR structures and families

GPCRs are the largest family of proteins involved in membrane signal transduction and are also the most intensively studied drug targets, largely due to their substantial involvement in human pathophysiology. The pharmacological modulation of GPCRs provides leverage for treatment of diseases of central nervous system (CNS), cancer, viral infections, inflammatory disorders, metabolic disorders, etc.

The superfamily is classified into six classes based on amino acid sequence similarities namely, Class A (rhodopsin-like family); Class B (secretin receptor family); Class C (glutamate receptor family); Class D (fungal mating pheromone receptors); Class E (cAMP receptors) and Class F (frizzled or smoothened receptors), of which only four (A, B, C and F) are found in humans.

GPCRs are involved in various biological processes and disease indications and they make excellent drug targets (Fig 2). Some GPCRs have been linked to cancer development and progression, based on their overexpression and/or up-regulation by diverse factors. A higher expression of GPR49 was found to be involved in the formation and proliferation of basal cell carcinoma, the glycine receptor GPR18 was found to be associated with melanoma metastases, and high levels of GPR87 were found to be associated with lung, cervix, skin, urinary bladder, testis, head and neck squamous cell carcinomas.

Recently, orphan GPCRs have become a potentially novel targets for treatment of diverse set of indications, such as GPR119 for treatment of diabetes, leucine-rich repeat-containing G protein-coupled receptors 4 & 5 (LGR4/5) for treatment of gastrointestinal disease, GPR35 for treatment of an allergic inflammatory condition, GPR55 as an antispasmodic target, proto-oncogene Mas for treatment of thrombocytopenia, and GPR84 for of ulcerative colitis.

Landscape of GPCR research and drug development

GPCRs are the largest ‘target’ class of the ‘druggable genome’ representing approximately 19% of the currently available drug targets. In humans, the GPCR superfamily consists of 827 distinct members, of which 406 are non-olfactory. However, current therapeutics in humans target only 25% of potentially druggable GPCRs, 103 out of possible 403 GPCR targets, for which there is at least one marketed drug in practice.

Current literature analysis shows that GPCRs have traditionally been regarded as the domain for small-molecule drugs and very few targets are well studied. More than 30% of the US Food and Drug Administration (FDA) approved drugs target GPCRs, which makes them the largest druggable class of biomolecules (Fig 4).

Enormous efforts have been expended to find relevant and potent GPCR ligands as lead compounds. Non-olfactory GPCRs constitute more than half of the human genome encoded targets that are not yet exploited for any therapeutic use and the knowledge is disproportionately focused in the scientific literature. Preliminary studies highlight that these receptors have functions in genetic and immune system disorders.

While the drugs that currently target GPCRs are primarily small molecules and peptides, GPCRs also recognize diverse ligands, including inorganic ions, amino acids, proteins, steroids, lipids, nucleosides, nucleotides, and small molecules (Fig 5).

The latest trends in GPCR research indicates that modalities other than small molecules are becoming more popular as GPCR targeting agents with the entry of monoclonal antibodies, peptide drugs and allosteric modulators into early-stage clinical trials. For instance, GLP1 receptor targeting biologics like exenatide, liraglutide, and dulaglutide have been approved for type 2 diabetes, and CGRP receptor targeting erenumab in the treatment of chronic migraine and so many other peptide drugs targeting various GPCRs are also in development.

Current trends in GPCR research

In recent years, there is a significant increase in information available about the sequences, structures and signaling networks of GPCRs and the G proteins, due to breakthroughs in X-ray crystallography and cryo-electron microscopy (cryo-EM), leading to great understanding of GPCR-G protein interactions. This significant increase in information of GPCR-G protein interactions is being explored using several bioinformatics and software tools, including protein data bank GPCRdb gpDB , human gpDB  and many more.

Due to limited spatial and high cost of experimental studies, computational modeling techniques such as bioinformatics, protein-protein docking and molecular dynamics simulations are playing an important role in exploring the GPCR-G protein interactions. Determining the 3-dimensional structural features of various unexplored orphan receptors and their ligand-associated complexes has become an exciting avenue in the GPCRs research in understanding on the molecular recognition and activation mechanisms and help the pharmaceutical investigation of new diseases in variety of therapeutic areas.

As the current human therapeutics cover only 25% of potentially druggable GPCRs, a relatively large extent of GPCRs still remain ‘orphan’ and therapeutically unexploited. This prediction and identification of GPCR ligands for these orphan receptors is an active area of research and interest to pharmaceutical industry.

 

References

  • Hutchings CJ. A review of antibody-based therapeutics targeting G protein-coupled receptors: an update. Expert Opin Biol Ther. 2020 Aug;20(8):925-935
  • Ellaithy A, Gonzalez-Maeso J, Logothetis DA, Levitz J. Structural and Biophysical Mechanisms of Class C G Protein-Coupled Receptor Function. Trends Biochem Sci. 2020 Dec;45(12):1049-1064
  • Sriram K, Insel PA. G Protein-Coupled Receptors as Targets for Approved Drugs: How Many Targets and How Many Drugs? Mol Pharmacol. 2018 Apr;93(4):251-258
  • Hauser AS, Attwood MA, Rask-Andersen M, Schioth HB, Gloriam DE. Trends in GPCR drug discovery: new agents, targets and indications. Nat Rev Drug Discov. 2017 Dec;16(12):829-842
  • Rask-Andersen M, Masuram S, Schioth HB. The druggable genome: evaluation of drug targets in clinical trials suggests major shifts in molecular class and indication. Annu Rev Pharmacol Toxicol. 2014;54:9-26.
  • Lu S, Zhang J. Small molecule allosteric modulators of G-protein-coupled receptors: drug-target interactions. J Med Chem. 2018
  • Gugger M, White R, Song S, Waser B, Cescato R, Rivière P. GPR87 is an overexpressed G-protein coupled receptor in squamous cell carcinoma of the lung. Reubi JC Dis Markers. 2008; 24(1):41-50
  • M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne. The Protein Data Bank Nucleic Acids Research, 2000;28:235-242
  • Margarita C Theodoropoulou, Pantelis G Bagos, Ioannis C Spyropoulos and Stavros J Hamodrakas. “gpDB: A database of GPCRs, G-proteins, Effectors and their interactions.” Bioinformatics. 2008 Jun 15;24(12);1471-2
  • Satagopam, V.P., Theodoropoulou, M.C., Stampolakis, C.K., Pavlopoulos, G.A., Papandreou, N.C., Bagos, P.G., Schneider, R. & Hamodrakas, S.J. GPCRs, G-proteins, effectors and their interactions: human-gpDB, a database employing visualization tools and data integration techniques. Database (Oxford) 2010;baq019
  • Kooistra AJ, Mordalski S, Pándy-Szekeres G, Esguerra M, Mamyrbekov A, Munk C, Keserű GM, Gloriam DE. GPCRdb in 2021: integrating GPCR sequence, structure and function. Nucleic Acids Research, 2020;49:D335-D343
Categories
Blog

GOSTAR Updates in 2020

Data Science To Empower
Life Science Innovation

GOSTAR Updates in 2020

GOSTAR is the largest manually annotated structure-activity relationship (SAR) database of small molecules published in leading medicinal chemistry journals and patents. Compounds from both discovery and development stages targeting all target families are covered. Along with SAR, key properties like ADME and toxicity are captured. This relational database enables users to navigate and analyze massive content of small molecules to derive insightful decisions in design and discovery of novel compounds.

Content coverage

The GOSTAR database content is composed from various sources which includes:

  • MedChem Journals
  • Patents
  • FDA/EMEA/PMDA Reports
  • Clinical Trial Registries
  • Scientific Reviews
  • Company Websites
  • Books
  • Conferences
  • Public Sources

Patents covered in 2020

 

The patent coverage in GOSTAR database is very comprehensive. The content was indexed from more than 2900 patents in the year 2020. GOSTAR avoids duplicity or redundancy in database by avoiding capturing similar patents, i.e. patent published in multiple patent offices.

Preclinical candidates covered in 2020

 

In the year 2020, the GOSTAR database was enriched with 1500+ preclinical compounds acting against various indications like COVID-19, Non-alcoholic steatohepatitis (NASH), Hepatitis virus infections, HIV infections, Cardiovascular diseases, and various cancers.

 

Few significant drug inclusions in 2020 were:

  • EPV-COV19
  • FT-8225
  • VNRX-9945
  • CARG-201
  • S-540956
  • BMS-818251
  • BRII-732
  • CR-13626
  • NAB815
  • CV730
  • GLPG-4124
  • IDG-16177

Target space covered in 2020 updates

 

New content was updated for more than 2500 protein targets in 2020. While content for EGFR was updated from 200+ references, Adenosine A2A receptor was updated from 86 references and KRAS had content updated from 54 references, whilst NOTCH made into top 20 with around 4.7K compounds covered from a reference (Table 2).

Distribution of SAR content

Of the 1.2 million SAR rows added to the GOSTAR, functional in-vitro and in-vivo contribute 41.25% to data, binding constitutes 32.28%, and 6.69% of content consists of ADME properties.

Approximately, 2% content is around toxicity properties of the compounds covered in 2020 and the rest 17% represents other property types including physicochemical properties.

 

Try GOSTAR today. To schedule a free demo, write to us at: marketing@excelra.com

For more information on GOSTAR, visit: https://www.gostardb.com/gostar/

Categories
Blog

Drug Approvals in 2020

Data Science To Empower
Life Science Innovation

Drug Approvals in 2020

In 2020, despite several challenges due to the COVID-19 pandemic, the FDA has approved many novel products that served previously unmet medical needs and significantly helped in advancing patient’s quality of life. The broad indication wise distribution (Figure 1) of all CDER’s 2020 drug approvals indicates notable advances in drug discovery1,2.

CDER, approved 53 novel drugs*, either as New Molecular Entities (NMEs) under New Drug Applications (NDAs: 74%) or as new therapeutic biologics under Biologics License Applications (BLAs: 26%).

New Drug Approvals (FDA) in 2020

Significant drug launches of 2020

Many of the novel entities approved in 2020 are notable for their potential positive impact and unique contributions towards medical care.

 

First-in-class novel drugs

40% of novel drugs (21 of 53) approved as ‘First-in-class’. Few notable approvals include:

  • Rukobia (Fostemsavir, Viiv Healthcare, 07/02/2020)
    A new type of antiretroviral medication to treat HIV-1 via gp120:CD4 cellular interaction.
  • Koselugo (Selumetinib, AstraZeneca LP, 04/10/2020)
    MEK1/2 (RAF-MEK-ERK) inhibitor, for treatment of certain pediatric patients with neurofibromatosis type 1 (NF1PN).

Orphan novel drugs

58% of novel drugs (31 of 53) designated as ‘Orphan status’ to treat rare diseases. Notable examples with rare diseases include:

  • Evrysdi (Risdiplam, Genentech Inc, 08/07/2020)
    mRNA splicing modifier for SMN2, used as a treatment for spinal muscular atrophy (SMA).
  • Lampit (Nifurtimox, Bayer Healthcare Pharms, 08/06/2020)
    The first therapy approved by FDA to treat pediatric patients with chagas disease.
  • Orladeyo (Berotralstat, BioCryst Pharmaceuticals Inc, 12/03/2020)
    A plasma kallikrein inhibitor, to treat patients with hereditary angioedema (HAE).

Other notable drug approvals

  • Artesunate (Amivas LLC, 05/26/2020)
    It helps in the treatment of severe malaria in adult and pediatric patients by inhibiting EXP1, a glutathione S-transferase.
  • Imcivree (Setmelanotide, Rhythm Pharmaceuticals Inc, 11/25/2020) 
    It is a MC4 receptor agonist for the treatment of obesity and to control hunger associated with pro-opiomelanocortin deficiency.
  • Isturisa (Osilodrostat, Novartis Pharms Corp, 03/06/2020)
    Used for adults with Cushing’s disease by blocking the enzyme known as 11-β-hydroxylase (CYP11B1) and preventing cortisol production.
  • Orgovyx (Relugolix, Myovant Sciences GmbH, 12/18/2020)
    It is a GnRH receptor antagonist and is used for the treatment of certain patients with pancreatic cancer.
  • Qinlock (Ripretinib, Deciphera Pharmaceuticals LLC, 05/15/2020)
    Potent pan-KIT and PDGFRα kinase inhibitor. It is the first new drug specifically approved as a fourth-line treatment for advanced gastrointestinal stromal tumor (GIST).
  • Veklury (Remdesivir, Gilead Sciences Inc, 10/22/2020)
    Inhibits SARS-CoV-2 RNA-dependent RNA polymerase (RdRp). It was the first medication in the U.S. for the treatment of patients with COVID-19 infection (hospitalized adults and adolescent).
  • Zokinvy (Lonafarnib, Eiger BioPharmaceuticals Inc, 11/20/2020)
    A farnesyltransferase inhibitor which is used to treat certain patients with Hutchinson-Gilford Progeria Syndrome and Progeroid Laminopathies (Rare conditions caused by certain genetic mutations that leads to premature aging).

References

Categories
Blog

EMPOWERING DRUG DISCOVERY WITH BIG DATA AND ARTIFICIAL INTELLIGENCE

Data Science To Empower
Life Science Innovation

Empowering drug discovery with Big Data and Artificial Intelligence

Any data that requires support from technological and infrastructural investments in order to get meaningful insights is defined as “Big Data.” The main reasons contributing for Big Data are the exponential growth in the data due to increased usage and the requirement to integrate these datasets for gaining valuable insights. A good example is the data in drug discovery processes (1).

This blog aims to provide insights into various types of Big Data in drug discovery, and highlights the applications of Big Data in fast-tracking the drug discovery process by using machine learning (ML) approaches.

What is Big Data in drug discovery?

Big data in drug discovery refers to the data collected from biological, chemical, pharmacological and clinical domains (2). The attributes that define the characteristics of these datasets include: fast producing, large size, complex, heterogeneous and high value data with commercial opportunities. Some of the large datasets of use in drug discovery processes are highlighted below:

Biology datasets:

Biological data provides insights to understand the underlying mechanisms associated with disease state, prediction and validation of potential target proteins for therapeutics, development of new bioassay techniques for identifying treatment modalities associated with potential targets, predictions on how treatments will interact with the body when given to a patient and finally assistance in the design of effective clinical trials (2).

The data types that define biological data are: drug target data, OMICS data (genomic, transcriptomic, proteomic and metabolomic data), exome data, GWAS data, gene expression data, disease-relevant animal and cellular models data, gene knockout or knockdown data etc.

Chemistry datasets:

Chemistry datasets are useful in the design of high-throughput screening libraries which assist in identifying and validating therapeutic targets in silico. These datasets assist in the prediction of molecular properties required for drug compounds and help provide insights in understanding how those molecules interact with biological macromolecules (3).

The data types that define chemistry data are: chemical structural representations, chemical line notations or identifiers (SMILES & InChI), molecular property descriptors, topological descriptors, topographical descriptors, structure-activity-relationship (SAR) and compound specific biological data.

Pharmacology datasets:

Pharmacological data in drug discovery provides information about the compounds or drugs tested in animal models in combination with assay data on protein targets in cell- or tissue- based models that allows the investigation of the effects of compounds at different levels of biological complexity (4).

The data types that define the pharmacological data are: absorption, distribution, metabolism, elimination, toxicity (ADMET) data, functional in-vitro assay and in-vivo assay properties.

Clinical datasets:

The clinical datasets in drug discovery provide the valuable information in relation to the patient data (5).

The data types that define the clinical datasets are: safety and efficacy data, treatment response and side-effect profiles, patient stratification data, competitive landscape and trial design data.

The information contained in all the aforementioned large and complex datasets offer opportunities to explore and understand mechanisms associated with a disease state, and provides the possibility to prevent and treat such conditions.

What is artificial intelligence and what are its applications in drug discovery?

Scientists working globally in drug discovery research generate voluminous pharmaceutical Big Data which is by nature, multisource and multidimensional. It is becoming increasingly difficult to not only stay informed on all the available literature, but also, to properly parse and integrate this Big Data into one’s own work-flows within various research projects. In order to overcome the hurdles associated with Big Data in drug discovery, pharmaceutical or information technology companies adopted artificial intelligence (AI) technologies to provide robust solutions that could fast track the drug discovery process.

When a machine exhibits human cognitive skills like the ability to learn and solve a problem, then the term describing the actions of the machine is defined as artificial intelligence (AI) (6). AI comprises of technologies like Machine Learning and Deep Learning methods. Machine Learning methods are well established for learning and prediction of novel properties, while Deep Learning methods show great prospects in drug design owing to their powerful generalization and feature extraction capability. Both these methods have made remarkable progress, in their usefulness and applicability and offer opportunities across all stages of drug discovery (7,8) .

Some of the applications of artificial intelligence in drug discovery include:

  • Protein design and function
    • Prediction of protein folding
    • Prediction of protein-protein interactions
  • Hit discovery
    • Generation of chemical libraries or new molecule fingerprints
    • Virtual screening
    • Drug repurposing
  • Hit to lead optimization
    • Generating models for de novo design of drugs
    • QSAR models prediction
    • Prediction of molecular descriptors
    • Prediction of topological & topographical descriptors
  • Prediction of ADMET properties
    • Prediction of pharmacokinetic parameters like ADME properties
    • Prediction of toxicity properties
    • Pharmacodynamics modeling

Challenges and limitations associated with Big Data & AI in drug discovery

Some of the major challenges associated with Big Data in drug discovery include: data generation, data integration, data quality, data storage and management (2). Furthermore, errors in reproducibility and standardization of data, data format difficulties for chemical structure representations, missing original data, lack of contextual information, insufficient availability of disease-relevant human data in some disease areas, curse of dimensionality, bias in data, gaps of fundamental understanding in many diseases, issues associated in clinical-translational for target discovery and validation, complexities in managing entity name space and ontologies are other critical challenges associated with Big Data in drug discovery. In addition, protecting patient data and de-identification of personal data are legitimate concerns with respect to data storage and management.

Although artificial intelligence technologies are promising new techniques, their related studies still have some limitations (8). The processing and analyzing of large amount of data will affect the performance reliability in generating data models. Interpretation of complex data, as in the case of data associated with biological mechanisms, is another limitation for models generated using these methods.

How Excelra can support your AI-based drug discovery programs

Standardized and high-quality datasets are essential for AI/ML based drug discovery programs. Excelra’s GOSTAR is the world’s largest medicinal chemistry intelligence database providing comprehensive and structured SAR data for more than 8 million compounds. Available as a ‘one-stop data source’ for in silico drug discovery, GOSTAR captures a variety of small molecule activities encompassing SAR, physicochemical, metabolic, ADME and toxicological profiles into a relational database format.

GOSTAR datasets are created with industry-accepted ontologies that can be delivered in flexible file-formats such as:

  • Flat files
  • Hierarchical files
  • Databases (Oracle, MySQL, etc.)
  • Semantic format

“10 of the Top 20 pharma companies utilize GOSTAR to support their drug discovery programs”

To know more, visit https://www.excelra.com/discovery/#gostar and download a brochure.

Try GOSTAR today. To schedule a free demo, write to us on: marketing@excelra.com

References

  • Benke K, Benke G. Artificial Intelligence and Big Data in Public Health. Int J Environ Res Public Health. 2018 10;15(12).
  • Brown N, Cambruzzi J, Cox PJ, Davies M, Dunbar J, Plumbley D, et al. Big Data in Drug Discovery. Prog Med Chem. 2018;57(1):277–356.
  • Petitjean M, Camproux A-C. In Silico Medicinal Chemistry: Computational Methods to Support Drug Design. Edited by Nathan Brown. ChemMedChem. 2016;11(13):1480–1.
  • Xie L, Draizen EJ, Bourne PE. Harnessing Big Data for Systems Pharmacology. Annu Rev Pharmacol Toxicol. 2017 Jan 6;57:245–62.
  • Singh G, Schulthess D, Hughes N, Vannieuwenhuyse B, Kalra D. Real world big data for clinical research and drug development. Drug Discov Today. 2018;23(3):652–60.
  • Christina Aguis. Evolution of AI: Past, Present, Future [Internet]. 2019. Available from: https://medium.com/datadriveninvestor/evolution-of-ai-past-present-future-6f995d5f964a
  • Zhong F, Xing J, Li X, Liu X, Fu Z, Xiong Z, et al. Artificial intelligence in drug design. Sci China Life Sci. 2018 Oct;61(10):1191–204.
  • Zhang L, Tan J, Han D, Zhu H. From machine learning to deep learning: progress in machine intelligence for rational drug discovery. Drug Discov Today. 2017;22(11):1680–5.