Chemoinformatic Resources for Organometallic Drug Discovery

Medicinal Organometallic Chemistry keeps contributing to drug discovery efforts including the development of diagnostic compounds. Despite the limiting issues of metal-based molecules, e.g., such as toxicity, there are drugs approved for clinical use and several others are under clinical and pre-clinical development. Indeed, several research groups continue working on organometallic compounds with potential therapeutic applications. For arguably historical reasons, chemoinformatic methods in drug discovery have been applied thus far mostly to organic compounds. Typically, metal-based molecules are excluded from compound data sets for analysis. Indeed, most software and algorithms for drug discovery applications are focused and parametrized for organic molecules. However, considering the emerging field of material informatics, the objective of this Commentary we emphasize the need to develop cheminformatic applications to further develop metallodrugs. For instance, one of the starting points would be developing a compound database of organometallic molecules annotated with biological activity. It is concluded that chemoinformatic methods can boost the research area of Medicinal Organometallic Chemistry.


Introduction
Organometallic and inorganic compounds attract a large interest because of their broad range of biological activities [1]. Metal-containing compounds in clinical use and under clinical development as well as relevant in diagnostic applica-Computational Molecular Bioscience tions have been discussed extensively in the literature. Indeed, metallodrugs and some inorganic molecules offer large benefits as diagnostic tools [2] and are associated with the field of metalloimaging. Metallodrugs have diverse mechanisms of action [3] and therapeutic applications of which one of the more explored are anti-cancer followed by antibacterial activity. Another important point is the therapeutic application that covers not only human but also veterinary consumption which expands the field of application and the opportunities for success.
Organometallic compounds also are attractive because they can explore novel molecular targets not addressed by the currently available chemical space (defined mostly by organic compounds). Similarly, emerging molecular targets such as epigenetic could be conveniently addressed by novel compounds located outside the traditional drug-like space. In addition, metallodrugs offer distinct features that could be useful for complex diseases best addressed by multi-target approaches [4].
Despite the fact organometallic compounds have general concerns such as toxicity and cost (particularly considering a "large" production), organometallic drugs have attractive and distinct structural features with the ability to augment the relevant medicinal chemical space, ideally balancing novelty with relevance [5].
One of the distinct structural features of organometallic drugs is molecular complexity that is well-known to have a major impact on drug discovery [6].
While molecular docking of metal-complexes and other modeling approaches are commonly conducted to explore compound-target interactions [7] [8], chemoinformatic studies addressing aspects such as chemical diversity, visual representation of the chemical space, and similarity searching, to name a few, have been done on a more limited basis. However, they represent major areas of opportunity. Chemoinformatics arises from the combination of scientific and technological tools with the 3D understanding and manipulation of the chemistry applied to therapeutic research. Although it has been applied to a larger extent to organic compounds, it is proposed that using these tools in organometallic chemistry can provide good outcomes and generate significant advances in therapeutics.
The main goal of this Commentary is to highlight major areas where, in the author's opinion, chemoinformatic methods used to explore organic-based molecules can be extended to address the needs of the organometallic-based drug discovery. After this short Introduction, several areas of opportunity or application are discussed. They are not arranged in strict order of priority.

Areas of Application
Chemoinformatic methods are broadly used across several stages of the drug discovery process [9]. For historical reasons these approaches, boosted by the needs of pharmaceutical companies, have been developed and applied to organic compounds. Unlike organic compounds, organometallic compounds have had much less therapeutic applications, which is even more noticeable when it comes to the application of chemoinformatics for its development. In fact, in several methods and chemoinformatic protocols metal-based compounds are excluded from the analysis. The practice of filtering out compounds is typically done when working with medium-to-large chemical databases for analysis of chemical diversity or to develop predictive models, to name a few examples. However, as mentioned earlier, these types of practices have had little or no application in organometallic molecules. One reason to remove metal-based compounds is its overall low frequency in most major chemical databases used in drug discovery currently available. A second and perhaps more strong reason to exclude metal-containing molecules is the lack of appropriate parameters to address the presence of metal atoms in the chemical structures.
In this section of the Commentary we outline several areas where chemoinformatic methods commonly used in current drug discovery can be applied to study organometallic compounds.

Database of Organometallic Compounds for Drug Discovery
Perhaps one of the most relevant and straightforward applications of cheminformatics to study organometallic compounds is related to compound databases. Indeed, compound databases play a significant role in drug discovery [10]. Either public, in-house (mostly private), virtual, and on-demand [11], are key repositories to store, organize, and mine chemical and biological information. Major compound databases used in drug discovery have been reviewed extensively elsewhere [12]. Despite the fact these large compound databases include some organometallic molecules, the vast majority are small organic molecules and the few organometallic molecules that are included do not have information that is useful for the development of other new molecules, but they are limited to common parameters. In fact, as commented in the Introduction, while characterizing the chemical diversity of such databases, a common practice is filtering out metal-containing molecules.
To the best of our knowledge, there are no large compound databases that store and organize the information of organometallic molecules annotated with biological activity. Therefore, building, curating and maintaining a database of this kind is a major area of opportunity to integrate informatics methods to organometallic drug discovery. To address this need, a proof-of-principle database is D-InoDB [13]. This database with still a limited number of compounds so far, contains information of molecules approved for clinical use and under clinical development.
Compounds databases with organometallic compounds can be further developed in a web-based application (vide infra). Similar to other databases of organic compounds, a database of organometallic compounds annotated with biological activity can facilitate a large number of analysis such as structure-property (activity) relationships-QSP(A)R-including activity landscape modeling [14], data mining, and virtual screening, to name a few. The compound database can be

Molecular Representation
A cornerstone in chemoinformatics is molecular representation [16]. The appropriate description of the molecules is the most important first step towards virtual any qualitative or quantitative analysis. This can be clearly seen in studies such as QSP(A)R where the selection of the descriptors is key to obtain a predictive model. In some instances, "simple" 2D descriptors can be enough to obtain a useful and predictive model. In other instances, more accurate descriptors are required to capture the molecular shape and 3D information for explaining and/or predicting biological activity or assessing metal-binding sites in proteins [17].
The accuracy and speed of the calculation to compute 1D, 2D or 3D descriptors are one of the most sensitive points considering that as the overall accuracy of the descriptors increases, the calculation speed decreases. This is particularly relevant for metal-based molecules while selecting the descriptors to be computed. Therefore, it is essential to keep in mind the application of the description to optimize resources. While there are several methods based on quantum mechanics to describe accurately metal-based compounds, such methods are still not suited to manage efficiently large amounts of structures. In addition, the existing (including calculated) information and descriptors on these molecules is not uniform and is not available for use.

Molecular Fingerprints
In cheminformatics, molecular fingerprints are common representations of organic compounds and several different types have been developed. In general, such fingerprints are computed very rapidly and are appropriate to analyze even thousands or millions of compounds efficiently. In turn, such representations are the basis to perform several analyses such as similarity searching, diversity, and clustering analysis (including qualitative, quantitative and visual analysis).
A general approach to generate appropriate molecular fingerprints for organometallic compounds is developing a typical (dictionary or topological) fingerprint for the organic portion of the molecule and then adding a fingerprint developed for the metal portion. A bottleneck of this approach is the speed of the calculations to compute the metal portion. A workaround to address this issue can be to generate large compound databases with the values pre-calculated for different metals.

Diversity Analysis
In drug discovery, common practice and useful chemoinformatic analysis is the quantification of the molecular diversity of compound databases [18]. For instance, to identify novel hits it is generally desirable to screen compound data- To address the need for generating diverse libraries, organic chemists have developed "diversity-oriented-libraries" [19]. Another approach is the "libraries-from-libraries" [20]. In lead optimization, in contrast, a general approach is screening compound data sets with lower molecular diversity e.g., high structural similarity to the active, lead molecule. In other words, in lead optimization it is more common to explore focused regions in chemical space. Examples of data sets aimed to address this need are the "focused" and "targeted" libraries. For all these cases, i.e., to select diverse or focused and less diverse libraries, experimental chemists (medicinal, organic or inorganic) can readily identify and select compounds that meet the desired criteria of diversity. However, when dealing with medium-to-large compound databases it becomes more difficult to assess molecular diversity in an accurate manner. This is clear when purchasing data sets available from third parties (commercial vendors, for instance). Therefore, diversity analysis is standard practice when analyzing organic small molecules.
The methods available for this type of molecules can be readily extended to analyze the diversity of organometallic compounds. To this end, the development of molecular fingerprints appropriate for organometallic molecules (vide supra) can be the basis to measure the diversity. Such molecular fingerprint representations or other appropriate representation based on continuous values can be used as the basis to apply diversity metrics such as the Tanimoto coefficient, Euclidean distance or other diversity metrics available [16]. It would remain to assess the most suitable fingerprint representations and diversity metrics tuned for organometallic compounds.

Chemical Space
The concept of chemical space [21] is also quite relevant in drug discovery for several purposes. Although there is not a single, correct definition, one concept is a multi-dimensional space for set (ideally all chemical possible) compounds [22]. The concept of chemical space is the basis to perform studies that include but are not limited to QSP(A)R studies (e.g., it is used as a matrix that contains the descriptors and biological activity); diversity analysis; clustering and visual assessment of diversity; comparative studies assessing the similarity or differences among compound data sets. A suitable chemical space can be used as a standard for profiling most structural sets of interest [23].
Since the chemical space depends on chemical representation, there are no "unique" or "invariant" chemical spaces. Despite the large dependence of the chemical space with structural representation, quantitative and qualitative anal-DOI: 10.4236/cmb.2020.101001 6 Computational Molecular Bioscience ysis of the chemical space of organic molecules is now relatively straightforward to study. Indeed, it is fairly common to find visual representation of the chemical space of compound data sets from different sources such as synthetic molecules (e.g., from diverse designs), natural products (e.g., from different geographical sources or natural origin) [24] [25]. This is largely in part due to there are "standard" representations and descriptors available for organic compounds.
However, there is no visual representation of the chemical space comparing bioactive organic vs. organometallic compounds. This is due to largely in part, the lack of appropriate molecular descriptors suited to represent a large number of organometallic molecules in an efficient manner. As commented above, fingerprint representations of organometallic molecules will boost the qualitative and quantitative analysis of its chemical space.

Virtual Screening
In silico also called virtual screening of compound databases has been a very useful approach to identify hit compounds [30]. Compound databases from different sources such as synthetic libraries, natural product data sets [31], and even virtual libraries (where compounds are cherry-picked for synthesis and experimental testing) are screened regularly. To this end, two general approaches structure-based and ligand-based methods are employed. As discussed in detail elsewhere, the method of choice will depend on the experimental information available for the system e.g. if the 3D structure of the molecular target is known docking, pharmacophore-based screening, similarity searching, and combinations of the above. In the latter cases, also named cascade or sequential approaches, fast (but less accurate) methods are applied first to rapidly filter large amounts of compounds followed by more accurate but slower methods to select molecules for experimental testing. At the end of the process factors such as availability of the physical samples are considered (e.g., commercial availability and cost, for instance, if the compounds are commercially available).

Similarity Searching
One of the ligand-based techniques frequently used in virtual screening is similarity searching. The rationale of these approaches is that similar compounds have similar activity (if there are no activity cliffs [27], that is, molecules with similar chemical structure but very different and unexpected large activity difference).
In this case, the chemical structures of all the molecules in a compound library are compared systematically with the chemical structure of one or several active molecules that are used as reference or queries. Two key components to perform similarity searching are the molecular representation and a similarity measure [16]. For molecular representation is common to employ a molecular fingerprint because they are quite fast to compute (vide supra).Thus, applying chemoinformatics tools as a whole such as virtual screening, the use of fingerprints and the search for molecular similarity, we can expect to find a result that, although not definitive, gives an overview or a guide to what is being sought.
Thus far, similarity searching has not been reported for organometallic compounds, but it can be easily performed once an efficient molecular representation is developed. To this end, molecular fingerprints suited for organometallic molecules can boost the application of this technique to identify novel hit compounds. In addition to molecular fingerprints, other molecular representations can be employed.

Webservers
A broad number of chemoinformatic resources and methods are now available to the scientific community through webservers. A considerable number of such servers are publicly accessible [33] [34]. Chemoinformatic servers focused on organic small molecules include, but are not limited to, the generation and analysis of molecular descriptors, visual representation of the chemical space, diversity analysis, and servers to predict the ADMETox profile of compound data sets. Other servers are dedicated to hosting compounds and predictive models.
As discussed above, the servers are focused on organic molecules such that a common preparation or curation step to analyze compound data sets is to remove compounds containing metals. Therefore, we consider that a significant are of opportunity to advance Medicinal Organometallic Chemistry is to develop, maintain and update web-severs able to manage and deal with organometallic molecules. Such servers can be used for hosting maintaining and mining compound databases, calculation of descriptors including molecular fingerprints, and the prediction of properties-including ADMETox-using validated QSP(A)R models, to name a few.

Conclusion
Organometallic-and inorganic-based compounds are promising resources to address novel and emerging molecular targets. Similarity, metal-based medicinal agents can represent new alternatives to tackle difficult targets poorly addressed by the current traditional chemical space typically defined by small organic molecules. In addition, organometallic-based compounds can be part of multi-target approaches used in combination with other biologics or organic small-molecules.
While molecular modeling and docking of metal-complexes compounds are performed regularly to explore compound-target interactions, chemoinformatic approaches aimed to organize and manage the information of organometallic compound databases annotated with biological activity are still limited. Similarly, chemoinformatic approaches to study systematically the chemical diversity, visual representation of the chemical space, similarity searching, and SAR involving organometallic compounds are not fully developed. One of the key starting points to extend the cheminformatics to organometallic compounds is developing appropriate and efficient molecular representations to describe, as accurately as possible, the structure of the compounds. Such representations will largely depend on the intended purpose of the analysis, for instance, data mining, exploration of the chemical space, diversity analysis, molecular interactions drug/compound-molecular