Machine Learning Based Virtual Screening for Biodegradable Polyesters ()
1. Introduction
Textile development for clothing is a primary and growing human need, and the current demand is met by producing 110 million metric tons of industrial polyester per day [1]. The most commonly used polyester, which accounts for over half of all consumption, is polyethylene terephthalate (PET). The usage of fibers has doubled over the last 20 years due to the increased popularity of fast fashion companies [2]-[4]. These brands produce new styles and products weekly, optimizing for low cost rather than fabric biosafety. The rise of these industries has led to an increase in textile waste, amounting to 97 million tons in 2023 [5]. In anaerobic environments like landfills, these materials additionally produce 1.9 billion tons of CO2 annually, causing air pollution and damage to human health [6]. Since polyesters take over 200 years to break down, these problems greatly worsen over time, increasing waste accumulation rates and reducing landfill space availability.
70% of all clothing is derived from polyester fibers, and the dyes used to color these pieces introduce toxicants and heavy metals into the water supply. These pollutants can adversely impact animal and human health when untreated wastewater enters local and residential water systems [5]. In the past 20 years, the global population has grown by 25%, leading to an increased demand for inexpensive clothing [7]. Thus, to support the population’s growing needs while accounting for environmental challenges, sustainable alternatives to polyesters need to be developed at scale.
Over the past few years, researchers have developed innovative ways of generating molecules [8]. Initial approaches focused on atom-by-atom construction or substructure-based (ring or bond) assembly. These techniques perform best for small molecules, with significant declines in performance for large polymers [9]. In 2020, a method of molecular generation specifically tuned to large polymer generation, the Junction Tree Variational Auto-Encoder (JTVAE), was developed by Jin et al. [10]. This study developed a scalable method that generates polymers hierarchically, incorporating structural motifs, a key feature of large polymers, into the generation process. The generative model follows an encoder-decoder architecture in which the encoder takes a molecular input in the form of a graph and transforms it into a vector representation. The encoder captures both the coarse-grained motifs and the fine-grained atom connectivity. The decoder takes the embedding (lower dimensional representation of molecule) and constructs a new molecular graph step-by-step, adding one node or edge conditioned on the information learned from the encoder. The outputs of the decoder are thousands of novel graph representations of chemically valid molecules. However, these molecules are general and have different properties which need to be investigated and screened.
Previously, molecular screening has been done using two main methods: high-throughput screening (HTS) and knowledge-based filtering [11]. High-throughput screening is an experimental technique that rapidly evaluates the biochemical properties of millions of samples. Its high sensitivity and speed make it an optimal technique for simple and well-studied mechanisms. However, this method lacks support for more diverse systems and therefore is limited in its capacity for novel polymer generation. This approach also has a high propensity for false positives and potentially lower hit rates due to incompatibility with other technology. This method has been used to effectively develop general high-volume datasets for further virtual and ML-based screening [12]. Fransen et al. conducted a biodegradation assay using the clear zone technique, where polymers were suspended under the action of bacteria Psuedomonas lemoignei. These observations were used to experimentally derive the effects of chemical structures on biodegradability across 600 polymers.
Virtual screening is a computational technique in which diverse sets of compounds are assessed for target properties. Current techniques in virtual and computational screening include ligand-based modeling and knowledge-based rules [13]. Ligand-based approaches predict molecular activity by mapping them to similar structures, which works well for well-studied molecules but tends to have a bias against novelty. Knowledge-based rules develop specific chemical criteria to select or exclude certain molecules. This targeted selection increases the likelihood of identifying optimal molecules and reliably eliminates clear outliers. Additionally, knowledge-based filtering can be tuned to specific application areas. Rules can be implemented specifically to filter molecules which are optimal for sustainability or drug discovery. However, these rules may be overly restrictive to the point of preemptively removing target molecules. For example, the Lipinski Rule of 5 contains specific quantitative boundaries for features such as molar mass, H donor and acceptor sites, and LogP. These restrictions lead to higher rates of false negatives because this form of filtering obeys the criteria regardless of biochemical complexity [14]. Additionally, rule-based filters typically cannot consider a broader application field, such as sustainability.
This work proposes a hybrid machine learning-based approach, in which JTVAE-generated polymers are filtered to biodegradable polyesters. For the filtering, a machine learning-based biodegradability classifier trained on Big Simplified Molecular Input Line Entry Systems, or BigSMILES, strings from HTS polyesters and polycarbonates, scores each generated molecule, capturing the biochemical complexity of biodegradability. The correlations of chemical rules with biodegradability are also computed to increase the interpretability of the final molecular design. The properties chosen to evaluate the structures from an atom, bonding, and environmental lens were molecular structure, bond types and interaction with water. For molecular structure, it is expected that aromatic rings will reduce biodegradability because of their high rigidity and absence of enzyme target groups, such as hydroxyls. Ester linkages are expected to increase biodegradability due to their high susceptibility to hydrolysis. Lastly, hydrophobicity will likely decrease biodegradability due to low solubility in water and other polar solvents.
2. Methods
2.1. Data Description and Preliminary Analysis
The methods used in this work can be split into three parts: polymer creation with JTVAE generated molecules, biodegradable polyester filtering with a predictive model, and chemical and synthesizability analysis of the final molecules. The biodegradability predictor developed in this work uses gradient boosted trees, an AI/ML-model which retains performance even with small amounts of structured data.
We focus on molecules created by Jin et al.’s hierarchical encoder-decoder architecture for polymer graph generation. They used the polymer dataset with 86 K polymers from St. John et al. for training their model which was then tested for distributional statistics between original and generated compounds in addition to chemical validity and diversity [15]. In the current work, we generate 10,000 polymer molecules using model checkpoint 19 with the polymer vocabulary and other default settings for the library. Polyesters were then selected from the 10 k polymers by identifying molecules with a repeated ester linkage. Ester linkages form through the reaction between an organic alcohol and carboxylic acid, and a polyester is comprised of repeating units of this group.
Several chemical properties were computed for the molecular structures using the cheminformatics library RDKit, including LogP, molecular weight, heavy atom count, and bonding information [16]. These were computed to investigate the general characteristics of the polyesters, broadly understanding their molecular frameworks before applying sustainability-specific filtering.
2.2. Biodegradability Predictor
Biodegradability cannot be inferred from the above properties because structural motifs, interactions with solvents, and bonding information contribute to this vital characteristic. An AI/ML-based gradient boosted tree model was trained and tuned to quantify this property. The model was trained from a publicly available dataset containing molecules and binary labels showing their biodegradability from Fransen et al.’s work on the experimental discovery of biodegradable polymers.
The data was in the form of BigSMILES, string representations of molecular structure. The characters “{[][<]”, “[<],[>]”, and “[>][]}” were removed from the SMILES to facilitate encoding. These characters indicate the places of repeating units within a polymer, but do not impact the molecular formula. The model then implicitly learns that the subunit repeats. Molecules were created from the SMILES, and each molecule was verified to be kekulizable. In RDkit, kekulization involves fully expanding aromatic bonds, which contributes to standardization across molecules. Each molecule was then converted into a 128-bit Morgan fingerprint binary vectors, which leads to the modeling matrix having 128 columns and 549 rows.
The data was divided using a 70, 10, 20 split for training, validation, and testing, and the initial model was trained. This predictor computed a biodegradability score on each molecule, and the top ten highest-performing molecules were further analyzed. Several evaluation metrics were computed on the model, such as Area Under Receiver Operating Characteristics curve (AUROC) and Area Under Precision Recall Curve (AUPRC). The validation dataset was used for early stopping when the binary cross entropy loss did not improve in 10 iterations. Hyperparameter tuning was performed on boosted tree parameters such as learning rate, maximum depth, minimum data in leaf, and regularization parameters lambda_l1 and lambda_l2.
2.3. Synthesizability Analysis
Gao et al.’s SynNet was used to create synthesis pathways to show if the polymers could be created in a laboratory using feasible reactions and purchasable compounds [17]. SynNet was chosen due to its choice of materials being preexisting monomers rather than generating unrealistic molecular structures.
SynNet uses a Markov decision process, which conditions on a target molecular embedding. The neural networks model synthetic trees according to reaction rules from a set possible space of reaction templates. The networks are trained on many pathways created from a database of compounds commercially available in the United States. This method was first validated by ensuring that the network could recover new molecules using conditional generation. Next, the method was validated through the identification of synthesizable structural analogs. Third, the molecules were validated through optimization for specific applications, such as drug discovery. In this work, SynNet was used to produce the SMILES strings of the monomers for each target polyester. The National Institute of Health’s chemical Identifier Resolver was then used to find the compound name and make the results more interpretable [18].
3. Results and Discussion
3.1. Results for the Biodegradability Predictor
Figure 1 shows the Receiver Operating Characteristics curve of the biodegradability binary classification model for training, validation and test data splits. An AUROC of 83.59% on the holdout test dataset indicates that the model could effectively distinguish between biodegradable and nonbiodegradable molecules and capture the property very well. The black dotted line shows the minimum AUROC, only possible if the model learns no weights and operates completely randomly. Figure 2 shows the Precision Recall curve for the same model and data splits. AUPR of 87.24% on the test dataset indicates that the model is reliable and effective in identifying true positives, or labeled biodegradable polyesters.
3.2. Effect of Descriptors on Biodegradability
The chemically valid polymers generated by Jin et al.’s model were filtered as described previously and scored using the biodegradability model. The properties of the top ten polymers which had the highest score were further analyzed, and their synthesis pathways created.
Figure 1. Receiver Operating Characteristics (ROC) Curves for train, validation and test data splits using the Biodegradability Prediction Model.
Figure 2. Precision Recall (PR) Curves for train, validation and test data splits using the Biodegradability Prediction Model.
The effects of various chemical properties on biodegradability are discussed below and are supported by the experimentally derived findings in Fransen et al.’s work. The most biodegradable molecules contained a carbon backbone of seven connections or lower; any molecules with greater than fifteen had largely inhibited biodegradability. This assertion is supported by the analysis of molecular weight as a chemical descriptor, which is negatively correlated with biodegradability. Also, polar heteroatoms (non-C or O atoms) contribute to biodegradability, likely supported by interactions with enzymes and their availability. Figure 3 and Figure 4 support these assertions about molecular weight. In Figure 3, a molecule with a score of greater than 0.8, the best performing molecule, contains a much shorter carbon backbone than the molecule in Figure 4, which scored very poorly.
Figure 3. Visualization of a molecule scoring greater than 0.8.
Figure 4. Visualization of a molecule scoring less than 0.8.
3.3. Effect of Molecular Structure on Biodegradability
The presence of aromatic rings had a weak negative correlation with biodegradability. This may be due to their higher rigidity, because of their alternating single and double bonds therefore decreasing biodegradability. In terms of material properties, the presence of these bonds may preserve features such as thermal and mechanical strength, which are vital for polyesters to be used in textile production.
3.4. Effect of Bond Types on Biodegradability
For results on bond types, ester linkages were positively correlated with biodegradability, shown by a correlation coefficient of 0.062. This correlation may occur due to the following chemical properties. Ester bonds form between carbon and oxygen atoms, which have a substantial electronegativity difference. This results in a very polar bond, causing structures to be dissolvable in polar solvents and cleavable by hydrolytic enzymes. Thus, the presence of polar heteroatoms was also positively correlated with biodegradability, supported by the prevalence of sulfur atoms in Figure 3. Additionally, repeating C=O bonds specifically are commonly found in many types of organic matter. This prevalence has resulted in the evolution of metabolic degradation pathways of similar molecules across several microorganisms. Polyesters with these characteristics generally degrade in a wide variety of environments and conditions.
Figure 5. Correlation coefficients of chemical descriptors.
3.5. Effect of Solvent Interactions on Biodegradability
Finally, there was a weak positive correlation between hydrophobicity and biodegradation. This property was quantified using the partition coefficient, LogP. This measure is computed as a chemical descriptor in RDKit, and a positive LogP value demonstrates lipophilicity, while a negative value indicates a higher affinity for the aqueous phase. The correlation coefficient for LogP was 0.16, suggesting hydrophobicity was favored, likely due to a facilitated entrance through the cell membrane, increasing uptake and later degradation. Additionally, when polymerized, more hydrophobic molecules will retain qualities such as durability, breathability, and stain resistance, necessary properties in textiles. All analysis of chemical descriptors is supported by correlations calculated and shown in Figure 5.
3.6. SynNet Analysis
SynNet showed that the top high-scoring molecules are completely chemically synthesizable. The component parts having simplistic structures and less than ten monomers suggested the polyesters are accessible and not complex to generate. SynNet does not consider the commercial solvents and materials required to produce the polymers, only the basic components themselves. Thus, there may be additional materials and costs necessary to generate these new materials.
4. Conclusions
This study proposes a novel machine learning-based virtual screening method for biodegradable polyesters. To achieve this screening, a tree-based model captures the biodegradability property and scores a set of AI-generated molecules. The final molecules achieve the desired properties of biodegradability and synthesizability. Additionally, the method is reliable and representative, an improvement from the state of the art.
Molecular structure, bonding, and interactions with water were evaluated for their effect on biodegradability to aid in interpretability for the final molecules. The presence of aromatic rings was fairly neutral, suggesting that these rings in trace quantities may assist with final polymer properties. Presence of ester linkages positively correlates with biodegradability. Finally, hydrophobicity mildly increased biodegradability as shown by the partition coefficient.
Further analysis can include building prediction models that capture other properties which contribute to sustainability, such as solubility, and using them in conjunction with the biodegradability model. Additionally, the structures designed should undergo in silico simulations under various environmental conditions as well as in vitro testing. In the virtual screening space, generative models can be leveraged to produce novel compounds with desired properties, rather than filtering a set of general structures, posing several unique applications in the sustainability field.