Automation of Database Generation, Statistical Analysis, and Supervised Training of Graph Neural Networks: Application to the Spectroscopy of Large CH Clusters of Astrophysical Interest
Xuewen XIAO
ISMO
Accurate prediction and interpretation of infrared (IR) spectra for large hydrocarbon clusters remain challenging due to the high computational cost of quantum-chemical methods and the complexity of experimental spectra. Moreover, generating large, diverse, and high-quality training datasets for machine learning is difficult due to the expense of full quantum-chemical calculations and the limited availability of experimental data. To address these obstacles, this thesis introduces Mech‑AMK, an automated workflow that couples systematic database generation with a partition-based fragmentation procedure to produce synthetic training data that retain the essential spectral features of large molecular clusters.
Mech‑AMK was applied to representative large CH aggregates, C60H10 and C120H10, to produce two complementary collections: a high‑level dataset of optimized structures and computed IR spectra obtained at the B97‑D3/def2‑SVP level of theory, and a corresponding synthetic‑fragments dataset generated via partition-based fragmentation and recombination rules that emulates the spectral contributions of the full systems. Both collections include three‑dimensional geometries and computed vibrational spectra suitable for supervised learning.
The high‑level dataset was used to train graph neural networks (GNNs) for IR‑spectrum prediction. Model development and benchmarking show that an AttentiveFP architecture achieves superior accuracy and spectral fidelity compared with a standard message‑passing neural network (MPNN) baseline, improving peak position and intensity reproduction across the spectrum. At the same time, the simulated spectra were compared directly to experimental measurements: good agreement is observed in the C-C stretching and bending region (~ 850–1700 cm-1) and in the C-H stretching region (~ 700–3100 cm-1), demonstrating that the combined high-level and synthetic datasets capture the dominant spectroscopic signatures of large hydrocarbon aggregates.
Overall, this work demonstrates that automated database generation paired with fragmentation‑derived synthetic data enables scalable training of GNNs for reliable vibrational-spectra prediction, and facilitate quantitative simulation–experiment comparisons for large molecular systems. The approach provides a practical route to extend spectroscopic machine learning to progressively larger and more diverse aggregates and suggests future extensions toward experimental observables beyond IR spectra.
Supervised by Daniel Pelaez Ruiz