Efficient implementation of Deep Learning techniques for the spectroscopical identification of molecules using Tensor Decomposition methods

Zixing QIU

ISMO

This research investigates the problem of molecular identification from spectral data: the inverse problem of inferring molecular structure from a given spectrum, as opposed to predicting spectra from known molecular representations. We propose a novel deep learning framework in spired by large language models (LLMs), treat ing the mapping between spectra and molecular representations as a translation task. To overcome the prohibitive memory and energy costs associated with large LLM archi tectures, we introduce the Hierarchical Tucker Finite Basis Representation (HT-FBR), a ten sor decomposition approach based on Hierarchi cal Tucker Decomposition (HTD) and successive Singular Value Decompositions. By interpolat ing the leaf factors with piecewise polynomials, we obtain a functional representation that can be efficiently reconstructed from a coarser ref erence tensor with the same number of modes and same domain of definition. Furthermore, we develop a chain-of-operators form that en ables single-entry evaluation of tensor elements, avoiding the explicit construction of the full Tucker frame tree. This form facilitates embar rassingly parallel computation and allows direct optimization of HT-FBR parameters from scat tered reference data using a backpropagation like scheme. Additionally, the HT-FBR framework is demonstrated in the fitting of potential energy surfaces for two systems: (i) a benchmark, cis-trans isomerization of HONO (6D) and (ii) the coupled 12D diabatic potential energy surfaces of ethene. For the latter we use a database gen erated through a series of Direct Dynamics Vari ational Multiconfiguration al Gaussian trajec tories (DD-vMCG). Our HT-FBR representa tion is fully compatible with both trajectory and wavepacket based methods such as such (ML )MCTDH. We further integrate the HT-FBR into an LLM-based architecture by embedding the chain-of-operators form within the computa tional graph, allowing end-to-end optimization of both HT-FBR factors and its weights. This structured design reduces parameter redun dancy and computational cost, yielding a sig nificant reduction in total parameters without loss of predictive accuracy. To evaluate the proposed model, we ap ply the decomposed LLM architecture to the identification of polycyclic aromatic hydrocar bon (PAH)-like molecules from infrared spectra. A comprehensive molecular database was con structed by combining the NASA AMES PAH, and NIST datasets. Training was accelerated using multi-GPU parallelization via Distributed Data Parallel (DDP) and Distributed Model Parallelization (DMP) frameworks. Experimen tal results demonstrate that the proposed archi tecture achieves an excellent cross-entropy loss in learning the molecular feature space, validat ing its efficiency and predictive capability