New project takes aim at theory-experiment gap in materials data
Condensed-matter physics and materials science have a silo problem. Although researchers in these fields have access to vast amounts of data – from experimental records of crystal structures and conditions for synthesizing specific materials to theoretical calculations of electron band structures and topological properties – these datasets are often fragmented. Integrating experimental and theoretical data is a particularly significant challenge.
Researchers at the Beijing National Laboratory for Condensed Matter Physics and the Institute of Physics (IOP) of the Chinese Academy of Sciences (CAS) recently decided to address this challenge. Their new platform, MaterialsGalaxy, unifies data from experiment, computation and scientific literature, making it easier for scientists to identify previously hidden relationships between a material’s structure and its properties. In the longer term, their goal is to establish a “closed loop” in which experimental results validate theory and theoretical calculations guide experiments, accelerating the discovery of new materials by leveraging modern artificial intelligence (AI) techniques.
Physics World spoke to team co-leader Quansheng Wu to learn more about this new tool and how it can benefit the materials research community.
How does MaterialsGalaxy work?
The platform works by taking the atomic structure of materials and mathematically mapping it into a vast, multidimensional vector space. To do this, every material – regardless of whether its structure is known from experiment, from a theoretical calculation or from simulation – must first be converted into a unique structural vector that acts like a “fingerprint” for the material.
Then, when a MaterialsGalaxy user focuses on a material, the system automatically identifies its nearest neighbors in this vector space. This allows users to align heterogeneous data – for example, linking a synthesized crystal in one database with its calculated topological properties in another – even when different data sources define the material slightly differently.
The vector-based approach also enables the system to recommend “nearest neighbour” materials (analogs) to fill knowledge gaps, effectively guiding researchers from known data into unexplored territories. It does this by performing real-time vector similarity searches to dynamically link relevant experimental records, theoretical calculations and literature information. The result is a comprehensive profile for the material.
Where does data for MaterialsGalaxy come from?
We aggregated data from three primary channels: public databases; our institute’s own high-quality internal experimental records (known as the MatElab platform); and the scientific literature. All data underwent rigorous standardization using tools such as the pymatgen (Python Materials Genomics) materials analysis code and the spglib crystal structure library to ensure consistent definitions for crystal structures and physical properties.
Who were your collaborators on this project?
This project is a multi-disciplinary effort involving a close-knit collaboration among several research groups at the IOP, CAS and other leading institutions. My colleague Hongming Weng and I supervised the core development and design under the strategic guidance of Zhong Fang, while Tiannian Zhu (the lead author of our Chinese Physics B paper about MaterialsGalaxy) led the development of the platform’s architecture and core algorithms, as well as its technical implementation.
We enhanced the platform’s capabilities by integrating several previously published AI-driven tools developed by other team members. For example, Caiyuan Ye contributed the Con-CDVAE model for advanced crystal structure generation, while Jiaxuan Liu contributed VASPilot, which automates and streamlines first-principles calculations. Meanwhile, Qi Li contributed PXRDGen, a tool for simulating and generating powder X-ray diffraction patterns.
Finally, much of the richness of MaterialsGalaxy stems from the high-quality data it contains. This came from numerous collaborators, including Weng (who contributed the comprehensive topological materials database, Materiae), Youguo Shi (single-crystal growth), Shifeng Jin (crystal structure and diffraction), Jinbo Pan (layered materials), Qingbo Yan (2D ferroelectric materials), Yong Xu (nonlinear optical materials), and Xingqiu Chen (topological phonons). My own contribution was a library of AI-generated crystal structures produced by the Con-CDVAE model.
What does MaterialsGalaxy enable scientists to do that they couldn’t do before?
One major benefit is that it prevents researchers from becoming stalled when data for a specific material is missing. By leveraging the tool’s “structural analogs” feature, they can look to the properties or growth paths of similar materials for insights – a capability not available in traditional, isolated databases.
We also hope that MaterialsGalaxy will offer a bridge between theory and experiment. Traditionally, experimentalists tend to consult the Inorganic Crystal Structure Database while theorists check the Materials Project. Now, they can view the entire lifecycle of a material – from how to grow a single crystal (experiment) to its topological invariants (theory) – on a single platform.
Beyond querying known materials, MaterialsGalaxy also allows researchers to use integrated generative AI models to create new structures. These can be immediately compared against the known database to assess synthesis feasibility and potential performance throughout the “vertical comparison” workflow.
What do you plan to do next?
We’re focusing on enhancing the depth and breadth of the tool’s data fusion. For example, we plan to develop representations based on graph neural networks (GNNs) to better handle experimental data that may contain defects or disorder, thereby improving matching accuracy.
We’re also interested in moving beyond crystal structure by introducing multi-modal anchors such as electronic band structures, X-ray diffraction (XRD) patterns and spectroscopic data. To do this, we plan to utilize technologies derived from computational linguistics and information processing (CLIP) to enable cross-modal retrieval, for example searching for theoretical band data by uploading an experimental XRD pattern.
Separately, we want to continue to expand our experimental data coverage, specifically targeting synthesis recipes and “failed” experimental records, which are crucial for training the next generation of “AI-enabled” scientists. Ultimately, we plan to connect an even wider array of databases, establishing robust links between them to realize a true Materials Galaxy of interconnected knowledge.
The post New project takes aim at theory-experiment gap in materials data appeared first on Physics World.
