Introduction

Data-driven solutions in materials science have become powerful tools for many applications, such as predicting potentially new materials1,2,3, targeted property and composition modification4,5,6,7, exploring vast composition/configuration spaces (CCSs)8,9,10,11, studying processes12,13,14, etc. The undeniable advantage of such approaches is the ability to speed up predictions compared to direct application of density functional theory (DFT) calculations to entire search spaces. In direct structure-to-property predictions, any machine learning model can be considered as a surrogate model approximating computationally expensive solvers. For large chemical spaces, this definitely may be extremely helpful due to the enormous number possible atomic arrangements in three-dimensional space and the complexity of the energy surface under consideration15,16. Within more complex heuristics, such solutions can be easily extended to molecular dynamics applications through approximation of interatomic potentials17. Whatever the method used, there is always data behind the scenes. In turn, data sampling for training purposes may become a non-straightforward task for a separate research in some cases, e.g., for disordered systems18,19,20,21,22.

Classical machine learning algorithms can provide useful and reliable results even being trained on small or noisy datasets prepared in tabular form23,24,25,26. For such models, the successful application of active learning for materials discovery should be also noted27,28. For such solutions, specific geometrical/topological descriptors of the search space entries specially developed for a certain domain29,30,31 or degenerated automatically32,33,34 are required. On the one hand, this peculiarity requires expert knowledge in the field of certain class of functional materials, and on the other hand, reduces transferability, universality, and extendibility of the model developed for a certain task. Considering CCSs of disordered crystals, one should also keep in mind a limited descriptive power of many automatically generated and even hand-crafted descriptor sets in such spaces11,35. In contrast to the descriptor-based approaches, neural networks have the ability to use the structure representation directly, searching for the most essential features under the hood of a specific problem solution, such as predicting various properties of materials36,37. By handling atoms as nodes/vertices, chemical structures can be naturally represented as a graph and hence be processed by graph neural networks (GNNs)38,39. Thus, based on the above comparison of the classical machine learning approaches and GNN-based solutions, it can be assumed that the latter mentioned group has much more bright perspectives in materials discoveries, where descriptor-based approaches may be limited.

Among the families of materials for energy conversion and storage applications, reliable tools for thermodynamic property assessments are of practical importance for chemically modified (doped or alloyed) perovskites. Solar cells utilizing lead halide perovskites have already achieved a power conversion efficiency exceeding 25%40,41,42. Among the lead halide perovskites, CsPbI3 appears to be the most promising due to its low band gap of 1.7 eV allowing for efficient absorption of light. The γ-phase of CsPbI3 retains favorable optoelectronic characteristics, such as direct bandgap and high charge-carrier mobility, making it a potential alternative to the highly symmetrical cubic perovskite phase43. However, any large-scale applications of CsPbI3 face difficulties caused by its polymorphic transitions44,45,46 into the undesirable δ-CsPbI3 phase possessing no useful properties. γ- and δ-CsPbI3 powders are black- and yellow-colored. Thus, we use corresponding designations bellow.

One of many methods for stabilization of the γ-phase is increasing tolerance factor by means of partial Pb2+ ion substitution at B-site46,47,48,49,50,51,52,53. It is well-known that Cd and Zn cations leads to significant improvement of electro- and photoluminescence of CsPbBr3 and CsPbCl3 nanoparticles54,55,56,57,58,59,60. Additionally, substitution of Pb2+ at B-site by Cd2+/Zn2+ within the CsPbI3 structure can be seen as either isoelectronic doping at lower concentrations, or alloying at higher concentrations. Therefore, these dopants do not introduce extra charge carriers in the systems. For CsPbI3, we recently studied the effects of such chemical modifications61 from computational perspectives. This work showed that CCS complexity drastically increases when substituent content increases and can hardly be studied by means of DFT. From domain specifics point of view, partial substitution of I by Br also contributes to γ-phase stabilization with only a small increase in the band gap62,63.

Doping and chemical modifications not only seems promising for the γ-phase stability enhancement, but definitely lead to a dramatic increase in the complexity of the corresponding CCS from computational/predictive perspectives. The aim of the current contribution is to shed light on answering the question «How should one subsample entire range of disorder realization within particular crystal structures to train the most precise machine learning models?” On this track, simultaneous substitution of Pb2+ at B-site by Cd2+/Zn2+ and I by Br represent perfect ability for the development and verifications of efficient domain-inspired data sampling strategies for the subsequent training GNN models.

Methodology

Complete composition/configuration space

To address the aforementioned issue, a set of γ- and δ-CsPbI3 structures with partial substitutions of Pb and I positions by Cd/Zn and Br, respectively, was used. The pristine crystal structures were taken from the experimental data64. For the complexity reasons, we set limits of Pb and I site substitutions for both phases studied to 25% and 33.3%, espectively. To do so, the conventional crystal structures of γ- and δ-CsPbI3 were enlarged up to the 1 × 2 × 1 supercells followed by substitution of up to 2 Pb atoms and up to 8 I atoms by Cd and Br atoms, respectively. Technically, the described procedure was performed by writing corresponding partial occupancies of Wyckoff site 4b (4с for δ-phase) filled with a mix of Pb, Cd and Wyckoff sites 4c, 8d (4c, 4c, 4c for δ-phase) filled with a mix of I, Br to CIF file of γ- or δ-phase for the Supercell program65. In this way, two complete CCSs of unique symmetrically inequivalent structures were obtained with Cd content range from 0 to 25% with step 12.5% and Br content range from 0 to 33.3% with step 4.167%. Within the built CCSs, each entry has a weight, reflecting the number of symmetrically equivalent structures, obtained as a result of the full CCSs generation in a purely combinatorial way.

DFT calculations

For the structures chosen in this study for the training, validation, or test purposes, target properties used in GNN fitting were obtained using the DFT calculations. Total energies of each structure of a certain dataset and all neat constituents in their respective ground states were calculated after relaxation using Vienna Ab initio Simulation Package (VASP) with Perdew-Burke-Ernzerhof exchange-correlation functional and projector-augmented wave pseudopotentials66,67. The energy cutoff of 600 eV was employed. The calculations were performed at the Г-point of reciprocal space to eliminate the interactions between defects and due to relatively large supercell size. Stopping tolerances for energy convergence within the self-consistency loop and maximum forces acting on atoms after the relaxation process were set to 10− 5 eV and 0.01 eV/Å, respectively. Structure formation energy was chosen as a target property for the GNN models.

GNN models

To stay consistent with the previously obtained results, we used the Allegro model68, provided state-of-the-art results on the benchmarks. Allegro was compared61 with another rather pioneer model – SchNet69 – and showed the best performance obtained for doped CsPbI3. Allegro represents a many-body potential using iterated tensor products of learned equivariant edge representations without atom-centered message passing in contrast to other popular in the field of machine learning potentials models, where message passing process between neighboring nodes implemented in. For example, NequIP model70, can significantly change node embeddings and reduce information about the actual arrangement of dopants. Instead, the Allegro model operates only on graph edges, and node embeddings remain unchanged.

Firstly, the Allegro model was pretrained on a subset of the AFLOWLIB database, representing a storage of DFT-derived properties of experimentally observed and hypothetical crystals71. This database is managed by the AFLOW software that uses VASP for relaxations and thermodynamic properties evaluations. For model pretraining, all structures containing at least two species out of Cs, Pb, Zn, I, Br were selected from AFLOWLIB and split into train and validation parts in ratio of 90:10. The obtained distributions of relaxed and formation energies for the structures within the built dataset are given in Figures S1-2 of the Supplementary Information. This step helps the model to catch information about interatomic interactions in the structures similar to the title compounds.

Results and discussion

Hints of additional data sampling criteria

In the aforementioned work61, we comprehensively analyzed impact of the number of substitutions in the training and validation structures on the quality of the GGN model predictions. Each of the introduced training/validation groups comprised the structures with the same numbers of defects. The higher the defect content, the higher the number of structure realizations in the CCS. To account for this, we considered four random subsets of the structures with 2 and 3 substituted sites within each of the introduced groups. For this reason, models trained and validated on the datasets identical in terms of defect contents – training/validation groups – might provide different predictions only due to the randomness of the structures constituted them. For instance, the models validated on the maximum available number of the structures with 2 substituted sites and, consequently, trained only on the structures with other defect contents – training/validation group 9 of ref.61 – demonstrated widely scattered root-mean-square errors (RMSEs) as shown in Fig. 1. This observation can be definitely interpreted as an inconsistence of the model predictions with respect to the random subsampling rather than chemical compositions of training/validation samples.

By design, the complete CCS included all possible symmetrically inequivalent structure realizations. Despite the same chemical composition, some CCS entries can thus possess different symmetries caused by the differences in defect arrangements. Thus, the composition differences by themselves cannot be the only reason for this behavior. Therefore, one can consider space groups of the built CCS entries (prior to any relaxation of them) as the next most obvious and easily accessible metainformation in addition to the composition. The test scores in the abovementioned work were checked for the group of training/validation splits providing the highest variation of them. Such a comparison demonstrated that the most pronounce differences in the model scores might be associated to the differences in the data splits concerning presence or absence of the structures of Pc (7) and Pm (8) space groups. This fact prompted us to study the symmetry effect in more detail.

Fig. 1
figure 1

For the 9th train/validation group in the work61, distributions of Cd- and Zn-doped CsPbI3 crystal structures by their space groups determined prior to DFT relaxations. The corresponding test RMSE scored of the models trained on the 9th group are provided in meV/atom.

CCS statistics

The complete CCSs obtained in the current study included 2,946,709 and 2,995,462 inequivalent structures for the γ- and δ-CsPbI3 phases, respectively. For each entry of the built CCSs, the corresponding space symmetry was additionally determined prior to any DFT evaluations using the pymatgen library72. Furthermore, defect contents in each entry, i.e. a set of Cd and Br atoms quantity at each Wyckoff site of 1 × 2 × 1-supercell were collected as metainformations. Expectedly, the introduced substitutions lead to symmetry reduction and to a structure diversity in terms of space groups as shown in Figs. 2 and 3. The readers are also referred to Figure S3 of the Supplementary Information section for more details on the corresponding group subgroup relations obtained within the built CCSs.

Regarding the structure weights it should be added that there was no strong correlation obtained between the space group of a structure and its weight. Nevertheless, the entries with the highest available space group, namely Pnma (62), had the lowest weights of 1, and vice versa all P1 (1) entries possessed weights of 16. Thus, any random subsample of the full CCSs is unlikely to provide high symmetry structures for statistical reasoning – low numbers of high symmetry structures and low weight each of them. The graphical representation of weight dependencies via symmetries over the entire CCSs is provided in Figure S4.

Fig. 2
figure 2

Heatmap of the number of complete CCS entries of the black γ-CsPbI3 via Pb and I substitution levels and symmetry space groups. The empty fields correspond to the lack of certain combination. The color map corresponds to the natural logarithm of the number of CCS entries (the values provided as notations).

Fig. 3
figure 3

Heatmap of the number of complete CCS entries of the yellow δ-CsPbI3 via Pb and I substitution levels and symmetry space groups.

The obtained space groups within the built CCSs turned out to be severely unbalanced. In the case of black γ-CsPbI3, space symmetry groups greater than 7 (Pc) corresponded to 176 structures (less than 0.006% of the built CCS). On the other hand, 99.6% of the structures possessed space group P1. In the case of yellow δ-CsPbI3, corresponding values equaled to 0.014% and 96.4%, respectively.

To explore possible effect of space symmetries in training and validation subsets on the resulting performance of the GNN model, we added space group numbers of the entries of the CCSs built in this study and subsequently used this metainformation for the required balancing of the training/validation and test subsets.

Training, validation, and test subsets

Based on the space groups statistics, we set a threshold for space group number to split the entire CCSs into high- and low-symmetry structures. To select the former mentioned group, 176 γ- and 405 δ-phase structures with space group number greater than 7 (Pc) were considered. To select low-symmetry entries, all γ-phase structures of the full CCS were ordered in ascending order by space group number and in descending order by weight. For each unique composition i in the high-symmetry structure set, first ni structures were collected from the ordered list obtained, where ni equals to the number of high-symmetry γ-phase structures with a certain defect content. If there were no structure with required defect composition among those possessing space groups lower than the introduced threshold, structures with a higher number were used. In the same way δ-phase structures were collected. This allowed preserving similar chemical compositions among the introduced groups and directly track symmetry impact on the model performance. In such way, we follow monothetic analysis, i.e. method of designing experiments involving testing factors one at a time instead of multiple ones simultaneously.

Thus, two structure sets, referred to hereafter as predominantly high-symmetry (PHS) and predominantly low-symmetry (PLS) datasets, were created. Both of them comprised 176 γ- and 405 δ-phase structures possessing the similar defect compositions not to add excessive uncertainty, but different space group distributions. Both datasets were further doubled by substitution all Cd atoms with Zn to the size of 1162 structures each and subdivided into train and validation parts with 80/20 ratio using γ-/δ-phase stratification.

For the test purposes, 100 random γ- and 100 random δ-phase structures with P1 space group and the highest weights were taken from the complete CCS. It is worth noting here that defect compositions of the selected structures was different compared to that of train/validation structures allowing to check generalization ability of the models additionally. The obtained dataset was doubled by substitution all Cd atoms with Zn and denoted as PLS test dataset (a total of 400 structures).

To create PHS test dataset, γ-phase structures of the full CCS with space group number less than or equal to 7 (Pc) were ordered in descending order by space group number and weight. After that first ni structures with defect composition i were picked, where ni equals to amount of γ-phase structures with defect composition i in PLS test dataset with Cd dopant only. In the same way δ-phase structures were picked. Thus, the dataset comprising 100 γ- and 100 δ-phase structures was made. Its structures have the same defect compositions as in the case of the PLS test, but high space group number. One cannot take structures with space group number more than 7 (Pc) to add them to the PHS test, because these structures were included in the PHS train/validation dataset.

Furthermore, 45 (5*9) additional train/validation datasets intermediate in a sense of space group contents were created by random mixing PLS and PHS train/validation datasets in a ratio of 10:90, 20:80, etc. Besides, 4 supplementary train/validation datasets were sampled using similar procedure as in the case of PLS dataset. In each case, the subsets were considered with different random seeds to track how the changes in train/validation dataset affect quality of the corresponding models and to obtain statistically significant results. It is worth noting that each Cd-substituted structure in each one of 49 abovementioned datasets had a pair structure with Zn as substituent. Detail information about the datasets obtained is given in Table 1 and Figures S5 – S10.

Table 1 Dataset statistics with respect to the corresponding CsPbI3 polymorph and space group numbers included.

The obtained distributions of DFT-derived formation energies within the built datasets are shown in Fig. 4. It can be noted that the introduced PLS and PHS datasets remained similar ion terms of the target properties, which may be associated with the similar sets of chemical compositions of the structures included. Strictly speaking, this observation has particular importance for additional checking in any other studies, since the obtained distribution of the target properties can directly impact GNN performance by themselves. It is also important to emphasize that since the PHS structures were included in our sets without randomness in their selection, while the PLS ones were chosen quasi-randomly, one can conclude that the high-symmetry structures are a reliable choice for assessing thermodynamic properties in complete CCS. In other words, one would expect that structures selected from a substantially larger set would obtain broader property distributions than that in the very limited set of high symmetry variants. However, this is not observed in any of the sets considered.

Fig. 4
figure 4

Distributions of DFT-derived formation energies for PHS and PLS structures within the introduced training/validation (left) and test (right) datasets.

In turn, for the purpose of this study, this fact allows one to conclude, that comparison of the models trained on different subsets is not influenced by any differences in target energy distributions. Thus, we were able to study symmetries impact independently on any other data peculiarities – chemical compositions and target energy ranges. Additionally, it is worth noting that formation energy distributions of training and validation data cover the entire range for that of the test dataset. Thus, we consider an interpolation task in terms of mode of GNN-based predictions.

Test scores of model predictions

For the models using random initialization of trainable parameters, the test scores of model predictions turned out to be almost identical, regardless of whether pretraining step was carried out. However, greater stability of the pretrained model could be expected, which means that model predictions were less dependent on some random factors. Thus, the pretrained Allegro model was fine-tuned on the introduced PLS and PHS datasets, 45 introduced mixtures of high- and low-symmetry structures and 4 additional PLS-like datasets resulting in a total of 51 fine-tuned model states.

All models were run on two test datasets. Then, RMSEs of predicted formation energies compared to those calculated using VASP were obtained for each model and for each test dataset. In turn, the RMSE values of the models, trained on the mixed train/validation datasets, were averaged over 5 values for each mixing ratio, and corresponding standard deviations were additionally calculated. The results obtained are given in Fig. 5 and Table 2.

Fig. 5
figure 5

Test RMSEs in formation energy assessments obtained for the PLS (left) and PHS (right) test structures using the models trained on various proportions of the PLS and PHS training/validation structures. Black lines denote RMSE values averaged over 5 independent models for each mixing ratio, blue translucent areas correspond to one standard deviation.

Table 2 Test RMSEs in formation energy assessments obtained for the PLS and PHS test structures using the models trained on various proportions of the PLS and PHS training/validation structures. For each fraction of the PHS entries in the training/validation data less than 1, there are 5 independent models considered for random sampling from the PLS subset.

From Fig. 5 and Table 2, it can be seen that purposeful sampling of high symmetry structures for training GNNs increase resulting errors of assessments and thus reduces quality of model predictions. Also, it should be noted, that the models, trained on the datasets, containing more than 50% high symmetry structures, demonstrate lower formation energy RMSE on PHS test dataset. On the contrary, the other models provide more precise results on the PLS test.

As is known, the larger the size of the training sample and the higher its diversity in terms of feature space, the higher the generalizing ability and accuracy of the machine learning model. The higher performance of models trained on low symmetry train/validation dataset can be explained by the fact that a low symmetry structure has more variety of unique environment of each atom. In other words, there are more quantity of unique bi-, tri- and so on interatomic interactions in the case of a crystal structure with low symmetry. As the space group number increases, these unique local interactions degenerate due to the appearance of new symmetry elements and a decrease in the number of unique subgraphs in the crystal structure. This observation in itself clearly shows the more quantity of training samples and their diversity are available, the better generalization ability and accuracy of a model can be obtained. Thus, in addition to defect content and chemical composition, the space group of structures in training data can be used as an additional criterion for sampling of training datasets. Moreover, such metainformation obtained for the structures before structural relaxation is physically based and can be collected without expensive computations since it does not require a DFT-based evaluation of the CCS entries.

Inference results over the entire CCS

To study impact of symmetry distributions in train/validation data on the model predictions over the built CCSs for both phases, we used Allegro models in inference mode. Keeping in mind monotonic increase of the test RMSEs (see Fig. 5) by increase of the fraction of high-symmetry structures (PHS fraction) in training data, we considered limiting cases of the Allegro models fine-tuned on the PLS and PHS datasets. Distributions of the predicted formation energy over the complete CCS are shown in Fig. 6.

Fig. 6
figure 6

Violin plots of GNN-based assessments of formation energies over the complete CCSs and accounting for weights of their entries as predicted by the models trained and validated on the PLS and PHS datasets.

By comparison of the model predictions obtained for Zn- and Cd-substituted structures, one can conclude that the obtained distributions are similar for the models fine-tuned on the PLS dataset from both shape and ranges perspectives. Secondly, the predictions of the models fine-tuned on the PHS dataset have sharper distributions despite the fact that for both substituents the predicted ranges of formation energies remain similar. In the case of Cd-substituted systems, the models fine-tuned on the PHS dataset result in a number of peaks in the obtained distributions, i.e. predominant levels in formation energy predictions, which might be caused by the limited distinguishability of the composition-structure-property relationships. Thirdly, usage of high-symmetry structures in the training dataset does not affect δ-phase predictions. At the same time the model fine-tuned on the PHS dataset predict higher formation energies for the Cd-substituted structures of the γ-phase.

The differences between the models’ inference discussed above directly affects the calculated parameters, such as the energy difference between the chemically modified phases, that can be important from a practical point of view. For the models trained on the PHS and PLS structures, Fig. 7 shows relative behavior of energy difference between the γ and δ phases via substituent contents. As can be seen, the PHS group tends to underestimate the energy difference between phases for both metals substituting Pb. Moreover, the difference between the PHS and PLS models is more related to metallic substituent content rather than the Br/I ratios for the compositions studied.

Fig. 7
figure 7

For the models trained and validated on the PLS and PHS datasets, the formation energy difference between the Cd- (left) and Zn-doped (right) black (γ) and yellow (δ) phases obtained using minimum energy configurations over the complete CCSs.

Conclusions

In the present work, we comprehensively studied the impact of space symmetries of training structures on the resulting performance of the GNN-based surrogate models for structure-to-property predictions. To do so, it was suggested to build complete composition/configuration spaces of two competing CsPbI3 phases with B-site substitutions by Cd or Pb and I substitutions by Br. During data selection for training and validation of a set of independents realizations of the Allegro model, we tried to avoid any composition differences between the training datasets and made comparison of the models trained on exactly the same amounts of DFT-derived datapoints (structures).

It was shown that in the case of similar distributions of target properties and compositions, a higher fraction of high-symmetry structures resulted in a lower quality of model predictions. As the resulting statistics of the built CCS entries shows, selection of predominantly low-symmetry structures accompanied with composition constrains can be carried out in a purely random manner. Nevertheless, possessing the same composition and similar target property distributions predominantly low symmetry structures can provide better performance compared to that of high-symmetry structures in similar applications. Keeping in mind nearly the same composition and target property distributions, this observation might be caused by more diverse subgraphs of the structure graphs in the case of low symmetries. As shown, accounting for different symmetries on the training data can also significantly affect the inference mode of the trained GNN models.

Obviously, the results of the approach developed in this study and applied to the particular family of functional materials can hardly be extrapolated to any other group of crystal structures without appropriate additional check of the conclusions made. However, the developed workflow clearly shows the following points:

  1. (1)

    entries of vast CCSs can be severely unbalanced with respect to the space symmetries reduced compared to the original structure due to the introduced point defects and determined prior to their DFT evaluation,

  2. (2)

    despite the significantly lower numbers of the high-symmetry structures in the built CCS their thermodynamic properties ranges remain comparable with those of random subsets of much more numerous low-symmetry entries of the built CCS in all the cases considered,

  3. (3)

    when such parameters as compositions and target property ranges coincide, symmetry of crystal structures used for training models can significantly affect their resulting performance and at least can be considered as an additional easily achievable criterion for constructing training datasets in a stratified manner.

As any other deep learning models (artificial neural networks), GNNs are sensitive to quality and quantity of data available for training. By design, GNNs implicitly generate inner features of input molecular or crystal graphs and aggregates them by a number of trainable layers to produce a final prediction. This, in turn, hinders interpretation of GNN prediction pipeline, getting scientific insights on certain data importance, and searching for enhanced data selection routines73,74. Keep in mind the challenge of rationalized selection of training samples from the entire CCS to reaching a trade-off between the size of the training dataset and the quality of the model predictions, the insights of this study may additionally support development of tools efficient in terms of data amounts required for training and validation purposes.