Introduction

Aromatic compounds especially BTX (i.e., benzene, toluene, and xylenes) and ethyl benzene are valuable raw materials in the petrochemical industries1. In addition, the reduction of sulfur, nitrogen, and aromatic compounds (especially Benzene) in the middle distillate products has considerable importance due to their environmental impacts2,3. Therefore, great attention has been focused on the separation of aromatic compounds from naphtha-cracking streams. The separation and purification of aromatic compounds are usually challenging due to the close boiling points of different hydrocarbons and the different combinations of azeotropes that may be formed4,5. Based on aromatic concentration, three separation techniques are categorized commercially: (I) liquid–liquid extraction is predominantly employed typically for low aromatic concentrations (in the range of 20 to 65 wt%); (II) extractive distillation is commonly chosen for moderate aromatic concentrations (in the range of 65 to 90 wt%), and (III) the azeotropic distillation is used for extremely high aromatic concentrations (> 90 wt%)6,7.

The liquid extraction process is regarded as the most favorable approach in the aromatic separation process for mixtures containing lower than 20 wt% aromatic contents. The notable advantages of this method include low energy consumption, preservation of both physical properties and chemical structures, ease of application, and mild operating conditions. In the extraction process, the selection of an appropriate solvent is important. In other words, an ideal solvent should be environmentally friendly, easily regenerated, sufficiently available at low cost, have a high level of aromatic compound selectivity, have a significant distribution ratio, and maintain a minimum solvent-to-feed ratio2,8. In addition, the physical and thermodynamic properties of the solvent such as density, surface tension, viscosity, and thermal stability should be applicable in the industry. Commonly, organic solvents such as furfuryl alcohol, N-formyl morpholine (NFM), sulfolane, ethylene glycols, dimethyl sulfoxide (DMSO), and N-methyl pyrrolidone (NMP) are employed in the aromatic/aliphatic separation process. Due to the high levels of toxicity, volatility, flammability, and regeneration costs of these organic solvents, alternative solvents are being investigated3,8.

Nowadays, ionic liquids (I.Ls) have been gradually noticed as an attractive and promising alternative to traditional organic solvents for the extraction of aromatic and aliphatic hydrocarbons. In general, I.Ls exhibit low vapor pressure, notable thermal and chemical stability, effective solubility for both organic and inorganic compounds, and low environmental impacts. Additionally, I.Ls are often recognized as “designer solvents” due to their potential for diverse combinations of anions and cations, resulting in numerous I.Ls with distinct specifications and a wide range of applications6,9. In the petroleum refining industries, numerous separation processes using I.Ls have been investigated. These processes include desulfurization, denitrogenation10, and dearomatization of fuels11.

Investigation of liquid–liquid equilibrium (LLE) data is needed to better understand the role of I.Ls in the liquid–liquid extraction process12. As a result, the number of experimental studies on the LLE of ternary systems containing I.L., aromatic hydrocarbon, and aliphatic hydrocarbon has increased over the recent decades. The literature predominantly discusses imidazolium-based I.Ls as the preferred solvents for extracting aromatic hydrocarbons from the mixture of aromatic/aliphatic components13,14. Furthermore, other categories of I.Ls based on pyridinium, ammonium, etc. have been investigated in some studies15,16.

The vast number of potential I.Ls that can be synthesized makes it impossible to select the optimal I.L. for a particular task or any separation/extraction process only using experimental measurements. Therefore, the development of predictive models can be significantly time-saving and cost-effective. In the literature, various models such as COSMO-RS17 and UNIFAC18 have been used to calculate LLE data, but limited ternary systems were studied in the development of each model. Therefore, the variety of anions, cations, and aliphatic hydrocarbons in the studied systems was very limited.

The QSPR method is a commonly used technique that utilizes molecular descriptors to establish quantitative relationships between a chemical structure and its corresponding property19,20,21. Therefore, it is possible to design and/or develop new solvents or chemical structures using the QSPR method for different applications. This modeling method not only offers quantitative information about the desired property but also provides valuable qualitative insights via descriptor interpretation. The QSPR method has been extensively applied in various studies related to ILs, demonstrating its effectiveness in predicting their thermodynamic and physicochemical properties22,23,24,25.

In our recent researches, we have utilized the QSPR approach to develop descriptive and predictive models for estimating the distribution of thiophene26 , a sulfur-containing compound, and pyridine27, a nitrogen-containing compound, between the hydrocarbon-rich and I.L.-rich phases. This study focuses on the separation of benzene from fuels. Benzene is a toxic aromatic cyclic hydrocarbon and exposure to it beyond the standard limit can cause serious problems for human health such as cancer. On the other hand, benzene is used as a raw material for the production of resins, drugs, dyes, and many other chemical products. Therefore, the separation of benzene from the fuels has received much attention in different studies. After reviewing the literature, it was found that despite the availability of LLE data for a considerable number of ternary systems including I.L., benzene (as an aromatic hydrocarbon), and an aliphatic hydrocarbon, there is no QSPR model to predict the distribution of benzene between the aliphatic hydrocarbon-rich and I.L.-rich phases. In order to fill this gap, a dataset containing LLE data for 112 ternary systems was collected. This dataset covers 17 anions, 20 cations, and 12 aliphatic hydrocarbons. By the combination of the involved anion, cation, and aliphatic hydrocarbon structures in the present dataset, 4080 possible ternary systems (i.e., 20 × 17 × 12 = 4080 systems) can be formed. It should be noted that there is only experimental data for 3% (112 ternary systems) of the 4080 systems, so far. As a result, developing a QSPR model can estimate the benzene distribution between the aliphatic hydrocarbon-rich and I.L.-rich phases of the remaining non-studied systems. Additionally, the impact of the variation of anion, cation, and aliphatic hydrocarbon structure on benzene separation has been investigated.

QSPR method

Dataset

Collecting an adequate and reliable dataset is the first step in the QSPR method. A dataset including 112 ternary systems was collected after a comprehensive review of the literature3,9,11,12,13,14,15,16,17,18,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61. The collected dataset includes 17 anions, 20 cations, and 12 aliphatic hydrocarbons. More details regarding the ternary systems for mentioned dataset are presented in Table 1. All LLE data points were measured at 298.15 K and 1 atm.

Table 1 The employed dataset in the present study.

The dataset included various aliphatic hydrocarbons such as straight-chain hydrocarbon32, cyclic hydrocarbon61, and branched hydrocarbon35. Also, imidazolium-, pyrrolidinium-, ammonium-, and pyridinium-based I.Ls were present in the dataset.

In order to estimate the benzene distribution between the aliphatic hydrocarbon-rich and I.L.-rich phases, it was tried to investigate the relationship between the benzene mole fraction in the aliphatic hydrocarbon-rich phase (X2) and the benzene mole fraction in the I.L.-rich phase (Y2) for all ternary systems included in the main dataset and predict Y2 in terms of X2 using a linear model.

Due to the variation of different structures (i.e., anion, cation, and aliphatic hydrocarbon) in each ternary system, it is necessary to apply the effect of the structure of each group (i.e., anions, cations, or aliphatic hydrocarbons) using molecular descriptors in the model. Therefore, the descriptors of each group were considered to develop the model.

Two subsets (i.e., train and test subsets) were used to validate the constructed QSPR model. Train subset was used for the development and internal validation of the QSPR model. Test subset was used for external validation of the constructed model.

Basic theory

All 112 ternary systems of the dataset including I.L. (1), benzene (2), and aliphatic hydrocarbon (3) were investigated. A relationship between the values of Y2 and X2 was identified in every ternary system. Consequently, Eq. (1) can be developed to predict Y2 values at any desired X2 values:

$${\text{Y}}_{{2}} {\text{ = f (X}}_{{2}} {)}$$
(1)

where f can be a linear or non-linear function of X2. According to Table 1, the structures of anions, cations, and aliphatic hydrocarbons are different in each of the ternary systems. In Eq. (1), the only independent variable in the model is X2, so the prediction of Y2 values is not possible with this model as it does not consider the impact of anion, cation, and aliphatic hydrocarbon structures. As a result, anion, cation, and aliphatic hydrocarbon descriptors should be considered as independent variables along with X2 (see Eq. (2)).

$${\text{Y}}_{{2}} {\text{ = f (X}}_{{2}} {\text{) + g (anion, cation, and aliphatic hydrocarbon descriptors)}}$$
(2)

Molecular descriptors calculation

Structural features of various compounds are encrypted in molecular descriptors such as constitutional, geometrical, topological, etc. In order to calculate different types of descriptors, it was necessary to use the optimal structure for each compound in the dataset. Therefore, the 3D structure of all anions, cations, and aliphatic hydrocarbons was drawn using Chembio3D Ultra software. Afterward, the structure optimization was carried out based on the Density Functional Theory (DFT). In this regard, the 6-31++G(d,p) basis set and B3LYP level were employed. In the next step, Dragon software was employed to calculate the molecular descriptors for the optimized structures. Descriptors with constant or almost constant for each group were eliminated. Finally, 1247 anion descriptors, 1282 cation descriptors, and 1079 aliphatic hydrocarbon descriptors were calculated.

Model construction

Descriptors selection

In order to identify the most important and effective descriptors among a large number of descriptors (nearly 3608 descriptors), variable selection is important. The Genetic Algorithm (GA) method was employed to select appropriate variables as one of the most effective and efficient variable selection methods. GA is known for its capability to explore the large search space efficiently by mimicking the process of natural selection62. It operates by generating a population of potential solutions (in this case, the subsets of descriptors) and iteratively improving them through operations such as selection, crossover, and mutation. The fitness function used in GA evaluates the predictive accuracy and complexity of the developed model to ensure that the selected descriptors not only improve the prediction performance but also prevent overfitting63. The fitness function employed in this study was based on the Leave-One-Out Cross-Validation (LOO-CV) coefficient of determination (R2). By optimizing the fitness function through LOO-CV, the best subset of descriptors was identified in such a way to improve the predictive performance of the developed model while maintaining the model simplicity.

The performance of the GA in selecting appropriate descriptors is highly dependent on some key parameters such as population size, mutation rate, and the number of iterations64. Population size controls the number of candidate solutions (i.e., subsets of descriptors) evaluated in each iteration. A carefully chosen population size of 500 allows for sufficient exploration of the descriptor space while maintaining the computational efficiency. Mutation rate defines the likelihood of the random changes in the solutions, which helps maintain diversity in the population and avoids early convergence to suboptimal solutions63,65. In the present study, a mutation rate of 20 was selected to balance the exploration and exploitation of the search space. Iterations refer to the number of generations through which the population evolves. In this study, 10,000 iterations were used ensuring that the GA converged towards an optimal subset of descriptors without overfitting risk.

Linear model

For discovering the correlation between the independent variables (i.e., GA-selected descriptors and X2) and the Y2, Multiple Linear Regression (MLR) was employed to construct a QSPR model. The GA-MLR model was developed using the QSARINS software66,67,68.

Non-linear models

Machine learning has received much attention in recent years as a powerful tool in the development of predictive models. Machine learning methods can develop non-linear models, which is more accurate compared to the linear models. In this study, Genetic Programming (GP) and Artificial Neural Network (ANN), as two common methods in the field of machine learning approaches, were used to develop non-linear models.

In order to develop non-linear GP based model which is biological science inspired, GPTIPS code was employed. The input data of GPTIPS was X2 and the final selected descriptors. In this method, the model was developed using selected mathematical operators (e.g. addition, subtraction, multiplication, and division) and adjusting the parameters of the code. After several iterations of modeling and comparing the statistical parameters of the developed models, the best model has been selected. More details on the model development procedure using GP can be found elsewhere27,69.

Multi-Layer Perceptron (MLP), which is one of the most widely used artificial neural networks, was chosen to develop the ANN model. MLP consists of an input layer, an output layer, and one or more hidden layers, each layer containing a number of neurons. In this study, X2 and the final selected descriptors were the input of the network and Y2 was the output of the network. More details on the model development procedure using MLP can be found elsewhere27,69.

Statistical parameters and models validation

Table 2 provides statistical parameters that can be utilized to assess the predictive capability of the QSPR model that has been developed. Where n is the number of data in the train, test, or main set, p is the number of independent variables, \({\text{Y}}_{{{2}_{{\text{i}}} }}^{{{\text{Exp}}}}\). is the experimental Y2 value, \({\text{Y}}_{{{2}_{{\text{i}}} }}^{{{\text{Pre}}}}\). is the predicted Y2, and \(\overline{{\text{Y}}}_{{2}}\) is the average of the experimental Y2 values.

Table 2 Statistical parameters employed in the present study.

Q2LOO-CV, which is defined as the LOO-CV coefficient of detertion, was initially used for internal validation of the constructed model69. Additionally, Leave-Many-Out Cross Validation (LMO-CV) was performed, where 25% of the data was left out in each iteration. The Q2LMO-CV was also calculated, further supporting the robustness and predictive power of the model. The calculation of Q2 for LMO-CV follows a similar procedure as that for LOO-CV (Eq. (7)). More detailed information about cross-validation techniques can be found elsewhere70,71. Additionally, external validation is commonly employed in the model validation process considering train and test sets.

Results

Linear model construction

In our recent research, we examined 84 ternary systems comprising of I.L., thiophene, and hydrocarbon solvent. The results revealed a correlation between the mole fraction of thiophene in the hydrocarbon-rich phase and the mole fraction of thiophene in the I.L.-rich phase. In another investigation, we examined 51 ternary systems comprising of I.L., pyridine, and hydrocarbon solvent. The results revealed a correlation between the mole fraction of pyridine in the hydrocarbon-rich phase and the mole fraction of pyridine in the I.L.-rich phase. Thiophene is an aromatic heterocycle compound containing a sulfur atom and pyridine is an aromatic heterocycle compound containing a nitrogen atom.

In this study, by reviewing the literature, 112 ternary systems containing I.L., aromatic hydrocarbon (benzene in the current study), and aliphatic hydrocarbon were gathered. In total, 17 anions, 20 cations, and 12 aliphatic hydrocarbons were included in the dataset. For each ternary system, the relation between the benzene mole fraction in the I.L.-rich phase (Y2) and the benzene mole fraction in the aliphatic hydrocarbon-rich phase (X2) was calculated. It should be noted that several polynomial forms were examined in the fitting process and it was found that the linear relation between Y2 and X2 is more accurate in spite of its simplicity. The results of this analysis are tabulated in Table 3. As observed in Table 3, the R2 for all correlations are greater than 0.85. Consequently, a strong correlation between X2 and Y2 was confirmed.

Table 3 The developed models using only X2 variable (i.e., \({\text{Y}}_{2}\text{ = a }{\text{X}}_{2}\text{ } + \, {\text{b}}\)) for each of the ternary systems.

In order to provide a model that predicts the benzene distribution between the aliphatic hydrocarbon-rich and I.L.-rich phases, Eq. (12) was developed using all the data points in the dataset (see Table 4). In other words, a linear model was developed using X2 to predict Y2 values. Y2 is the target variable and X2 is the independent variable. As Table 3 shows, in each ternary system, anion, cation, and aliphatic hydrocarbon are different. Therefore, it was necessary to add appropriate molecular descriptors to the model from each of the three mentioned groups. In this regard, Eq. (13) was constructed using one anion descriptor and X2, Eq. (14) was constructed using one cation descriptor and X2, and Eq. (15) was developed using one aliphatic hydrocarbon descriptor and X2. The statistical parameters of the developed models are presented in Table 4. As the statistical parameters of the developed models show, R2 value for the developed model using only X2 was 0.698. This value is equal to 0.861, 0.765, and 0.710 for the models that include X2 and one anion descriptor, one cation descriptor, or one aliphatic hydrocarbon descriptor, respectively. These results show that the structure of anion and cation has a considerable impact on determining the Y2 value; In contrast, the addition of the aliphatic hydrocarbon descriptor to the model had a negligible effect on increasing the accuracy of the constructed model. For further investigation, Eq. (16) was developed using X2, an anion descriptor, and an aliphatic hydrocarbon descriptor. As the comparison of the statistical parameters of Eqs. (13) and (16) shows, there is almost no progress in the model estimation capability. Also, Eq. (17) was developed using X2, a cation descriptor, and an aliphatic hydrocarbon descriptor. Similarly, the comparison of Eqs. (17) and (14) shows the insignificant effect of the addition of aliphatic hydrocarbon descriptor on the developed model. The reason can be due to the similarity of the structure of hydrocarbons in the present dataset. Therefore, the model was developed using only X2, an anion descriptor, and a cation descriptor (Eq. 18).

Table 4 The constructed models taking into account all ternary systems in the main dataset and their statistical parameters.

Plot (a) in Fig. 1 shows the Y2 values predicted using Eq. (12) versus Y2Exp and plot (b) shows the Y2 values predicted using Eq. (18) versus Y2Exp. The results clearly show that considering the anion and cation descriptors in the developed model increases the estimation capability of the constructed model.

Fig. 1
figure 1

The estimated Y2 values versus experimental Y2 values in the main set for the developed model using (a) only X2 variable (i.e., Eq. (12)), (b) X2 and GA-chosen descriptors regardless turning point (i.e., Eq. (18)), and (c) X2 and GA-chosen descriptors considering turning point (i.e., Eq. (21) or (22)).

In order to investigate the results of modeling using polynomials containing higher degrees of X2, the second to fifth power of X2 was also considered as independent variables along with X2 and anion, cation and aliphatic hydrocarbon descriptors. The mentioned variables were given as input to QSARINS software. The outcome of this analysis confirmed that the inclusion of higher powers of X2 do not improve the accuracy of the developed model. Therefore, the linear model seems to be the most appropriate model.

As plot (b) in Fig. 1 shows, the predicted Y2 values are negative for some data points. By considering the X2 = 0.15 as a turning point and constructing the model separately for X2 range larger than 0.15 and lower than 0.15, this problem was solved (Please check plot (c) in Fig. 1). In addition, the comparison of the X2 coefficient for the models developed using only the X2 variable shows that in the range of 0–0.15, the Y2 values have a stronger dependence on the X2 value (see Eqs. (19) and (20)) in Table 4). Therefore, developing two linear models with the same variables for the range of 0 to 0.15 and 0.15 to 1 can upgrade the estimation capability of the model. In this regard, the model was initially created for the X2 range of 0.15 to 1 by selecting suitable descriptors for both anions and cations. Subsequently, the determined variables were utilized to develop the model for the range of 0 to 0.15.

In the construction of QSPR models, identifying the appropriate number of descriptors used in the model is important. The breaking point plot is used as a valuable tool to determine the optimal number of descriptors. In this regard, models were developed using X2 and anion or cation descriptors for the range of 0.15–1. Also, R2 and Fisher function (F) values were determined for each constructed model. As a result, the effect of increasing the number of descriptors of each group on the estimation capability of the constructed model was determined. Considering the models developed using only X2 and anion descriptors (see Fig. 2), only one anion descriptor was needed to develop the model. Moreover, an increment in the number of anion descriptors not only had a negligible effect on the accuracy of the model but also caused a considerable decrease in the F-value. In addition, the breaking point plot confirmed that increasing the number of cation descriptors does not result in a significant improvement in the accuracy of the model. Therefore, due to the effectiveness of the presence of cation descriptors in determining Y2 values, using one cation descriptor in the model is sufficient. Finally, three variables (i.e., X2, one anion descriptor, and one cation descriptor) were selected to develop the final model. Taking into account the X2 = 0.15 as the turning point, the model was constructed for the range of 0.15 to 1 using three mentioned variables (Eq. (21)). Afterward, considering the points (0,0) and (0.15, Y2 predicted by Eq. (21), the model for the range 0 to 0.15 was also developed (Eq. (22)). As a result, the final model consists of two linear models developed for 0 to 0.15 and 0.15 to 1 ranges.

Fig. 2
figure 2

Effect of number of variables on the (a) R2 and (b) F-values (breaking point plot).

The plot (c) in Fig. 1 shows the Y2 values predicted using Eqs. (21) and (22) versus Y2Exp. A comparison between plots (b) and (c) in Fig. 1 reveals that the constructed model considering the turning point is not only more accurate compared to the model without the turning point, but also there is no negative predicted Y2 value in the model.

Evaluation and validation of the constructed linear QSPR model

In addition to internal validation, external validation is also very important for the evaluation of the constructed QSPR models. In the external validation, most of the data are classified into train set to be used for the model development. The rest of the data are classified in the test set to investigate the validation of the constructed model. In this regard, 26% of the total data (i.e., 30 ternary systems which are identified in Table 3) including [BF4] anion, [DCA] anion, [TfO] anion, [BMpyr] cation, [C10MIM] cation, [N4111] cation, n-decane and n-dodecane were involved in the test set. The external validation of the model is more reliable due to the inclusion of three anions, two cations, and two aliphatic hydrocarbons exclusively in the test set. In other words, the test set contains four ternary systems in which all anion, cation, and aliphatic hydrocarbon structures are different considering the involved structures in the train set (i.e., 61st, 62nd, 66th, and 67th systems).

To highlight the benefits of considering the X2 = 0.15 as a turning point, Eq. (23) was constructed utilizing the train set for X2 values higher than 0.15, at first (see Table 5). The linear model for the X2 range lower than 0.15 (Eq. (24)) was developed considering two points (i.e., (0,0) and (0.15, Y2 predicted by Eq. (23)). The Y2 values predicted using Eq. (23) or (24) for all data points can be found in the supporting information Excel file.

Table 5 Final developed QSPR models.

Figure 3 shows William’s plot, residual plot, and Y2 values predicted using Eq. (23) or (24) versus experimental Y2 values for test and train sets. According to the William’s plot, no outlier is observed in the employed dataset. The outcome of statistical evaluation of the final QSPR model after train and test classification, are reported in Table 6.

Table 6 Statistical evaluation of the final constructed QSPR models.
Fig. 3
figure 3

(a) Estimated Y2 values versus experimental Y2 values, (b) residuals versus experimental Y2, and (c) standard residuals versus leverage (c) using final QSPR models (i.e., Eq. (23) or (24)) for train and test sets.

Non-linear model construction

As mentioned, GP and ANN methods were chosen to develop non-linear models. First, Eq. (25) was developed using the four mathematical operators of addition, subtraction, multiplication and division.

$$\begin{aligned} {\text{Y}}_{2} ~ & = ~0.01807 \times {\text{X}}_{2} - \left( { - ~{\text{X}}_{2} \times {\text{BELe3}} \times {\text{HTm}}^{2} + ~{\text{X}}_{2} \times {\text{BELe3}} + 4.732} \right) \\ & \;\;\; \times \left( {{\text{HTm}} - 2 \times {\text{X}}_{2} \times {\text{HTm}} + 6.718} \right) \times 3.819 \times 10^{{ - 5}} - 0.01807 \times ~{\text{X}}_{2} \\ & \;\;\; \times {\text{HTm}} - 0.01585 \times ~{\text{X}}_{2} ~ \times {\text{BELe3}}^{3} + ~{\text{X}}_{2} ~ \times ~\left( {{\text{BELe3}}^{2} + 3 \times {\text{BELe3}} - 2 \times ~{\text{X}}_{2} } \right) \\ & \;\;\; \times ~0.1119 + 0.009033 \times {\text{BELe3}}^{{\text{2}}} + 0.02907 \times {\text{X}}_{2} ~\left( {{\text{BELe3}} + {\text{HTm}}~ - {\text{X}}_{2} \times {\text{BELe3}} + 1.519} \right) \\ & \;\;\; + 0.009033 \times {\text{X}}_{2} ^{2} \times {\text{BELe3}} \times {\text{HTm}} + ~{\text{X}}_{2} ^{2} \times {\text{BELe3}} \times {\text{HTm}}~\left( {2 \times {\text{X}}_{2} - {\text{BELe3}}} \right) \times 0.01585 - 0.02417 \\ \end{aligned}$$
(25)

Afterward, the MLP network was employed to develop a non-linear ANN model. The network inputs were X2 along with an anion descriptor and a cation descriptor selected by GA (see Fig. 4). For the hidden layers, the hyperbolic tangent sigmoid non-linear activation function was selected and for the output layer, the linear activation function was selected. Since the Scaled Conjugate Gradient learning algorithm resulted in the minimum MSE values, it was used to train the network. Finally, an MLP network containing one hidden layer with 5 neurons was selected as the most appropriate MLP network. The predicted Y2 values using the GP and MLP models as well as the weights and biases of the MLP network can be found in the supporting information Excel file.

Fig. 4
figure 4

The MLP network optimum structure.

Evaluation and validation of the constructed non-linear QSPR models

The train and test sets for the non-linear models were considered the same as the linear model. The values of the predicted Y2 versus experimental Y2 and also the residual values versus experimental Y2 for GP and ANN models are presented in Fig. 5. The statistical parameters of the final linear and non-linear constructed QSPR models for train and test sets are presented in Table 7.

Table 7 Statistical evaluation of the final linear and non-linear constructed QSPR models.
Fig. 5
figure 5

(a) Estimated Y2 values versus experimental Y2 values using GP-based model, (b) residuals versus experimental Y2 using GP-based model, (c) estimated Y2 values versus experimental Y2 values using MLP-based model, and (d) residuals versus experimental Y2 using MLP-based model (Train and test sets are considered).

In this study, the accuracy of the QSPR models was evaluated by computing and presenting the statistical parameters of the developed models which are tabulated in Table 8. A comparison between the statistical parameters of Eq. (12) and Eq. (18) reveals that considering the anion and cation descriptors in the model development improves the model accuracy significantly. Additionally, a comparison between the statistical parameters of Eq. (18) with Eqs. (21) and (22) confirms that considering the turning point in the model development increases the estimation capability of the model. On the other hand, the comparison of statistical parameters of non-linear models with the final linear model (i.e., Eq. (23) or (24)) shows that although the prediction accuracy of the model has increased slightly, this increase is not significant. One reason can be the variety of the anions and cations in the dataset. The more important reason that seems to be the almost linear subordination of Y2 to X2 and the selected descriptors. Therefore, compared to non-linear models, the developed linear model has a favorable prediction accuracy along with its simplicity. Therefore, the linear model was considered a more favorable model due to its less complexity and applicability in the interpretation of descriptors.

Table 8 Comparison of the statistical parameters of the developed QSPR models for the main set.

Discussion

In the development of the QSPR model, it was found that the effect of the aliphatic hydrocarbon structure on the estimation of Y2 value is insignificant. Therefore, the aliphatic hydrocarbon descriptor was not considered in the developed final model. The name and structure of all aliphatic hydrocarbons are presented in Table 9.

Table 9 The name, structure, and chemical formula for all aliphatic hydrocarbons involved in the dataset.

After developing the QSPR model, the analysis of the chosen descriptors within the model helps to find their possible importance in estimating the target variable (Y2). In this regard, anion and cation descriptors that appeared in the model have been introduced and investigated.

Interpretation of the anion descriptor

HTm descriptor was GA-selected anion descriptor in the final QSPR model. HTm is a GETAWAY (GEometry, Topology, and Atom-Weights AssemblY) descriptor. GETAWAY descriptors attempt to match the 3D molecular geometry provided by the molecular influence (or leverage) matrix and atom connectivity using chemical information, topology, and various atomic weighting schemes72. Considering the matrix type, GETAWAY descriptor can be classified into two categories: influence/distance matrix R (R-GETAWAY) and molecular influence matrix H (H-GETAWAY). The matrix type for calculating the HTm descriptor is the influence matrix H and it is weighted by atomic masses. As a symmetric matrix, the molecular influence matrix (H) is defined by Eq. (26)73:

$${\text{H}}~ = ~{\text{M}}~ \times ~({\text{M}}^{{\text{T}}} ~ \times ~{\text{M}})^{{ - 1}} ~ \times ~{\text{M}}^{{\text{T}}}$$
(26)

H is a symmetric A × A matrix, with the M matrix comprising three columns representing the central Cartesian atomic coordinates (x, y, z), and the A row representing the atoms of a molecule. The calculations are performed on the H-filled molecular graph. The superscript T is the transposed of the M matrix. Diagonal elements in the H matrix (i.e., hii) are called leverage and are determined as follows:

$${0 } \le {\text{ h}}_{{{\text{ii}}}} { } \le { 1,}\;\;{ } - {1 } \le {\text{ h}}_{{{\text{ij}}}} { } \le { 1 }$$
(27)
$$\mathop \sum \limits_{{\text{i = 1}}}^{{\text{A}}} {\text{h}}_{{{\text{ii}}}} {\text{ = D}}$$
(28)
$$\mathop \sum \limits_{{\text{j = 1}}}^{{\text{A}}} {\text{h}}_{{{\text{ij}}}} { = 0}$$
(29)
$$\overline{{\text{h }}} { = }\frac{{\text{D}}}{{\text{A}}}$$
(30)

where h is the element of the H matrix, the average value of the diagonal terms in the H matrix is represented by \(\overline{{\text{h}}}\), while the rank of the M matrix is denoted by D. For linear molecules, D is equal to 1, for planar molecules it is 2, and for 3D-molecules it is 3. It is important to note that the molecular influence matrix remains unchanged regardless of rotation.

Leverages (hii) represent the contribution or influence of each atom in determining the shape of the molecule. Mantle atoms in the structure of molecules always have higher leverage values compared to the atoms near the center of the molecule. In addition, the size and shape of the molecule are effective in determining the maximum leverage values. In spherical molecular structures, the leverage values are lower compared to the linear molecular structures. The maximum leverage decreases as the number of atoms (i.e., the molecular size) increases in the molecules that have almost the same shape. Consequently, the HTm descriptor is particularly useful for studying the molecular systems where atom connectivity and the spatial distribution of atomic masses have a significant effect on the molecular behavior, such as in the extraction processes. This molecular descriptor is capable to reflect the geometrical positioning of atoms, along with their influence on the molecular interactions, provides a more accurate representation of the overall molecular structure.

The off-diagonal elements (hij) reflect the ability of atom j to interact with atom i. The positive values of hij indicate that atoms i and j are located in the same molecular region and the probability of atomic interactions between them is higher. On the other hand, the negative values of hij show that the mentioned atoms are located in the molecular region opposite to the center of the molecule and the probability of atomic interactions between them is low.

The coefficient of HTm is positive in Eq. (23) or (24) which shows that increasing the HTm values of the anion structures improves the benzene extraction. The HTm descriptor values are presented in Table 10. The bond length and the number of bonds are effective in determining the leverage value. Therefore, between the atoms close to the surface of the anion, the larger atoms (which are chlorine, sulfur, fluorine, oxygen, and nitrogen atoms, respectively), have larger leverage values. Therefore, the presence of electronegative atoms near the anion surface can increase the interaction between the hydrogen atom in benzene and the electronegative atom in anion. Additionally, the size and shape of anion structure is effective in determining HTm descriptor values. Therefore, the HTm descriptor’s dependence on the atomic masses and the molecular topology makes it a key factor in evaluating the intermolecular interactions, especially in systems where the electron density and spatial arrangement play pivotal roles. In the benzene extraction process, these interactions are crucial because they directly affect the efficiency of the phase transfer and separation processes. Therefore, the HTm descriptor, through its leverage of the atomic masses and 3D structural information, offers deeper insight into the anion’s capacity to interact with benzene molecules, especially in I.L. systems where the complex molecular behavior is involved.

Table 10 Details of anions involved in the current dataset.

Interpretation of the cation descriptor

BELe3 descriptor was GA-chosen cation descriptor in the QSPR model. BELe3 is Burden eigenvalues descriptor that is weighted by atomic Sanderson electronegativities. In order to calculate the Burden eigenvalues, Burden matrix (B) is calculated. Burden matrix which is calculated based on the hydrogen-depleted graph is defined as Eq. (31)74.

$$\left[ B \right]_{{ij}} = \left\{ {\begin{array}{*{20}l} {\pi _{{ij}}^{*} ~.~~10^{{ - 1}} } \hfill & {~if} \hfill & {~\left( {i,j} \right) \in E} \hfill \\ {Z_{i} ~} \hfill & {~if} \hfill & {~i = j} \hfill \\ {0.001~} \hfill & {~if} \hfill & {\left( {i,j} \right) \notin E~} \hfill \\ \end{array} } \right.$$
(31)

where Zi is the atomic number of atoms i, and E is the edges of the graph. The conventional bond order, denoted as \({\pi }_{ij}^{*}\), represents different values for various types of bonds. For instance, it takes on values of 0.1, 0.15, 0.2, 0.3, and 0.01 for single, aromatic, double, triple, and terminal bonds, respectively.

The Burden eigenvalues reflect the molecular connectivity and help to predict the interaction strength in the solvent systems. Higher eigenvalues can suggest stronger interactions between the cation and benzene, enhancing the extraction efficiency. In addition, the BELe3 descriptor is particularly relevant for evaluating the interactions in I.L. systems due to its sensitivity to the electronic environment of the cation. Table 11 illustrates that the BELe3 descriptor values is increased by an increase in the length of the alkyl side chain in the cation structure. It may be due to the increase in the number of single bonds. As the alkyl side chain length increases, not only the number of single bonds increase, but also the steric effects of the cation play critical role in the molecular interactions. Eq. (23) or (24) show the positive effect of increasing BELe3 values on the benzene extraction. Therefore, a longer alkyl chain may enhance the benzene extraction. It should be noted that as the cation size increases, the Coulombic interaction between the anion and the cation decreases. Consequently, the CH-π interaction between the cation and benzene increases. This enhancement is particularly crucial in the I.L. systems, where these interactions can significantly influence the solubility and extraction outcomes. Moreover, the electronegativity of the atoms in the structure of the cations is effective in determining the Y2 value. It seems that the impact of electronegativity in this context underscores the significance of cation structure design in optimizing the extraction efficiency, as it affects the overall interaction energy and stability of the I.L. with benzene.

Table 11 Details of cations involved in the current dataset.

Conclusion

In the present study, after collecting an extensive dataset including 112 ternary systems containing I.L., benzene, and aliphatic hydrocarbon, new linear and non-linear QSPR models were developed to estimate the benzene distribution between the aliphatic hydrocarbon-rich and I.L.-rich phases. For this purpose, the existence of correlation between X2 and Y2 values was confirmed for each of the ternary systems, at first. Afterward, considering the anion, cation, and aliphatic hydrocarbon molecular descriptors in the model, it was found that the aliphatic hydrocarbon descriptor has a negligible effect in predicting Y2 values. Consequently, only cation and anion molecular descriptors were considered in the constructed model. It was also found that considering the X2 = 0.15 as a turning point is effective in increasing the estimation capability of the constructed model. Moreover, the breaking point plot showed that only one anion descriptor and one cation descriptor were sufficient for the development of the model. As a result, the final linear model was constructed using X2, one anion descriptor, and one cation descriptor considering the turning point. Internal and external validation confirmed that the linear constructed QSPR model can estimate Y2 with reasonable accuracy. GP and MLP machine learning method were employed for developing non-linear QSPR models. The results showed that the prediction accuracy for the non-linear models is only slightly better compared to the linear model. Therefore, the linear model was preferred in this study due to its simplicity and ease of interpretation.

Interpretation of the GA-chosen descriptors in the constructed model revealed that the size of the cation and anion, the electronegativity of the atoms in the cation structure, the shape of anion, and the presence of electronegative atoms near the surface of the anions are effective in predicting Y2 values. The QSPR-developed model is an applicable tool to predict Y2 values for unstudied ternary systems (only 112 ternary systems from 4080 synthesizable ternary systems using 17 anions, 20 cations, and 12 aliphatic hydrocarbons have been studied). The outcome of the present study is a step forward in the development of a computational approach for solvent screening in the extractive removal of benzene from fuel mixtures containing aromatic/aliphatic compounds using I.Ls as green solvents.