Introduction

Kimchi, a traditional fermented vegetable originating from Korea, is globally recognized as a health food. Baechu kimchi, the most popular and well-known type of kimchi, is made by fermenting brined kimchi cabbage (Brassica rapa subsp. pekinensis) processed with a seasoning mixture consisting of various ingredients, such as white radish, red pepper (Capsicum annuum L.) powder, fish sauce, garlic, sticky rice porridge, and other leaf and stem vegetables. In Korea, aside from these basic ingredients, manufacturers use a wide range of optional components as a seasoning base to enhance the flavor of kimchi (e.g., purees of fruits such as apple and pear, and extracts of ingredients such as sea tangle, dried Alaska pollock, and mushrooms). Moreover, starter cultures are selectively applied in kimchi fermentation to maintain uniform quality, extend shelf life, and enhance organoleptic or health-promoting properties1,2,3,4.

Kimchi products exhibit significant variation in quality and price based on their geographical origin. This price discrepancy between domestic and imported kimchi products has led to the dominance of imported products on restaurant tables in Korea. Furthermore, in Korea, false indications of the geographical origin of kimchi frequently occur, driven by restaurants seeking illicit profits. This phenomenon results in continuous hardships for consumers, as well as innocent businesses and manufacturers. For instance, according to the Korean Ministry of Agriculture, Food and Rural Affairs, there were 1183 cases of false indication of geographical origin reported for baechu kimchi over the past three years (2020–2022) in Korea.

Therefore, a technique capable of distinguishing between domestic and imported kimchi products is essential to enhance the protection of both consumers and producers against violations of the geographical origin of kimchi in Korea. Typically, the geographical origin of foods has been determined by characteristic features revealed in their biochemical composition5,6. However, successfully classifying the geographical origin of kimchi presents a challenge due to the complexity and similarity of the kimchi matrix, influenced by various factors such as raw materials used, conditions of crop production and processing, and fermentation parameters7.

Previous research on the classification of the geographical origin of kimchi has primarily relied on differences in elemental, metabolite, and mass spectral profiles. These studies employed various analytical techniques, including 1H-nuclear magnetic resonance7, electronic nose8, inductively coupled plasma atomic emission spectrometer and inductively coupled plasma mass spectrometer9, matrix-assisted laser desorption ionization time-of-flight mass spectrometry10, ultra-performance liquid chromatography coupled to a high-resolution mass spectrometry11. Although these analytical methods, when coupled with chemometrics such as principal component analysis (PCA), hierarchical cluster analysis, linear discriminant analysis (LDA), and partial least squares regression-discriminant analysis (PLS-DA), have demonstrated successful results, they are not without limitations.

Most of these methods involve expensive and cumbersome operations that must be conducted by skilled operators, and they are often time-consuming. Some of these methods involve sample extraction procedures that require heat treatment or use reagents and solvents. These procedures are not inefficient to analyze large quantities of samples and may be environmentally harmful. In addition, several chemometric methods used in previous studies may not be very efficient for the discrimination of geographical origin of kimchi. PCA needs to manually match each data cluster to discriminate between samples. LDA may not yield satisfactory classification results if data is not normally distributed or the decision boundary of the data is nonlinear or very complex. Hence, it is crucial to develop a method that is relatively faster, cheaper, and easier to operate and serves as a reliable alternative to these conventional methods.

In this regard, Fourier transform near-infrared (FT-NIR) spectroscopic analysis, which possesses such advantages, has been widely applied for qualitative and quantitative analyses in food and foodstuffs. This non-destructive technology provides spectral data that swiftly reflect the chemical composition of samples, enabling rapid and straightforward classification of the geographical origin of samples when combined with a chemometric tool12,13,14. Furthermore, machine learning techniques have recently emerged as powerful chemometric tools for determining the geographical origin of various foods, including honey13, teas14,15, cocoa beans16, rice17, and sea cucumber18. Fortunately, machine learning techniques can directly interpret classification results without human intervention, and can efficiently deal with large amounts of complex data. This technique is expected to accurately classify the geographical origin of processed foods such as kimchi, as it can extract valuable information from raw datasets that may exhibit hard-to-characterize trends or patterns because of the high similarity or large size of the data.

This study aimed to explore the feasibility of FT-NIR spectroscopy coupled with machine learning techniques in determining the geographical origin of kimchi. The focus was on achieving accurate classification for the geographical origin of kimchi by utilizing an emerging chemometric tool such as machine learning, rather than relying on the selection of significant compounds capable of distinguishing the geographical origin of kimchi. To achieve this objective, FT-NIR spectral data from domestic and imported kimchi samples were acquired, and various spectral preprocessing methods and machine learning techniques were evaluated to establish an accurate and robust classification model.

Results and discussion

FT-NIR spectral interpretation

Figure 1a and S1 depict individual raw and preprocessed spectra acquired from domestic and imported kimchi samples. In Fig. 1a, individual raw spectra exhibit a similar pattern characterized by eight broad absorption bands, encompassing numerous adjacent and overlapping bands throughout the wavenumber range. The differences in absorbance intensity among these spectra were likely associated with multiplicative responses to variations in pathlength, as these differences became significantly smaller after preprocessing with pathlength correction methods such as multiplicative signal correction (MSC) and standard normal variate (SNV)19 (Fig. 1a and S1a,h). The individual spectral patterns of domestic and imported sample groups remained consistent regardless of the type of pathlength correction method used, but variations were observed based on the smoothing method employed; Norris derivative (ND) proved more effective than Savitzky–Golay filtering (SG) in reducing the noise level of spectra, thereby enhancing their appearance (Figure S1). This suggests that different preprocessing methods may have different effects on the model performance for the discrimination of kimchi samples according to their origin.

Fig. 1
figure 1

Representative raw and preprocessed FT-NIR spectra of domestic and imported kimchi samples. (a) Individual raw spectra. (b) Average patterns of spectra preprocessed with MSC. (c) Average patterns of spectra preprocessed with MSC + D2 + ND. The ten peaks marked with numbers had significant differences (p < 0.05) in intensity between the two sample groups. Note: MSC, multiplicative signal correction; D2, the second derivative; ND, Norris derivative filtering.

Figure 1b illustrates representative average patterns of spectra preprocessed with one of the pathlength correction methods (MSC) for domestic and imported sample groups. As depicted in Fig. 1b, the two average spectra nearly overlap, making it challenging to differentiate them with the naked eye. This suggests that, on average, the chemical information obtained from these two sample groups is notably similar, despite the diversity and complexity of intrinsic/extrinsic factors influencing the quality of kimchi products. It underscores the need for a practical chemometric tool to extract information useful for distinguishing between them.

Figure 1c exhibits representative average patterns of spectra preprocessed with a combined method involving one of the pathlength correction methods + one of the derivative methods + one of the smoothing methods (MSC + D2 + ND) for domestic and imported sample groups. As depicted in Fig. 1c, the D2 treatment revealed some sharp peaks. Among the noteworthy regions, ten peaks marked with numbers exhibited significant differences (p < 0.05) in intensity between the two sample groups. The chemical compositions associated with these differences may contain valuable information for classifying the origin of kimchi. Although the complexity of constituents in kimchi products limits the interpretation of structural information because of the extensive overlapping of bands, some insights are summarized in Table 1.

Table 1 Speculative assignments of absorption bands with significant differences (p < 0.05) in peak intensity between the two sample groups in the average patterns of spectra preprocessed with MSC + D2 + ND that obtained from domestic and imported kimchi samples.

In general, a band around peak No.1 (7278 cm−1) and bands around peaks No. 7 (4401 cm−1), No. 8 (4327 cm−1), No. 9 (4262 cm−1), and No. 10 (4046 cm−1) correspond to combinations of C-H22. A band near peak No. 2 (6954 cm−1) corresponds to the first overtone of O-H and N-H22. Two bands around peak No. 3 (5789 cm−1 for domestic; 5785 cm−1 for imported) and peak No. 4 (5677 cm−1) are caused by the first overtone of C-H20. Another two bands around peak No. 5 (4852 cm−1 for domestic; 4844 cm−1 for imported) and peak No. 6 (4736 cm−1) correspond to combinations of O-H and N-H20. These FT-NIR absorption band regions are closely linked to carbohydrates (7278 cm−1, 6954 cm−1, 5789 and 5785 cm−1, 5677 cm−1, 4852 and 4844 cm−1, 4736 cm−1, 4401 cm−1, 4327 cm−1, and 4262 cm−1)19,17,20,21,22,23,24, proteins (5789 and 5785 cm−1, 4852 and 4844 cm−1, 4736 cm−1, 4401 cm−1, 4327 cm−1, 4262 cm−1, and 4046 cm−1)19,17,21, and lipids (5789 and 5785 cm−1, 5677 cm−1, 4736 cm−1, 4401 cm−1, 4327 cm−1, 4262 cm−1, and 4046 cm−1)21,25,26.

According to information from the food nutrition database by the Korean Ministry of Food and Drug Safety, kimchi contains approximately 90% moisture, 2% protein, less than 1% lipid, 6–7% carbohydrates, and 2–3% ash (Table 2). Carbohydrates are the most predominant nutrients in kimchi. These compounds include dietary fibers, which are related to the integrity of the plant cell wall27, and sugars that contribute to the sweet taste of kimchi28. These are mainly derived from plant-based ingredients, and sucrose is often added to enhance the taste. Meanwhile, proteins are primarily derived from animal-based ingredients such as fish sauces and dried-fish extracts29,30. Some of these components in kimchi have been found to vary depending on the geographical origin. The differences in some metabolites including amino acids, sugars, and proteins were usefully used to classify the geographical origin of kimchi in previous studies21,31. However, some of these components undergo changes in content during fermentation; for example, sugars such as glucose and fructose are consumed by lactic acid bacteria during fermentation28. Hence, the chemical compositions of kimchi can vary depending on both the ingredients used and the conditions of processing and fermentation. It is essential to overcome the complexity and diversity of factors that determine the quality of kimchi to identify the indigenous characteristics of kimchi based on its origin. Therefore, it is critical to develop a classification model with good performance that can distinguish between the characteristics of kimchi according to its geographical origin.

Table 2 A representative proximate composition (g/100 g) of kimchi made from kimchi cabbage harvested in different seasons (available on the food nutrition database by Korean Ministry of Food and Drug Safety (MFDS, 2022)).

Classification of geographical origin of domestic and imported kimchi samples by PCA

Figure 2 and S2 present PCA score plots constructed from the top two PCs using preprocessed FT-NIR spectral data of domestic and imported samples. These two-dimensional PCA score plots explained 18.9% (SNV + D2 + SG)–75.2% (MSC) of the total variance. This indicates that the FT-NIR analytical tool has high dimensionality, enhancing its discrimination ability against similar samples32. As shown in Fig. 2a and b, when pathlength correction methods were applied, domestic and imported kimchi samples widely overlapped on the plots, and the distribution patterns on the PCA plots were very similar regardless of the type of pathlength correction method used. These results indicate a similarity in spectral transformation between MSC and SNV, also observed in Figure S1, projected onto the PCA plots.

Fig. 2
figure 2

Principal component analysis (PCA) score plots constructed from the top two PCs (PC1 and PC2) using spectral data of domestic and imported kimchi samples that were preprocessed with (a) MSC, (b) SNV, (c) MSC + D1, (d) MSC + D1 + ND, (e) MSC + D2, and (f) MSC + D2 + ND. Note: MSC, multiplicative signal correction; SNV, standard normal variate; D1, the first derivative; ND, Norris derivative filtering; D2, the second derivative.

The variation of data points for each sample group on these plots (Fig. 2a,b) decreased as the degree of preprocessing increased; for example, the distribution of data points was small in the order of MSC + D2 < MSC + D1 < MSC (Fig. 2a,c,e). Meanwhile, when a combined method including either D1 (Fig. 2c,d and Figures S2a,c,d,e) or D2 + ND (Fig. 2f and S2g) was applied, the two sample groups could be differentiated to some extent by PC 2. These results highlight the importance of selecting appropriate preprocessing methods to improve the classification of the geographical origin of kimchi. However, despite applying various combined preprocessing methods to the raw spectral data, PCA could not completely separate the two sample groups based on their geographical origin. These results indicate that the differences in FT-NIR data between domestic and imported kimchi samples were not sufficiently clear for a full distinction using PCA. A similar unclear classification pattern was also observed in PCA using electronic noise data for determining the geographical origin of kimchi in a study by Lee et al.31. The authors could not obtain a completely differentiated pattern between domestic and imported kimchi sample groups using proteomic data. Conversely, a clear separation between domestic and imported kimchi sample groups was achieved by PCA using 1H NMR data. Thus, our results suggest that an advanced chemometric method should be attempted to more explicitly differentiate domestic and imported kimchi samples based on their geographical origin.

Classification of geographical origin of domestic and imported kimchi samples by chemometric techniques

Comparison of preprocessing method

Various chemometric techniques, including KNN, CART, SVM, NB, RF, and PLS-DA, were employed to construct a classification model for determining the geographical origin of kimchi. Tables S1 and 3 summarize the cross-validation results with training sets and performance evaluations with test sets in terms of accuracy, recall, specificity, precision, and F1 score. Figure 3 presents confusion matrices obtained during model testing. These results reveal that the impact of applying preprocessing methods to the raw spectral data on the classification outcome depended on the type of chemometric algorithm used. This phenomenon aligns with observations in previous studies on the discrimination based on FT-NIR analysis of cocoa beans16, sea cucumbers18, and raw milk33. As indicated in Tables S1 and 3, when employing KNN and SVM algorithms, no misclassification occurred even without data preprocessing in both training and testing models, supported by the perfect values of all performance metrics. Specifically, when using KNN, minor misclassifications were observed in model training with spectral data preprocessed with MSC + D2 and SNV + D2, but all preprocessing methods yielded flawless results without classification errors in model testing. Similarly, when using SVM, insignificant classification errors were observed in model training with several datasets preprocessed with MSC + D2, MSC + D2 + SG, SNV + D2, and SNV + D2 + SG. However, all data preprocessing methods, with two exceptions (MSC + D2 and SNV + D2), produced immaculate classification results in model testing.

Fig. 3
figure 3

Confusion matrix plots obtained in model testing to discriminate the geographical origin of domestic and imported kimchi samples.

Conversely, for other algorithms such as CART, NB, RF, and PLS-DA, applying preprocessing methods to the raw spectral data generally improved the classification results in both training and testing models, with a few exceptions where the results slightly deteriorated (Tables S1 and 3). Among these four algorithms, RF and PLS-DA achieved complete classification results using some preprocessed datasets. When using RF, all performance metrics had a high value (0.97) in both training and testing models without data preprocessing. However, these values slightly decreased to 0.88–0.95 (accuracy), 0.90–0.96 (recall), 0.87–0.96 (specificity), 0.87–0.96 (precision), and 0.89–0.95 (F1 score), respectively in both training and testing models when applying four data preprocessing methods (MSC + D2, MSC + D2 + SG, SNV + D2, and SNV + D2 + SG). This result shows that the four methods were not useful in enhancing the performance of the RF model for classifying the geographical origin of kimchi. However, one method (MSC + D1 + ND) in model training and three methods (MSC + D1 + ND, SNV, and SVN + D1 + ND) in model testing allowed for flawless model performance. Based on these results, MSC + D1 + ND was found to be the most effective one in reinforcing the performance of the RF among preprocessing methods applied in this study, and the RF algorithm could establish a robust model with the excellent values of all performance metrics (1.00) using the dataset preprocessed with MSC + D1 + ND. Meanwhile, when using PLS-DA, values of all performance metrics ranged from 0.87 to 0.93 in both training and testing models without data preprocessing. All data preprocessing methods except for six (MSC, MSC + D2, MSC + D2 + SG, SNV, SNV + D2, and SNV + D2 + SG) in model training and all methods except for two (MSC + D2 and SNV + D2) in model testing led to perfect model performance. These results imply that applying D1 following one of the scattering correction methods (MSC and SNV) was crucial for improving the performance of PLS-DA models regardless of whether a smoothing technique is applied. Based on these results, six methods (MSC + D1, MSC + D1 + ND, MSC + D1 + SG, SNV + D1, SNV + D1 + ND, and SNV + D1 + SG) were found to be the most suitable one for enhancing the performance of PLS-DA models among the preprocessing methods applied in our study.

However, CART and NB algorithms could not achieve perfect classification performance even with preprocessed spectral datasets in both training and testing models. In terms of performance metrics, the best preprocessing methods for improving the performance of models based on the CART and NB algorithms were found to be MSC + D2 + ND and SNV + D2 + SG, respectively. Although most data preprocessing methods in addition to the best methods were beneficial in enhancing the classification performance when using CART and NB algorithms, the recall values of CART and NB models decreased from 0.90 to 0.93 to 0.66–0.88 and from 0.77 to 0.81 to 0.00–0.73, respectively in some cases using datasets preprocessed with each of the four methods: MSC + D2, MSC + D2 + SG, SNV + D2, and SNV + D2 + SG for CART, and MSC, MSC + D2 + ND, MSC + D2 + SG, and SNV for NB, respectively. This result indicates that applying these preprocessing methods significantly reduced the predictive ability of the CART and NB models for domestic samples.

From these findings, it is evident that data preprocessing plays a less critical role in the performance of classification models based on KNN and SVM algorithms for identifying the geographical origin of kimchi. However, data preprocessing proves generally helpful in enhancing model performance when utilizing other algorithms, particularly PLS-DA (Table 3). These results underscore the importance of selecting an optimal data preprocessing method tailored to each classification algorithm to achieve improved model performance.

Table 3 The results of performance evaluation with test sets of classification models for determining the geographical origin of domestic and imported kimchi samples.

In this context, functions of MSC or SNV were essential to build all types of classification models used in our study. Especially, the performance of a few models such as CART and NB models was greatly improved with these methods. These results demonstrate that differences in sample pathlength and scattering effects seriously disrupted to recognize important features for the classification, and eliminating the hindrance was considerably helpful for enhancing the prediction ability of used classification models. In contrast, applying D1 or D2 following one of the scattering correction methods (MSC and SNV) did not always guarantee improved model performance. Unlike to MSC and SNV, the impact of D1 and D2 on model performance varied depending on the classification algorithm. For instance, MSC + D2 and SNV + D2 had lower values of performance metrics than did MSC and SNV when using the CART, SVM, RF, and PLS-DA algorithms, whereas higher values when using the NB algorithm. These results suggest that useful information related to the feature of interest might be covered with increased noise by D2, which led to degraded performing models when using the CART, SVM, RF, and PLS-DA algorithms. On the other hand, despite the noise amplification, when using the NB, it seemed that modeling was improved since the overlapped peaks were resolved and detailed structures were emphasized by the action of D2. Hence, preprocessing with either MSC + D2 or SNV + D2 is not recommended for the classification of the geographical origin of kimchi based on the chemometric algorithms used in this study. None of the classification algorithms, using datasets preprocessed with these methods, demonstrated a perfect classification result in model training (Table 3). This aligns with the PCA results, where datasets preprocessed with these methods exhibited patterns of kimchi samples that were challenging to classify into two categories according to their origin (Fig. 2e and S2f). These results suggest that when employing D2 for data preprocessing of raw spectra, it is essential to use a smoothing method simultaneously for better classification results. Among the two smoothing methods used in this study, ND is recommended because it consistently achieved better results than SG for the classification of the geographical origin of kimchi based on the algorithms used in this study. Moreover, the PCA results were superior when ND was applied for data preprocessing compared to when SG was used. Specifically, the PCA using datasets preprocessed with either MSC + D2 + SG or SNV + D2 + SG failed to identify a distinguishable cluster of kimchi samples according to the geographical origin of kimchi (Fig S2b,h).

Comparison of chemometric algorithms

As indicated in Table 3; Fig. 3, except for CART and NB, all supervised chemometric algorithms used in this study successfully established classification models with flawless performance. Fifteen KNN models, thirteen SVM models, three RF models, and twelve PLS-DA models without any classification errors in model testing. However, in CART models, the best performance with minor misclassifications was observed when using datasets preprocessed with MSC + D1, MSC + D1 + ND, MSC + D2 + ND, SNV + D1, and SNV + D1 + ND (0.98 of accuracy, 1.00 of recall, 0.97 of specificity, 0.97 of precision, and 0.98 of F1 score). In these cases, only one imported kimchi sample was incorrectly discriminated as domestic. In contrast, in NB models, at least four misclassifications were observed.

Among the four successful algorithms, KNN emerged as the optimal choice for determining the geographical origin of kimchi. It consistently and accurately classified all 30 domestic and 30 imported kimchi samples based on their geographical origin, irrespective of the data preprocessing method employed, as evident in the model testing results (Fig. 3). Furthermore, the classification of the test set was accomplished within a short execution time of 11 s (Table S2). This execution time was comparable to those obtained from SVM models, except for one based on the raw dataset, but significantly shorter than those from RF and PLS-DA models. These results highlight that KNN outperformed other algorithms used in this study in recognizing and extracting distinct features between domestic and imported kimchi samples, even without the aid of data preprocessing.

KNN has a successful track record in determining the geographical origin of foods. Similar to our study, KNN models exhibited superior performances compared to other models, such as SVM and RF, for identifying the geographical origin of white tea based on NIR in a study by Zhang et al.15. KNN models based on destructive analytical techniques, such as gas chromatography-mass spectrometry, demonstrated perfect results for discriminating the geographical origin of tea34,35,36,37 and liquors38,39. The simplicity of KNN’s mathematical approach, lack of assumptions regarding underlying data, and robustness against outliers contribute to its effectiveness in solving classification problems40,41,42. These advantages appear to have contributed to the excellent performance in the classification of kimchi according to geographical origin in our study.

While KNN demonstrated the best performance, SVM and PLS-DA algorithms also showed high potential for discriminating kimchi samples based on their geographical origin (Table 3; Fig. 3). In SVM models, only one imported kimchi sample was incorrectly classified as domestic, and this occurred only when spectral data were preprocessed with MSC + D2 and SNV + D2 in model testing (Fig. 3). All metrics values obtained in model testing were also 1.00, except for when using datasets preprocessed with MSC + D2 and SNV + D2 (0.98 of accuracy, 1.00 of recall, 0.97 of specificity, 0.97 of precision, and 0.98 of F1 score). Similarly, in PLS-DA models, except for one using the raw dataset, only one imported kimchi sample was incorrectly discriminated as domestic, and again, this happened only when spectral data were preprocessed with MSC + D2 and SNV + D2 in model testing. All metrics values obtained in model testing for PLS-DA were also 1.00, except for PLS-DA models constructed using raw, MSC + D2, and SNV + D2 datasets. Between these two algorithms, SVM performed better than PLS-DA because, similar to KNN, SVM effectively captured the differences between domestic and imported kimchi samples even without data preprocessing and was not significantly affected by data preprocessing. In contrast, when using PLS-DA, data preprocessing was essential to prevent misclassifications, and the execution time of PLS-DA was longer than that of SVM, except for one case based on the raw dataset (Table S2).

Numerous successful instances of determining geographical origin using SVM are evident in previously reported studies. For example, the SVM model exhibited impeccable performance in discriminating roast green tea based on geographical origin using FT-NIR data in a study by Chen et al.14, surpassing other methods such as LDA, KNN, and back propagation artificial neural networks. Another study by Gaiad et al.33 demonstrated that SVM, based on trace element profiles, outperformed other models including RF, KNN, LDA, and PLS-DA in identifying the geographical origin of lemon juices. SVM is known for its robustness against outliers and high generalization performance, preventing overfitting43. However, it requires data preprocessing, such as normalization, when the data scale varies, and it is not suitable for multi-classification problems. Nonetheless, these limitations do not pose issues for binary classification with consistent data scales, as demonstrated in discriminating the geographical origin of kimchi in this study. Therefore, alongside KNN, SVM emerges as a promising chemometric tool for determining the geographical origin of kimchi.

Conversely, a poor classification result emerged when the NB model was established using the dataset preprocessed with MSC + D2 + SG (Fig. 3). In this instance, while all 30 imported kimchi samples were correctly identified, all 30 domestic kimchi samples were misclassified as imported. In evaluating the classification models for the geographical origin of kimchi, both precision and recall are critical, as it is essential to simultaneously avoid misclassification of both domestic and imported kimchi samples. Therefore, the F1 score, the harmonic mean between precision and recall, may be more useful for this classification than accuracy in assessing model performance. Moreover, precision may be deemed more important than recall in the classification of the geographical origin of kimchi. If a model misclassifies a few domestic kimchi products as imported, it could be rectified through additional testing, but if a model misclassifies imported kimchi products as domestic, it might be easily overlooked. In this context, it can be asserted that the NB model built from the dataset preprocessed with MSC + D2 + SG exhibited the poorest performance in discriminating kimchi samples according to their geographical origin, as both precision and F1 score were indeterminate owing to the misclassification of all domestic samples.

As a result, it was determined that a successful classification model could be established using KNN, SVM, and PLS-DA algorithms when the FT-NIR spectra were preprocessed with all methods, except for MSC + D2 and SNV + D2, as utilized in this study. However, the collected samples do not represent the broader market variability. Thus, to ensure the generalizability of models for practical applications, the sample size should be progressively increased, and the feasibility of the proposed approach should be continuously re-evaluated as the sample size grows.

Conclusions

In this study, we demonstrated the feasibility of coupling FT-NIR spectroscopy with chemometric techniques. Various chemometric algorithms, including KNN, SVM, and PLS-DA, exhibited significant potential for discriminating the geographical origin of kimchi. KNN and SVM algorithms built a robust classification model even without any preprocessing. RF and PLS-DA algorithms could a perfect model performance by applying respective appropriate preprocessing methods, but CART and NB algorithms were not able to completely eliminate classification errors even with preprocessed datasets. Interestingly, PLS-DA was better than CART, NB, and RF in building a robust model using preprocessed spectral datasets. Despite the similarity in FT-NIR spectra between domestic and imported kimchi samples, these supervised algorithms, especially KNN, successfully differentiated domestic and imported kimchi samples based on their geographical origin. In addition, KNN was more efficient than SVM and PLS-DA to identify the geographical origin of kimchi because of its ability to build a robust classification model without preprocessing and better computational efficiency. Although this non-destructive approach did not provide information regarding specific components that differ between sample groups according to their geographical origin, it eliminates the need for time-consuming, labor-intensive, and environmentally unfriendly procedures associated with conventional analytical methods. Therefore, the proposed method holds promise as a valuable tool for determining the geographical origin of kimchi, and can be an ideal alternative to conventional methods in terms of its cost-effectiveness, speed, and accuracy. However, a larger dataset should be constructed for a high generalizability of models. Moreover, further research is required to cover various factors such as raw materials used, conditions of crop production and processing, and fermentation parameters to expand the applications of the proposed method. For these, it needs to continuously explore suitable preprocessing and modeling techniques.

Materials and methods

Kimchi samples

A total of 30 domestic and 30 imported kimchi products were purchased online. These samples underwent freeze-drying at -70 ℃, and they were subsequently homogenized using a grinder (Model HR1673, VIDEOTON Elektro-PLAST Ltd., Hungary) and then utilized for FT-NIR spectroscopic analysis. Freeze-dried samples were stored at -18 ℃ until analysis. Production sites and altitudes of collected samples are represented in Table S3.

FT-NIR spectroscopic analysis

Data acquisition

Spectral data were gathered employing an Antaris™ II FT-NIR analyzer (Thermo Scientific™ Co., USA). The sample was positioned in a powder sampling cup, which was then affixed to the sample cup spinner with the integrating sphere model. Spectral data for each sample were acquired in quintuplicate at 32 scans and a resolution of 8 wavenumbers (cm−1) in the range of 10,000–4,000 cm−1. All analyses were conducted at a room temperature of 25 ℃ and a relative humidity of 21–22%.

Data preprocessing

Before constructing classification models, spectral data preprocessing was conducted utilizing TQ Analysis 9 software (Thermo Scientific™ Co., USA) to improve the robustness and prediction ability of each model by removing physical phenomena in the spectral data44. The widely used three categories of preprocessing methods such as scattering correction, spectral derivatives, and data smoothing were used14,18,44,45,46. For pathlength correction, multiplicative signal correction (MSC) and standard normal variate (SNV) were employed to compensate for differences in sample pathlength. The scattering correction is typically required prior to modeling based on spectral data to reduce unwanted spectral variation effects. Derivatives, specifically the first derivative (D1) and the second derivative (D2), were utilized to reveal peaks appearing as shoulders and to pinpoint the precise center of shoulders in the raw spectra. However, unwanted noise caused by spectral derivative can be amplified. In this case, a smoothing technique is necessary for obtaining meaningful spectral features by minimizing random noise that could lead to errors for the classification. Smoothing procedures involved SG with 7 data points and 3 polynomial orders, as well as ND filtering with a segment length of 5 and a gap of 5 between segments. To find an effective preprocessing method to improve the performance of each classification model, a total of 14 preprocessing methods, consisting of three types of combinations, were applied: (1) one of the pathlength correction methods (MSC or SNV); (2) one of the pathlength correction methods + one of the derivative methods (MSC + D1, MSC + D2, SNV + D1, SNV + D2); (3) one of the pathlength correction methods + one of the derivative methods + one of the smoothing methods (MSC + D1 + ND, MSC + D1 + SG, MSC + D2 + ND, MSC + D2 + SG, SNV + D1 + ND, SNV + D1 + SG, SNV + D2 + ND, SNV + D2 + SG).

Data processing and classification model

Statistical analyses were performed using XLSTAT Version 2023. A two-sample t-test was conducted on peak intensities of notable peaks observed from average patterns of spectra preprocessed with MSC + D2 + ND. PCA was employed to ascertain whether kimchi samples could be grouped based on FT-NIR spectral differences according to their geographical origin. Supervised chemometric techniques, including k-nearest neighbors (KNN), classification and regression tree (CART), support vector machine (SVM), naïve Bayes (NB), random forest (RF), and PLS-DA, were used to construct a classification model for predicting the geographical origin of kimchi. This model utilized both raw and 14 processed spectral datasets that were non-zero.

KNN, a linear and non-parametric supervised method has been frequently used in applications of FT-NIR to identify geographical origin14,15,17. KNN operates on the principle of assuming that data points with similar characteristics tend to fall into similar categories. This algorithm calculates distances between a given data point to be classified and all training data points, subsequently determining the parameter “k,” which refers to the number of nearest neighbors to the given data point. KNN classifies the given data point into the nearest neighbor class that is most frequently represented around it40. In this study, Euclidean distance was used to calculate similarities, and the automatic determination of the number of neighbors was adopted.

CART is a machine learning method for both classification and regression of data. CART classifies a given data point based on decision rules inferred from the data features. This algorithm constructs a flowchart-like tree structure consisting of nodes, branches, and depth. For classification, it employs the Gini index, which measures the probability of a randomly chosen data point being misclassified. CART calculates the Gini index for all combinations and recursively splits branches of the tree toward the lowest Gini index until a stopping criterion is met40. In this study, the tree parameters of minimum parent size, minimum son size, and maximum depth were set to 5, 2, and 3, respectively, and the complexity parameter CP was set to 0.001.

SVM is one of the popularly applied algorithms for the identification of geographical origin based on spectral data13,14,16,33. The aim of SVM algorithm is to find an optimal hyperplane in the N-dimensional space to classify given data points. SVM classifies a given data point based on a hyperplane serving as a decision boundary that maximizes the margin to separate the data into classes among generated hyperplanes. The margin represents the distance between the decision boundary and support vectors, which are data points close to the decision boundary. The kernel method is used to find the boundary when the data classification is by a non-linear boundary. This method can separate the classes linearly in a higher dimensional space33,40. In this study, the sequential minimal optimization parameters of C and tolerance were set to 1.0 and 0.001, respectively. Standardization was adopted in the preprocessing, and a linear kernel was used.

NB is one of supervised machine learning algorithms. NB assigns the class of a given data based on Bayes’ rule, which operates on conditional probability. This algorithm calculates the probability that a given data point belongs to each class and then classifies it into the class with the highest probability. NB may provide promising results by reducing the complexity of spectral data with a high dimension, because of its assumption of statistical independence47. In this study, the smoothing parameter was set to 1.

RF is a supervised ensemble learning method that constructs a large number of decision trees, usually trained with the bagging method. The bagging method is based on the idea that a combination of learning models increases the overall result. RF consists of multiple unrelated trees and predicts the class of given data based on the classification results of the majority of trees; individual trees provide a classification result, and the majority vote of trees in the forest is used to assign the class of the given data. This algorithm is superior to a single decision tree in terms of accuracy of model. In addition, this method does not need a high computational cost for large datasets and offers information for important variables33. In this study, the bagging method (random with replacement) was used to grow trees, with the number of trees set to 50. For stop conditions, the construction time was set to 300, and convergence was set to 50. Tree parameters of minimum node size, minimum son size, and maximum depth were set to 2, 1, and 20, respectively. The CP was set to 0.000.

PLS-DA, a variant PLS regression is used when Y is categorical. This technique is one of the most widely used classification method that extends and integrates the features of PCA and multiple regression17,48. The aim of this algorithm is to maximize the covariance between variables X and Y by finding a linear subspace of the explanatory variables. PLS-DA projects the given data onto extracted latent variables that maximize the separation between sample classes and find the most relevant variance and correlation structure for class separation. Namely, this algorithm searches for a linear transformation of the data into a lower-dimensional space so that different classes of data are well separated33,40. PLS-DA may be more suitable for the classification using spectral data than other methods such as PCA that only maximizes variance without considering class labels and LDA that is prone to a multicollinearity due to the nature of spectral datasets with a high correlation between neighboring bands.

A total of 150 points of raw spectral data or preprocessed data (30 kimchi products × 5 replicates) for each domestic and imported product groups were divided into two subsets: a training set (30 kimchi products × 4 of 5 replicates = 120 data) and a test set (30 kimchi products × 1 of 5 replicates = 30 data). The performances of the training and test sets for each model were evaluated by k-fold cross-validation (k = 4). Model performance was evaluated in terms of accuracy, recall, specificity, precision, error rate, and F1 score, and these metrics were calculated using Eqs. (1)–(5).

$$\text{Accuracy }= \frac{TP +TN}{TP+TN+FP+FN}$$
(1)
$$\text{Recall}= \frac{TP}{TP+FN}$$
(2)
$$\text{Specificity}= \frac{TN}{TN+FP}$$
(3)
$$\text{Precision}= \frac{TP}{TP+FP}$$
(4)
$$\text{F}1\text{ score}= \frac{2\times Precision\times Recall}{Precision+Recall}$$
(5)

where TP denotes true positive; FP denotes false positive; TN represents true negative; and FN denotes false negative. The range of these metric values is from 0 to 1.

Namely, accuracy is the proportion of correctly predicted observations (TP + TN) to the total observations (TP + TN + FP + FN). This is useful for quickly measuring how correct a model is. Recall is the proportion of correctly predicted positive observations (TP) to all actual positives (TP + FN). Specificity is the proportion of correctly predicted negative observations (TN) to all actual negatives (TN + FP). Recall and specificity reflect how well a model correctly classifies the target class (positives in recall and negatives in specificity) without missing any of them. Precision is the proportion of correctly predicted positive observations (TP) to the total predicted positive observations (TP + FP). This shows how reliable a model is when it classifies observations as positive. F1 score is the harmonic mean between precision and recall. This implies a balance between the two metrics and is useful for having to consider both FP and FN.

The purpose of preprocessing was to obtain a preprocessed dataset that reduces certain signals to enhance the information reflecting the characteristics of the samples, ultimately improving model performance. However, in the preprocessing process, useful information about feature of interests can be revealed or reduced, which may have different effects on model performance depending on the chemometric algorithm18. Thus, these performance metrics were used to determine whether preprocessing methods have a beneficial effect on model performance, and which of the preprocessing methods are the best suited for building robust models using each algorithm.