Optimal Predictive Modeling of Nonlinear Transformations: Innovative Applied Mathematics in an Artificial Intelligence System ()
1. Introduction
The non-negative matrix factorisation (NMF) method, which was first used in the paper [1], is a technique that allows a non-negative matrix
to be decomposed into two non-negative factors given by
and
, where
where
is an
matrix with elements in
, and
is an
matrix with nonnegative elements. Here,
represents the number of features,
represents the number of observations or samples, while
denotes the rank or dimensionality of the feature subspace of
.
Let
be a set of data. For any column vector
in
, we can express it as
indicating that
can be approximated by a linear combination of the basis vectors (i.e., all column vectors of
), with the components of
acting as the weight coefficients. Thus,
is also known as the coefficient matrix, allowing the feature vector to be computed as a linear combination of the basis vectors.
Originally proposed for feature extraction, the NMF nonnegative nature makes it suitable for various non-negative processing tasks, underpinned by strong theoretical foundations and interpretability [2]. In pattern recognition, dimensionality reduction is crucial for efficiently managing high-dimensional data and improving recognition accuracy. While the K-means algorithm is frequently used for clustering [3], its performance often degrades in high-dimensional spaces, underscoring the necessity for effective dimensionality reduction techniques.
In addition to pattern recognition, NMF has applications across various fields, including dimension reduction [4], image processing [5], speech processing [6] [7], spectral analysis [8] [9], DNA expression analysis [10] [11], microRNA-disease analysis [12] [13], social network analysis [14], text analysis [15], hyperspectral signal unmixing [16], and blind spectral unmixing [17] [18]. As data size increases, so does the demand for processing and storage. This has led researchers to develop new algorithms for linear dimensionality reduction (LDR), a data analysis method that reduces the dimensionality of information with minimal loss of information before analysis. The transformation of data is an effective means to improve the linear representation. The idea is to transform the data via a function
to be learned, and use
and apply the factorization to
instead of
, i.e.
.
We assume that each data point is associated with a label
. In the most basic case, they have labels
that are binary, i.e., they can take two values representing two classes. For multi-class cases, the labels can be represented using one-hot encoding [19]. A prominent example of a non-linear transformation applied to count data in document classification is the term frequency-inverse document frequency (TF-IDF) transformation [20] [21]. In this context,
denotes the frequency of word
appearing in document
, and the TF-IDF transformation maps
to
where
In [22] [23], document classification is achieved through NMF and KL-NMF. Additionally, [24] demonstrates the application of non-linear transformations to enhance the effectiveness of these classification methods. This motivates our exploration of improved transformations, which we aim to apply to various document datasets. This motivates our exploration of some nonlinear transformations, which we intend to apply to various document datasets to enhance the results obtained in [25].
Another effective method is the probabilistic IDF [26], which modifies the basic IDF calculation to incorporate probabilities, ensuring that terms that do not appear in a document are still assigned meaningful weights. Jelinek-Mercer smoothing [27] [28], initially developed for language modeling, can also be adapted for TF-IDF by interpolating maximum likelihood estimates with background distributions.
Moreover, Katz [29] and Witten-Bell [30] smoothing techniques provide further enhancements by allowing for interpolation between observed and unobserved frequencies, improving robustness in document classification tasks at language analysis. It is important to emphasize the use of these smoothing techniques to improve the performance of TF-IDF so that the weights assigned to terms indicate the general importance of the terms in the overall corpus.
Outline and Contribution of the Paper
This paper proposes a new method for document classification. As detailed in Section 2, KL-ONMF serves as the reference for our method. We will smooth the TF-IDF and incorporate parameters (see Section 3), which we will optimize using the fminsearch function in MATLAB with a machine learning approach. We apply non-linear transformations in order to capture non-linear characteristics better. We will show that this new approach is much better than KL-ONMF on the original data for document classification (see Section 4).
2. Alternating Optimization for ONMF with the KL Divergence
In this section, we focus on Alternating Optimization for Orthogonal Nonnegative Matrix Factorization (ONMF) with the Kullback-Leibler (KL) Divergence. The goal is to minimize the KL Divergence between the given matrix
and the product of two matrices
and
, while enforcing specific constraints on
.
The optimization problem can be formulated as follows:
For ONMF utilizing the KL Divergence, it is crucial that
is component-wise nonnegative, reflecting the inherent properties of KL Divergence, which is only defined for non-negative matrices. This non-negativity is essential, as it ensures that the divergence measurement remains valid and interpretable in the context of probability distributions.
In contrast, ONMF using the Frobenius norm does not impose such restrictions, allowing for broader applications but potentially leading to less interpretable results in probabilistic terms. The choice of KL Divergence in this context emphasizes our focus on modeling non-negative data, making it particularly suitable for applications like document classification or image processing, where the underlying data naturally adheres to non-negativity constraints.
Moreover, the constraint
ensures that the matrix
retains orthogonality, which is vital for preserving the interpretability of the factors derived from the decomposition. This orthogonality condition aids in distinctly separating the components, allowing for clearer insights into the latent structures represented by the factors in
and
.
This formulation highlights the methodological rigor behind ONMF with KL Divergence, ensuring that the results are both mathematically sound and practically applicable to real-world datasets. The update rules for
and
in ONMF with the KL Divergence highlight the different optimization strategies compared to Frobenius ONMF, making KL-ONMF less sensitive to outliers and more focused on data points with smaller norms.
Algorithm 1 summarizes the alternating optimization scheme, known as KL-ONMF.
It’s a simple yet effective and highly scalable alternating optimization algorithm. It’s an alternating optimization algorithm, which is simple but efficient and highly scalable, running in
operations where
is the number of non-zero entries in the data matrix, and
is the number of clusters. Applied on documents and hyperspectral images, KL-ONMF demonstrates its performance over ONMF using the Frobenius norm by providing better clustering results, and the convergence time is always smaller on average.
3. Optimal Predictive Modeling of Nonlinear
Transformations
In this section, we develop smoothing techniques to enhance the calculations of Term Frequency-Inverse Document Frequency (TF-IDF). These techniques address limitations associated with raw term frequency counts, ensuring a more balanced representation of term importance.
3.1. TF-IDF Modeling
In [31], the authors define TF-IDF as a combination of two statistics: term frequency (TF) and inverse document frequency (IDF). The term frequency
quantifies how often term
appears in document
[32], and is expressed as:
(1)
where
is the raw count of the term in the document, and
is the total number of terms in document
. Inverse document frequency measures the amount of information provided by a word, indicating whether it is present in all documents, and is defined as the inverse fraction, on a logarithmic scale, of documents that contain the word. It is given by:
(2)
where
is the total number of documents in the corpus, and
is the total number of terms in document
. This approach effectively balances the frequency and rarity of terms, making it a powerful tool for information retrieval and text analysis. In the next section, we will introduce smoothing techniques to enhance the TF-IDF representation, aiming to improve the robustness and performance of our classification method.
3.2. Smoothing Technique for TF-IDF in Optimal Predictive Modeling
To address the limitations of raw term frequency counts and ensure a more balanced representation of term importance, we introduce smoothing techniques in the calculation of both term frequency (TF) and inverse document frequency (IDF). The term frequency is recalculated with a smoothing term to prevent zero counts from disproportionately affecting the results. Similarly, the inverse document frequency is adjusted to ensure that terms common across many documents do not dominate the weighting.
The term frequency
is computed as:
(3)
where smooth = 1 for Laplace smoothing or add-1 smoothing [33]. The inverse document frequency
is calculated to measure the importance of the term across the document corpus:
(4)
where
represents the number of documents containing the term
. This formulation helps to mitigate the impact of terms that appear in a large number of documents, ensuring that the IDF value remains informative. The TF-IDF weighting of a term is therefore the product of its TF term frequency Equation (3) and its IDF document inverse frequency Equation (4). It is defined by Equation (5):
(5)
In the following, we start with a matrix
representing word occurrences, where each element
indicates how many times word
appears in document
. The algorithm computes the Term Frequency (TF) for each word in a document, reflecting its relative importance, and the Inverse Document Frequency (IDF) measures how common or rare a word is across all documents. The resulting TF-IDF scores highlight the significance of words in the context of the documents. The detailed algorithm is described below.
3.3. Parameterization of TF-IDF Score with Custom Adjustments
To further refine the TF-IDF score, we introduce adjustable parameters
,
, and
. These parameters enable customization based on the dataset’s characteristics, thereby enhancing the overall performance of clustering algorithms. The TF-IDF score incorporating parameters to be optimized can be expressed as follows:
(6)
We note that for the classic TF-IDF, we have
and
or
and
. In our new model, each parameter plays a crucial role in fine-tuning the model to capture the underlying patterns in the data: parameter
scales the contribution of the transformed term frequencies. By adjusting
, we can amplify the significance of certain terms based on their frequency, ensuring that important terms are more prominently represented in the final feature set. Parameter
serves as the exponent applied to the term frequency matrix
. This exponentiation can accentuate the differences among term frequencies, effectively emphasizing more frequent terms while diminishing the influence of less common ones, and parameter
adjusts the influence of the original TF-IDF scores.
3.4. Innovative Nonlinear Transformations: Boosting Predictive Performance in AI Applications
In this section, we explore three nonlinear transformations that enhance the model’s ability to capture complex relationships within data. These transformations are inspired by principles from neural networks and include the logarithm function, the square root function, and the hyperbolic tangent function. Each transformation serves a specific purpose in refining the feature representation of TF-IDF scores.
Logarithm Function
We prefer the logarithm function due to its effectiveness in reducing skewness in data distributions. Many real-world datasets exhibit right skewness; applying a logarithmic transformation can render the data more symmetric, thereby improving the performance of clustering algorithms. The logarithmic transformation can be expressed as:
(7)
Moreover, logarithmic scaling emphasizes smaller differences in low-value ranges, which is particularly beneficial for TF-IDF scores. This allows for a more nuanced understanding of infrequent terms that may still hold significant semantic weight. Additionally, by compressing the range of values, logarithmic transformations mitigate the influence of outliers, making clustering algorithms less sensitive to extreme values.
Square Root Function
The square root function is particularly advantageous for processing non-negative TF-IDF scores, as it emphasizes larger values while preserving the non-negativity of the data. This property is crucial for highlighting important terms that appear more frequently within the dataset, thereby amplifying their significance in the analysis. We use the expression given by:
(8)
Moreover, the square root transformation effectively mitigates the impact of extremely high values, enabling clustering algorithms to concentrate on the overall structure of the data rather than being skewed by a few dominant terms. Since many clustering methods rely on distance measures, such as Euclidean distance, the square root transformation helps to maintain the relative distances between data points. This results in more accurate and meaningful clustering outcomes, enhancing the model’s ability to identify distinct groups within the data.
Hyperbolic Tangent Function
The hyperbolic tangent function outputs values in the range of −1 to 1, which is particularly beneficial for clustering algorithms that assume centered data. This symmetry facilitates the model’s ability to learn balanced representations of both positive and negative influences within the dataset.
(9)
Additionally, the tanh function introduces strong nonlinearity, which aids in capturing complex relationships among data points. This capability is crucial in clustering, where relationships may not be linearly separable. In the context of neural networks, the tanh function can also lead to more efficient gradients during backpropagation, potentially improving convergence rates in learning models.
In summary, we have chosen these transformations because they introduce nonlinearity into the model, allowing it to better learn from the underlying patterns in the data. The effectiveness of these nonlinear transformations is further enhanced by the previously optimized parameters
,
, and
. By refining the representation of term importance through parameter tuning, we ensure that these transformations can operate on the optimized TF-IDF scores
, resulting in a more relevant and expressive feature space. This integrated approach—comprising smoothing, parameter optimization, and nonlinear transformations—aims to enhance clustering accuracy and improve document classification and retrieval outcomes.
3.5. Algorithm: Optimizing TF-IDF Score with Smoothing Techniques
Algorithm 2 summarizes the alternating optimization scheme, which we refer to as KL-ONMF. The index
refers of terms in the TF-IDF vector (ranging from 1 to
), and
is the index of documents (for each term
,
varies based on the number of documents).
Algorithm 2 is designed to enhance the traditional TF-IDF scoring method by incorporating smoothing and nonlinear transformations. It takes as input a vector of TF-IDF scores along with several parameters, including smoothing and scaling coefficient
, exponent
, and adjustment factor
, with the goal to produce an optimized vector of TF-IDF scores and corresponding cluster labels.
3.6. Parameter Optimization Using Regularization
To identify optimal values for parameters
and
, we leverage the fminsearch function in MATLAB, which performs unconstrained optimization. During this process, the parameter
is fixed at a baseline value of 0.0001. Initial values for
and
are generated randomly within a defined range of 0 to 2, enabling a thorough exploration of the parameter space and facilitating the discovery of improved configurations.
In our optimization function, we incorporate a regularization term to mitigate the risk of overfitting. This term penalizes excessively large values of the parameters
and
by introducing a quadratic penalty, represented as
, where
denotes the number of parameters. Specifically, this is an L2 regularization [34] component that helps to prevent overfitting by penalizing large parameter values, with each
representing a parameter being optimized. By doing so, we encourage the model to prioritize simpler configurations that generalize better to unseen data. The inclusion of this regularization term helps maintain a balance between fitting the training data and preserving the model’s ability to perform well on validation sets, ultimately leading to more robust and reliable results.
3.7. Data Separation
In order to evaluate the robustness of the optimized methodology, it is essential to separate the data into training and test sets. This separation is crucial for assessing the model’s ability to generalize to unseen data and is performed by allocating 80% of the documents to the training set and 20% to the validation set.
The 80/20 split is strategically chosen as it strikes a balance between providing ample data for training the model and retaining a substantial portion for evaluation. This division ensures that the model can effectively learn from the training data while maintaining the capability to accurately assess its performance on independent data. Such a balance is crucial for developing a robust model that can generalize well to new, unseen instances.
In this context, “training” refers to the phase in which the model actively learns from the training set, adjusting its parameters to minimize prediction errors among terms and their corresponding clusters. Conversely, “validation” is the process of evaluating the model’s performance on unseen data, which is essential for understanding its generalization capabilities. This two-phase approach allows us to ensure that the model is not only fitting the training data well but is also capable of making accurate predictions in real-world scenarios.
Furthermore, the use of labels in this unsupervised task is limited strictly to evaluation purposes. During the training phase, labels are not utilized; instead, they serve as a benchmark for measuring the accuracy and effectiveness of the model’s clustering outcomes. This methodology emphasizes our commitment to an unbiased evaluation process, ensuring that the model’s performance is assessed based solely on its ability to identify and group similar documents without any prior knowledge of their labels.
To enhance the evaluation process, we also employ cross-validation. Specifically, we set the number of folds for cross-validation to num_folds and the number of repetitions to num_repetitions.
Cross-validation is a resampling technique that helps in assessing how the results of a statistical analysis will generalize to an independent dataset. By splitting the training set into several subsets (folds), we can train the model on a portion of the data while validating it on the remaining parts. This method is repeated for a specified number of folds, providing a more comprehensive evaluation of model performance.
1) Number of Folds: Setting num_folds = 5 means that the training data will be split into five subsets. The model will be trained on one subset and validated on the other, and this process will be repeated for each fold.
2) Number of Repetitions: With num_repetitions = 5, the entire cross-validation process will be repeated five times, ensuring that the model’s performance is robust and less sensitive to the specific data split.
By using cross-validation, we can achieve a more reliable assessment of our model’s performance, reducing the risk of overfitting and ensuring that the enhancements made to the TF-IDF calculations yield tangible benefits in practical applications.
4. Numerical Experiments
In this section, we use nonlinear functions for document clustering, and we compare the performance of KL-ONMF (Algorithm developed in [25]) applied to the original data with that of KL-ONMF applied to the transformed data using (Algorithm). All experiments were run on a LAPTOP 11th Gen Intel® CoreTM i5-1135G7 @ 2.40GHz × 8 16,0 Go RAM.
Initialization:
Similar to k-means, ONMF algorithms can be initialized in a variety of ways. We take the approach suggested in [35] to simplify the presentation. The successive nonnegative projection algorithm (SNPA) [36] is used in this method to initialize
. It finds a subset of
columns from
that accurately represent the dataset’s evenly distributed data points.
To refine the parameters during the optimization phase, we utilize the Nelder-Mead optimization algorithm [37]. For this algorithm, we initialize the parameters by generating random values. The Nelder-Mead algorithm iteratively refines these parameters based on the objective function (Equation (10)), which in our case is designed to minimize the difference between the transformed data and the original data, while also incorporating a regularization term. By combining the SNPA for the initialization of
and the Nelder-Mead method for parameter optimization, we aim to achieve robust and efficient convergence in the ONMF framework.
Parameterization of Functions:
To minimize the error rate of the objective functions and enhance accuracy, we begin by setting a fixed initial parameter value for
at 0.0001. This choice ensures consistency across our optimization process, allowing us to focus on adjusting the other parameters,
and
. The values for
and
are randomly initialized (see Section 3.6), specifically rounded to three decimal places to maintain precision. This randomness introduces variability in the initial conditions, which can help the optimization algorithm explore different configurations effectively. To find the optimal values of
and
, we employ the fminsearch algorithm, which utilizes the Nelder-Mead method. This optimization technique is particularly beneficial for multidimensional problems where derivatives may not be easily computable. It iteratively adjusts the parameters based on the evaluation of the objective function, seeking to minimize the error rate. By systematically refining the values of
and
, the algorithm converges towards a local parameterization that improves the model’s accuracy in its predictions. Overall, this approach ensures a robust framework for optimizing the model’s performance based on the given dataset.
Objective Function:
The objective function aims to minimize the sum of the squared deviations between the calculated and observed outflows. This approach ensures that the model’s predictions closely align with the actual data. Let
be the set of parameters defined as:
where
represents the first coefficient influencing the model’s predictions, and
represents the second coefficient. The parameters in
are crucial for minimizing
while adhering to the constraints that require both
and
to be non-negative. This ensures that the model parameters remain within a physically meaningful range, ultimately leading to a more accurate representation of the relationship between the inputs and outputs.
Mathematically, the objective function can be expressed as:
(10)
where
measures how well the model’s predictions (calculated outflows) match the actual observed outflows. The goal is to find the parameters
and
that minimize this function. The Frobenius norm
quantifies the difference between the predicted and actual values. Here,
represents the observed outflows, while
denotes the calculated outflows based on the model and parameters
and
. The calculated outflows
can be expressed as:
(11)
where
is the matrix of features and
is the matrix of coefficients derived from the parameters
and
. The Frobenius norm computes the squared difference between these two sets of values, aggregating the errors across all observations. To discourage invalid parameter values, a penalty term is added, ensuring that only non-negative parameters are considered in the optimization process. To prevent overfitting by penalizing large values of the parameters, we add a regularization coefficient
that controls the strength of this penalty, balancing the fit of the model to the data against the complexity of the model.
Document Data Sets:
We cluster the 14 document data sets from [38] using KL-ONMF. Table 1 provides not only the names of the data sets but also their dimensions (where
represents the number of words and
the number of documents) and the number of clusters
.
Table 1. Summary of text datasets (for each dataset,
is the total number of documents,
the total number of words,
the number of classes,
the average number of documents per class, and Balance the size ratio of the smallest class to the largest class).
Data |
|
|
|
|
Balance |
ngsim |
2998 |
15,810 |
3 |
|
|
classic |
1073 |
30 |
20 |
|
|
ohscal |
11,162 |
11,465 |
10 |
1116 |
0.437 |
k1b |
2340 |
21,839 |
6 |
390 |
0.043 |
hitech |
2301 |
10,080 |
6 |
384 |
0.192 |
reviews |
4069 |
18,483 |
5 |
814 |
0.098 |
sports |
8580 |
14,870 |
7 |
1226 |
0.036 |
la1 |
3204 |
31,472 |
6 |
534 |
0.290 |
la12 |
6279 |
31,472 |
6 |
1047 |
0.282 |
la2 |
3075 |
31,472 |
6 |
513 |
0.274 |
tr11 |
414 |
6429 |
9 |
46 |
0.046 |
tr23 |
204 |
5832 |
6 |
34 |
0.066 |
tr41 |
878 |
7454 |
10 |
88 |
0.037 |
tr45 |
690 |
8261 |
10 |
69 |
0.088 |
Accuracy:
Accuracy is defined as the proportion of correctly classified data points relative to the total number of data points. In the setting of document clustering, the accuracy of a computed disjoint clustering
compared to the true disjoint clusters
for
is defined as follows:
(12)
The correctly classified data points are those that have been assigned to the same cluster as their original data. The clustering process was carried out using the K-means algorithm.
Running time:
Table 2 summarizes the execution times associated with the objective function evaluated through the Nelder-Mead optimization algorithm across various datasets.
Table 2. Execution times of datasets.
Dataset |
Min Time (s) |
Max Time (s) |
Mean Time (s) |
ng3sim |
1.40 |
2.57 |
1.84 |
classic |
172.11 |
215.82 |
193.40 |
ohscal |
4.45 |
31.01 |
13.46 |
k1b |
30.78 |
31.28 |
31.05 |
hitech |
1.58 |
1.80 |
1.65 |
reviews |
4.60 |
48.19 |
19.15 |
sports |
3.96 |
6.90 |
5.48 |
la1 |
2.56 |
2.78 |
2.66 |
la12 |
4.39 |
8.58 |
6.30 |
la2 |
2.12 |
55.91 |
20.07 |
tr11 |
0.91 |
1.18 |
1.01 |
tr23 |
0.73 |
0.75 |
0.74 |
tr41 |
1.14 |
1.50 |
1.31 |
tr45 |
1.53 |
1.67 |
1.60 |
The results indicate substantial variability in processing times, with the “classic” dataset exhibiting the highest execution times, ranging from 172.11 to 215.82 seconds. This suggests that this dataset may present more complex optimization challenges or larger data sizes. Conversely, datasets like “ng3sim” and “hitech” show significantly lower mean execution times of 1.84 and 1.65 seconds, respectively, indicating efficient convergence during optimization. The “la2” dataset, with a maximum time of 55.91 seconds, points to potential outliers that may require further analysis to understand the underlying causes of increased processing time. Meanwhile, datasets such as “tr11,” “tr23,” “tr41,” and “tr45” demonstrate consistent execution times, which may reflect similar optimization landscapes. Overall, these execution times underscore the impact of dataset characteristics on the performance of the Nelder-Mead optimization algorithm, highlighting opportunities for further refinement in handling more computationally intensive datasets.
Optimal Parameters
In this section, we present the optimal parameters identified for each dataset through cross-validation. The table below summarizes these parameters, which play a crucial role in enhancing the performance of our model. The first column of the table lists the datasets. To ensure stability and prevent excessive fluctuations during optimization, we fixed the parameter
at 0.0001 rather than allowing it to be optimized alongside
and
. This choice is guided by the observation that large variations in
can lead to significant distortions in the TF-IDF scores, potentially overshadowing the contributions of the other parameters. By maintaining
at a low constant value, we provide a baseline influence that stabilizes the model while allowing
and
to be fine-tuned for better performance.
In a sensitivity analysis, we examined the effects of varying
around its fixed value. The results indicated that small deviations from 0.0001 did not substantially alter the overall performance of the clustering algorithms. However, larger adjustments (e.g., setting
to 0.01 or higher) resulted in noticeable degradation in clustering quality, highlighting the importance of keeping
within a narrow range. This analysis supports our decision to fix
at 0.0001, ensuring that the model remains robust while still allowing
and
the flexibility needed to adapt to specific dataset characteristics. The remaining columns represent the optimized parameters
and
, which vary for each dataset. The parameter
influences the model’s responsiveness to changes in the data, while
affects the overall scaling of the model’s output. Finally, the “Best Smooth” column indicates the optimal smoothing value achieved during cross-validation, which reflects the model’s performance in terms of clustering accuracy. The varying values in this column highlight how different datasets require tailored parameter settings to achieve the best results.
Table 3. Optimal parameters for transformations.
Dataset |
Log |
Sq |
HTan |
|
b |
c |
BS |
b |
c |
BS |
b |
c |
BS |
ng3sim |
2.7809 |
9.1445 |
50.1700 |
2.7951 |
9.1290 |
7.0700 |
2.3249 |
9.1467 |
18.3700 |
classic |
0.0682 |
1.0919 |
25.0000 |
0.0682 |
1.0919 |
15.0000 |
0.0682 |
1.0918 |
15.0000 |
ohscal |
3.0457 |
16.4311 |
0.1700 |
2.4546 |
16.5128 |
25.0000 |
3.6363 |
16.3386 |
17.0300 |
k1b |
2.2467 |
21.8494 |
16.7300 |
1.6804 |
22.9310 |
0.1000 |
2.7789 |
22.8170 |
0.8700 |
hitech |
3.1803 |
23.7720 |
34.0000 |
3.1843 |
23.8005 |
8.5700 |
2.6590 |
23.7885 |
0.2000 |
reviews |
3.0342 |
27.0628 |
66.6700 |
3.0326 |
27.0573 |
13.3700 |
3.0317 |
27.0648 |
83.3300 |
sports |
2.5719 |
20.4878 |
8.5000 |
2.5922 |
20.4682 |
35.0000 |
3.0420 |
20.4616 |
66.6700 |
la1 |
2.7167 |
17.7621 |
0.4000 |
3.2277 |
17.6894 |
46.6700 |
2.2066 |
17.8179 |
7.1700 |
la12 |
2.2671 |
18.6777 |
16.7000 |
2.7673 |
18.6036 |
8.3700 |
2.7632 |
18.6087 |
41.6700 |
la2 |
2.8408 |
17.7540 |
23.3400 |
3.3745 |
17.6619 |
1.7000 |
2.8439 |
17.7351 |
0.2300 |
tr11 |
2.4381 |
5.7950 |
50.6700 |
1.2637 |
5.3720 |
40.0300 |
2.4959 |
6.1245 |
8.3700 |
tr23 |
1.8971 |
3.3695 |
0.1000 |
2.2632 |
3.5174 |
0.1000 |
2.2537 |
3.5558 |
0.1000 |
tr41 |
2.5543 |
16.7309 |
13.3700 |
2.2009 |
16.4067 |
0.7300 |
2.1465 |
15.7258 |
18.3700 |
tr45 |
2.3656 |
9.2114 |
16.7300 |
2.3873 |
8.8119 |
0.1000 |
1.5483 |
8.2574 |
17.6700 |
Table 3 presents parameter values for three different nonlinear transformation: logarithmic transformation (Equation (7)), square root transformation (Equation (8)), and hyperbolic tangent transformation (Equation (9)). Even though these methods use different mathematical functions, the values for parameters
and
are very similar across all datasets. For the logarithmic transformation,
usually ranges from 0.068 to 3.180, and
ranges from 1.091 to 27.062. The square root transformation also shows similar patterns, with
values in the same range and
values that suggest these two transformations might work well with similar types of data. By analyzing the ranges of the parameter of the “Best Smooth”, we observe interesting trends: for the logarithmic transformation, the values of
range from 0.068 to 3.180, while those for
span from 1.091 to 27.062. The “Best Smooth” values in this transformation vary from 0.10 to 66.67. For the square root transformation, the
parameters remain within a similar range, from 0.068 to 3.374, and the values of
range from 1.091 to 27.057. The “Best Smooth” values for this function is between 0.10 and 46.67. For the hyperbolic tangent transformation, the
values also fall within the same interval, from 0.068 to 3.636, with
values ranging from 1.091 to 27.064. The “Best Smooth” values in this approach vary from 0.10 to 83.33. These intervals demonstrate that, although the transformations employ different mathematical nonlinear functions, the parameters
and
exhibit significant similarities, suggesting that these transformations may be well-suited for similar types of data. The choice of a fixed starting value for
allows for a clearer analysis of the impact of
and
on the results, minimizing the effects of variations in
.
Performance Evaluation:
We present the accuracy achieved by KL-ONMF on both the original and transformed data across different datasets. The accuracy metrics are denoted as follows: acc-KL for overall accuracy, BestAcKL-ONMF (Train) for the accuracy on the training dataset, and BestAcKL-ONMF (Validation) for the accuracy on the validation dataset. The distinction between “training” and “validation” phases is critical in our approach, particularly given the nature of the KL-ONMF method and the transformations applied to the data. During the training phase, the model focuses on learning complex patterns and relationships within the feature space that is generated from the transformed TF-IDF scores. This phase heavily relies on the training data, which plays a significant role in adjusting the model’s parameters. Conversely, the validation phase serves an essential purpose by preventing the model from overfitting to the training data. Evaluating the model on a separate validation set allows us to gauge its ability to generalize to new, unseen data—an aspect that is vital for practical applications. In our methodology, where TF-IDF scores are subjected to various transformations, validating performance on data that the model has not previously encountered becomes particularly important. This validation helps us assess the effectiveness of the smoothing and non-linear transformations applied, ensuring they genuinely enhance the model’s clustering and classification capabilities rather than simply memorizing the training data. Thus, the clear separation of training and validation phases in our approach not only bolsters the reliability of the model’s performance but also underscores its robustness in real-world scenarios.
Table 4. KL-ONMF vs. KL-ONMF Transformed for the clustering of 14 document data sets: Cross-Validation with Repetitions for Logarithm Transformation.
Dataset |
|
|
|
AcKL-ONMF |
Best AcKL-ONMF (Train) |
Best AcKL-ONMF (Validation) |
ng3sim |
2998 |
15,810 |
3 |
71.45 |
74.61 |
74.12 |
classic |
7094 |
41,681 |
4 |
85.38 |
91.99 |
85.49 |
ohscal |
11,162 |
11,465 |
10 |
47.74 |
51.69 |
51.20 |
k1b |
2340 |
21,819 |
6 |
58.46 |
60.94 |
60.53 |
hitech |
2301 |
10,080 |
6 |
38.55 |
37.81 |
37.27 |
reviews |
4069 |
18,483 |
5 |
72.35 |
72.12 |
71.84 |
sports |
8580 |
14,870 |
7 |
66.55 |
67.19 |
65.45 |
la1 |
3204 |
31,472 |
6 |
61.20 |
63.10 |
61.66 |
la12 |
6279 |
31,472 |
6 |
67.48 |
63.10 |
61.75 |
la2 |
3075 |
31,472 |
6 |
59.90 |
63.91 |
61.95 |
tr11 |
414 |
6424 |
9 |
54.11 |
47.83 |
44.85 |
tr23 |
204 |
5831 |
6 |
34.31 |
43.14 |
41.83 |
tr41 |
878 |
7453 |
10 |
48.63 |
50.72 |
49.24 |
tr45 |
690 |
8261 |
10 |
59.57 |
60.39 |
58.89 |
Averages |
|
|
|
58.98 |
60.61 |
59.01 |
Table 4 compares the performance of the AcKL-ONMF method applied to both untransformed data and smoothed, non-linear transformations. The metrics include baseline accuracy, Best AcKL-ONMF accuracy for training, and validation sets across various datasets.
Notably, the Best AcKL-ONMF(Train) values consistently outperform the baseline accuracy, indicating that the method effectively learns features from the transformed data. For instance, Dataset 1 shows a significant increase from 71.45% to 74.61%, demonstrating enhanced model performance through the application of smoothing and non-linear transformations.
When examining the Best AcKL-ONMF(Validation) results, there is a clear trend of improvements in accuracy on unseen data. Datasets such as Dataset 2 show a rise from 85.38% to 91.99% in training, with a validation accuracy of 85.49%, illustrating the model’s ability to generalize effectively.
The average accuracies further validate our approach. The training set achieves an average of 60.61%, while the validation set maintains competitive performance at 59.01%. This consistency across both training and validation sets suggests that the AcKL-ONMF method not only optimizes the model parameters well but also ensures robust performance on unseen data.
Table 5. KL-ONMF vs. KL-ONMF Transformed for the clustering of 14 document data sets: Cross-Validation with Repetitions for Squared Transformation.
Dataset |
|
|
|
AcKL-ONMF |
Best AcKL-ONMF (Train) |
Best AcKL-ONMF (Validation) |
ng3sim |
2998 |
15,810 |
3 |
71.45 |
73.85 |
72.98 |
classic |
7094 |
41,681 |
4 |
85.38 |
92.92 |
92.46 |
ohscal |
11,162 |
11,465 |
10 |
47.74 |
51.74 |
51.18 |
k1b |
2340 |
21,819 |
6 |
58.46 |
56.17 |
55.98 |
hitech |
2301 |
10,080 |
6 |
38.55 |
38.06 |
37.87 |
reviews |
4069 |
18,483 |
5 |
72.35 |
69.14 |
68.15 |
sports |
8580 |
14,870 |
7 |
66.55 |
68.56 |
68.24 |
la1 |
3204 |
31,472 |
6 |
61.20 |
61.49 |
60.21 |
la12 |
6279 |
31,472 |
6 |
67.48 |
60.05 |
57.47 |
la2 |
3075 |
31,472 |
6 |
59.90 |
62.12 |
60.11 |
tr11 |
414 |
6424 |
9 |
54.11 |
45.65 |
44.85 |
tr23 |
204 |
5831 |
6 |
34.31 |
40.52 |
41.01 |
tr41 |
878 |
7453 |
10 |
48.63 |
50.30 |
48.71 |
tr45 |
690 |
8261 |
10 |
59.57 |
60.77 |
59.47 |
Averages |
|
|
|
58.98 |
59.38 |
58.48 |
Table 5 compares the performance of the AcKL-ONMF method applied to both untransformed data and smoothed, non-linear transformations. The metrics include baseline accuracy, Best AcKL-ONMF accuracy for training, and validation sets across various datasets.
Notably, the Best AcKL-ONMF(Train) values consistently outperform the baseline accuracy, indicating that the method effectively learns features from the transformed data. For instance, Dataset 1 shows a significant increase from 71.45% to 73.85%, demonstrating enhanced model performance through the application of smoothing and non-linear transformations.
When examining the Best AcKL-ONMF(Validation) results, there is a clear trend of improvements in accuracy on unseen data. Datasets such as Dataset 2 show a rise from 85.38% to 92.92% in training, with a validation accuracy of 92.46%, illustrating the model’s ability to generalize effectively.
The average accuracies further validate our approach. The training set achieves an average of 59.38%, while the validation set maintains competitive performance at 58.48%. This consistency across both training and validation sets suggests that the AcKL-ONMF method not only optimizes the model parameters well but also ensures robust performance on unseen data.
Table 6. KL-ONMF vs. KL-ONMF Transformed for the clustering of 14 document data sets: Cross-Validation with Repetitions for Hyperbolic Tangent Transformation.
Dataset |
|
|
|
AcKL-ONMF |
Best AcKL-ONMF (Train) |
Best AcKL-ONMF (Validation) |
ng3sim |
2998 |
15,810 |
3 |
71.45 |
73.86 |
72.95 |
classic |
7094 |
41,681 |
4 |
85.38 |
91.58 |
82.24 |
ohscal |
11,162 |
11,465 |
10 |
47.74 |
51.57 |
51.15 |
k1b |
2340 |
21,819 |
6 |
58.46 |
60.23 |
59.46 |
hitech |
2301 |
10,080 |
6 |
38.55 |
38.78 |
38.17 |
reviews |
4069 |
18,483 |
5 |
72.35 |
71.27 |
71.15 |
sports |
8580 |
14,870 |
7 |
66.55 |
68.67 |
67.88 |
la1 |
3204 |
31,472 |
6 |
61.20 |
66.46 |
63.77 |
la12 |
6279 |
31,472 |
6 |
67.48 |
61.03 |
57.76 |
la2 |
3075 |
31,472 |
6 |
59.90 |
63.12 |
62.35 |
tr11 |
414 |
6424 |
9 |
54.11 |
50.32 |
46.62 |
tr23 |
204 |
5831 |
6 |
34.31 |
45.10 |
39.54 |
tr41 |
878 |
7453 |
10 |
48.63 |
51.78 |
49.77 |
tr45 |
690 |
8261 |
10 |
59.57 |
65.75 |
64.06 |
Averages |
|
|
|
58.98 |
61.39 |
59.06 |
Table 6 compares the performance of the AcKL-ONMF method applied to both untransformed and smoothed, non-linear transformed data. The metrics include baseline accuracy, Best AcKL-ONMF accuracy for training, and validation sets across various datasets.
The Best AcKL-ONMF(Train) values consistently exceed the baseline accuracy, demonstrating the method’s effectiveness in extracting features from the transformed data. For example, Dataset 1 shows an increase from 71.45% to 73.86%, indicating a substantial improvement in model performance due to the application of smoothing and non-linear transformations.
In terms of Best AcKL-ONMF(Validation), the results reveal a positive trend in accuracy for unseen data. Dataset 2 exemplifies this, with the accuracy rising from 85.38% to 91.58% in training, while validation accuracy stands at 82.24%. This illustrates the model’s capacity to generalize effectively to new data.
Average accuracies further support our findings, with the training set achieving an average of 61.39%, compared to a baseline of 58.98%. The validation set maintains a competitive average of 59.06%, reinforcing the consistency of the AcKL-ONMF method across both training and validation contexts.
5. Conclusion and Suggestion
In this paper, we developed an innovative clustering approach for non-negative data, named Optimal Predictive Modeling of Nonlinear Transformations using the Kullback-Leibler Divergence “OPMNT-KL-ONMF”. Our methodology begins by applying smoothing to TF-IDF scores, followed by the integration of parameters optimized using the Nelder-Mead method. We then employ nonlinear transformations to extract nonlinear features. Comparative results between KL-ONMF applied to original data and data transformed by OPMNT demonstrate a significant performance improvement. Our OPMNT-KL-ONMF model represents a promising advancement in the field of clustering non-negative data, such as document datasets, and paves the way for future research and applications across various domains of data analysis.
Acknowledgements
Sincere thanks to the members of OJAppS for their exemplary professionalism, and a special acknowledgment to Managing Editor Emily Lee for her exceptional commitment to high-quality standards.