Application of Machine Learning Techniques for Okra Shelf Life Prediction

The ability of machine learning techniques to make accurate predications is increasing. The aim of this work is to apply machine learning techniques such as Support Vector Machine, Naïve Bayes, Decision Tree, Logistic Regression, and K-Nearest Neighbour algorithms to predict the shelf life of Okra. Predicting the shelf life of Okra is important because Okra becomes harmful for human consumption if consumed after its shelf life. Okra parameters such as weight loss, firmness, Titrable Acid, Total Soluble Solids, Vitamin C/Ascorbic acid content, and PH were used as inputs into these machine learning techniques. Support Vector Machine, Naïve Bayes and Decision Tree each accurately predicted the shelf life of Okra with accuracies of 100%. However, the Logistic Regression and K-Nearest Neighbour achieved 88.89% and 88.33% accuracies, respectively. These results showed that machine learning techniques especially Support Vector Machine, Naïve Bayes and Decision Tree can be effectively applied for the prediction of Okra shelf life.


Introduction
Okra (Hibiscus esculentus) is considered as the second most important vegetable in West African market after tomatoes [1] [2].
The vegetable crop is known for its mucilaginous properties that give the Okra soup a slimy texture, which has huge benefits as stabilizers, emulsifiers and thickeners [3]. Okra is a versatile crop known for its tender and tasty pods as well as its importance in preparing African soups and sauces. In Nigeria and most African regions, Okra soup is prepared from fresh or dried Okra pods.
Although, it is believed that the nutritional value of Okra when preserved is altered, research has it that such alteration is almost insignificant.
A comparison shows that the dried Okra (Okra chips) retains the mucilage from the fresh Okra pods. Okra is considered as an important fruit because it contains carbohydrates, proteins, total fat, cholesterol, dietary fibre, vitamins, electrolytes, minerals and phyto-nutrients [4]. Furthermore, Okra has health benefits such as high fiber content, prevention of diabetes, serves as good folate content, provides Vitamin K benefits, controls asthma, prevents constipation, prevents sunstroke, prevents colon cancer, controls obesity, controls cholesterol level, has low Glycemic Index, prevents kidney diseases, helps in digestion, good source of antioxidants, prevents skin pigmentation, improves immunity, improves eye sight, and supports ulcer healing amongst several other benefits [5].
With the huge health benefits that Okra provides, it is very important to extend its shelf life.
Most of the farmers/traders use synthetic chemicals such as actellic 25 EC (pirimiphos-methyl), heptachlor, and thiometon to extend the shelf life of Okra [6].
These synthetic chemicals are harmful to the environment and their chemical residues may be found in the Okra fruit. The chemicals are also dangerous to the health of the users [7]. It is worth noting that biological approaches in preserving general vegetables have proved to be very effective and suitable for mammals [8]. Even though there are methods developed to extend Okra shelf life, not much has been done in predicting the shelf life of Okra.
For the fact that machine learning techniques have the capability to learn from existing data and make predictions, this research looks at how safe Okra is for consumption either in the preserved or unpreserved state using machine learning techniques. By preserved state, we mean Okra that is treated to extend its shelf life, whereas, unpreserved state means Okra that is harvested and not treated.
In this paper, we propose the novel use of support vector machine (SVM), Naïve Bayes, Decision Tree, Logistic Regression and K-Nearest Neighbour (KNN) to predict how long Okra can stay (shelf life) before it is no longer safe for consumption based on some measured Okra parameters.

Related Work
Research attempts in predicting the shelf life of fruits are generally rare. However, some attempts have been made by some researchers for the classification of fruits.
Al-Sulaiman [9], developed a framework for predicting the quality of Okra pods after drying in a domestic microwave oven using an artificial neural network (ANN) model. The evaluation results indicated that the ANN technique outperformed the multiple linear regression model for the prediction of Okra quality. The model achieved an R 2 value of 0.98.
Savakar [10], proposed a classification technique based on artificial neural network (ANN) to identify the diverse types of fruits. Five different types of fruits, i.e., Apple, Mango, Sweet Lemon, Chikoo and Orange were considered and a total of 5000 images were used. Each of the fruit types contributed 1000 images that were examined using the ANN model. Experiments demonstrated that the model performed well with an accuracy of 94%.
Dubey and Jalal [11], presented a model for fruit and vegetable classification using the K-Means clustering method for segmentation. The classification of fruits and vegetables was done by multi-class SVM, based on the improved sum and difference histogram (ISADH) texture feature, which was a modified version of sum and difference histograms for texture classification proposed by Unser in 1986. Results showed that the model recorded an accuracy of 99%.
Yissah, Ikyo and Ige [12], proposed the extension of Okra shelf-life using X-ray irradiation of varying quantity. The technique elongates the shelf-life of fresh Okra to minimize its post-harvest loss and increase its availability. Experiments using an irradiation dosage of 0.051 Gy as recommended by NAFDAC and USFD, demonstrated that the approach prolonged the fresh Okra shelf-life from 3 days to 14 days.
Yu, Li, Shen, Cao and Mao [13], employed the use of Bias classifier, decision tree and SVM techniques to classify and compare their results to determine the most efficient algorithm in the prediction of grain data. Experiments showed that the SVM algorithm outperformed the other algorithms.
Khaing, Naung, and Htut [14], developed a control system of objection recognition based on a convolutional neural network (CNN). The CNN was given the Sakib, Ashrafi, Siddique and Bakr [16], proposed a deep learning model using convolutional neural network (CNN) for fruit identification or recognition. The fruits-360 dataset was used for testing the proposed model. Results showed that the model achieved an overall accuracy of 100% and a training accuracy of 99.79%.
Nosseir, and Ahmed [17], developed a model that classified four different types of fruits and further separated the decayed ones from the fresh ones. The model adopted the use of K-NN and SVM techniques for the prediction. Experiments demonstrated that the model achieved 95%, 96% and 98% accuracies using K-NN, linear SVM and quadratic SVM, respectively. Ummapure and Hanchinal [18], proposed a multi-class framework where three feature vectors were developed to identify fruits types. The multi-class framework achieved 99.98% classification accuracy.
Zeeshan, Prabhu, Arun and Rani [19], implemented a fruit classification system using computer vision and support vector machine (SVM). The model employed the use of the Gaussian filter which reduced the noise while enhancing the image quality for effective classification. The model was tested using 655 images and achieved an overall accuracy of 87.06%.
Behera, Rath, Mahapatra and Sethy [20], presented a comprehensive analysis of the existing methods for fruits classification with a focus on the development of state-of-the-art methods. The study compared different techniques for the identification, classification and grading of fruits.
Jaramillo-Acevedo, Choque-Valderrama, Guerrero-Álvarez and Meneses-Escobar [21], leveraged on the RGB color model which is based on the physical and chemical changes during the ripening process. Hence, classified the consumption maturity for Hass avocado fruits using an artificial neural network. Experimental results yielded 88% ripeness estimate accuracy and 0.819 regression value during the post-harvest period.
Ogbaji and Iorliam [22], carried out a laboratory experiment and showed that the shelf life of the treated Okra fruits ranged from days 1 -15, while the control ranged from 1 -7 days. The laboratory data generated from their research is used in this paper for the shelf life prediction of Okra.
To the best of the researchers' knowledge, the prediction of Okra shelf life has not been investigated using machine learning techniques. Hence, this research is motivated to apply machine learning techniques to predict Okra shelf life.

Methodology
In this paper, five algorithms including SVM, Naïve Bayes, Decision Tree, Logistic Regression and KNN are employed to predict and compare the shelf life of Okra. Three scenarios of: 1) Okra without preservation; 2) Okra preserved with Moringa and 3) Okra preserved with Neem was considered. Data generated from the three scenarios were fed into the five algorithms as input. In each case, the algorithms did a prediction to determine whether the Okra is safe for consumption or not based on the laboratory data generated. The performances of all the five algorithms are presented in terms of the Accuracy, F1-score, Recall, and Precision.

Dataset Used
The measurement of the Okra fruits weight loss (%) was achieved by placing the fruits on a digital weighing balance and each reading was recorded during the storage process.
The fruit firmness (N/cm) was measured as the maximum penetration force (N) reached during tissue breakage using a standard probe. The registered force at the penetration of a standard probe up to a certain depth (cm) was read as firmness. The firmness of the Okra fruit was determined using a penetrometer as described in Kumah, Olympio, and Tayviah [23].
The percentage titratable acidity (TA %) of Okra fruits was determined as follows; after blending the Okra fruits, Ten millilitres of the juice were filtered using a filter paper in a beaker. Five millilitres of the filtrate were pipetted into a conical flask, then 10 millilitres of sterile distilled water were added to aid a clear endpoint detection. After this, two drops of phenolphthalein indicator were added. 0.1 N sodium hydroxide (NaOH) was added dropwise and the solution was shaken thoroughly until a pink colour was obtained. The acid content of the Okra sample was calculated using the formula: The Total Soluble Solids (TSS) (˚Brix) of the Okra fruits was determined using a handheld refractometer. A homogenous sample was obtained after blending Okra fruits, then two drops from the blended sample were applied to the refractometer using a plastic dropper, and the reading was obtained as a percentage of soluble solids concentration in ˚Brix [24].
Vitamin C/Ascorbic acid content (mg/ml) was determined using the method described in AOAC [24].
Contration of Volume C Volume of Vita Molarit min Volu y of indophenol me of indophenol = * The final volume was recorded and the concentration of Vitamin C in the sample was calculated and expressed in mg/ml using the above formula. Okra fruits pH was determined by inserting the pH meter into the Okra paste obtained after blending and the readings obtained [25].
The shelf life of Okra was evaluated by counting the number of days Okra fruits were still acceptable for marketing and consumption. This was decided based on the appearance and spoilage of the Okra fruits. The above process is reported vividly in Ogbaji and Iorliam [22].

Data Preprocessing
Data pre-processing in this study was the removal of noise from experimental data by sieving out unwanted parameters. The unprocessed data was pre-processed to align with core parameters such as Firmness, TSS, TTA, Vitamin C and Weight loss of the Okra to predict the shelf of Okra as shown in Figure 2.
For the normalisation of data (which involves scaling down the dataset to eliminate the effects of variation in the datasets), the researchers obtained the normalization equation by subtracting the minimum value (X minimum) from the Okra variable to be normalized (X). After subtracting the minimum value (X minimum) from the maximum value (X maximum), the previous result is divided by the latter. Mathematically, normalisation was carried out using the equation below: Variables like Treatment, Variety, Firmness, TSS, TTA, Vitamin C, and Weight loss were rearranged from the complex raw format to a less complex format (data cube aggregation). The resulting dataset became smaller in size (dimensionality reduction) while retaining all of the details required for prediction.

The Proposed Model
The designed model for the study is achieved by applying five different algorithms to make predictions from the dataset. Two classes are considered "class 1" and "class 0". Class 1 is the class of Okra that is safe for consumption while class 0 is the class of Okra that is not safe for consumption. Taking into account some Okra parameters like Firmness, TSS, TTA, Vitamin C and Weight loss, the algorithms make independent predictions to determine the Precision values, Recall values, F1 values and Accuracies. For all the algorithms compared, the dataset is split into 70% training, and 30% testing. The proposed Okra shelf life prediction model is shown in Figure 1.  we define x = (x 1 , x 2 ) and w = (a, −1),

Model Description and Evaluation
We can get 0 w x b ⋅ + =.
The SVM classifier works with a hypothesis function h defined as: Equation (2) is the mathematical representation of the support vector machine algorithm used in this study.

Naïve Bayes
Given a feature vector ( ) 1 2 , , , n X x x x =  and a class variable C k , Bayes Theorem states that: Using the chain rule, the likelihood ( ) | k P X C can be decomposed as: where; P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes). P(c) is the prior probability of class. P(x|c) is the likelihood which is the probability of the predictor given class. P(x) is the prior probability of the predictor.

Decision Tree
The decision tree goes from observations about an item (represented in the branches) to conclusions about the item's target value. This algorithm uses two popular techniques for attribute selection: Information Gain and Gini Index. The Information Gain is given by: Entropy as a measure of impurity in a given attribute specifies randomness in data. Entropy is calculated as:  (6) where; j is the number of classes.
If a decision tree is used for regression and the output is continuous in nature, a reduction in variance is often used. Using the variance formula, the algorithm divides the population. The Reduction variance is expressed as where; x is the mean of values, X is the actual mean and n is the number of values.

Logistic Regression
Logistic regression models the probability of an outcome depending on individual characteristics. It is given by: where: P i indicates the Bernoulli distribution of the variable under consideration (e.g., Okra shelf-life), β i are the regression coefficients and the x i are variables.
The reference group (β 0 ) is the reference level of each and every variable

K-Nearest Neighbour (KNN)
KNN determines the closest point to the input point from the given data set. It thus stores the available cases (test data) and classifies new cases by majority votes of its K neighbours. The algorithm is based on which points share the highest probabilities using mathematical values (vectors). Hence, uses the Euclidean distance formula: where (x, y) and (a, b) are coordinates.

Evaluation Metrics
The evaluation metrics used in this paper are the Accuracy, F1-score, Precision, Recall and Confusion Matrix.

Results and Discussion
This study used the dataset generated in the laboratory in Ogbaji and Iorliam [22], using the procedure explained in section 3.2. It consists of 60 rows and 8 columns of data stored in .XLSX (an Excel file format). It is a combination of both text and numbers as shown in Figure 2.

Support Vector Machine (SVM)
The performance accuracy of the SVM algorithm as shown in Table 1 recorded a precision of 1.00, recall value of 1.0, F1 score of 1.00, and 100% accuracy. This Journal of Data Analysis and Information Processing result affirms that the SVM algorithm performs well and can be relied upon for the prediction of Okra shelf life. The confusion matrix for the SVM model is shown in Figure 3.

Naïve Bayes (NB)
From Table 2, the performance accuracy of the Naïve Bayes model also achieved 100% accuracy. The precision, recall values and F1 score all achieved 1.00. This shows that for the prediction of the shelf life of Okra, the Naïve Bayes algorithm also gave a very high prediction accuracy. Figure 4 describes the performance accuracy of the Naïve Bayes model using a confusion matrix.

Decision Tree (DT)
The same dataset fed into the SVM and Naïve Bayes model was used in the Decision Tree model. Table 3 shows the accuracy achieved in which the model achieved a Precision value of 1.00, recall value of 1.00, F1 score of 1.00, and accuracy of 100%.
The performance accuracy of the Decision Tree model is as shown in the confusion matrix in Figure 5. Table 4 shows that the Logistic Regression algorithm achieved a Precision value of 0.88, recall value of of 1, F1-score value of 0.80 and accuracy of 88.89%.

Logistic Regression (LR)
Using the Logistic Regression algorithm, the model yielded the confusion      matrix as shown in Figure 6.

K-Nearest Neighbour (K-NN)
The KNN algorithm for predicting Okra shelf-life under study showed the Precison value of 0.87, recall value of 0.93, F1-score of 0.73, and accuracy of 83.33% (See Table 5). The confusion matrix for the K-NN model is as shown in Figure 7.
Comparatively, Table 6 shows how our proposed model outperforms the works done by other researchers based on the accuracy achieved by the various models and approaches used. It is obvious from the comparison that our proposed Journal of Data Analysis and Information Processing   model outperformed the existing state-of-the-art methods especially when using the SVM, Naïve Bayes and Decision Tree. This proves that our proposed method is effective, efficient and can be relied upon for the prediction of Okra shelf life.
From Figures 3-5, it can be observed that the three algorithms (SVM, Naïve Bayes and Decision Tree), all predicted the shelf life of Okra correctly up to 100% accuracy. The confusion matrices show that in every case, 4 of the Okra are predicted to be bad (unsafe for consumption) and 14 of the Okra are predicted to be good (safe for consumption).
The low performance of the Logistic Regression and KNN algorithms shown in Figure 6 and Figure 7 further affirms the suitability of the SVM, Naïve Bayes and Decision Tree algorithms for the shelf-life prediction of Okra.

Conclusions and Future Work
In this work, we used five machine learning techniques (Support Vector Machine, Naïve Bayes, Decision Tree, Logistic Regression, and K-Nearest Neighbour) to predict the shelf life of Okra. Our prediction achieved an accuracy of 100% in the three algorithms (Support Vector Machine, Naïve Bayes and Decision Tree) and an F1 measure of 1.00, and a recall value of 1.00 for the three algorithms (Support Vector Machine, Naïve Bayes and Decision Tree). The Logistic Regression and KNN algorithms achieved lower prediction results. This shows that machine learning techniques are reliable and can accurately predict the shelf life of Okra.
In our future work, we hope to predict the shelf life of different variety of fruits using other machine learning techniques.