Predicting the stress level of students using Supervised Machine Learning and Artificial Neural Network (ANN)

Nowadays, the concept of stress is universally acknowledged. Many of us face situations that contribute to daily hassles, affecting professionals such as teachers, doctors, lawyers, journalists, and parents. University students are also encountering similar challenges. This study aims to identify the factors generating stress among students at Tribhuvan University Dharan in Nepal. We can predict and prevent stress at its early stages by analyzing these stress factors. This paper proposes various machine learning and deep learning models, including support vector machine (SVM), Random Forest, Gradient Boosting, AdaBoost, CatBoost, LightGBM, ExtraTree, XGBoost, logistic regression, K-nearest neighbor (KNN), Naive Bayes, decision tree, multi-layer perceptron (MLP), and artificial neural network (ANN). The Naive Bayes model achieved an accuracy of 90%, while SVM had the lowest test accuracy at 85.45%. The accuracy of these models improved with hyperparameter tuning. The key finding of this study is that the "academic period" is the most stressful time for students compared to other situations.


INTRODUCTION
Stress is a state of mind in which a person feels pressured to perform daily routine activities.This phenomenon is evident across various sectors, including but not limited to offices, universities, hospitals, and others.Sometimes, it is obvious, but generally, it results from higher expectations and low passion, unrealistic workloads, insecure jobs, community violence, and examinations.Professionals like teachers, doctors, lawyers, journalists, parents, and others go through stressful situations.Even students are not spared from the stress.Students are the future of every country, so it is essential to analyze the factors responsible for stress among students.Thus, by earlier detection of these factors, stressful situations can be ignored or controlled.According to the World Health Organization (WHO), stress is defined as a state of everyday pressure caused by a difficult situation (American Psychological Association, 2023).
Stress is classified into two types: Acute and chronic (Mind, 2022).Short-term stress is called acute stress.It can be seen in the examination hall, joining a new job, during speech, and facing a deadline for work.Chronic stress occurs when we feel pressure and tension for an extended period.It arises from financial difficulties, career-related problems, highly pressured jobs, and relationship challenges.It is essential to understand these factors for effective stress level management (Medlineplus, 2022).Stress management strategies can be developed with the help of machine learning and deep learning techniques.Physiological, psychological, academic, environmental, and social (PPAES) are the stress's various building blocks, as shown in Figure 1.
Stress factors can be integrated with machine learning (ML) and deep learning (DL) models to forecast student stress.In this paper, various ML models, such as support vector machines, decision trees, Random Forests, Extratree, and Naïve Bayes, are trained and tested.Stress levels are classified using labeled data.Applying hyperparameter tuning techniques has substantially improved the performance of machine learning (ML) algorithms.These algorithms are assessed using recall, precision, F1-score, and accuracy metrics.

Figure 1 Stress factor classification of stress dataset
This paper is structured into five sections, where the second and third sections cover the related work and research methodology.
At the same time, the experimental evaluation, results and discussions, and conclusion are detailed in the fourth and fifth sections, respectively.

Related Works
Stress expectation frameworks have been studied by several researchers using different machine-learning approaches.The researchers made numerous studies attempts to achieve effective methods and high accuracy in identifying stress-related elements.In their literature, they studied about three and four machine learning models, and others applied deep learning models.The existing works have been discussed here: Onim et al., (2024) used a fully connected convolutional neural network (CNN) model, which validates the integration of digital stress biomarkers.In this paper, they used a sensor with fusion and found 96.7525% accuracy and 0.9745 F1-score with the context in the case of CNN and Random Forest, providing accuracy and F1-score of 0.937 and 92.48%, respectively.In their investigation, Liao et al., (2024) examined elements in SRQ and SCL-90 and utilized machine learning to understand mental stress.
The SHAP model was used for finding the training results.The ROC-AUC curve has been evaluated for Random Forest.Ratul et al., (2023) examined the psychological and social levels of stress among university students.This type of stress was due to physical illness, mobile dependency, the internet, lack of social activities, and others.It examined 444 university students from various backgrounds.
Different machine learning algorithms were used for feature reduction techniques: Principal component analysis (PCA) and the Chi-Squared test.GridSearchCV, cross-validation, and GA were used to optimize hyperparameters.The result gave an accuracy of 80.5%, recall of 0.826, and precision of 1.00.Ding et al., (2023) were using machine learning and deep learning approaches for stress prediction.
At present, stress increases health issues and puts human lives and health at risk, and 40% of young were facing stress due to frustration, nervousness, and anxiety.It was using a hybrid approach.This consists of gradient boosting and Random Forest, which were given an accuracy of 100%, and a 10-fold cross-validation, also used for finding mean and standard deviation.A statistical T-test was applied to determine the relevance of this method in comparison with other machine learning methods.Al-Atawi et al., (2023) Employed machine learning, the Internet of Things (IoT), and wearable devices for managing stress.The stress was assessed using three main factors, such as body temperature, sweat, and movement in exercise, to achieve an accuracy of 99.5% and improve people's psychological health and well-being The Microsoft team-hosted online cross-sectional survey was used for this study.The objective of this CSSQ was to test structural validity, Discriminant validity, convergent validity, and internal consistency.According to Nijhawan et al., (2022), natural language processing (NLP) and ML have detected stress over social interaction.The latent Dirichlet Allocation (LDA) technique has been used for exploratory analysis of user tweets.A deep learning model, i.e., BERT, was used for sentimental classification, which had the best stress detection rate.According to Edwards et al., (2020), student stress has been reduced in the academic library using robot animal companions.After contact with a robot animal, participants reported decreased stress and increased positive affect.
One hundred three students from Midwestern US universities were selected for this study.Some libraries allow students to interact with pets such as dogs and cats.The authors organized a robot petting zoo at Midwestern University during the final week.
Participants were asked to rate their stress level before interacting with T1, i.e., robot pet.After that interaction, participants were asked to assess their present state of stress and how supportive they thought the robot animal was (T2).Pairwise t-tests were employed to analyze the pre-test and post-test scores data.Table 1 shows the summary of existing studies.The dataset from Tribhuvan University Dharan, Nepal is the original contribution and unique aspect of the study.This dataset has not previously been analyzed using ML and DL models to predict the stress factors among students.A total of twelve machine learning and two deep learning models were trained and tested for stress forecasting among students.Multiclass classification was used to forecast the stress level.Comparative analysis was performed to find out the most accurate model.Lastly, the stress level feature is not a previous analysis as a target variable by any researcher.

METHODOLOGY
This paper's data source is Kaggle, an open data repository.The students' stress dataset belongs to Dharan University of Nepal.After conducting the exploration analysis, we will conduct a comprehensive analysis of the stress_level feature.Several different criteria are used to compare the models under consideration.The results of all models are studied, and the best model is chosen.The GridSearchCV method is applied for hyperparameter adjustment to evaluate model performance relative to accuracy.

Dataset pre-processing
In the initial preprocessing stages, the first step involves importing the necessary library and reading the dataset.Subsequently, data cleaning is performed to identify and remove duplicate, irrelevant, and missing values.Any identified missing or duplicate values are then eliminated.The dataset pertains to Dharan Tribhuvan University, Nepal, and contains zero null values, indicating its readiness for direct utilization.As depicted in Figure 2, all columns exhibit uniform data type objects, an equal count, the absence of duplicate rows, and no missing values, thus confirming the dataset's integrity.Consequently, no further cleaning is deemed necessary.
of cross-validation and hyperparameter techniques in machine learning, the goal is to enhance the accuracy of the models and determine the optimal parameters for these algorithms.LightGBM means Light Gradient Boosting Machine and applies classification techniques for Machine Learning.It is also a decision tree-based learning algorithm and builds a tree based on the histogram method.LightGBM works by combining weak decision trees to make them strong ones.ExtraTree is an ensemble of supervised machine-learning classification algorithms that use decision trees.The Random Forests approach and the ExtraTree algorithm produces different decision trees, but the sampling for every tree is random and non-replaceable.As a result, distinct samples are created in a dataset for every tree.For every tree, a certain quantity of features is also arbitrarily chosen from the entire collection of features (Ampomah et al., 2020).
Extreme gradient boosting (XGBoost) is a supervised decision tree-based machine learning technique that combines decision trees to enhance model accuracy.It is scalable and constructed in parallel (Wade, 2020).The Naive Bayes algorithm assigns equal weight to each feature or characteristic, making it efficient because no property has an effect over another.The algorithm will become efficient as one property has no effect over another.NBC is a simple, proficient, and widely used classification algorithm for text categorization, according to (Chen et al., 2021).Another well-known machine learning method is decision trees, which classify data according to predetermined criteria.
The tree comprises two entities: nodes and leaves denote decisions.A multi-layer perceptron (MLP) is a neural network (NN) with multiple layers.All layers are fully connected to the network.It works in the forward direction only.According to Desai and Shah (2021), MLP uses the backpropagation technique to enhance the model's performance.An artificial neural network is a system composed of several simple processing units operating in parallel.Processing takes place in each node or computing component with a low processing capability; the network's function is defined by its structure and the weight of its connections (Guillod et al., 2020).

Datasets
The data set used in this paper is collected from students studying in high schools and colleges of Tribhuvan University, Dharan, Nepal.It has twenty features that can be used to predict and measure their stress levels.This dataset has twenty-one columns and 1100 rows.We have taken "stress_level" as a dependent variable out of twenty-one columns, and other variables are independent.The target variable is a ternary attribute, which provides a stress level diagnosis.
Table 2 shows the student's stress level.The value "two" represents a highly stressful situation, "one" is medium, and "zero" means no stress exists.The data set contains 1,100 rows, of which 880 rows are used to train the model and 220 are used for testing.Independent features, for example, anxiety_level range from 0 to 21, depression (0 to 27), self-esteem (0 to 30), and all others lie between zero to five.In most cases, higher values in anxiety level, depression, and lower values in self-esteem might be associated with higher stress levels.A random sample of the dataset is given in Table 2.The summary statistics of features is given in Table 3.

Exploratory Data Analysis
This section explains the stress factors and the correlation between them."Stress_level" is a multiclass variable, meaning it can be categorized as low, medium, or high.Therefore, it is essential to investigate the impact of psychological, physiological, environmental, social, and academic factors on the "stress level".As discussed, "stress_level" is a multiclass variable, i.e., low, medium, and high, so it is important to find out how psychological, physiological, environmental, social, and academic factors affect it.Thus, correlation matrixes were derived.
The psychological factors correlation matrix has a strong positive correlation between anxiety level and stress level, 0.74.A strong correlation is -0.76.It exists between self-esteem and stress level.A feature such as self-esteem of psychological factors has a negative correlation among all psychological factors, as shown in Figure 8.The physiological factors correlation matrix has a strong positive relation, 0.71, between headache and stress level.A strong negative correlation is -0.64 between sleep quality and stress level.
In Figure 9, sleep quality has negative correlations of -0.75, -0.64, -0.30, and -0.54.The correlation matrix shows that noise level is negatively correlated (−0.57) with basic needs and positively related with stress level, i.e. 0.66. Figure 10 shows the correlation matrix for the environmental factors.With a correlation coefficient of 0.75, as shown in Figure 11, the association between stress levels and bullying is high.Among social factors, social support has a negative correlation with peer pressure, extracurricular activities, and bullying.Figure 12 displays the correlation matrix of academic factors, which shows a positive correlation (0.74) between future_career_concerns and stress level and a strong negative association with academic performance of -0.72.

Figure 9 Physiological Factors Correlation Matrix
Figure 13 illustrates the relationships between "stress_level" and all other factors.Figure 14 demonstrates the density distribution of student stress for anxiety level, self-esteem, and depression.The boxplot of anxiety_level for 'no stress (0)' exists between 0 to 14, for 'medium stress (1)', it is between 6 to 17, and for 'high stress (2), it is 9 to 21, as shown in Figure 14.This category also has a few outliers below the lower and upper margins.Most students' self-esteem lies between 0 and 25 for high-stress levels, as shown in Figure 14.When actual is zero and predicted is one.FN: When actual and predicted values are one and zero, respectively.

Definitions and formula
Accuracy: Accuracy is the sum of true positive and true negative divided by the total number of classifieds, which is: Accuracy = (TP + TN) ÷ (TP + TN + FP + FN ) Precision: When the true positive value of the confusion matrix is divided by a combination of TP and FP, then: Sensitivity is also known as recall, and it can be obtained by dividing TP by the total number of TP and FN.

Sensitivity = TP ÷ (TP + FN)
F1-score is the average of recall and precision, which is more critical than accuracy.
F1-score = 2 * ((precision * recall) ÷ (precision + recall)) The shapes of the training dataset have 880 rows and 20 columns, while test and validation contain 220 rows and 20 columns, as shown in Table 4.In this paper, ML techniques like GB, Adaptive Boosting (AdaBoost), LightGBM (LGBM), RF, CatBoost, SVM, ExtraTree (ET), XGBoost, Linear Regression, LR, KNN, NB, and DT are used for classification and regression.DL models like ANN and MLP are also used in this paper to predict stress levels.

Dataset Shape
Training Set (880, Employing the right technique for machine-learning and deep-learning is essential in developing a highly precise and reliable classifier.The results of hyperparameter optimization for the ML models ExtraTree (ET), Random Forest (RF), AdaBoost, and Gradient Boosting (GB) over ten folds are presented in Table 5. Varied parameters were utilized for hyperparameter tuning.The Gradient Boosting model achieved an accuracy of 0.8897 with the optimal parameters (learning_rate: 1, max_depth: 7, and n_estimators: 500).For the AdaBoost model, the best parameters were found to be learning_rate: 0.25 and n_estimators: 500, resulting in an accuracy of 0.8852.
Similarly, the Random Forest model attained its highest accuracy of 0.8988 with the following parameters: max_depth: None, max_features: 1.0, max_samples: 0.75, and n_estimators: 20.Lastly, the ExtraTree model demonstrated an accuracy of 0.8863 with the best parameters: Criterion: Gini, min_samples_leaf: 4, max_depth: 20, min_samples_split: 4, and n_estimators: 100.represents the model-predicted label, and the y-axis represents the actual label.To compare different models, we assess how well they predict true positives (TP) and true negatives (TN).We select the model that performs better in predicting TP and TN as our base model.For instance, the Naïve Bayes model has TP=68, TN=61,11,0,69, FN=0, 8 and FP=1, 2, as depicted in Figure 19B.
The ROC curve illustrates the model's performance at various thresholds.The y-axis represents sensitivity, and the x-axis represents the false positive rate.The AUC for the RF model with the highest stress level falls under Class 2, as shown in Figure 15.The loss and accuracy curve for the ANN model is depicted in Figure 17, covering ten epochs.Additionally, Figure 18 displays the accuracy curve for the MLP, showing both training and testing accuracy.
In the first row, there were 76 instances of class 0, of which 68 were correctly identified as class 0. In the second row, there were 73 instances of class 1, with the classifier correctly identifying 61.In the confusion matrix's third row, 71 instances, out of which 69 belonged to class 2, as shown in Figure 19B.By utilizing the above confusion matrices, we can calculate the precision, recall, and f1score for high (2), medium (1), and low (0) stressful situations, as shown in

CONCLUSION
The following text discusses the final findings of the stress factor identification models.The study uses a combination of machine learning (ML) and deep learning (DL) to analyze the dataset.These models can identify various factors responsible for causing stress.
Students facing stress can be identified using models such as SVM, Random Forest, GB, AdaBoost, CatBoost, LightGBM, ExtraTree, XGBoost, LR, KNN, NB, DT, Multi-Layer Perceptron (MLP), and Artificial Neural Network (ANN).The accuracy of the Naïve Bayes model is 90%, while SVM has the lowest test accuracy level at 85.45%.The study reveals that the academic period is a critical factor of stress for students.
Physiological and psychological factors show moderate stress levels among college and school-going students.Social and environmental factors are reported to cause the lowest stress.Based on the study findings, a stress diagnosis system can be developed for students, ultimately enhancing their performance.This research can also be extended in the future by incorporating IoT.Wearable devices for stress detection among students may be developed and embedded with a mobile application to process real-time information.

Figure 2
Figure 2 Identifying null, duplicates, and missing values

Figure 3
Figure 3 Proposed Methodology

"
Google Colab" is used for implementation, a cloud-based platform provided by Google.Colab contains well-known data science libraries, such as Keras, TensorFlow, PyTorch, and Scikit-learn.The ANN model was trained for ten epochs and obtained an accuracy of 88.63 %.The final ANN model is shown in Figure 4.The ANN architecture follows the pattern of layers, and each layer contains neurons.The neural network's final model's visual representation is displayed in Figure 5.In this section, the following subsections discussed the student stress dataset.

Figure 4
Figure 4 ANN model at optimum performance

Figure 5 Figure 6
Figure 5 ANN architecture for student stress level

Figure 10 Figure 11 Figure 12 Figure 13 Figure 14
Figure 10 Environmental Factors Correlation Matrix

Figure 15
Figure 15 ROC curve of RF model

Figure 16 Figure 18 Figure 19
Figure 16 ROC curve of ANN model . "Humidity-Temperature-Step count-Stress levels" refers to a dataset with 2001 samples called Stress-Lysis.csv.This paper used machine learning techniques like stacked ensemble methods (SEM), gradient boosting, and Random Forest.Various sensors (temperature, humidity, and accelerometer) were used to measure stress levels.Mittal et al., (2022) Focused on how stress is managed in the workplace and education field.It finds all feasible factors that were responsible for anxiety and depression.Various machine learning approaches, such as supervised and unsupervised techniques, have been used in this paper to detect stress effectively and efficiently among a vast population.It aims to detect stress before it occurs in life.It has also been used in deep learning and stress management, specifically in workplace and education settings.In a study conducted by Vallone et al., (2022), student stress levels were examined at four universities in Spain by surveying 331 students during the COVID-19 pandemic.The survey took place from 12 to 19 April 2021.Stress in students is due to institutional life, relationships, contagion, fear, and isolation.For this Coronavirus disease, the 2019 student stress questionnaire (CSSQ) had seven -items, and the original tool of the three-factor solution was used.

Table 2
Sample of Student Stress Dataset Figure 7 Distribution of student stress level

Table 3
Summary Statistics of features

Table 5
Hyperparameter optimization result for ML.study aimed to analyze the levels of PPAES components of student stress.In this study, we have developed machine learning and deep learning models to predict different stress states with the help of PPAES factors.Various ML techniques were trained, and confusion matrix and classification reports were obtained.Figures15 and 16show the multiclass RF and ANN model performance using the ROC curve.Under a high-stress level, the ANN ROC value is 0.99, which shows better accuracy.Table 6 displays the NB algorithm's recall, precision, accuracy, and support during low, intermediate, and high-stress phases. This

Table 6
Classification Report of NBThe model's performance is evaluated by comparing the F1 score and precision.The Naïve Bayes model achieved 96% precision in the 'no stress' scenario and 78% accuracy in the 'high stress' scenario.Overall, the Naïve Bayes model has an accuracy of 90%.The x-axis

Table 7 .
(Bansal and Vyas, 2024)dels demonstrated byRescio et al., (2024)describe the stress of twenty workers through 1D-CNN, LSTM, and GRU models of deep learning and obtained an accuracy of 95.38% (1D-CNN).The dataset used in this paper is not publicly available.A stress detection framework based on ML and IoT was proposed by(Bansal and Vyas, 2024).Stress detection using the proposed MLIoT-ESD technique is more time-consuming as compared to traditional approaches.The above-mentioned papers do not adopt the ML, DL models, and dataset, which are implemented by the authors of this paper.The comparative results of various machine learning techniques are presented in Table7.The table illustrates the performance of machine learning techniques, including SVM, XGBoost, AdaBoost, Random Forest, LightGBM, Gradient Boosting, decision tree, CatBoost, ExtraTree, KNN, Naïve Bayes, and logistic regression, based on calculated train accuracy, test accuracy, recall, precision, and F1-score of each classifier.

Table 7
Results of machine learning models

Table 8
Results of deep learning Models