Global solar radiation forecast using an ensemble learning approach

ABSTRACT


INTRODUCTION
The primary objective of utilizing non-conventional energy systems is to mitigate global climate change, provide more access to energy and improve energy security [1]. By using sustainable energy and guaranteeing that all residents have access to inexpensive, dependable, sustainable, and contemporary energy, sustainable development can be made possible [2]− [4]. Considering the stress and needs, solar power radiation can be an optimal solution for non-conventional energy sources [5]. The challenges and limitations of solar power are reduced significantly due to advancements in photovoltaic technologies which in turn increases the efficiency of energy conversion and reduces panel installation and electricity cost notably. Solar power is the future, considering the very fact that it is an inexhaustible energy source and requires low installation costs [6]. PV panels often cannot provide stable electric power output due to variations in weather conditions, the facility grid stability is reduced significantly while integrating the photovoltaic power into the power grid [7]. As a stable power system is a necessity, a particular forecasting technique is required for the stable and safe integration of the photovoltaic power into the power grid [8]. In addition to helping to stabilize the grid, a precise forecasting model technique is essential for managing storage, developing an energy road map for congestion management, and estimating reserves. The availability of data has made it possible to use deep learning and machine learning techniques [9].
In order to manage energy use optimally in the present, safely operate power systems, and balance consumption and production, the predictive analysis will be crucial [10]. The goal of the vision is to introduce and use the most recent technology to create electrical networks that are more secure, effective, eco-friendly, and reliable [11]. With the advancement in accurate forecasting of meteorological and hydrological variables like precipitation, evaporation, and temperature humidity has made it possible for predicting solar energy generation in a more efficient manner [11]. PV prediction models can be used by consumers to coordinate their use with on-site power generation and, as a result, maximize their profitability. One of the key advantages of data-driven models is that they utilize less time to make judgments on power system planning and take less time to perform predictions [12]. Accurate prediction of energy produced by PV systems has been identified together with the major challenges as it allows grid operators to manage electricity generation by making informed decisions which in turn reduces the uncertainties and cost. Several forecasting methods presented in the literature are described in Table 1. One of the most predominantly used machine learning techniques, the support vector machine has been used to a great extent in building energy and for predictions of non-conventional energy applications.
Even with a small sample of datasets, the approach is incredibly efficacious for resolving non-linear problems. In order to reduce the generalization error's upper bound, which is made up of the sum of the training error and a confidence level, the support vector machine utilizes the structural risk minimization (SRM) concept [22]. Applying the fundamental principles of Support Vector Machine to regression problems entails adding a kernel function, non-linearly transforming the input space into a higher-dimensional feature space, and then performing a linear regression in this feature space [23], [24]. But still, this technique has promised researchers over the years over certain datasets. It was discovered that the suggested technique outperformed the ANN in terms of performance. Shi et al. [25] also suggested a support vector machine and weather classification-based PV forecast approach. The outcomes demonstrated that the suggested prediction model for grid-connected systems was successful and promising. Utilizing support vector machines, Yousif and Kazem research [26] developed solar photovoltaic power output. The proposed model projected photovoltaic current using inputs such as solar radiation and ambient air temperature. Kazem and Yousif [27] employed a support vector machine model and evaluated its performance in comparison to multi-layer perceptron, and generalized feed-forward networks (GFF).
In artificial intelligence (AI) and machine learning, deep learning models are regarded as a new paradigm of learning. Recent years have seen a substantial increase in interest in deep learning due to its ability to handle complex data. Various architectures of ANN's from simple ANN networks to more complex models like an auto-encoder [28] and Long short-term memory (LSTM) [29] networks have been effectively and successfully utilized to forecast renewable energy. A method based on artificial neural networks was developed by Hiyama and Kitabayashi [30] to predict the maximum power output from a PV system. The author's input features included solar radiation, wind speed, and outside air temperature [31]. A hybrid multilayer feed-forward neural network was created by Sulaiman et al. [32] to estimate the output from a gridconnected PV system.
The published works show that many approaches have been used to predict solar power or solar radiation output. Amongst the employed techniques artificial neural network is one of the favored techniques. However, an artificial neural network requires the user to provide several model parameters, for instance, the number of neurons in hidden layers, the number of hidden layers, and the number of training epochs. Two of the most prominent machine learning techniques support vector machines and artificial neural networks, showed instability issues [33]. Due to the instability, even slight changes in the input data could cause significant variances in the anticipated values. To overcome these instability issues, a more advanced machine learning algorithm, Ensemble Learning was developed.
Ensemble learning is a machine learning technique that involves training many base learners and combining their output to address a single problem. The fundamental tenet is that the aggregate output of the weak learners or the base learners should generally be more accurate than the output of any one learner. Very little research has been done on ensemble techniques for predicting solar radiation, which indicates that ensemble-based methods like RF and gradient boosting and extreme gradient boosting techniques have not been extensively studied. Considering that they are able to overcome the shortcomings, ensemble-based strategies typically outperform individual learners who build them [25]. The approach of ensemble techniques has drawn a lot of interest and is now in demand across various industries. Utilizing ensemblebased methods with solar photovoltaic systems motivates because the majority of the earlier research works are centered around regressive methods, support vector machines, and artificial neural networks and their variants, and these ensemble-based algorithms are more computationally efficient in comparison to the other widely used algorithms. The paper performs the research on the performances of two major Ensemble technique methods i.e., RF and XGBoost with Hyper-parameter tuning in solar radiation prediction from future weather data produced by the meteorological station.

THEORY OF FORECASTING MODEL 2.1. Random forest (RF)
RF is an ensemble-based bagging machine learning algorithm comprising a significant number of decision trees. Decision trees are used as the base learners or the weak learners in the RF models. The working of a RF is depicted in Figure 1. In RF, the performances of individual base learners i.e., the decision trees are boosted by the aggregation of individual tree results. The main trademark of RF is random feature sampling and random row sampling while selecting a set of rows and features from a dataset to train a particular decision tree. Cross-validation is not necessary when using RF because they can do out-of-bag error estimation as part of the forest-building process. By randomly selecting data from the initial training dataset with replacement, RF initially creates numerous additional training data sets. The size of the new ISSN: 2088-8694  Global solar radiation forecast using an ensemble learning approach (Debani Prasad Mishra) 499 training dataset is the same as the previous ones, however, sampling with replacement may result in some observations being duplicated [34]. Each decision tree model has high variance as gets trained with particular samples but as the final output depends on the aggregation of all the individual model's output, the final output has low variance.

XGBoost
XGBoost is an optimized distributed gradient boosting ensemble machine learning algorithm that uses a gradient boosting framework. The working of XGBoost is depicted in Figure 2. At the University of Washington, a research endeavor led to the creation of the XGBoost algorithm which led to a major advancement in the Machine Learning domain. XGBoost also uses Decision trees as its base for weak learners. In XGBoost, the Decision trees are built in a sequential manner. The weak learners are trained sequentially and each weak learners are, therefore dependent on each other [35]. In XGBoost, weight plays a significant role. Before being fed into the decision tree that predicts the results, each independent variable is given a weight. The weight of the variables that were incorrectly predicted by the base or the weak learners are increased and after that, they are fed to the next subsequent learner. These distinct weak learners are then combined to produce a model that is more precise and accurate. The high execution speed out of the core computation of XGBoost makes it a favorite among data scientists [36].

. Data Description
These datasets are meteorological data from the Hawaii Space Exploration Analog and Simulation (HI-SEAS) weather station between mission IV and mission V on which the model is trained and tested to get the best result possible. Weather parameters in the dataset include temperature, humidity, day time, wind speed, sunrise/sunset time, wind direction, and barometric pressure. The input dataset contains 15 min interval between each instance of weather parameters data. The model predicts the Solar Radiation as output. The model output calculated further with characteristic parameters of PV panel used gives the Solar Power output. The everyday hourly values of radiation are shown in Figure 3 and Figure 4 represents the combination of pressure and temperature for different values of radiation. Figure 5 represents the correlation plot between all the weather parameters. The correlation matrix represents the correlation coefficient between two variables which describes the extent of the linear relationship between them. The diagonal elements of correlation matrix will have value 1 because it is the cross-section of same weather parameters. The values near to 0 indicate that the features are very minimally related to each other while the values near 1 and -1 indicated the features are maximally related to each other.

Evaluation indices 3.2.1. Root mean square error (RMSE)
The standard deviation of the residuals is known as the root mean square error (RMSE). The residual is a fraction of the distance from the fallback line, which is the information hotspot. RMSE is the percentage of how much these residuals are fanned out. At the end of the day, you'll see how the information might be best suited. Mean squared error is commonly used to validate experimental results in climatology, estimation, and multivariate studies. The formula says: where f is predicted values (model output), o is actual values. The mean is indicated by the bar above the squared differences. The slightly different notation can be used to write the same formula as follows: where, (Zfi -Zoi) 2 is differences, squared, and N is the sample size.

R 2 score
The coefficient of determination, also known as the R 2 score, is used to evaluate the regressive model's accuracy. It operates by calculating the variation in the predictions that the dataset can explain. It is used to determine how accurately the model predicts the observed results based on the ratio of the total deviations of the results it describes, and it is expressed as, where SStot denotes the total sum of errors and SSres denotes the sum of squares of the residual errors.

Hyperparameter tuning
Hyperparameter tuning consists of finding a setup optimal hyperparameter values for a learning algorithm while applying this optimized algorithm to any dataset. That combination of hyper-parameters maximizes the model's performance and minimizes the predefined loss function to supply better accuracy. Hyper-parameters are specific to algorithms themselves, so we will calculate their values from the data. We use hyperparameters to calculate the model parameters.

Random search cross-validation
The most effective method for discovering the ideal collection of hyper-parameters for a machine learning model is random search. Using random draws from a specified set of hyper-parameter distributions, the randomized search meta-estimator algorithm trains and assesses a number of models. After training N distinct models with various randomly chosen hyper-parameter combinations, the algorithm chooses the best successful version of the model it has seen, giving you a model trained on a nearly ideal set of hyperparameters.

Optuna
Optuna is a software framework for automating the optimization of hyper-parameters. By utilizing several samplers, including grid search, random, Bayesian, and evolutionary algorithms, it automatically determines the ideal hyper-parameter values. We can pass any Machine Learning algorithm as hyperparameters in Optuna and it will give the algorithm that gives the best result along with its hyperparameters.

Random forest (RF) regression model 4.1.1. Training and Hyperparameter tuning
The initial form of the data is x*y format, where x stands for the number of features and outputs and y for the overall number of instances. In this phase, the dataset is separated into training and testing sets. The effectiveness of the studied RF algorithms depends on the adjustment of hyperparameters, i.e., number of trees, number of features to consider at every split, minimum numbers of samples required to split a node, the maximum number of levels in the tree, minimum number of samples required at each leaf node and bootstrap (A dataset is randomly sampled with replacement using the statistical resampling approach known as "bootstrapping"). The Random search CV hyperparameter tuning gives the combination of the best set of parameters for a more accurate model.

Prediction and result
The predictive performance of the RF regressor model is illustrated in Figure 6. The graph illustrates the plots of radiation values predicted by the RF model at different unix times vs measured values (actual value) in the testing dataset. The outcomes illustrate the level of a linear relationship and demonstrate how accurate the model can forecast solar radiation. At some unix time, larger discrepancies between real and anticipated values are seen due to a higher variation of solar radiation. In spite of that, the built RF model showed strong non-linear mapping generalizations ability and can be efficient in the prediction of solar radiation. The model was evaluated using R 2 score evaluation indices which resulted to be 0.809 and RMSE score came out to be 108.55.

Training and Hyperparameter tuning
The initial form of the data is x*y format, where x stands for the number of features and outputs and y for the overall number of instances. In this phase, the dataset is split into the training set, validation set, and testing set. While fine-tuning model hyperparameters, the validation set is used to provide an unbiased evaluation of a model fit on the training dataset. The adjustment of hyperparameters increases the accuracy of the XGBoost model, i.e., learning rate (learning rate, simply refers to how quickly the model learns), early stopping (validation metric at least improve once in every round(s) to continue training), evals (it is a list of validation sets for which metrics will be evaluated during training), depth of the tree, 'num_boost_round' (number of trees to build). Optuna hyperparameter tuning is used to optimize the model parameters. During optimization, at each iteration, a new set of parameters is created and their loss value is evaluated. The set of parameters with less loss value is chosen as the best set of parameters. Then the model is built using those sets of parameters.

Prediction and result
The predictive performance of the XGBoost model is illustrated in Figure 7. The graph illustrates the plots of radiation values predicted by the RF model at different Unix times vs measured values (actual value) in the testing dataset. The outcomes illustrate the level of a linear relationship and demonstrate how accurate the model can forecast solar radiation. The statistical evaluation indices, R 2, and RMSE were used to appraise the model's performance as represented in Table 2. The XGBoost model is observed to show higher fluctuation in predicting solar radiation than the RF model. The model's R 2 value came out to be 0.64 and RMSE score to be 122.12.

CONCLUSION
In this research work, the practicability of deploying tree-based ensemble methods (RF and XGBoost) to predict solar radiation which in turn evaluates the photovoltaic system power output. The capability of ensemble technique methods for predicting solar radiation has been verified with model prediction performances. Ensemble algorithms were shown to marginally outperform other popular machine learning techniques. The work also aimed to use tree-based ensemble methods to explain the significance of the input attributes. Based on several weather parameters, the developed machine learning models can be used to forecast solar radiation. Both RF (internal cross-validation) and XGBoost perform cross-validation and can be used to manage datasets with large dimensions. The modeling strategy is demonstrated a reliable one that can be used for real-time solar radiation prediction. There is still room for improvement which can lead to more accurate models. One of the future work directions is to extend the existing work for more generalized datasets.