The Surface Energy Balance Algorithm for Land (SEBAL) framework in Graphics Processing Units (GPU) using Cuda and OpenCL

Evapotranspiration (ET) is the second largest component of the terrestrial hydrological cycle on a global scale. The estimation of ET at regional scale using satellite imagery and algorithms allow the conversion of instantaneous measures in daily totals which represents a major contribution to the study of energy exchange between the biosphere and atmosphere. One of the most important remote sensing algorithms used to estimate the spatial partitioning of energy is the SEBAL Surface Energy Balance Algorithm for Land. The time processing of this algorithm can be very high depending on the size of the image. Thus, the objective of this study was to propose a framework of the use of graphic processing units (GPU) to run the SEBAL to increase the performance and reduce the time required to run it. For implementation of SEBAL’s framework was used local data collected in a micrometeorological tower installed in an area located in the Northern Pantanal (Brazil) and Landsat 5 – TM image. We used the Model Maker function of the ERDAS 9.2 Software to run SEBAL algorithm, and its implementation in Java, CUDA and OpenCL languages. A daily ET estimated by SEBAL in JAVA, CUDA and OpenCL languages were similar to the value performed by ERDAS software. The daily ET computed by the ERDAS software was 10% lower and the SEBAL computed in JAVA language, CUDA and OpenCL was 9% lower than the ET obtained by the Bowen ratio method. The results showed that CUDA was better due to the Speed Up performance up to 35,000 times.


Introduction
The water demand is growing rapidly in the world, suppressing their supply, which challenges its availability and food production (Vörösmarty et al., 2010;Burek et al., 2016). Agricultural activities depending directly on water consumption for its development affect population growth and its demands on food, which is competing with the water supply in industries, homes and environmental (Vörösmarty et al., 2010;FAO, 2018). It is estimated an average rate of 4.7% growth of the agricultural area in Brazil per year over the next decade, one of the largest in the world, and much of that growth is in the Cerrado, Pantanal and Amazon (Menke et al., 2009;Soares et al., 2020).
The evapotranspiration (ET) has been the best indicator to assess the water consumption and water use for human activities (Leivas et al., 2011;Angelini et al., 2017). The ET of land surface is an important component of energy and water balance.
It is the second largest component of the terrestrial hydrological cycle on a global scale, accounting for about 60% of the precipitation from the surface to the atmosphere (Mu et al., 2011). In addition, ET is an important energy flow, which uses more than half the total solar energy absorbed by the surface (Mu et al., 2011). Therefore, the correct characterization of ET is crucial in the study of terrestrial ecosystem in climate dynamics and the hydrological cycle (Allen et al., 2011;Danelichen et al., 2020).
Recent studies have been demonstrated the possibility of estimating ET at regional scale through observations of remote sensing data combination with surface atmospheric data (Tian et al., 2013;Angelini et al., 2017). The perimeters of the ET on a regional scale, based on satellite images and algorithms that allow the conversion of instantaneous measurements in total daily evapotranspiration represents a major contribution to studies of energy exchange between the  (Biudes et al., 2009;Allen et al., 2011;Bégué et al., 2018). One of the most useful remote sensing algorithms and widely used in studies of the partitioning of energy in soil-plant-atmosphere system is the Surface Energy Balance Algorithm for Land (SEBAL) (Bastiaanssen et al., 1998;Bastiaanssen, 2000;Glenn et al., 2007;Kalma et al., 2008;Allen et al., 2011;Bégué et al., 2018). This algorithm uses satellite images and a few surface data as input for estimating the net radiation and soil, sensible and latent heat fluxes. Once implemented and validated the use of these has the advantage of providing the energy balance at the effective and economically surface without the need of ground stations. Although remote sensing depends on the orbital sensor that feeds data channels, it provides a wide spatial coverage and good spatial resolution (Allen et al., 2011;Mu et al., 2011;Cardim et al., 2018). The SEBAL has been used to estimate the effective albedo, surface temperature, and surface emissivity as well as the energy balance components with Landsat -5 (Allen et al., 2011;Mendonça et al., 2012;Ferreira Junior et al., 2013;Danelichen et al., 2020) and the MODIS sensor aboard the Terra satellite/Aqua (Mu et al., 2011;Ruhoff, 2011;Tian et al., 2013).
The processing time of SEBAL can be high, especially if we use a complete satellite image due to the number of pixels that an image contains can spend around 100 million pixels (image 12000 by 8600 pixels), generating a file at least 700 MB in size, so the time of the calculations can be unsatisfactory (Oliveira et al., 2018). The use of parallel processing (with hundreds of cores) performing tasks at the same time in computing the energy balance by remote sensing can significantly reduce the processing time than central processing units.
In this sense the objective of this research was to implement the use of graphics processing units (GPU) to compute the SEBAL algorithm to increase performance and reduce the computation time required.

Surface Energy Balance Algorithm for Land (SEBAL)
The ET is obtained after conversion of the instantaneous value of the latent heat flux -LE (W m -2 ), in total ET (mm) in for 24 h in SEBAL (Bastiaanssen, 2000). The LE is obtained as the residue from the classical equation of energy balance at the surface, given by Equation (1). = + + (1) where Rn is the net radiation, LE is the density of latent heat flux, H is the density of sensible heat flux, and G is the density of heat flow in the soil, all in W m -2 .
The net radiation at the surface (Rn) is computed using the following Equation (2) of the radiation balance of surface radiation (Allen et al., 2002;Silva et al., 2005).
(2) where R s↓ is the incident shortwave radiation (W m -2 ), αsup is the albedo of each pixel (dimensionless), R L↓ is the long wave radiation emitted by the atmosphere in the direction of each pixel (W m -2 ), R L↑ is long wave radiation emitted by each pixel (W m -2 ) and ε 0 is the emissivity of each pixel.
The heat flux on the ground (G) is the rate of heat storage in the soil and vegetation due to conduction. The G is calculated by Equation (3) using the following empirical equation developed by Bastiaanssen (2000) representing values nearby midday.
where T s is the surface temperature (°C), α is the surface albedo and NDVI index is the normalized difference vegetation. For the purpose of correction of the values of heat flow to the ground water bodies (NDVI < 0) can be used the following expression: G = 0.3Rn used by Silva et al. (2006). The sensible heat flux (H) is the rate of heat loss to the air by convection and conduction due to a difference in temperature. The H is estimated based on the wind speed and the surface temperature by using an internal calibration of the difference in temperature of the air (2 m) and surface temperature. The H is a function of the temperature gradient, surface roughness, and velocity of wind (Equation 4). To facilitate the computation, we used two pixels "anchors" (as the values of H can be predicted and estimated dT) and the wind speed at given height (Bastiaanssen et al., 1998;Mendonça et al., 2012).
where ρ is air density (kg m -3 ), c p is the specific heat of air (1004 J Kg -1 K -1 ), dT (K) is the temperature difference, and r a is the aerodynamic resistance to heat transport (s m -1 ).
Data requirements of the SEBAL The SEBAL requires three input data sets: a) Estimated data using satellite images: Vegetation indices (NDVI, SAVI and LAI), surface albedo, emissivity, and surface temperature. b) Data obtained by the weather station: air temperature, atmospheric pressure, and wind speed. c) Remotely sensed data, such as long wave radiation and shortwave radiation.

Data for testing
In this study we used data from a tower of the Federal University of Mato Grosso (UFMT), installed in an area located in the private protected area of the Social Service of Commerce (RPPN SESC Pantanal) located in coordinates 16°39'50''S and 56º47'50''W, whose altitude is 120 m, in Barão de Melgaço-MT, distant 160 km from Cuiabá -MT, Brazil (Figure 7). We used a Landsat-5 TM image of study area on 4 September 2008 (presenting clear sky conditions), provided by the National Institute for Space Research -INPE (http://www.dgi.inpe.br / CDSR /). The Bands 1, 2, 3, 4, 5 and 7 of the Landsat-5 TM has spatial resolution of 30 x 30 m, and the band 6 (thermal) has spatial resolution of 120 x 120 m, the Landsat-5 TM has temporal resolution of 16 in 16 days, and their passage to 9:30 pm local time. The image had a size of 1,688 x 1,279 pixels. All pre-processes (stacking, georeferencing, and clipping) were all performed using the software ERDAS 9.2. The tests were run on the GPU using a NVIDIA 520m with 48 cores at 1.5 Ghz and NVIDIA GeForce GTX 560 with 384 cores at 1.5 Ghz.

Graphic processing unit
The architecture of the GPU's circulating around a vector scalable multi-threaded, streaming multiprocessor (SM) (NVIDIA, 2011), and each Streaming Processor (SP) manages memory allocation, synchronization, and communication between the SP's (Papakonstantinou et al., 2009), which facilitates access the parallelism of the GPU's. Performing instructions on the GPU pipeline uses, and each stage is performed by a specific hardware parallel. However, there is a possibility of the stages run on a single hardware using programmable unit (Owens et al., 2008).

OpenCL
OpenCL is an open standard for parallel programming of heterogeneous platforms, providing access to high-and low-level processes in parallel devices. OpenCL works on both the CPUs and the GPUs from various manufacturers (NVIDIA, 2009;Augusto and Barbosa, 2013). OpenCL is a framework that also includes language, libraries, and API (Khronos, 2011). Each device on the OpenCL (Figure 1) has p compute units (CU), which in turn is composed of q processing elements (PE) (Augusto and Barbosa, 2013). Figura 1. Concept of device architecture (Augusto and Barbosa, 2013).
The OpenCL performs parallelization on the GPU as shown in Figure 2, assuming that associated values for X and Y, then the multiplication of X*Y determines the number of work-items (processes or threads) that will exist on the card. By grouping a number p of work-items, determine if a workgroup, which is performed using the SIMD architecture. This division can be made between one and three dimensions and required that the result of multiplying the amount of workgroups for each dimension (Equation 8), shall not exceed the number of work-groups maximum board (Equation 9). Figura 2. Work-items and workgroup of the OpenCL (9) where WG T is the number of full workgroups supported by the video card and D i (i = 1, 2, 3) represents the number of workgroups that dimension. Another requirement is that the number of work-items in each dimension must be a multiple of the amount of workgroup of the size in question (Equation 10), or the division should result in an integer Z.

ℤ =
where WI i is the number of items in the workdimension (i = 1, …, 3). With this structure can divide tasks into smaller tasks that are executed simultaneously on different work-items, generating performance gain applied to solving the problem (NVIDIA, 2009).

CUDA
CUDA is an architecture and programming model for parallel plates NVIDIA GPU to solve complex computational problems in a more efficient than on a CPU. It has also been developed to support multiple languages such as CUDA C, CUDA Fortran, and OpenCL (NVIDIA, 2011).
The CUDA-works gathering threads created by the program block ( Figure 3) and each block is then executed independently of the other in each core (NVIDIA, 2011), as shown in figure 4. The blocks are also divided into grids. There are some restrictions in CUDA and OpenCL. One is that the total of multiplying the number of threads per block in each dimension may not exceed the total number of threads per block supported by the plate (Equation 11).
≥ 1 × 2 × 3 (11) where T T is the total number of threads per block supported by the plate and T i (I = 1, ..., 3), is the number of threads per block of size i. Another constraint is that the number of grids by size cannot exceed the number of grids that dimension borne by the plate (Equation 12). ≥ (12) where G i is the number of grid dimension (i = 1, ..., 3) supported by the plate, and g i is the number of grids defined in dimension i in developed program (NVIDIA, 2011).

Framework development for SEBAL
The framework was developed in order to provide three distinct resources for the developer to use for the calculation of SEBAL. One for the pure Java language, another using CUDA and another for OpenCL. All based on LandSat 5 satellite images. The framework was divided into 3 distinct parts: 1 -Pre-processing: it is the component where LandSat images are processed to generate the initial SEBAL information 2 -Calculation of pixels "hot" and "cold": this is the part where you can calculate the pixels that will serve as input for determining the energy balance.
3 -Processing: it is the component that receives all outputs from steps 1 and 2 and performs all SEBAL calculations with 3 Java, CUDA and OpenCL languages. Figure 4 shows the class diagram that represents the framework. The LandSat class is where all the pre-processing of images takes place with the output of NDVI, SAVI, LAI, albedo, balance of radiation, emissivity, surface temperature and long wave radiation. The Sebal abstract class has a method for calculating the coefficients of the "hot" and "cold" pixels. And the SebalImpl, SebalCUDA and SebalOpenCL classes represent the SEBAL calculation using Java, CUDA and OpenCL respectively. Along with these classes, a class called SebalRun was developed that has the function of executing all the necessary procedures for the execution of SEBAL receiving only the input images and the type of execution that the developer wants, for example, Java, CUDA or OpenCL. The Algorithm 1 shows how to use the framework to calculate SEBAL using all available languages. After the implementation of the entire framework, we performed some tests to verify its performance in relation to ERDAS, and for that we use a performance measure described below.  System.out.println("Java"); datas = sebal.calculate(arqs, ExecutionEnum.NORMAL); Utilities.exportTXT(System.getProperty("user.dir") + "/OutputSebal", datas);

Performance measurement
We used the speedup in this study for measures performance (Equation 13) (Staroba, 2004). The speedup of a parallel system can be defined as the ratio between the time spent by a single processor to solve a given problem and the time spent by a parallel System of n processors to solve the same problem (Hwang, 1993).
where T 1 is the execution time with a processor and T n is the execution time with n processors.

Statistical analysis
The data are normally distributed by the Kolmogorov-Smirnov tests. The evaluation of ET estimated by SEBAL via function ModelMaker software ERDAS 9.2 image processing algorithm implemented in the JAVA GPU was performed using some indicators: accuracy -Willmott index "d" (Equation 14); root mean square error "RMSE" (Equation 15) and the mean absolute error "EMA" (Equation 16). The accuracy is related to the remoteness of the estimated values from those observed (ERDAS software). Mathematically this approach is given by a designated index of agreement that can be widely applied to the comparison between models (Willmott et al., 1985). Their values range from zero, with no agreement, 1 with perfect agreement.
where P i is the estimated value, the observed value O i and O is the average of the observed values. The RMSE indicates how the model fails to estimate the variability of measurements around the mean and measures the change in the estimated values around the measured values (Willmott & Matsuura, 2005). The lower limit of RMSE is 0, which means that there is full compliance between the model estimates and measurements.
The EMA indicates the distance (deviation) of the mean absolute values estimated from the values measured. Ideally, the values of the EMA and NDE were near zero (Willmott & Matsuura, 2005).

Results and discussion
We have implemented the SEBAL in three languages: Java, CUDA, OpenCL which allows comparison of the time response between them. However, we automated the calculation of the A and B coefficients of SEBAL which facilitates the calculating process of the SEBAL algorithm by Java, CUDA and OpenCL languages. This automation refers to the calculation of the temperature variation of the hot and cold pixel, thereby reducing the possibility of errors that previously its calculation was performed manually in a spreadsheet (Allen et al., 2002;Danelichen et la., 2020). Another important automation was to obtain evapotranspiration in a single step, as it was obtained in several steps before automation (Santos et al, 2010;Lima et al., 2021).
The estimated time was only the execution time of each test, disregarding the time of reading and writing, not part of the scope of this work. Each test was run several times and the highest and lowest time were removed and made the average response time for analysis. The time response was reduced at least 400 times independently from the resource utilized ( Figure 5). The best performance for processing SEBAL in ERDAS software was obtained with the CUDA with fastmath (fm) option and 10 stream multiprocessors (SM), featuring an increase of approximately 36,000 times. The fastmath option had more effect on board with 48 cores than stream multiprocessor configuration. Moreover, the stream multiprocessor option had more effect on the board with 384 cores than fastmath configuration. It is noteworthy that the fastmath option in compiling CUDA makes GPU uses optimized implementations for some functions, such as log and divide (NVIDIA, 2011). The time response was reduced from approximately 7 to 90 times using Java code ( Figure 6). The best performance to processing SEBAL in relation to the implemented in Java was obtained by CUDA with fastmath (fm) option and 10 stream multiprocessors (SM). The fastmath option had more effect on the board with 48 cores than stream multiprocessor configuration. Moreover, the stream multiprocessor option had more effect on the board 384 cores than fastmath configuration.   The daily estimated evapotranspiration (ET) by SEBAL using JAVA language, CUDA and OpenCL and the algorithm executed by ERDAS software had errors near to zero and correlation coefficient and Willmot concordance index were close to 1 (Figure 7). The values obtained in this study showed a correlation of 0.99, EMA of 0.136, RMSE of 0.065 and d equal to 0.99. These values were due to the number of dT interactions to stabilize aerodynamic resistance (Mendonça et al., 2012;Danelichen et al., 2020). When computed for a spreadsheet the number of iterations required was 16, stabilizing an error of 2.2%. By the other hand, the running algorithm by JAVA, CUDA and OpenCL languages in which the interaction processes were automated, the number of interactions was 31 with an error less than 0.01%. Biudes (2009) obtained daily ET of 4,863 mm for the same day chosen in this research using the Bowen ratio method to data from the meteorological tower of RPPN SESC Pantanal. The daily ET computed in this study using the software ERDAS (ET = 4.3605 mm day -1 ) was 10% lower than the ET obtained by Biudes (2009) and 9% lower than the value obtained by SEBAL using the JAVA language, CUDA and OpenCL (ET = 4.383 mm day -1 ).

Conclusions
Three new implementations of SEBAL algorithm using Java, CUDA and OpenCL were introduced. These implementations were tested using a consistent data set and the codes were timed. At best case, it was possible to achieve about 36,000 (90) times speed up (going from 8 hours of work to 2 minutes including the time needed to read the input data from hard drive and store it back). This speed up was achieved by using only one CUDA-enable card. To further speed up the program one can use multiple NVIDIA card and in other platforms using OpenCL. The daily ET values estimated by Java, CUDA and OpenCL languages were like ET estimated by SEBAL algorithm in ERDAS software.

Software availability
• Software Name: JGPUSebal • Developers: This paper authors.