Review of fast square root calculation methods for fixed point microcontroller-based control systems of power electronics

Received Sep 23, 2019 Revised Jan 9, 2020 Accepted Feb 16, 2020 Square root calculation is a widely used task in real-time control systems especially in those, which control power electronics: motors drives, power converters, power factor correctors, etc. At the same time calculation of square roots is a bottle-neck in the optimization of code execution time. Taking into account that for many applications approximate calculation of a square root is enough, calculation time can be decreased with the price of precision of calculation. This paper analyses existing methods for fast square root calculation, which can be implemented for fixed point microcontrollers. It discusses algorithms’ pros and cons, analyses calculation errors and gives some recommendations on their use. The paper also proposes an original method for fast square root calculation, which does not use hardware acceleration and therefore, is suitable for implementation at a variety of modern Digital Signal Processors, which have high-speed hardware multipliers, but do not have effective dividers. The maximum relative error of the proposed method is 3.36% for calculation without division, and can be decreased to 0.055% using one division operation. Finally, the most promising methods are compared and results of their performance comparisons are depicted in tables.


INTRODUCTION
Modern control systems of power electronics have extended functionality and large code to be executed. At the same time the most of mass produced devices are subject for cost optimization, therefore developers frequently select cheap microcontrollers, which are not powerful. The bulk of the code including math functions and model of control objects are run every sampling period of the system, which is typically equal to modulation period. The modulation period of the majority of power electronics control systems lies in the interval of 4 -20 kHz, while operating frequency of the cost-effective microcontroller belong to the range of 40 -80 MHz, which gives 2,000-20,000 processor cycles per modulation period. High power devices operate at lower modulation frequencies, while lower power electronics and devices with lower inductances, which are widely used in home appliances, automotive and industry, operate at higher modulation frequencies. The price of high power electronics is also high, so increase of microcontroller's cost in 1 -2 $ does not impact significantly the total price of device, but the cost of the low power units is low, and the pressure form competitors make every cent significant. So developers of power electronics put cheap Digital Signal Processors (DSP) in their solutions and then try to optimize software to be able run in the selected DSP. One of the main milestones in the way of optimization is square root calculation routine.
Square root operation is frequently used in many control systems of power electronics and digital signal processing algorithms. Tasks with square roots can be divided into two groups. Tasks from the first group are executed every sampling period and process newly sampled data. Tasks from the second group are executed rarely (with a frequency less than 120 Hz), when measurement results are needed and process data from multiple samples. First group includes voltage and current vector transformations and normalizations, models of control objects, observers, etc. Second group includes calculation of Root Mean Squares (RMS) and Total Harmonic Distortion (THD), least squares based algorithms, etc. Square root operation is also intensively used in electrical drives, where it is involved into motor models calculation, Maximum Torque Per Ampere (MTPA) control [1], Maximum Torque Per Voltage (MTPV) control, field weakening [2], various observers, Power Factor Correctors (PFC), etc.
Authors of this paper faced with the same problem, which arose, when 80 MHz DSP was substituted with 64 MHz DSP and software operating at 16 kHz failed due to the MCU overload. Analysis of the software revealed several bottle-necks, one of which was square root calculation routine called several times per sampling period and two times in the 60 Hz interrupt. At the same time implementation of the square root calculation routine used digit-by-digit method, which was not perfect and took a lot of processor cycles. Therefore, authors concentrated their efforts on the optimization of time used for square root calculation.
Let's formulate a problem for the discussion. For a given number S, it is necessary to find the number X so that: Modern DSPs for cheap control systems of power electronics are typically 16 or 32-bit with 10 or 12-bit ADC. Therefore, majority of variables in control algorithms are 16-bit and intermediate results of calculations are 32-bit variables. Consequently, X is expected to be 16-bit and S is expected to be 32-bit.
The overwhelming majority of DSPs are designed for fast multiplication with addition, but has no hardware acceleration for the division operation. Consequently, implementation of the division operation is slow and typically takes as many processor cycles as number of bits of the result. For a 16-bit result, one division operation takes about 16 cycles and significantly slows down calculations. Thus, it is desired to minimize the number of divisions in the square root calculation algorithm.
In such a way our purpose is to accelerate the square root calculation with the price of decreased tolerance. As square root operation is typically used for the processing of measured signals corrupted by noise, calculation errors should be similar to the measuring errors, which are typically 0.5 -5%. However some cases demand higher precision, so calculation methods must be able to decrease errors by running one to two additional iterations. Therefore, the calculation method must quickly calculate the square root with errors in the range of 0.5 -5% and must have fast convergence to be able to decrease errors in one to two additional iterations.
Basically, the algorithms for square root evaluation can be divided into two main groups: iterative methods and approximation by real functions. Iterative methods comprise three classes: direct methods, algorithms based on the Newton-Raphson formula and normalization techniques [3].
Authors of [4][5][6][7][8][9][10][11][12][13] reported hardware implementation of non-iterative methods. These algorithms are mainly designed for low-resolution data (8-bit), but operate extremely fast and can be used for pre-processing of sampled data Such methods need additional hardware and cannot be used in general DSP-based algorithms. Authors of [14] used parallel calculation in reported speed increase in more than 38%.
The uncommon approach for square root calculation was reported in [16]. Authors proposed to use iterative execution of the nonlinear Infinite Impulse Response (IIR) filter to obtain the square root of the given number. This method does not involve division operation, but intensively uses multiplication with addition operations. This paper is divided into five sections. In Section II the authors explain existing methods for the fixed point square root calculation, and discuss their pros and cons. Section III proposes a new method and discusses its tolerance. Experimental results and performance of the most promising methods are given in the Section IV. Section V contains conclusions.

EXISTING METHODS
The authors of [3] give classification and extensive review of the general square root calculation methods, but not all of them are suitable for fixed point implementation. Paper [3] can be extended with [15] and [17], which review and analyse only fixed point algorithms.

1155
There are several square root calculations methods, which can be considered to be fast. Some of them utilize hardware features of processor core and can be implemented in the limited number of DSPs, while others are general methods, which do not depend on processor core. This paper reviews the most prospective methods and discusses their pros and cons.

Digit-by-digit calculation
This method for square root calculation at fixed point DSPs, is the most popular as it is easy to understand and implement. Despite relatively slow convergence, many engineers still pay attention to this method, often implementing it in hardware [18] and software [19]. This method can be considered as a basic method and can be used for the evaluation of other methods.
For better understanding the algorithm implementation in binary system, let's consider its operation with decimal numbers, which is often used for manual calculation of square roots with paper and pencil.
Suppose X k is the k-th digit of X, so: The most significant digit X k is the first approximation of the square root X and can be easily found as the highest integer which satisfies: The next digit X k-1 of square root X can be found using the following inequality: which transforms into: Similarly, the next digit can be calculated as: This approach can be used recursively during the next steps to obtain the necessary number of digits. Since every digit of the result X is multiplied by 102k, it's easier to separate S into pairs of digits and calculate Xk for every pair. If S contains odd number of digits, a 0 must be added to the left. Thereafter, every pair is combined with previous step remainder and new digit satisfying: must be found. Let's denote root calculated at the current step as "R" and remainder of the initial value as "Rem". R is initialized with zero, and every iteration of the algorithm calculates the next digit of the root. The flowchart of the algorithm is shown in Figure 1: Let's illustrate this algorithm with the following example: The square root of 54756 is: So answer is 234.
This method can be extended to binary numbers. A binary system is simpler for calculation because the greatest number can be found just by comparing numbers. The only difference is that condition (7) transforms into: This method does not use specific hardware and its execution time does not significantly depend on the processor type. The advantages of this method are precise results (LSB or half LSB if remainder is analysed), simple implementation and absence of division. It calculates root digit by digit, starting from the most significant, so calculations can be stopped before processing of full number S, if acceptable tolerance is reached.
However, the most significant disadvantage of the explained method is calculation time. Despite this, digit-by-digit method is frequently used in the control systems, where the processor is not highly loaded. It was also proposed for implementation in hardware and the authors of [19] enhanced this idea to calculation of other functions and implemented on FPGA.

Polynomial approximation
Square root function is difficult for approximation because its derivative rapidly changes close to zero, which results in high approximation errors. Therefore, square root can be successfully approximated only at the limited interval.
Since approximating polynomial gives appropriate results in the limited interval, the initial number must be scaled to fall inside it. After calculation of the square root of the scaled number, the resulting value must be multiplied by the square root of the scaling value to obtain the correct answer. The authors of [20] published assembler code of the proposed method and claimed maximum execution time of 75 cycles.
This essential scaling operation is not simple for many DSPs and takes time, however processors from Analog Devices have hardware units called an exponent detector, which calculates the amount of the redundant sign bits and shifts initial numbers, so calculation is accelerated. Moreover, sign operation of the exponent detector limits the range of input value by 32768, which is restricting in most cases. Evaluation of the high power polynomial increases calculation error and impacts tolerance of the result. So this method is not recommended for implementation.

Table approximation
Lookup table-based methods are common for function approximation. They are relatively fast, but use a lot of memory to increase tolerance. Bipartite and multipartite methods are excellent examples of table lookups and are described in [18]. Authors propose inserting two lookup tables into the data paths and use one table for the initial seeds, while another one should be used for adding a small correcting value.
The approximation error can also be reduced by interpolation between table points (e.g. linear), but it increases the calculation time and diminishes the main advantage of the method -speed. The flowchart of function approximation with variable segments table and linear interpolation between points is depicted in Figure 2. It indicates that calculation contains one division, which significantly slows down calculation.
The calculation time of the function approximated with table can be significantly decreased, if the table consists of equal segments and a number of them is to the power of two. A flowchart of the corresponding algorithm is depicted in Figure 3. It indicates that calculation was simplified and division had been excluded.
Table approximation perfectly works for the smooth function with smooth derivative and outputs higher errors for the function with rapidly changing derivatives. A typical example of functions, which can be approximated easily, is sine and cosine.
As stated above, square root is not easy for approximation, because its derivative rapidly changes close to zero. Therefore, variable step table approximation or pre-scaling must be used. For this specific reason, Figure 3 contains two scaling blocks.
On the one hand, the tolerance of this method mainly depends on the size of the table and in some cases the size is unacceptably large. On the other hand, this method needs pre-scaling before table is read and after, which diminishes its main advantage -calculation speed.

Newton's method
This method is very popular for numerical calculation because it is simple and has fast convergence. It can be successfully used for the approximation of various functions and calculation operations [21]. Let's consider its application to the calculation of square roots.
Finding X, which is square root of S is the same as solving: According to Newton's method, the next step approximation xi+1 of the function root is: where xi is root approximation at the current step. The calculation process can be illustrated by drawing series of secant lines to parabola as shown in Figure 4. The iterative calculation is stopped, when the necessary accuracy has been reached: However, for simpler calculation, the iterative process can be limited to fixed number of iterations. Thus, the function given in (10), using (11) transforms into: This equation is also known as the Babylonian method [22]. It has quadratic convergence and allows us to calculate square root with low number of operations. However, each iteration of the method contains division, which takes a significant amount of processor time Properly selected initial value can significantly accelerate calculation, so an initial guess x0 is very important. We suggest to locate square root by finding n, so that square root belongs to the interval: This step can be easily implemented in the DSPs and provides advantage of using the power of two, therefore division at the first step can be substituted with bit shift. Then the initial value can be selected during this interval, using any criteria. Typical initial values are middle point of the interval: which is the most probable number, and value, which gives symmetrical relative error at interval borders: where division can be substituted by the bit shift. However, the following steps involve division and are performed according to (13).   Let's calculate the error for every iteration in order to define the number of required steps. The relative error is higher, when X lies at the interval's lower border: X=2n. Relative error of initial approximation: First step approximation: (19) Relative error of the first step approximation: Second step approximation: This tolerance is typically sufficient, and square root calculation needs three iterations, which involve only two divisions.

A two-variable iterative method
This method was proposed by the authors of [23] for one of the first computers. It is applicable to the square root calculation of the number S satisfying 0 < S <3, so it also needs scaling of the initial number.
This method calculates two numbers a and c at every iteration, where a converges to square root of S and c converges to 0.
These numbers are initialized as: and at every iteration they are updated as: It should be noted that for faster calculation S can be scaled to satisfy 0.5 < S <2. The maximum relative error occurs when S = 2. Maximum relative errors for each calculation step are: This method, as Newton's method, has quadratic convergence and typically 3 -4 iterations of (26) is sufficient. This calculation method does not contain division operation and time delays inherent to it.
The problem of this method is multiplication of 32-bit numbers, which is typically slower than 16-bit number multiplication. It also truncates 64-bit numbers to 32-bit, producing calculation errors, which tend to be accumulated.

Goldschmidt's algorithm
One more interesting and prospective method is Goldschmidt's algorithm. Its usual description reflects a hardware orientation with difficulties of implementation in software, however the author of [24] suggests a software-friendly way of computing, which is described below.
Let y0 be a suitably good approximation to S 1 , such as: Initialize variables: And then iterate as follows: Calculation can be stopped, when ri is sufficiently low or after fixed number of iterations. Calculated values converge to: The author of [24] proves that Goldsmith's algorithm, as Newton's method has quadratic convergence, but has an advantage of absence of division operation. The feature of the algorithm is that it simultaneously calculates square root and reciprocal square root. However, if one of these values is not needed, the corresponding equation in (30) can be excluded.
Equations in (30) can be effectively implemented for large numbers of DSPs, which have hardware units for multiplication with addition. However, it should be noted that variables in (30) have significantly different orders, therefore a lot of attention must be paid to their representation.
The disadvantage of this method is the necessity of initial approximation (28), which is quite difficult. The paper [24] suggests using small lookup tables for this purpose, but it takes additional memory. Another drawback of this method is propagation of errors due to rounding. The author of [24] claims that high precision results are usually achieved by starting with Goldschmidt's algorithm and then switching to the self-correcting Newton's iterations. However, for our purpose the tolerance of original algorithm is sufficient.

PROPOSED METHOD
After analysis of advantages and disadvantages of the discussed methods, the authors proposed a combined method, which is far superior to other algorithms.
Authors propose to define an initial root interval as described in (14). Then, initial approximation can be made by drawing the secant between the ends of the root interval and calculation of secant intersection with abscissa axis, Figure 5. The equation of the secant line drawn over two points [x 1 , y 1 ] and [x 2 , y 2 ] is: Its root can be found as: Let's calculate the maximum error Δxmax. From Figure 6 error function for the approximation with secant is: It has maximum at the point: And maximum absolute error is: As can be seen from (40), the error is negative, so initial guess x 0 can be improved by adding shift.
And from (40) maximum relative error is: This error is almost twice less than the first step error in the Newton's method. If precision of (44) is not sufficient, one more calculation step should be performed. It could be performed according to (13), which is simple for implementation, but involves one division operation. Let's find the maximum error, which corresponds to X=2n. Initial approximation: Relative error of the first step approximation:

EXPERIMENTAL RESULTS
For the performance analysis, the most prospective algorithms and digit-by-digit method as a reference, have been selected. Due to the fact that the proposed algorithm is not iterative, it has been enhanced with one iteration of Newton's method in order to obtain higher precision.
Calculation time, operations and the number of processor cycles were calculated for the lower precision calculation (about 2%) and higher precision calculation (about 0.05%). The results were put into Table 1. and Table 2 for the lower and higher precision calculations, respectively. These tables contain main operations and do not include data initialization, data movement and branching, which strongly depend on the compiler optimization and conveyer mechanism of the microprocessor. The real calculation time of the selected methods is typically greater at 8-15 cycles depending on the optimization settings.
For the experimental results Cortex-M3 based microprocessor has been selected, because this core is widely used and does not have specific hardware acceleration. Timings of instruction execution were taken from the technical reference [25]. For calculation of execution time we considered the worst case scenarios, e.g. according to [25] division operation takes 2-12 cycles, depending on the data, so we used 12 for method speed evaluation.  All the multiplications were considered as 32-bit operations, which generally take 1 cycle. Every multiplication operation is supposed to involve one more shift operation for scaling of the result. Some algorithms operate with numbers of different ranges, so they may need 64-bit multiplication, which is executed in 3-7 cycles, but this issue was not considered here.
Initial scaling, which finds n satisfying (14), is performed by four stages of comparison and includes maximum 4 comparison, 3 shifts and 4 additions. Initial conditions for Goldsmith's algorithms are tougher, so it needs one more stage and takes 5 comparison, 3 shifts and 5 additions.
Experimental results prove feasibility of the proposed method and indicate that it is fastest among reviewed algorithm. Furthermore, its simplicity makes implementation in hardware possible, resulting in high speed with acceptable tolerance.

CONCLUSION
This paper analyses the existing methods for square root calculations at fixed points DSPs, and proposes an original method for the approximate calculus with acceptable tolerance. The maximum relative errors for several iterations are calculated, so the number of steps can be selected, depending on the demanded tolerance of the result. The complexity and performance of the proposed method and competitive methods were both checked and compared. The experimental results indicate significant leadership of the proposed method, thus it is recommended for implementation in new algorithms or hardware.