Thursday, November 13, 2014

0.1 Float > 0.1 Double ?

  • Introduction:

You should not worry about using all comparison operators with floating-point numbers (float, double, and decimal). The ==, <, >, <=, >=, and != operators work just fine with these numbers. But, it is important to remember that they are floating-point numbers, rather than real numbers or rational numbers or any other such thing.

In pure (real) math, every decimal has an equivalent binary. In floating-point math, this is just not true! Consider the following example, let:
double d = 0.1;
float f = 0.1;
, should the expression f > d return true or false? Let’s analyze the answer to this question during the remaining part of this article.
  • 0.1 in Binary:

Many new programmers become aware of binary floating-point after seeing their programs give odd results:
“Why does my program print 0.10000000000000001 when I enter 0.1?”
“Why does 0.3 + 0.6 = 0.89999999999999991?”
“Why does 6 * 0.1 not equal 0.6?”
The answer is that most decimals have infinite representations in binary. Take 0.1 for example. It’s one of the simplest decimals you can think of, and yet it looks so complicated in binary:

Decimal 0.1 In Binary ( To 1369 Places) - Photo by [2]
The bits go on forever; no matter how many of those bits you store in a computer, you will never end up with the binary equivalent of decimal 0.1.
0.1 is one-tenth, or 1/10. To show it in binary, divide binary 1 by binary 1010, using binary long division:
Computing One-Tenth In Binary - Photo by [2]

The division process would repeat forever because 100 re-appear as the working portion of the dividend. Recognizing this, we can abort the division and write the answer in repeating bicimal notation, as 0.00011.
When working with floating-point numbers, it is important to remember that they are floating-point numbers, rather than real numbers or rational numbers or any other such thing. You have to take into account their properties and not the properties everyone wants them to have. Do this and you automatically avoid most of the commonly-cited "pitfalls" of working with floating-point numbers.
  • Floating Binary Point Types :

Float and Double are floating binary point types. In other words, they represent a number like this: 10001.10010110011.
Decimal is a floating decimal point type. In other words, they represent a number like this: 12345.65789.

Precision is the main difference: Float is 7 digits (32 bit), Double is 15:16 digits (64 bit), and Decimal is 28:29 significant digits (128 bit).
Decimals have much higher precision and are usually used within financial applications that require a high degree of accuracy. Decimals are much slower (up to 20X times in some tests [4]) than a double/float. Decimals versus Floats/Doubles cannot be compared without a cast whereas Floats versus Doubles can.
  • Question Answer :

As 0.1 cannot be perfectly represented in binary, while double has 15 to 16 decimal digits of precision, and float has only 7. So, they both are less than 0.1.
I'd say the answer depends on the rounding mode when converting the double to float. float has 24 binary bits of precision, and double has 53.
In binary, 0.1 is:
0.1₁₀ = 0.0001100110011001100110011001100110011001100110011…₂
            ^        ^         ^   ^
            1       10        20  24

So if we round up at the 24th digit, we'll get:
0.1₁₀ ~ 0.000110011001100110011001101
            ^        ^         ^   ^
            1       10        20  24

, which is greater than both of the exact value and the more precise approximation at 53 digits.
So, yes 0.1 float is greater than 0.1 double. This expression returns true! 
  • Examples :

It’s important to note that some decimals with terminating bicimals don’t exist in floating-point either. This happens when there are more bits than the precision allows for. For example,
converts to :
, but that’s 54 bits. Rounded to 53 bits it becomes :
, which in decimal is :
Such precisely specified numbers are not likely to be used in real programs, so this is not an issue that’s likely to come up.

Interesting fact: 1/3 is a repeating decimal = 0.333333333333333333333……....
But in Ternary (The base-3 numeral system) it’s only 0.1 !



Wednesday, November 12, 2014

Pointers Declaration Syntax : Simplified

Sorry for the small font, but it is necessary for better viewing and hence better understanding.

T   *x[N]           // x is an N-element array of pointer to T
T  (*x)[N]          // x is a pointer to an N-element array of T
T   *f()            // f is a function returning a pointer to T
T  (*f)()           // f is a pointer to a function returning T
T  (*f())(int)      // f is a function returning a pointer to a                   function with an int parameter and returns T

T  (*f[N])(int)​     // f is an N-element array of pointers to                     functions with an int parameter and returns T.

Good, now, how to understand this declaration:
void (*signal(int signo, void (*func)(int)))(int); ?

Simply apply the previous rules to it, and it breaks down as :

       signal                                      // signal
       signal(                            )        // is a function
       signal(    signo,                  )        // with a parameter named signo
       signal(int signo,                  )        //   of type int
       signal(int signo,        func      )        // and a parameter named func
       signal(int signo,       *func      )        //   of type pointer
       signal(int signo,      (*func)(   ))        //   to a function
       signal(int signo,      (*func)(int))        //   taking an int parameter
       signal(int signo, void (*func)(int))        //   and returning void
      *signal(int signo, void (*func)(int))        // returning a pointer
     (*signal(int signo, void (*func)(int)))(   )  // to a function
     (*signal(int signo, void (*func)(int)))(int)  // taking an int paraneter
void (*signal(int signo, void (*func)(int)))(int); // and returning void

Now it's easy, isn't it ? ;-)

If the declaration syntax is changed like :
void (*(*signal)(int signo, void (*func)(int)))(int); ,
signal will be a pointer to a function with a parameter named .........etc........

Detecting and Handling Endianness in Run-Time

A long time ago, in a very remote island known as Lilliput (a fictional island, thanks my friend Nabil for the info), society was split into two factions: Big-Endians who opened their soft-boiled eggs at the larger end ("the primitive way") and Little-Endians who broke their eggs at the smaller end. As the Emperor commanded all his subjects to break the smaller end, this resulted in a civil war with dramatic consequences: 11.000 people have, at several times, suffered death rather than submitting to breaking their eggs at the smaller end [1]-[2].

Eventually, the 'Little-Endian' vs. 'Big-Endian' feud carried over into the world of computing as well, where it refers to the order in which bytes in multi-byte numbers should be stored, most-significant first (Big-Endian) or least-significant first (Little-Endian) to be more precise [2].

Endian (endianness in the most common cases) refers to how bytes are ordered within computer memory
  • Big-Endian means that the most significant byte of any multibyte data field is stored at the lowest memory address, which is also the address of the larger field. 
  • Little-Endian means that the least significant byte of any multibyte data field is stored at the lowest memory address, which is also the address of the larger field.
For example, consider the 32-bit number, 0x16FAE50A. Following the Big-Endian convention, a computer will store it as follows:
  • Base_Address       : 16
  • Base_Address   + 1 : FA
  • Base_Address   + 2 : E5
  • Base_Address   + 3 : 0A
Whereas architectures that follow the Little-Endian rules will store it as follows:
  • Base_Address       : 0A
  • Base_Address   + 1 : E5
  • Base_Address   + 2 : FA
  • Base_Address   + 3 : 16
Even that there is no significant performance difference between the two endianness types;
different microcontrollers follow different endianness platforms,
some microcontrollers are big-indian and some 
others are little-endian.

As an example for EEPROM Handler that can cause serious bugs in case that endianness is not carefully considered, check the following code:
typedef struct EEPhandler_tstrPROBE_CONFIG_GROUP{
    short as16X_BOARD_TEMP_TABLE[10];

void DEE_bEepromReadSync(char* pu8BufferToRead){
    //EEPROM data is read exactly as if we wrote the following code:
    *pu8BufferToRead = 0x00;
    *pu8BufferToRead = 0x1F;   

    DEE_bEepromReadSync((char *)pstrPROBE_CONFIG_GROUP);
    // After the previous line of code, we have
    // (*pstrPROBE_CONFIG_GROUP).as16X_BOARD_TEMP_TABLE[0] = 0x001F on some platforms and = 0x1F00 on another platforms
    // according to its endianness!
After the execution of the line of code: DEE_bEepromReadSync((char *)pstrPROBE_CONFIG_GROUP) ,
we will have the first element of s16X_BOARD_TEMP_TABLE = 0x001F on some platforms and
0x1F00 on another platforms according to its endianness!
This is a serious issue if the code is meant to run several platforms or handling communication between different micrcocontrollers.

Endianness is not a compiler issue, nor even an operating system issue, but a platform issue.There are no compiler options or workarounds for endianness.
There are however conversion routines so that you can 
normalize the endianness of stored data. 

There are programmatic ways to detect whether or not you are on a big-endian or little-endian architecture at run-time.
Unions can be used to detect endiannes in run-time as follows:
int is_big_endian(void){
    union {
        uint32_t i;
        char c[4];
    } bint = {0x01020304};
    return bint.c[0] == 1; 

Or the pointer casting trick as follows:
int is_big_endian (void){
    short int word = 0x0001;
    char *byte = (char *) &word;
    return(byte[0] ? false : true);
Afterwards some routines can be used to convert the endianness in run-time. Below is a routine that swaps a 32-bit unsigned integers:
uint32_t swap_endian_u32(uint32_t u){
        uint32_t u;
        unsigned char u8[sizeof(uint32_t)];
    } source, dest;

    source.u = u;
    for (size_t k = 0; k < sizeof(uint32_t); k++)
        dest.u8[k] = source.u8[sizeof(uint32_t ) - k - 1];
    return dest.u;