vbdpss.hlp (Table of Contents; Topic list)
Article Q42980
                                                 Contents  Index  Back
─────────────────────────────────────────────────────────────────────────────
                           Knowledge Base Contents  Knowledge Base Index
 
 Tutorial to Understand IEEE Floating-Point Errors - Q42980
 
 Floating-point mathematics is a complex topic that confuses many
 programmers. The tutorial below should help you recognize programming
 situations where floating-point errors are likely to occur and how to
 avoid them. It should also allow you to recognize cases that are
 caused by inherent floating-point math limitations as opposed to
 actual compiler bugs.
 
 The information in this article applies to:
 
 - The Standard and Professional Editions of Microsoft Visual Basic
   version 1.0 for MS-DOS
 - Microsoft Visual Basic version 1.0 for Windows
 - Microsoft QuickBasic versions 4.0, 4.0b, and 4.5 for MS-DOS
 - Microsoft Basic Professional Development System (PDS) versions 7.0
   and 7.1 for MS-DOS
 - Microsoft Basic Compiler versions 6.0 and 6.0b for MS-DOS
 
 More Information:
 
 Decimal and Binary Number Systems
 ---------------------------------
 
 Normally, we count things in base 10. The base is completely
 arbitrary. (The only reason that people have traditionally used base
 10 is that they have 10 fingers, which have made handy counting
 tools).
 
 The number 532.25 in decimal (base 10) means the following:
 
    (5 * 10^2) + (3 * 10^1) + (2 * 10^0) + (2 * 10^-1) + (5 * 10^-2)
        500    +     30     +      2     +     2/10    +    5/100
    _________
    =  532.25
 
 In the binary number system (base 2), each column represents a power
 of 2 instead of 10. For example, the number 101.01 means the
 following:
 
    (1 * 2^2) + (0 * 2^1) + (1 * 2^0) + (0 * 2^-1) + (1 * 2^-2)
        4     +      0    +     1     +      0     +    1/4
    _________
    =  5.25  Decimal
 
 How Integers Are Represented in PCs
 -----------------------------------
 
 Since there is no fractional part to an integer, its machine
 representation is much simpler than it is for floating-point values.
 Normal integers on PCs are 2 bytes (16 bits) long with the most
 significant bit indicating the sign. Long integers are 4 bytes long.
 Positive values are straightforward binary numbers. For example:
 
     1 Decimal = 1 Binary
     2 Decimal = 10 Binary
    22 Decimal = 10110 Binary  etc.
 
 However, negative integers are represented using the two's complement
 scheme. To get the two's complement representation for a negative
 number, take the binary representation for the number's absolute value
 and then flip all the bits and add 1. For example:
 
    4 Decimal = 0000 0000 0000 0100
                1111 1111 1111 1011     Flip the Bits
    -4        = 1111 1111 1111 1100     Add 1
 
 Note that -1 Decimal = 1111 1111 1111 1111 in Binary, which explains
 why BASIC treats -1 as logical true (All bits = 1). This is a
 consequence of not having separate operators for bitwise and logical
 comparisons. Often in BASIC, it is convenient to use the code fragment
 below when your program will be making many logical comparisons. This
 greatly aids readability.
 
    CONST TRUE = -1
    CONST FALSE = NOT TRUE
 
 Note that adding any combination of two's complement numbers together
 using ordinary binary arithmetic produces the correct result.
 
 Floating-Point Complications
 ----------------------------
 
 Every decimal integer can be exactly represented by a binary integer;
 however, this is not true for fractional numbers. In fact, every
 number that is irrational in base 10 will also be irrational in any
 system with a base smaller than 10.
 
 For binary, in particular, only fractional numbers that can be
 represented in the form p/q, where q is an integer power of 2, can be
 expressed exactly, with a finite number of bits.
 
 Even common decimal fractions, such as decimal 0.0001, cannot be
 represented exactly in binary. (0.0001 is a repeating binary fraction
 with a period of 104 bits!.)
 
 This explains why a simple example, such as the following
 
    SUM = 0
    FOR I% = 1 TO 10000
       SUM = SUM + 0.0001
    NEXT I%
    PRINT SUM                   ' Theoretically = 1.0.
 
 will PRINT 1.000054 as output. The small error in representing 0.0001
 in binary propagates to the sum.
 
 For the same reason, you should always be very cautious when making
 comparisons on real numbers. The following example illustrates a
 common programming error:
 
    item1# = 69.82#
    item2# = 69.20# + 0.62#
    IF item1# = item2# then print "Equality!"
 
 This will NOT PRINT "Equality!" because 69.82 cannot be represented
 exactly in binary, which causes the value that results from the
 assignment to be SLIGHTLY different (in binary) than the value that is
 generated from the expression. In practice, you should always code
 such comparisons in such a way as to allow for some tolerance. For
 example:
 
    IF (item1# < 69.83#) AND (item1# > 69.81#) then print "Equal"
 
 This will PRINT "Equal".
 
 IEEE Format Numbers
 -------------------
 
 QuickBASIC version 3.0 was shipped with an MBF (Microsoft Binary
 Floating Point) version and an IEEE (Institute of Electrical and
 Electronics Engineers) version for machines with a math coprocessor.
 QuickBASIC versions 4.0 and later use IEEE only. Microsoft chose the
 IEEE standard to represent floating-point values in current versions
 of BASIC for three primary reasons:
 
 1. To allow BASIC to use the Intel math coprocessors, which use IEEE
    format. The Intel 80x87 series coprocessors cannot work with
    Microsoft Binary Format numbers.
 
 2. To make interlanguage calling between BASIC, C, Pascal, FORTRAN,
    and MASM much easier. Otherwise, conversion routines would have to
    be used to send numeric values from one language to another.
 
 3. To achieve consistency. IEEE is the accepted industry standard for
    C and FORTRAN compilers.
 
 The following is a quick comparison of IEEE and MBF representations
 for a double-precision number:
 
                Sign Bits   Exponent Bits   Mantissa Bits
                ---------   -------------   -------------
 
    IEEE        1           11              52 + 1 (Implied)
 
     MBF        1            8              56
 
 Note that IEEE has more bits dedicated to the exponent, which allows
 it to represent a wider range of values. MBF has more mantissa bits,
 which allows it to be more precise within its narrower range.
 
 General Floating-Point Concepts
 -------------------------------
 
 It is very important to realize that any binary floating-point system
 can represent only a finite number of floating-point values in exact
 form. All other values must be approximated by the closest
 representable value. The IEEE standard specifies the method for
 rounding values to the "closest" representable value. QuickBASIC
 supports the standard and rounds according to the IEEE rules.
 
 Also, keep in mind that the numbers that can be represented in IEEE
 are spread out over a very wide range. You can imagine them on a
 number line. There is a high density of representable numbers near 1.0
 and -1.0 but fewer and fewer as you go towards 0 or infinity.
 
 The goal of the IEEE standard, which is designed for engineering
 calculations, is to maximize accuracy (to get as close as possible to
 the actual number). Precision refers to the number of digits that you
 can represent. The IEEE standard attempts to balance the number of
 bits dedicated to the exponent with the number of bits used for the
 fractional part of the number, to keep both accuracy and precision
 within acceptable limits.
 
 IEEE Details
 ------------
 
 Floating-point numbers are represented in the following form, where
 [exponent] is the binary exponent:
 
    X =  Fraction * 2^(exponent - bias)
 
 [Fraction] is the normalized fractional part of the number, normalized
 because the exponent is adjusted so that the leading bit is always a
 1. This way, it does not have to be stored, and you get one more bit
 of precision. This is why there is an implied bit. You can think of
 this like scientific notation, where you manipulate the exponent to
 have one digit to the left of the decimal point, except in binary, you
 can always manipulate the exponent so that the first bit is a 1, since
 there are only 1s and 0s.
 
 [bias] is the bias value used to avoid having to store negative
 exponents.
 
 The bias for single-precision numbers is 127 and 1023 (decimal) for
 double-precision numbers.
 
 The values equal to all 0's and all 1's (binary) are reserved for
 representing special cases. There are other special cases as well,
 that indicate various error conditions.
 
 Single-Precision Examples
 -------------------------
 
  2  =  1  * 2^1  = 0100 0000 0000 0000 ... 0000 0000 = 4000 0000h
     Note the sign bit is zero, and the stored exponent is 128, or
     100 0000 0 in binary, which is 127 plus 1. The stored mantissa is
     (1.) 000 0000 ... 0000 0000, which has an implied leading 1 and
     binary point, so the actual mantissa is 1.
 
 -2  = -1  * 2^1  = 1100 0000 0000 0000 ... 0000 0000 = C000 0000h
     Same as +2 except that the sign bit is set. This is true for all
     IEEE format floating-point numbers.
 
  4  =  1  * 2^2  = 0100 0000 1000 0000 ... 0000 0000 = 4080 0000h
     Same mantissa, exponent increases by one (biased value is 129, or
     100 0000 1 in binary.
 
  6  = 1.5 * 2^2  = 0100 0000 1100 0000 ... 0000 0000 = 40C0 0000 h
     Same exponent, mantissa is larger by half -- it's
     (1.) 100 0000 ... 0000 0000, which, since this is a binary
     fraction, is 1-1/2 (the values of the fractional digits are 1/2,
     1/4, 1/8, etc.).
 
  1  = 1   * 2^0  = 0011 1111 1000 0000 ... 0000 0000 = 3F80 0000h
     Same exponent as other powers of 2, mantissa is one less than
     2 at 127, or 011 1111 1 in binary.
 
 .75 = 1.5 * 2^-1 = 0011 1111 0100 0000 ... 0000 0000 = 3F40 0000h
     The biased exponent is 126, 011 1111 0 in binary, and the mantissa
     is (1.) 100 0000 ... 0000 0000, which is 1-1/2.
 
 2.5 = 1.25 * 2^1 = 0100 0000 0010 0000 ... 0000 0000 = 4020 0000h
     Exactly the same as 2 except that the bit which represents 1/4 is
     set in the mantissa.
 
 0.1 = 1.6 * 2^-4 = 0011 1101 1100 1100 ... 1100 1101 = 3DCC CCCDh
     1/10 is a repeating fraction in binary. The mantissa is just shy
     of 1.6, and the biased exponent says that 1.6 is to be divided by
     16 (it is 011 1101 1 in binary, which is 123 in decimal). The true
     exponent is 123 - 127 = -4, which means that the factor by which
     to multiply is 2**-4 = 1/16. Note that the stored mantissa is
     rounded up in the last bit. This is an attempt to represent the
     unrepresentable number as accurately as possible. (The reason that
     1/10 and 1/100 are not exactly representable in binary is similar
     to the way that 1/3 is not exactly representable in decimal.)
 
 0   = 1.0 * 2^-128 = all zeros -- a special case.
 
 Other Common Floating-Point Errors
 ----------------------------------
 
 The following are common floating-point errors:
 
 1. Round-off error
 
    This error results when all of the bits in a binary number cannot
    be used in a calculation.
 
    Example: Adding 0.0001 to 0.9900 (Single Precision)
 
    Decimal 0.0001 will be represented as:
 
       (1.)10100011011011100010111 * 2^(-14+Bias)  (13 Leading 0s in
       Binary!)
 
    0.9900 will be:
 
       (1.)11111010111000010100011 * 2^(-1+Bias)
 
    Now to actually add these numbers, the decimal (binary) points must
    be aligned. For this they must be Unnormalized. Listed below is the
    resulting addition:
 
        .000000000000011010001101 * 2^0  <- Only 11 of 23 Bits retained
       +.111111010111000010100011 * 2^0
       ________________________________
        .111111010111011100110000 * 2^0
 
    This is called a round-off error because some computers round when
    shifting for addition. Others simply truncate. Round-off errors are
    important to consider whenever you are adding or multiplying two
    very different values.
 
 2. Subtracting two almost equal values
 
        .1235
       -.1234
        _____
        .0001
 
    This will be normalized. Note that although the original numbers
    each had four significant digits, the result has only one
    significant digit.
 
 3. Overflow and underflow
 
    This occurs when the result is too large or too small to be
    represented by the data type.
 
 4. Quantizing error
 
    This occurs with those numbers that cannot be represented in exact
    form by the floating-point standard.
 
 5. Division by a very small number
 
    This can trigger a "divide by zero" error or can produce bad
    results, as in the following example:
 
       A = 112000000
       B = 100000
       C = 0.0009
       X = A - B / C
 
    In QuickBASIC, X now has the value 888887, instead of the correct
    answer, 900000.
 
 6. Output error
 
    This type of error occurs when the output functions alter the
    values they are working with.