23 March 2012

Floating-point

Single-precision: sign 1b, exponent 8b, fraction 23+1b implied (= 6 ~ 9 decimal sigfigs)
Double-precision: sign 1b, exponent 11b, fraction 52+1b implied (= 15 ~ 17 decimal sigfigs)

Special cases:
  • Exponent = 0
    • fraction = 0: (+/-) zero
    • fraction != 0: "subnormal" number with implied bit set to 0 instead
  • Exponent = (FF or 3FF, maximum value in allocated bits)
    • fraction = 0: (+/-) infinity
    • fraction != 0: NaN (sign ignored)
      • top explicit fraction bit = 1: "quiet NaN"
      • top explicit fraction bit = 0 (and rest != 0): "signaling NaN"

No comments: