betaveros.stash: Floating-point

23 March 2012

Floating-point

Single-precision: sign 1b, exponent 8b, fraction 23+1b implied (= 6 ~ 9 decimal sigfigs)
Double-precision: sign 1b, exponent 11b, fraction 52+1b implied (= 15 ~ 17 decimal sigfigs)

Special cases:

Exponent = 0
- fraction = 0: (+/-) zero
- fraction != 0: "subnormal" number with implied bit set to 0 instead
Exponent = (FF or 3FF, maximum value in allocated bits)
- fraction = 0: (+/-) infinity
- fraction != 0: NaN (sign ignored)
  - top explicit fraction bit = 1: "quiet NaN"
  - top explicit fraction bit = 0 (and rest != 0): "signaling NaN"

betaveros.stash

Pages

23 March 2012

Floating-point

No comments:

Labels