The Storage of Floating-Point Variable, Float Type and Float Variables

Any numeric value with a decimal point will be interpreted by the compiler as a floating-point number, also known as a real number.

1.Storage of Floating-Point Variables

To store a floating-point number, a computer allocates 4 bytes (32 bits) of memory.

1 bit is used for the sign
8 bits are used for the exponent
23 bits are used for the significand (mantissa)

Let’s go through the steps of how a computer stores a floating-point number:

To convert a floating-point number to a binary number, let’s convert 10.75 to a floating-point number (1010.11) ₂:
To convert the binary number to standard form, we represent the floating-point number in the following form: 1. significand bits * 2^exponent.
- So, for the binary number 1010.11, we can represent it in standard form as 1.01011 *2³

standardization process of real number to storage

- Adding an offset value (bias value) to the exponent part.

In the storage of floating-point numbers, negative numbers are not stored using two’s complement. However, to overcome the problem of not having a subtractor, the concept of bias is introduced, which can convert the negative value of a floating-point number into a positive value that can be used in calculations. In this way, whether the floating-point number is negative or positive, the bias value is added to the exponent value to reduce the complexity of implementation.

The formula for calculating the bias value : bias_n = 2^n-1 – 1;

The offset value for 8-bit position: 2 ⁷ – 1 = 127

3. The range of the offset value depends on the number of bits in the exponent part. Assuming there are 8 bits in the exponent part, then the offset value is:

Offset value = 2^(8-1) – 1 = 127

Here, (8-1) refers to the number of bits in the exponent part minus 1, because the first bit represents the sign and cannot be used to represent the exponent. Therefore, the offset value for an 8-bit exponent is 127.

Therefore, the normalized exponent value will be the actual exponent value plus the bias value of 130 (3 + 127) – the binary form of 130 is 10000010.

The binary representation of 10.75 is shown in the following figure:

The sign bit is 0, indicating that the number is positive.

The exponent value is 130 (binary 10000010).

The significand value is 1.01011. Here, we can eliminate the 1 and the decimal point (.) because no matter what the number is, we always normalize it to 1.something.

Therefore, there is no need to store 1 and the (.), just take the bits after the (.) which is 01011.

Storage of Real Number Floating Point Data Type

2. Floating Point Type ( Real Number Type)

In computer programming, any value with a decimal point is interpreted as a floating-point number, which is stored in the form of m * b^e, where m is the fractional part, b is the base (usually 2), and e is the exponent. This representation combines precision and range, and can be used to represent very large or very small numbers.

There are three types of floating-point numbers: single precision (float), double precision (double), and extended precision (long double)

A single-precision floating-point number (float) occupies 4 bytes (32 bits) of memory, with a value range between 3.4E-38 and 3.4E+38, and can provide approximately 7 significant digits. A double-precision floating-point number (double) occupies 8 bytes (64 bits) of memory, with a value range between 1.7E-308 and 1.7E+308, and can provide approximately 16 significant digits. These limitations are determined by the storage structure of floating-point numbers.

The format and writing rules for defining a real variable are the same as for an integer variable.

float c = 10.5;  // float type
double a,b,c;  // a,b,c - double type

The float data type occupies 4 bytes (32 bits), of which 8 bits are used to store the exponent value and sign, while the remaining 24 bits are used to store the fraction value and sign. The float data type can provide at least 6 significant digits in decimal form, with the exponent ranging from -37 to 37 in decimal, meaning its value range is from 10^-37 to 10^37.

Sometimes, the precision or range of 32-bit floating-point numbers is not sufficient, and C language provides two larger floating-point types.

double: occupies 8 bytes (64 bits) .
long double: typically occupies 16 bytes.

Note that due to precision limitations, floating-point numbers are approximate values, and their calculations are not exact. For example, in C language, 0.1 + 0.2 is not equal to 0.3, but has a small error.

Following Comparation is wrong.

if (0.1 + 0.2 == 0.3) // false

In C language, scientific notation can be used to represent floating-point numbers by using the letter “e” to separate the decimal part and the exponent part.

double x = 123.456e+3; // 123.456 x 10^3 
// equals to
double x = 123.456e3;

In the above example, if there is a plus sign “+” after “e”, the plus sign can be omitted. Note that there should be no space before or after the “e” in scientific notation.

In addition, if the decimal part of the scientific notation is in the form of 0.x or x.0, the leading zero before the decimal point or the trailing zero after the decimal point can be omitted.

0.3E6
// equals to
.3E6
3.0E6
// equals to 
3.E6

Example 2.1 The Data Type Variable Precision

#include <stdio.h>
int main()
{
   int a = 1;
   char b = 'G';
   double c = 3.14;
   printf("Hello World!\n");

   // printing the variables defined
   // above along with their sizes
   printf("Hello! I am a character. My value is %c and "
   "my size is %lu byte.\n",
   b, sizeof(char));
   // can use sizeof(b) above as well

   printf("Hello! I am an integer. My value is %d and "
   "my size is %lu bytes.\n",
   a, sizeof(int));
   // can use sizeof(a) above as well

   printf("Hello! I am a double floating point variable."
   " My value is %lf and my size is %lu bytes.\n",
   c, sizeof(double));
   // can use sizeof(c) above as well

   printf("Bye! See you soon. :)\n");

   return 0;
}

Results:

Hello World!
Hello! I am a character. My value is G and my size is 1 byte.
Hello! I am an integer. My value is 1 and my size is 4 bytes.
Hello! I am a double floating point variable. My value is 3.140000 and my size is 8 bytes.
Bye! See you soon. :)

Process returned 0 (0x0) execution time : 0.871 s
Press any key to continue.

Example 2.2 The Floating Point Type Variable Precision

#include
int main(void)
{
   float a;
   double b;
   a=33333.33333;
   b=33333.33333333333333;
   printf("a=%f\nb=%f\n",a,b);
   return 0;
}

Results:

a=33333.332031
b=33333.333333

Process returned 0 (0x0) execution time : 0.851 s
Press any key to continue.

The Storage of Floating-Point Variable, Float Type and Float Variables

1.Storage of Floating-Point Variables

2. Floating Point Type ( Real Number Type)

Leave a Reply

Recent Posts

Popular Posts This Week

Popular Posts This Month

Popular Posts This Year

The Storage of Floating-Point Variable, Float Type and Float Variables

1.Storage of Floating-Point Variables

2. Floating Point Type ( Real Number Type)

Related Posts

Leave a Reply

Recent Posts

Popular Posts This Week

Popular Posts This Month

Popular Posts This Year