Double vs. Float, Which is Better?

Neither C++ Coding Standards nor Effective C++ addresses the question of which float point type is best to use and in what situations.

There are three floating point types in C and C++:

  1. float
  2. double
  3. long double

What the Standard Has to Say

There are exactly two guarantees provided by the standard:

  1. precision(float) <= precision(double) <= precision(long double). We are guaranteed that float has no more precision than double and double has no more precision than long double. It is possible for all three to be the same data type in a given implementation.
  2. The default floating point type is double, that is, typeid(3.3) is double.
  3. f suffix
    If you want a float constant, you must specify this with the f suffix: 3.3f.
    L suffix
    Similarly, a long double constant must be specified with the L suffix: 3.3L.

Guidance Provided by Stroustrup

In all of my C++ resources, the only guidance I can find is in the C++ Programming Language by Bjarne Stroustrup (the creator of C++).

"The exact meaning of single-, double-, and extended-precision is implementation-defined. Choosing the right precision for a problem where the choice matters requires significant understanding of floating-point computation. If you don't have that understanding, get advice, take the time to learn, or use double and hope for the best." [emphasis added]

Standard Library Implementation

Where there is only one version of a standard library floating point operation, the library defaults to working with double. This includes the functions atof and strtod. In C89 the only data type supported by all math.h functions was double.

Performance

On modern hardware double outperforms float in every case. In the higher optimization levels long double even outperforms float.

The test code was compiled with the command line

g++ floatdouble.cpp -std=c++0x -O3 -march=native

Type name: f Size in bytes: 4 Summation time in s: 2.82 summed value: 6.71089e+07 // float
Type name: d Size in bytes: 8 Summation time in s: 2.78585 summed value: 6.6e+09 // double
Type name: e Size in bytes: 16 Summation time in s: 2.76812 summed value: 6.6e+09 // long double

The test code was:

#include <chrono>
#include <vector>
#include <iostream>
#include <typeinfo>

template<typename T>
T sum(int num_times, T value)
{
  T val=0;

  std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < num_times; ++i)
  {
    val += value;
  }
  std::chrono::high_resolution_clock::duration d = std::chrono::high_resolution_clock::now() - t1;

  std::cout << "Type name: " << typeid(T).name() << " Size in bytes: " << sizeof(T) << " Summation time in s: " << std::chrono::duration_cast<std::chrono::duration<double>>(d).count();

  return val;
}

int main()
{
  std::cout << " summed value: " << sum<float>(2000000000, 3.3) << std::endl;
  std::cout << " summed value: " << sum<double>(2000000000, 3.3) << std::endl;
  std::cout << " summed value: " << sum<long double>(2000000000, 3.3) << std::endl;
  std::cout << " summed value: " << sum<unsigned char>(2000000000, 3.3) << std::endl;
  std::cout << " summed value: " << sum<short>(2000000000, 3.3) << std::endl;
  std::cout << " summed value: " << sum<long>(2000000000, 3.3) << std::endl;
  std::cout << " summed value: " << sum<long long>(2000000000, 3.3) << std::endl;
}

Conclusion

double should be your preferred floating point type in nearly every situation.

  1. It's faster.
  2. It's the default in C and C++.
  3. It's more portable and the default across all C and C++ library functions.
  4. It's significantly higher precision. The answer above is outright wrong (off by 2 orders of magnitude) for float. My simple 3.3 summation example quickly loses precision. In a smaller case, summing the value 3.3 2000 times results in: 6599.89 when using float instead of the correct answer of 6600.
  5. Stroustrup recommends it.

There is exactly one case where you should use float instead of double

  1. It's smaller. On 64bit hardware with a modern gcc, double is 8 bytes and float is 4 bytes.

Similarly, there is exactly one time you would need to use long double.

  1. It's very high precision. You might have an application that needs the absolutely most correct floating point precision you can get today.

Finally, and perhaps most importantly: 64bit floating point has been the standard supported in Intel compatible CPU's supporting the SSE2 instruction set since 2001. On most platforms double uses this same 64 SSE2 (IEEE64 bit) type.

Comments

For more details about floating point representation in computers, see the article "What Every Computer Scientist Should Know About Floating-Point Arithmetic."