Good choice of name and domain name for the website. Their are used like the plague and negative connotation is most welcome, these days when people do things they way they are supposed to rather than having a specific reason for it.

Why do people use floats? Honest question. I don’t know any reason for using floats in any situation.

Numeric processing with high dynamic range is simpler with floating-point numbers than fixed-point numbers. In particular, they have the ability to temporarily exceed range limitations with a fair amount of headroom and only a modest loss of precision.

I’m not sure that “any science, physics or simulation anywhere, ever” is a very specific use case. Just not one that overlaps much with current hip new computing tech much.

They require less memory and are adequate for some kinds of programming where higher precision isn’t necessary. For instance, https://gioui.org uses them for pixel offsets because fractional pixels don’t matter beyond a point. 0.00000000001 pixels isn’t usually worth worrying about in an application’s layout.

I also think that there are some processors on which float32 operations are faster than float64, but I don’t think that’s true of conventional x86_64 processors.

I also think that there are some processors on which float32 operations are faster than float64, but I don’t think that’s true of conventional x86_64 processors.

It’s true that there are lots of cases where you won’t see a difference at all because you’re limited by something else (e.g. the cost and latency of arithmetic can be hidden by memory latency sometimes), but I would not state this with confidence.

When you’re cache or memory bandwidth limited, you can fit twice as many float32 numbers into each cache line.

Vector operations on float32s typically have twice the throughput. All the vector operations in SSE and SSE2 for example come in versions that work on float32 or float64 numbers packed into 128 bit registers. The 32 bit versions operate on twice as many numbers with the same or better latency and clocks-per-instruction (according to Intel’s documentation, at least).

A few operations (such as division) have slightly worse latency noted in Intel’s docs for float64 versions.

In order to have an insignificant error like the example you guive, you are using up more memory, not less.

Having deltas order of magnitude smaller than the precision you need is an argument against floats. Not for floats. There is nothing positive into brute forcing the the maximal error by throwing useless bytes at it.

The do have have high precision around the range people use them. What they don’t have, and I suppose this is what people mean by precision is exactness. Given they are created by constructors accepting decimal notation in most programming languages. Most common decimal round numbers are not representable with such data types. And that is why I don’t understand why they are so ubiquitous.

I don’t think most floats are created to represent decimal numbers. Some are, like when representing currency or in a calculator, but most floats are representing light or sound levels, other sensor readings, internal weights in neural networks, etc.

I’m guessing you may work in a domain like finance where decimal numbers seem ubiquitous, but you’re not considering the wider use cases.

Yes, I do work in domains where decimal numbers are ubiquitous, floats are the plague. I see them even for representing natural numbers “in case we want to use a smaller unit”, and other such nonsense.

Even when used for store sensor readings (like image or sound) the only valid reason to use them is ifndividing your scale exponentially serves you better than linearly. Which I would argue It’s perhaps half the times or less.

In machine learning, it’s common to optimize your parameters for space, since in those cases you typically don’t care about the precision loss compared to doubles and it lets you halve your parameter size, but you don’t want to use fixed point because your parameter range can be large. There are some approaches that involve 8-bit or 16-bit fixed point, but it’s not a universal thing at all.

In general, though, a lot of times they’re just Good Enough, and they save you from having to think about scaling constants or writing your own multiplication algorithms due to hardware support.

Are you talking about the C float type, i.e. 32-bit IEEE floating-point, or all floating point types? If the latter, what commonly available data type should people use instead? Last I checked, few languages offer fixed-point types.

32-bit float is often used internally in audio code (for example Apple’s CoreAudio) because it has as much precision as a 24-bit integer but (a) gives you a lot more more dynamic range at low volume, and (b) doesn’t turn into garbage if a calculation overflows. (I don’t know if you’ve ever heard garbage played as PCM audio, but it’s the kind of harsh noise that can literally damage speakers or people’s hearing, or at least really startle the shit out of someone.)

A general reason for using floats is because a general purpose system — like the JavaScript language, or the SQLite database —doesn’t know the details of every possible use case, so providing FP math means it’s good enough for most use cases, and people with specialized needs can layer their own custom types, like BCD or fixed-point, on top of strings or integers.

JavaScript is a typical case where floating point is a bad default. Typical use cases for numerics are user-facing values such as prices, not 3D graphics.

I haven’t heard anyone say what should be used instead. Are you saying JavaScript should have provided a BCD DecimalNumber type instead of floating point? How would people doing any sort of numerics in JS have felt about this? Doing trigonometry or logarithms in BCD must be fun.

I’ve gone through a personal rollercoaster in my relationship with IEEE floating-point, and my current sense is that:

a) I’d love to have computers support a better representation like Unums or Posits or something else.

b) What we have available in mainstream hardware is fairly decent and certainly worth using while it’s the only option. Overflow and underflow in floating-point isn’t that different from overflow in integers, and a whole lot less likely to be encountered by most of us given the far larger domain of floating-point numbers.

c) The real problem lies in high-level languages that hide processor flags from programmers. C and C++ have some bolted-on support with intrinsics that nobody remembers to use. Rust, for all its advances in other areas, sadly hasn’t improved things here. Debug mode is a ghastly atavism, and having release builds silently wrap is a step back from gcc’s (also sub-optimal) -ftrapv and -fwrapv flags.

Haha as the implementor of unums and posits, I’d say unums are too much of a pain in the ass. Posits might have better performance, though if you need error analysis, it might be strictly worse. Posits had a fighting chance with the ML stuff going on but I think that ship has sailed.

As for ignored processor flags. I think zig is making an effort to make those sorts of intrinsics easily accessible as special case functions in the language, and hopefully they take on a strategy of making polyfilling easy for platforms that have partial support.

I use floats for GPU based computer graphics. I’ve read “Beating Floating Point at its Own Game: Posit Arithmetic”, and posits sound amazing: better numerical properties and more throughput for a given amount of silicon. But I’ve not used them, and I will never use them unless they are adapted by GPU manufacturers. Which I guess won’t happen unless some new company disrupts the existing GPU ecosystem with superior new GPU tech based on posits. Something like Apple with the M1, but more analogous to Space-X with the Falcon and Starship. I don’t see any reason for the large entrenched incumbents to gamble on new float technology that is incompatible with existing graphics standards.

Why do people use floats? Honest question. I don’t know any reason for using floats in any situation.

They’re used to represent real numbers. It’s easy and convenient to have types like float that natively represent real numbers. It’s also nice to have statically allocated, roughly word-sized representation (as opposed to arbitrary precision).

I don’t understand what you mean. Integers have uniform precision throughout b the scale. Choose the base unit as you see fit for the precision you want and that is what you get.

It always “drops the same range of precision”. if you need the precision of a float around zero, then set your base unit to that and there you have, it.s your maximum error. Unlike with floats.

When are integers slower and why? You always have to at least perform the same operation in the mantissa of your floats..?

the problem with fixed point is that you have to choose one range of precision, otherwise you’re just inventing what is likely to be a suboptimal software version of floating point. While there are (were?) cases where fixed point is acceptable, in general floating point can do better, and is faster.

The reasons fixed point is slower boils down to the lack of hardware support for fixed point, but there are a few other reasons - efficientlyand accurately computing a number of real functions often requires converting fixed point to some variant of floating point anyway.

In general integer operations are faster for basic arithmetic (and I really mean the basics: +,-,*), complex functions are typically made “fast” in fixed point arithmetic by having lookup tables that approximate the results, because fixed point arithmetic is typically used in places where accuracy is less important.

Multiplication, addition, subtraction of floating point is only marginally slower than integer arithmetic, and once you add in the shifts required for fixed point arithmetic floating point actually outperforms it.

It’s not a matter of just doing inter operations, because as you say everything is fundamentally integers in a cpu. The question is how many integer operations you have to do.

If you’re doing fixed point arithmetic you have to do almost everything floating point logic requires only without hardware support. Fixed point arithmetic isn’t simply integer arithmetic, it’s integer arithmetic plus large integer work, plus shifts. Because there isn’t hardware support, which there isn’t because if you’re adding hardware you may as well do floating point which is more generally useful.

No to be stubborn but I am still not getting your point.

The question is how many integer operations you have to do.

Less than half as if you use floats, obviously. Whatever operations your cpu does for integers, it needs do for the mantissa of your floats, plus handle the exponents plus moving stuff out of the way and back in.place.

Fixed point arithmetic isn’t simply integer arithmetic

I am not sure what you think I am suggesting but to be clear it is: reduce all you variables to integer and do only integer arithmetic. It is, in the end, everything a processor is capable of doing. Integer arithmetic. Everything builds on it.

I think the confusion here is the notion of “point”. A computer is capable of representing a finite number of states. A point is useful for us humans to make things more readable. But for a computer, a number is always an element in a finite set. You suggest I need to meaa around with fixed point arithmetic because I reject floats. But what I mean is: unless you hit scale limitations, there is no reason for using anything else than integers.

If the confusion is how the result is presented to the user… That is a non problem. Just format your number to whatever is most human readable.

No to be stubborn but I am still not getting your point.

no worries

Ok, the first problem here is that you can’t reduce everything to integer arithmetic, if I am doing anything that requires fractional values I need to adopt either fixed point or floating point arithmetic. Fixed point is inherently too inflexible to be worth creating a hardware back end for in a general purpose CPU, so has to be done in software, that gives you multiple instructions for each operation. If you are comparing fixed point to floating point in software fixed point generally wins, but the reality is the floating point is in hardware, so the number of instructions you are dispatching (which for basic arithmetic is the bottleneck) is lower, and floating point wins.

In this case point has nothing to do with what the human visible representation is. The point means how many fractional bits are available. It doesn’t matter what your representation is, floating vs fixed, the way you perform arithmetic is dependent on that decision. Fixed point arithmetic simplifies some of this logic which is why in software implementations it can beat floating point, but it does that by sacrificing range and precision.

To help clarify things lets use concrete examples, how are you proposing 1.5 gets represented, and how do you perform 1.5 * 0.5 and represent the result. I need to understand what you are proposing :D

I think the claim that precision and range are sacrificed doesn’t really hold. There is no silver bullet. The range of floats is larger because if has less precision as you get closer to the limits. Arguably, it has more precision where it is most useful, but this can be very deceiving. Include a large number in your computation and the end result might has less precision than what most people would think. They look at the decimal representation with a zillion decimal places and assume a great deal of precision. But you might have poluted your result with a huge error and it won’t show. This doesn’t happen with ints. You reach range limitations faster of course… But this isn’t very common with 64 bit ints.

But your final question perfectly illustrates the problem. As a programmer, you need to decide what should happen ahead of time. If you mean those values as exact values then you pretty much need a CAS to handle fractions, roots and and so on. Which obviously has no use for floats.
If you mean approximate values, you need to be explicit and be in charge of the precision you intend.
1.5*0.5 is 0.7 or 0.8. it doesn’t make sense to include more decimal places if you are no doing exact calculus.

We learn this in school and my pocket TI calculator does this. If you set precision to automatic and insert 1/3, the result is zero. But if you inser 1/3.0, the result is 0.3. why would you want more decimal places if the number cannot possibly be stored with its exact value and is derived for numbers with less precision?

If you write 1.000 kg, it doesn’t mean the same as 1kg. If you mean the first it means a precision to the gram, and the easiest when writing a computer program is to just reduce to grams and proceed with integer arithmetic.

the claim that precision and range are sacrificed doesn’t really hold

This is well studied. For example, I’ve seen the results of computational fluid dynamics simulation,, taking f128 to be “ground truth”, f64 gets far closer to the correct answer than any fixed64 representation.

Consider something like 1 / x^2, where x >> 1. You have to calculate x², which will be a very small large, and then take the reciprocal, which will be a very small number. You can’t pick a single fixed-point to cover both, and there’s no opportunity in that one calculation to switch between two formats

Situations like that are common in many scientific applications, where intermediate stages of computation are much bigger and small than both your inputs and final output.

That is when one would use floats yes. But let.s be clear. They are comon in some scientific applications, specifically chemistry. The maxint or a 32 bit integer is plentiful for must usages.

64 bit processors have been the standard for over a decade. Even those situations you mention hardly need a range larger than a 64 bit integer.

That is when one would use floats yes. But let.s be clear. They are comon in some scientific applications, specifically chemistry. The maxint or a 32 bit integer is plentiful for must usages.

I can’t think of a scientific field which wouldn’t prefer floats to 32 bit integers. What happens when you need to find a definite integral, or RK4 a PDE, or take the determinant of a matrix?

64 bit processors have been the standard for over a decade. Even those situations you mention hardly need a range larger than a 64 bit integer.

Regarding your first paragraph. I don’t think you are getting that I am suggesting to adjust the base unit to whatever precision delta you intend. Otherwise I don’t understand your question. Could you be clear about what exactly happens if you use floats that wouldn’t happen otherwise?
They are both a data type made of a descrete set representing point on the real number axis. What limitations exactly are you suggesting integers have other than their range?

As for your second paragraph, isn’t it the other way around? Isn’t the point of floats to overcome integer range and precision limita and strike a ballance between both? Why would you need to that if you don’t have such limitations anymore.
Floats were used all the time on 8 bit processors even for things you would integers because of range limitations. We don’t need to do that on our 32 and 64 bit processors.

I think there is this wrong idea that ints are meant to be used for natural numbers and such only. Which is of course a misconception.

Regarding your first paragraph. I don’t think you are getting that I am suggesting to adjust the base unit to whatever precision delta you intend. Otherwise I don’t understand your question. Could you be clear about what exactly happens if you use floats that wouldn’t happen otherwise? They are both a data type made of a descrete set representing point on the real number axis. What limitations exactly are you suggesting integers have other than their range?

My point is that all three of those things involve working with both very large and very small numbers simultaneously. You can’t “just set the precision delta”. Or if you can, you’d have provide a working demonstration, because I believe it’s much harder than you’re claiming it is.

Also, lots of science involves multiplying very small by very large numbers directly, such as with gravitational force.

As for your second paragraph, isn’t it the other way around? Isn’t the point of floats to overcome integer range and precision limita and strike a ballance between both? Why would you need to that if you don’t have such limitations anymore. Floats were used all the time on 8 bit processors even for things you would integers because of range limitations. We don’t need to do that on our 32 and 64 bit processors.

I think we use them for lots of reasons, and one is that you don’t need to pick a basis in advance of computation, like you do with fixed width.

# literal reals in python are IEEE floats
>>> 0.2 + 0.1
0.30000000000000004

vs

# Decimal is a wrapper around the GMP library - ie proper numbers
>>> from decimal import Decimal
>>> Decimal("0.2") + Decimal("0.1")
Decimal('0.3')

Extra syntax and extra library (even though it’s in the stdlib!) is a huge barrier. I have seen a number of real world systems be written to use floats - and suffer constant minor bugs - simply because it was easier.

Once or twice I have ripped out floats for decimals. It’s not too hard but you do need a typechecker to keep things straight.

Precision degrades much more gracefully with floating point operations (which round to approximate values or saturate to 0 or inf) than with integer or fixed width operations (which truncate or overflow).

If you have to do work with real numbers then floats are usually best of those three options.

bah, floats. when I worked in embedded systems we didn’t have such luxuries! fractional values were represented by scaling integers and shifting left/right to save time. yer all so spoiled! ;P

Now I have to unlearn all the bitshifting and scaling tricks that I used for so long. They are actually slower.

I’m curious that you were still using them now. Compilers have been able to do more clever tricks than I can remember to apply uniformly for ages (this kind of local peephole replacement is one of the easiest things to automate) and generally has a better cost model for my CPU than I can keep in my head. What compiler were you using that couldn’t do these transforms and for what target?

Scaling. Without floating point, it is advantageous to scale to arbitrary units so that full scale is full scale of the processor’s register size. E.g., brads instead of degrees. https://en.wikipedia.org/wiki/Binary_scaling

Filtering. I wrote a script to choose FIR coefficients that have a maximum of one shift and one addition or subtraction. I had to implement the shifts and arithmetic manually; the compiler wanted to insert calls to its soft-multiply algorithms.

There are other “Hacker’s Delight” tricks you can do. These are the ones I have used in earnest. All obviated by single cycle floating-point multiplies.

Can’t speak for mark, but I wasn’t using a compiler, I was writing assembly (this was in the late 1990s) for disk drive controllers that needed to compute their control output in the order of microseconds. We’d need to do ridiculous low-level stuff like optimize away pipeline stalls. It was for a fairly boutique microprocessor so there wasn’t a ton of compiler support.

I think they rewrote the servo controls in C shortly after I left because the cost/benefit of hand-rolling our own assembly code had finally shifted in the compiler’s favor.

What compiler were you using that couldn’t do these transforms and for what target?

I think he’s referring to implementing fixed-point arithmetic; shifting and scaling are required after some operations. Here’s a SO example with a brief introduction.

Yeah exactly. So to represent, e.g., one-third, you could approximate it as 11/32. So to divide by 3 (i.e. multiply by one third), you’d actually multiply by 11, and do a right shift by 5 bits (dividing by 2^5). If you want rounding, you’d add 16 before shifting. Note here that the denominator needs to be a 2^n number for the shift to work.

Why 11/32 and not something even more precise like 341/1024? Because the more precise you get, the bigger the numerator gets, and the more you are in danger of saturating your accumulator by making the result too big. So it’s a careful dance of maximizing the precision but taking care to not let the result get “too big” first.

And AFAIK compilers can’t really optimize for that.

This lets you see that everyone’s favorite floating point 0.3 doesn’t exist:

Good choice of name and domain name for the website. Their are used like the plague and negative connotation is most welcome, these days when people do things they way they are supposed to rather than having a specific reason for it.

Why do people use floats? Honest question. I don’t know any reason for using floats in any situation.

Numeric processing with high dynamic range is simpler with floating-point numbers than fixed-point numbers. In particular, they have the ability to temporarily exceed range limitations with a fair amount of headroom and only a modest loss of precision.

I agree this is the kind of thing they are appropriate for. A rather specific use case.

I’m not sure that “any science, physics or simulation anywhere, ever” is a very specific use case. Just not one that overlaps much with current hip new computing tech much.

High dynamic range = most graphics, so it’s not actually very specific

They require less memory and are adequate for some kinds of programming where higher precision isn’t necessary. For instance, https://gioui.org uses them for pixel offsets because fractional pixels don’t matter beyond a point. 0.00000000001 pixels isn’t usually worth worrying about in an application’s layout.

I also think that there are some processors on which float32 operations are faster than float64, but I don’t think that’s true of conventional x86_64 processors.

It’s true that there are lots of cases where you won’t see a difference at all because you’re limited by something else (e.g. the cost and latency of arithmetic can be hidden by memory latency sometimes), but I would not state this with confidence.

When you’re cache or memory bandwidth limited, you can fit twice as many float32 numbers into each cache line.

Vector operations on float32s typically have twice the throughput. All the vector operations in SSE and SSE2 for example come in versions that work on float32 or float64 numbers packed into 128 bit registers. The 32 bit versions operate on twice as many numbers with the same or better latency and clocks-per-instruction (according to Intel’s documentation, at least).

A few operations (such as division) have slightly worse latency noted in Intel’s docs for float64 versions.

In order to have an insignificant error like the example you guive, you are using up

morememory, not less.Having deltas order of magnitude smaller than the precision you need is an argument against floats. Not for floats. There is nothing positive into brute forcing the the maximal error by throwing useless bytes at it.

The do have have high precision around the range people use them. What they don’t have, and I suppose this is what people mean by precision is exactness. Given they are created by constructors accepting decimal notation in most programming languages. Most common decimal round numbers are not representable with such data types. And that is why I don’t understand why they are so ubiquitous.

I don’t think most floats are created to represent decimal numbers. Some are, like when representing currency or in a calculator, but most floats are representing light or sound levels, other sensor readings, internal weights in neural networks, etc.

I’m guessing you may work in a domain like finance where decimal numbers seem ubiquitous, but you’re not considering the wider use cases.

Yes, I do work in domains where decimal numbers are ubiquitous, floats are the plague. I see them even for representing natural numbers “in case we want to use a smaller unit”, and other such nonsense.

Even when used for store sensor readings (like image or sound) the only valid reason to use them is ifndividing your scale exponentially serves you better than linearly. Which I would argue It’s perhaps half the times or less.

In machine learning, it’s common to optimize your parameters for space, since in those cases you typically don’t care about the precision loss compared to doubles and it lets you halve your parameter size, but you don’t want to use fixed point because your parameter range can be large. There are some approaches that involve 8-bit or 16-bit fixed point, but it’s not a universal thing at all.

In general, though, a lot of times they’re just Good Enough, and they save you from having to think about scaling constants or writing your own multiplication algorithms due to hardware support.

Are you talking about the C

`float`

type, i.e. 32-bit IEEE floating-point, orallfloating point types? If the latter, what commonly available data type should people use instead? Last I checked, few languages offer fixed-point types.32-bit float is often used internally in audio code (for example Apple’s CoreAudio) because it has as much precision as a 24-bit integer but (a) gives you a lot more more dynamic range at low volume, and (b) doesn’t turn into garbage if a calculation overflows. (I don’t know if you’ve ever heard garbage played as PCM audio, but it’s the kind of harsh noise that can literally damage speakers or people’s hearing, or at least really startle the shit out of someone.)

A general reason for using floats is because a general purpose system — like the JavaScript language, or the SQLite database —doesn’t know the details of every possible use case, so providing FP math means it’s good enough for most use cases, and people with specialized needs can layer their own custom types, like BCD or fixed-point, on top of strings or integers.

JavaScript is a typical case where floating point is a bad default. Typical use cases for numerics are user-facing values such as prices, not 3D graphics.

I haven’t heard anyone say what should be used instead. Are you saying JavaScript should have provided a BCD DecimalNumber type

instead offloating point? How would people doing any sort of numerics in JS have felt about this? Doing trigonometry or logarithms in BCD must be fun.I’ve gone through a personal rollercoaster in my relationship with IEEE floating-point, and my current sense is that:

a) I’d love to have computers support a better representation like Unums or Posits or something else.

b) What we have available in mainstream hardware is fairly decent and certainly worth using while it’s the only option. Overflow and underflow in floating-point isn’t that different from overflow in integers, and a whole lot less likely to be encountered by most of us given the far larger domain of floating-point numbers.

c) The real problem lies in high-level languages that hide processor flags from programmers. C and C++ have some bolted-on support with intrinsics that nobody remembers to use. Rust, for all its advances in other areas, sadly hasn’t improved things here. Debug mode is a ghastly atavism, and having release builds silently wrap is a step back from gcc’s (also sub-optimal)

`-ftrapv`

and`-fwrapv`

flags.Haha as the implementor of unums and posits, I’d say unums are too much of a pain in the ass. Posits might have better performance, though if you need error analysis, it

mightbe strictly worse. Posits had a fighting chance with the ML stuff going on but I think that ship has sailed.As for ignored processor flags. I think zig is making an effort to make those sorts of intrinsics easily accessible as special case functions in the language, and hopefully they take on a strategy of making polyfilling easy for platforms that have partial support.

I use floats for GPU based computer graphics. I’ve read “Beating Floating Point at its Own Game: Posit Arithmetic”, and posits sound amazing: better numerical properties and more throughput for a given amount of silicon. But I’ve not used them, and I will never use them unless they are adapted by GPU manufacturers. Which I guess won’t happen unless some new company disrupts the existing GPU ecosystem with superior new GPU tech based on posits. Something like Apple with the M1, but more analogous to Space-X with the Falcon and Starship. I don’t see any reason for the large entrenched incumbents to gamble on new float technology that is incompatible with existing graphics standards.

Yeap. Sorry it didn’t work out. We tried though (I even have some verilog models for posit circuits).

Swift’s default integer arithmetic operators panic on overflow. (There are alternate ones that ignore overflow, for performance hot spots.)

Or when you actually need that behaviour, such as in hashing functions. But you don’t want your customer ids to actually wrap around silently.

They’re used to represent real numbers. It’s easy and convenient to have types like

`float`

that natively represent real numbers. It’s also nice to have statically allocated, roughly word-sized representation (as opposed to arbitrary precision).Why? What makes them more suited than integers for representing real numbers?

fractions, sqrt, etc fixed point arithmetic drops a huge range of precision at either the high or the low end, and is also slower for many operations.

I don’t understand what you mean. Integers have uniform precision throughout b the scale. Choose the base unit as you see fit for the precision you want and that is what you get.

It always “drops the same range of precision”. if you need the precision of a float around zero, then set your base unit to that and there you have, it.s your maximum error. Unlike with floats.

When are integers slower and why? You always have to at least perform the same operation in the mantissa of your floats..?

the problem with fixed point is that you have to choose

onerange of precision, otherwise you’re just inventing what is likely to be a suboptimal software version of floating point. While there are (were?) cases where fixed point is acceptable, in general floating point can do better, and is faster.The reasons fixed point is slower boils down to the lack of hardware support for fixed point, but there are a few other reasons - efficientlyand accurately computing a number of real functions often requires converting fixed point to some variant of floating point anyway.

In general integer operations are faster for basic arithmetic (and I really mean the basics: +,-,*), complex functions are typically made “fast” in fixed point arithmetic by having lookup tables that approximate the results, because fixed point arithmetic is typically used in places where accuracy is less important.

Multiplication, addition, subtraction of floating point is only marginally slower than integer arithmetic, and once you add in the shifts required for fixed point arithmetic floating point actually outperforms it.

I have no idea what you mean by “lack of hardware support”. Manipulating integers is leterally everything a processor does at a low level.

What are you referring to?

It’s not a matter of just doing inter operations, because as you say everything is fundamentally integers in a cpu. The question is how many integer operations you have to do.

If you’re doing fixed point arithmetic you have to do almost everything floating point logic requires only without hardware support. Fixed point arithmetic isn’t simply integer arithmetic, it’s integer arithmetic plus large integer work, plus shifts. Because there isn’t hardware support, which there isn’t because if you’re adding hardware you may as well do floating point which is more generally useful.

No to be stubborn but I am still not getting your point.

Less than half as if you use floats, obviously. Whatever operations your cpu does for integers, it needs do for the mantissa of your floats, plus handle the exponents plus moving stuff out of the way and back in.place.

I am not sure what you think I am suggesting but to be clear it is: reduce all you variables to integer and do only integer arithmetic. It is, in the end, everything a processor is capable of doing. Integer arithmetic. Everything builds on it.

I think the confusion here is the notion of “point”. A computer is capable of representing a finite number of states. A point is useful for us humans to make things more readable. But for a computer, a number is always an element in a finite set. You suggest I need to meaa around with fixed point arithmetic because I reject floats. But what I mean is: unless you hit scale limitations, there is no reason for using anything else than integers.

If the confusion is how the result is presented to the user… That is a non problem. Just format your number to whatever is most human readable.

no worries

Ok, the first problem here is that you can’t reduce everything to integer arithmetic, if I am doing anything that requires fractional values I need to adopt either fixed point or floating point arithmetic. Fixed point is inherently too inflexible to be worth creating a hardware back end for in a general purpose CPU, so has to be done in software, that gives you multiple instructions for each operation. If you are comparing fixed point to floating point in

softwarefixed point generally wins, but the reality is the floating point is in hardware, so the number of instructions you are dispatching (which for basic arithmetic is the bottleneck) is lower, and floating point wins.In this case point has nothing to do with what the human visible representation is. The point means how many fractional bits are available. It doesn’t matter what your representation is, floating vs fixed, the way you perform arithmetic is dependent on that decision. Fixed point arithmetic simplifies some of this logic which is why in software implementations it can beat floating point, but it does that by sacrificing range and precision.

To help clarify things lets use concrete examples, how are you proposing 1.5 gets represented, and how do you perform 1.5 * 0.5 and represent the result. I need to understand what you are proposing :D

I think the claim that precision and range are sacrificed doesn’t really hold. There is no silver bullet. The range of floats is larger because if has less precision as you get closer to the limits. Arguably, it has more precision where it is most useful, but this can be very deceiving. Include a large number in your computation and the end result might has less precision than what most people would think. They look at the decimal representation with a zillion decimal places and assume a great deal of precision. But you might have poluted your result with a huge error and it won’t show. This doesn’t happen with ints. You reach range limitations faster of course… But this isn’t very common with 64 bit ints.

But your final question perfectly illustrates the problem. As a programmer, you need to decide what should happen ahead of time. If you mean those values as exact values then you pretty much need a CAS to handle fractions, roots and and so on. Which obviously has no use for floats. If you mean approximate values, you need to be explicit and be in charge of the precision you intend. 1.5*0.5 is 0.7 or 0.8. it doesn’t make sense to include more decimal places if you are no doing exact calculus.

We learn this in school and my pocket TI calculator does this. If you set precision to automatic and insert 1/3, the result is zero. But if you inser 1/3.0, the result is 0.3. why would you want more decimal places if the number cannot possibly be stored with its exact value and is derived for numbers with less precision?

If you write 1.000 kg, it doesn’t mean the same as 1kg. If you mean the first it means a precision to the gram, and the easiest when writing a computer program is to just reduce to grams and proceed with integer arithmetic.

This is well studied. For example, I’ve seen the results of computational fluid dynamics simulation,, taking f128 to be “ground truth”, f64 gets far closer to the correct answer than

anyfixed64 representation.Consider something like

`1 / x^2`

, where`x >> 1`

. You have to calculate x², which will be a very small large, and then take the reciprocal, which will be a very small number. You can’t pick a single fixed-point to cover both, and there’s no opportunity in that one calculation to switch between two formatsSituations like that are common in many scientific applications, where intermediate stages of computation are much bigger and small than both your inputs and final output.

That is when one would use floats yes. But let.s be clear. They are comon in

somescientific applications, specifically chemistry. The maxint or a 32 bit integer is plentiful for must usages.64 bit processors have been the standard for over a decade. Even those situations you mention hardly need a range larger than a 64 bit integer.

I can’t think of a scientific field which

wouldn’tprefer floats to 32 bit integers. What happens when you need to find a definite integral, or RK4 a PDE, or take the determinant of a matrix?If we’ve got 64 bits, then why not use a double?

Regarding your first paragraph. I don’t think you are getting that I am suggesting to adjust the base unit to whatever precision delta you intend. Otherwise I don’t understand your question. Could you be clear about what exactly happens if you use floats that wouldn’t happen otherwise? They are both a data type made of a descrete set representing point on the real number axis. What limitations exactly are you suggesting integers have other than their range?

As for your second paragraph, isn’t it the other way around? Isn’t the point of floats to overcome integer range and precision limita and strike a ballance between both? Why would you need to that if you don’t have such limitations anymore. Floats were used all the time on 8 bit processors even for things you would integers because of range limitations. We don’t need to do that on our 32 and 64 bit processors.

I think there is this wrong idea that ints are meant to be used for natural numbers and such only. Which is of course a misconception.

My point is that all three of those things involve working with both very large and very small numbers simultaneously. You can’t “just set the precision delta”. Or if you can, you’d have provide a working demonstration, because I believe it’s much harder than you’re claiming it is.

Also, lots of science involves multiplying very small by very large numbers directly, such as with gravitational force.

I think we use them for lots of reasons, and one is that you don’t need to pick a basis in advance of computation, like you do with fixed width.

Floating-point numbers can only represent (binary) fractions, but many real numbers need to be represented by computations which emit digits.

One of the most important reasons is that floats are invariably literals whereas “proper” decimals are usually not

How so?

eg in Python

vs

Extra syntax and extra library (even though it’s in the stdlib!) is a huge barrier. I have seen a number of real world systems be written to use floats - and suffer constant minor bugs - simply because it was easier.

Once or twice I have ripped out floats for decimals. It’s not too hard but you do need a typechecker to keep things straight.

Precision degrades much more gracefully with floating point operations (which round to approximate values or saturate to 0 or inf) than with integer or fixed width operations (which truncate or overflow).

If you have to do work with real numbers then floats are usually best of those three options.

[Comment removed by author]

Very useful for the dramatic demonstration of how sparse floating point numbers get as exponents increase.

bah, floats. when I worked in embedded systems we didn’t have such luxuries! fractional values were represented by scaling integers and shifting left/right to save time. yer all so spoiled! ;P

Recently switched from a processor without a multiply instruction to ARM Cortex with a 32-bit floating point processor.

One cycle multiplies! We’re all living in the future!

Now I have to unlearn all the bitshifting and scaling tricks that I used for so long. They are actually slower.

I’m curious that you were still using them now. Compilers have been able to do more clever tricks than I can remember to apply uniformly for ages (this kind of local peephole replacement is one of the easiest things to automate) and generally has a better cost model for my CPU than I can keep in my head. What compiler were you using that couldn’t do these transforms and for what target?

Compiler? Just msp430-gcc.

Here are some examples.

Scaling. Without floating point, it is advantageous to scale to arbitrary units so that full scale is full scale of the processor’s register size. E.g., brads instead of degrees. https://en.wikipedia.org/wiki/Binary_scaling

Filtering. I wrote a script to choose FIR coefficients that have a maximum of one shift and one addition or subtraction. I had to implement the shifts and arithmetic manually; the compiler wanted to insert calls to its soft-multiply algorithms.

Trig. I wrote my own implementation of atan2() using bit shifts and addition. (standard CORDIC algorithm, https://github.com/markrages/cordic_polar.)

There are other “Hacker’s Delight” tricks you can do. These are the ones I have used in earnest. All obviated by single cycle floating-point multiplies.

Can’t speak for mark, but I wasn’t using a compiler, I was writing assembly (this was in the late 1990s) for disk drive controllers that needed to compute their control output in the order of microseconds. We’d need to do ridiculous low-level stuff like optimize away pipeline stalls. It was for a fairly boutique microprocessor so there wasn’t a ton of compiler support.

I think they rewrote the servo controls in C shortly after I left because the cost/benefit of hand-rolling our own assembly code had finally shifted in the compiler’s favor.

I think he’s referring to implementing fixed-point arithmetic; shifting and scaling are required after some operations. Here’s a SO example with a brief introduction.

Yeah exactly. So to represent, e.g., one-third, you could approximate it as 11/32. So to divide by 3 (i.e. multiply by one third), you’d actually multiply by 11, and do a right shift by 5 bits (dividing by 2^5). If you want rounding, you’d add 16 before shifting. Note here that the denominator needs to be a 2^n number for the shift to work.

Why 11/32 and not something even more precise like 341/1024? Because the more precise you get, the bigger the numerator gets, and the more you are in danger of saturating your accumulator by making the result too big. So it’s a careful dance of maximizing the precision but taking care to not let the result get “too big” first.

And AFAIK compilers can’t really optimize for that.