AP®︎/College Computer Science Principles
Number limits, overflow, and roundoff
When computer programs store numbers in variables, the computer needs to find a way to represent that number in computer memory. Computers use different strategies based on whether a number is an integer or not. Due to limitations in computer memory, programs sometimes encounter issues with roundoff, overflow, or precision of numeric variables.
An integer is any number that can be written without a fractional component. The same term is used in both programming and in math, so hopefully it's familiar to you.
All of these numbers are integers: , , , .
How can a programming language represent those integers in computer memory? Well, computers represent all data with bits, so we know that ultimately, each of those numbers is a sequence of 0s and 1s.
To start simple, let's imagine a computer that uses only 4 bits to represent integers. It can use the first bit to represent the sign of the integer, positive or negative, and the other 3 bits for the absolute value.
In that system, the number 1 would be represented like this:
The in the sign bit represents a positive number, and the in the right most bit represents the () place of the value.
What's the largest number this system could represent? Let's fill all the value bits with and see:
That's the positive number , since .
Check your understanding
Consider a computer that uses 6 bits to represent integers: 1 bit for the sign and 5 bits for the actual number. What's the largest positive integer it can represent?
What would happen if we ran a program like this on the 4-bit computer, where the largest positive integer is 7?
var x = 7; var y = x + 1;
The computer can store the variable
xjust fine, but
yis one greater than the largest integer it can represent with the 4 bits. In a case like this, the computer might report an "overflow error" or display a message like "number is too large". It might also truncate the number (capping all results to 7) or wrap the number around (so that 8 becomes 1).
We don't want to end up in any of those situations, so it's important we know the limitations of our language and environment when writing programs.
Check your understanding
On a computer which uses 6 bits to represent integers (with 1 bit to represent the sign), which of these operations result in overflow?
We've seen there are limitations to storing integers in a computer. Numbers that aren't integers, like fractions and irrational numbers, are even trickier to represent in computer memory.
Consider numbers like , , , or the famously never-ending .
Computer languages typically use floating-point representation for non-integers (and sometimes integers, too). It's similar to "scientific notation", a representation you might know from other studies.
In floating-point representation, a number is multiplied by a base that's raised to an exponent:
Since computers use the binary system instead of the decimal system, the base for floating-point numbers is instead of . Because of that, numbers that are exactly powers of are the simplest to represent:
Numbers between powers of 2 look like this:
What about non-integers? Once again, powers of are the simplest to represent.
Floating-point can also represent fractions between powers of 2:
Once the computer determines the floating point representation for a number, it stores that in bits. Modern computers use a 64-bit system that uses 1 bit for the sign, 11 bits for the exponent, and 52 bits for the number in front.
Here's in that binary floating-point representation:
The exact translation to bits is more complicated than we can go into here, but is a great topic for those of you who want to dive deeper.
Floating-point representation still can't fully represent all numbers, however. Consider the fraction and its floating point representation:
In binary, is still an infinitely repeating sequence:
We can't store an infinite sequence in a computer! At some point, the computer has to end the number somehow, either by chopping it off or rounding to the nearest floating point number. Computers have to do that fairly often, as even fractions like (which is a short in decimal) end up as infinitely repeating sequences once converted to binary.
We often don't notice the lower precision of a number's representation until we use it in calculations. That's when we can experience a roundoff error in the results.
✏️ Here's a program that attempts to add . In the non-computer world, we know that's . But in the computer, each of the values is stored as a rounded-off binary fraction, and when they're added together, they don't quite equal what we expect…
The more bits we can use, the more precise our numbers and calculations will be. Modern 64-bit systems offer a high enough precision for low-stakes calculations.
Perhaps at some point in your life, you'll find yourself writing programs that calculate voting outcomes, power a self-driving car, or even launch a rocket. When the stakes are high, precision matters.
🙋🏽🙋🏻♀️🙋🏿♂️Do you have any questions about this topic? We'd love to answer— just ask in the questions area below!
Want to join the conversation?
- 'Floating-point can also represent fractions between powers of 2:
But wait a minute. 1.5 is also a floating-point. So how you represent that? This example does not explain much.(36 votes)
- From the author:Apologies for the confusing explanation. Let's step through how the computer actually represents that number in floating point representation and see if that helps.
Let's look at 0.750:
The number 1.5 is called either the "mantissa" or the "significand". The number -1 is called the "exponent" (as per normal math term).
The floating point representation of 0.750 in binary needs to include the sign (positive/negative), the mantissa, and the exponent.
Here's the binary:
The first bit represents the sign, where 0 is positive.
The next 11 bits represents the exponent -1:
That's the decimal number 1022. According to the floating point standard, the exponent is calculated by subtracting 1023 from that value. 1022-1023 is -1, which is indeed the exponent.
The final 52 bits represent the mantissa 1.5:
That's a 1 followed by 51 zeros. According to the floating point standard, the first bit represents 1/2 (0.5), the second bit represents 1/4, the third bit represents 1/8, etc. The goal is for those bits to be able to represent values between 0 and 1, which then is considered a mantissa between 1 and 2. In this case, there is a 1 in the first bit, so this mantissa is 1.5.
The key thing here is that the mantissa only ever needs to be between 1.0 and 2.0 (excluding 2). If it goes above that, then the exponent can be increased instead.
I find it helpful to see what happens when you change bits in a representation. You can do that at this tool:
It currently represents 0.75 in 64-bit floating point. Try clicking the 1s or 0s to see what happens when they take on different values.
I also recommend this explanation of floating point representation:
- Why is a floating point called that way? What is floating about it?(10 votes)
- From the author:Great question. The floating part is the decimal (between the whole part and the fractional part), as floating point representation can both represent very large numbers with a lot of digits before the decimal (like 1292929.1) and very small numbers with a lot of digits after the decimal (like 1.29292929). Floating point representation can use its 52 bits to represent both the digits in the whole part and the digits in the fractional part. A contrasting type of representation is "fixed point", which always uses a certain number of bits to represent the whole part and a certain number of bits to represent the fractional part.
Here's a nice longer explanation: https://stackoverflow.com/questions/7524838/fixed-point-vs-floating-point-number(24 votes)
- are there some other programming scenarios when 2^1024 = infinity?(2 votes)
- I was experimenting with this in Swift using doubles and floats. I wrote:
var result2: Double = 0.1 + 0.1 + 0.1
var result3: Float = 0.1 + 0.1 + 0.1
The double returned 0.30000000000000004 and the float returned 0.3. In Swift Double represents a 64-bit floating point number and Float represents a 32-bit floating point number. So it did that because a float is... i guess... Less specific?(6 votes)
- Funnily enough, yes.
A number like 0.1 isn't easy store, because it's difficult to turn into a power of two. You see when you tell your computer to work with the number 0.1, the number has to be converted.
Conversion (as a human mind you!) would work something like (we multiply by two, cut off the 1 if it pops up and don't stop until we hit 0 or we're tired of the algorithm)
0.1 * 2 = 0.2 | 0
0.2 * 2 = 0.4 | 0
0.4 * 2 = 0.8 | 0
0.8 * 2 = 1.6 | 1
0.6 * 2 = 1.2 | 1
0.2 * 2 = 0.4 | 0
at this point we're starting a loop :(
so the decimal 0.1 converts to the binary 0.00011001100110011 .... (the string 0011 will just keep repeating making the result more precise but never actually reaching 0.1)
A double can store twice the amount of information of a floating point, so it can happen that this extra space is used to "fix" rounding issues, by being more precise.(11 votes)
- I realized there is a pattern in binary numbers
the last row goes 010101010101010......
the second last row goes 00110011001100110011........
the third last row goes 000011110000111100001111.......
the fourth last row goes 000000001111111100000000111111110000000011111111.....
the fifth last row goes 00000000000000001111111111111111000000000000000011111111111111110000000000000000111111111111111100000000000000001111111111111111......
and so on(8 votes)
- Nice observation!
If you look at the period between changing phases (1 -> 0 or 0 -> 1), you'll see you get exactly 2^i which corresponds to the ith row's contribution to the binary representation.
For instance, the third row changes every 2^2 = 4 cycles. Hence, the 3rd position (starting from 0) contributes 4 to the binary representation.(7 votes)
- Why do modern computers only use 64 bits? Can't they use as many bits as they want? Since more bits give more precise calculations.(3 votes)
This subject gets pretty deep, but it boils down to bit and computer architecture. 64 bits is 8^2 so 8 bytes, and as we learned a few lessons ago computers these days only work in chunks of bytes. Then most computers don't need more then 64 bits because it would be overkill unless we were in a professional setting where calculations needed to be exact.
Thanks for your question!(13 votes)
- I didn't understand how floating point representation number like 0.375 in wrtten in binary.It's getting very confusing for me!
Somebody help me out..(6 votes)
- There are 3 parts of a floating-point representation (using the IEEE-754 standard). The first bit is used to determine the sign of the number, 0 is positive and 1 is negative. The next section is the exponent of the number represented in scientific form with an added bias. The final section of the representation is the mantissa of the number in scientific form after dropping the leading 1.
Suppose we want to convert 0.375 to its floating-point representation.
0.375 = 0.011 = 1.1 * 2^(-2)
sign bit = 0 since 0.375 is positive
exponent = bias + original exponent = 1023 + (-2) = 1021 = 01111111101
mantissa = number after the leading 1 = 1000000000000000000000000000000000000000000000000000
floating-point representation = sign bit, exponent, mantissa = 0011111111011000000000000000000000000000000000000000000000000000
(In this example, the floating-point representation is in double precision format.)(7 votes)
- shouldn't max value be 32 because 0 is redundant? If there are 5 bits to store a value plus one more to store a sign, that would mean there would be both a +0 and a -0. If we shifted the positive binary values so that zero isnt redundant (0 00001 = -1, 0 00000 = 0, 1 00000 = 1, 1 00001 = 2, etc.), then wouldnt the possible range of integers be from -31 to 32?(2 votes)
- Nice observation as this is what has been done in practice! However, one small modification is that a 6-bit number often ranges from -32 to 31. This is because negative numbers have a 1 as the sign bit instead of 0 in your example. One reason to use this convention is to keep consistency.
Adding 1 should increasing the value of a number.
In your scheme, if I do
(-1) + 1 --> 0 00001 + 1
I should somehow get 0 00000 out of it.
A more natural way is to use a 1 as the sign bit so (-1) becomes
1 11111 + 1 --> 0 00000
Hope this helps!(5 votes)
- Why Doesn't an 8-bit calculator experience round off error to the same degree as a 64-bit computer might? If I'm understanding, any number higher than 127 should cause it to overflow. What am I missing?(3 votes)
- Your computer being a 64-bit machine doesn't really have anything to do with the number format, rather it describes its ability to store computational values. So a 64-bit system is more powerful than a 32-bit or 8-bit system because it can work with more values at once.
Numbers are stored at a particular location and can take up space independent from your computational power, the same way a computer game can take up 60gb even though you're running a 64bit system.
Roundoff errors depend on how the system implements numbers, so if a system gives a floating-point number 32 bit it'll be less precise if a system stores the number in a 64bit space.
You can learn more about floating-point numbers by searching for IEEE 754 on Wikipedia.(3 votes)
- I didn't understand the bar over the 3 what was that for ?(2 votes)
- The bar is sometimes used in mathematics to show that the numbers below the bar repeat periodically.
So in this case it means the 3 keeps repeating for ever.(4 votes)