Categories
BLOG

odds of guessing a 4 digit number

PIN analysis

A good friend of mine, Ian, recently forwarded me an internet joke. The headline was something like:

“All credit card PIN numbers in the World leaked”

The body of the message simply said 0000 0001 0002 0003 0004

Ian’s messages made me chuckle. Then, later the same day, I read this XKCD cartoon. The merging of these two humorous topics created the seed for this article.

I love Randall’s work. My favorite, to date, is this one. I have a signed copy of it on my office wall.

Like many of his creations, this cartoon is excellent at bifurcating readers; people read it, then either smile and chuckle, or stare blankly at it followed by a “Huh? I don’t get it!” comment. Then you explain it, and get a reply “Yeeaaaaaa…no, I still don’t get it!”

Esoteric humor in action.

You can be cool and buy his signed artwork too.

What is the least common PIN number?

There are 10,000 possible combinations that the digits 0-9 can be arranged to form a 4-digit pin code. Out of these ten thousand codes, which is the least commonly used?

Which of these pin codes is the least predictable?

Which of these pin codes is the most predictable?

If you were given the task of trying to crack a random credit card by repeatedly trying PIN codes, what order should you try guessing to maximize your chances of selecting the correct number in the shortest time?

If you had to make predication about what the least commonly used 4-digit PIN is, what would be your guess?

This tangentially relates to the XKCD cartoon. In Randall’s cartoon, the perpetrator’s plan backfired because his selected license plate was so unique that it was very memorable. What is the least memorable license plate? Ask any spy you know (snigger) what the best way to blend into a crowd is. Their answer will be not stand out, to appear “normal”, and not be notable in any way.

People are notoriously bad at generating random passwords. I hope this article will scare you into being a little more careful in how you select your next PIN number.

Are you curious about what the least commonly used PIN number might be?

How about the most popular?

DISCLAIMER

This article is not intended to be a hacker bible, or to be used as a utility, resource, or tool to help would-be thieves perform nefarious actions. I will only disclose data sufficient to make my points, and will try to avoid giving specific data outside of the obvious examples. I do not want to be an enabler for script-kiddies. Please do not email me asking for the database I used; if you do, you will be wasting your time as I’m not going to respond. I’m not going to sell, donate or release the source data – don’t ask!

Source

Obviously, I don’t have access to a credit card PIN number database. Instead I’m going to use a proxy. I’m going to use data condensed from released/exposed/discovered password tables and security breaches.

Soap Box – Password Database Exposures

Over the years, there have been numerous password table security breaches: Some very high profile, some low profile, but all embarrassing (and many exceedingly expensive; both in direct fines and indirect loss of business through erosion of trust and reputation).

Fool me once, well, no, even that’s not really acceptable, but fool me twice … I’ll go even further: Any developer who stores the password table of their database in clear text should be so mortified by this lack of security that they should not be sleeping at night until they fix it. Ignoring the fact that you should never have ever coded it this way, you have an obligation to learn from these past breaches.

If you work for a company and are knowledgeable that your customer database is “protected” by such lightweight security then run, don’t walk, to your CEO/Presidents office, pound on the door and insist (s)he puts out a mandate to fix the matter with extreme prejudice. Don’t leave until you get an affirmative response. Badger, badger then badger them again. Make yourself a proverbial thorn in their side.

I’m not trying to sell my services as a consultant here (though if you are interested, my rates are very reasonable compared to the cost of legal defense, potential FTC sanctions, class action suits, shareholder backlash, fines, loss of reputation and business …) There are plenty of security experts in the industry who can help you (if you need help filtering them and don’t have referrals, someone who has CISSP qualifications is a good place to start).

Bottom line Security strengthens with layers, and the simple application of encryption on your database table can help protect your customer’s data if this table is exposed. It does not defend against all possible attacks, but it does nothing but good things. What possible reason is there store things in clear-text?

Back to the data

By combining the exposed password databases I’ve encountered, and filtering the results to just those rows that are exactly four digits long [0-9] the output is a database of all the four digit character combinations that people have used as their account passwords.

Given that users have a free choice for their password, if users select a four digit password to their online account, it’s not a stretch to use this as a proxy for four digit PIN codes.

The Data

I was able to find almost 3.4 million four digit passwords. Every single one of the of the 10,000 combinations of digits from 0000 through to 9999 were represented in the dataset.

The most popular password is 1234 …

… it’s staggering how popular this password appears to be. Utterly staggering at the lack of imagination …

… nearly 11% of the 3.4 million passwords are 1234 .

The next most popular 4-digit PIN in use is 1111 with over 6% of passwords being this.

In third place is 0000 with almost 2%.

A table of the top 20 found passwords in shown at the right. A staggering 26.83% of all passwords could be guessed by attempting these 20 combinations!

(Statistically, with 10,000 possible combination, if passwords were uniformly randomly distributed, we would expect the these twenty passwords to account for just 0.2% of the total, not the 26.83% encountered)

Looking more closely at the top few records, all the usual suspects are present 1111 2222 3333 … 9999 as well as 1212 and (snigger) 6969 .

It’s not a surprise to see patterns like 1122 and 1313 occurring high up in the list, nor 4321 or 1010 .

2001 makes an appearance at #19. 1984 follows not far behind in position #26, and James Bond fans may be interested to know 0007 is found between the two of them in position #23 (another variant 0070 follows not much further behind at #28).

The first “puzzling” password I encountered was 2580 in position #22. What is the significance of these digits? Why should so many people select this code to make it appear so high up the list?

Then I realized that 2580 is a straight down the middle of a telephone keypad!

(Interestingly, this is very compelling evidence confirming the hypothesis that a 4-digit password list is a great proxy for a PIN number database. If you look at the numeric keypad on a PC-keyboard you’ll see that 2580 is slightly more awkward to type on the PC than a phone because the order of keys on a keyboard is the inverted. Cash machines and other terminals that take credit cards use a phone style numeric pads. It appears that many people have an easy to type/remember PIN number for their credit card and are re-using the same four digits for their online passwords, where the “straight down the middle” mnemonic no longer applies).

(Another fascinating piece of trivia is that people seem to prefer even numbers over odd, and codes like 2468 occur higher than a odd number equivalent, such as 1357 ).

Cumulative Frequency

As noted above, the more popular password selections dominate the frequency tables. The most popular PIN code of 1234 is more popular than the lowest 4,200 codes combined!

That’s right, you might be able to crack over 10% of all codes with one guess! Expanding this, you could get 20% by using just five numbers!

Below is a cumulative frequency graph:

Statistically, one third of all codes can be guessed by trying just 61 distinct combinations!

The 50% cumulative chance threshold is passed at just 426 codes (far less than the 5,000 that a random uniformly distribution would predict). Paranoid yet?

Bottom of the pile

OK, we’ve investigated most frequently used PINS and found they tend to be predictable and easy to remember, let’s turn for a second to the bottom of the pile.

What are the least “interesting” (least used) PINS?

In my dataset the answer is 8068 with just 25 occurrences in 3.4 million (this equates to 0.000744%, far, far fewer than random distribution would predict, and five orders of magnitude behind the most popular choice).

To the right are the twenty least popular 4-digit passwords encountered.

Warning Now that we’ve learned that, historically, 8068 is (was?) the least commonly used password 4-digit PIN, please don’t go out and change yours to this! Hackers can read too! They will also be promoting 8068 up their attempt trees in order to catch people who read this (or similar) articles.

Check out about the Nash Equilibrium

Memorable Years

Many of the high frequency PIN numbers can be interpreted as years, e.g. 1967 1956 1937 … It appears that many people use a year of birth (or possibly an anniversary) as their PIN. This will certainly help them remember their code, but it greatly increases its predictability.

Just look at the stats: Every single 19?? combination can be found in the top fifth of the dataset!

Below is a plot of this in graphical format. In this chart, each yellow line represents a PIN number that starts 19??

If all the passwords were uniformly distributed, there should be no significant difference between the frequency of occurrence of, for instance, 1972 and any other PIN ending in seventy two ??72 . However, as we shall see, this is not the case at all.

1972 occurs in ordinal position #76 (with a frequency 0.099363%). Here’s a histogram for the occurrences of all ??72 probabilities.

You can clearly see the spike at 1972 (with smaller spikes at 7272 and 1472 )

If you calculate the ratio of the peak of 1972 to the average of all the other ??72 PINS you get the ratio of 22:1

PINS starting with 19?? are much more likley to occur. Of course, it’s not just 1972. Here is plot of the ratio of 19 to non-19 for all hundred combinations. Along the x-axis are all the combinations of last two digits –XX, and for each of these the ratio of the 19XX to average of all the other ??XX occurrences has been calculated. Here’s the chart:

It’s a pretty good approximation for a demographic chart! (suggested by the red-dashed trend line) which would probably allow a fair estimation of the ages (years of birth) of the people using the various websites. (Of course, hackers invert this strategy and use the age of a target to try and give information to guess a user’s PIN. Looking at this graph, this might give them up to a 40x advantage!)

Just about all the ratios are above 1.0 . The noteable exceptions are ??34 and ??00 (which are easy to explain, since the massive popularity of 1234 and 0000 dwarf 1934 and 1900 respectively). Simiarly 33 44 55 66 … are lower than expected as the quad codes like 3333 mask out even the 1933 boost.

There are also spikes in the graph corresponding to the popular PINS of 1919 1984 and 1999

Patterns in data

I love pretty ways to graphically vizualize data. Pictures really do paint thousands of words.

Another interesting way to visualize the PIN data is in this grid plot of the distribution. In this heatmap, the x-axis depicts the left two digits from [00] to [99] and the y-axis depicts the right two digits from [00] to [99] . The bottom left is 0000 and the top right is 9999 .

Color is used to represent frequency. The higher frequency occurences are yellow to white hot, and the lower frequency occurences are red, through dark red to black.

Geek Note The scaling is logarithmic.

You could look at this plot all day!

The bright line for the leading diagonal shows the repeated couplets that people love to use for their PIN numbers 0000 0101 0202 … 5454 5555 5656 … 9898 9999 .

Every eleventh dot on the leading diagonal is brighter corresponding to the quad numbers e.g. 4444 5555 . Here is a larger scale version:

Interesting things

There are so many interesting things to learn from this heatmap. Here are just a couple:

The first is the interesting harmonics of shading (seen here more easily in a gray scale plot).

You can make out a “grid pattern” in the plot.

The lighter areas corresponding to couplets of numbers that are close to each other. For some reason, people don’t like to select pairs of numbers that have larger numerical gaps between them. Combinations like 45 and 67 occur much more frequently than things like 29 and 37

Here we see the line corresponding to 19XX . The intensity the dots relates to the chart we plotted earlier

There is a strong bias towards the lower left quadrant. People love to start their PIN numbers with 0 , and even more so with the digit 1 .

The chart on the right shows the relative frequency of the first digit of 4-digit pin codes.

As you can see, the digit 1 dominates (and it’s not all down to the 19XX phenomenon.)

Little bright specs dot the plot in places corresponding to numerical runs (both ascending and descending) such as 2345 , 4321 and 5678 .

I’ve highlighted just a couple on the plot to the left.

Jumps in steps of two are also visible e.g. 2468

Repeated-pair couplets of numbers are very common, such as XYXY

The hundred sets of repeating couplet pairs represent a staggering 17.8% of all observed PIN numbers.

More than four

The purpose of this posting was to investigate patterns and frequency of four digit PIN numbers. However, the database I collected also has all-numeric password of different lengths. It’s worth taking a quick look at these too.

I found close to 7 million all-numeric passwords. Approximately half of these were the four-digit codes we’ve just examined.

Six digit codes are the next most popular length, followed eight.

I hope, hope that the people who have passwords of nine digits long are not using their Social Security Numbers!

Below are the top 20 passwords for the various lengths, along with their share of their same-size namespace.

Some interesting observations (and a little speculation)

For five digit passwords, users appear to have even less imagination in selecting their codes (22.8% select 12345). All the usual suspects occur, but a new addition is the puerile addition in position #20 of the concatenation of 420 and 69.

For six digit password, again 696969 appears highly. Also of note is 159753 (a “X” mark over the numeric keypad). James Bond returns with 007007.

For seven digits, the standby of 1234567 is a much lower frequency (though still the top). I speculate that this is because many people may be using their telephone number (without area code) as a seven digit password. Telephone numbers are fairly distinct, and already memorized, so when a seven digit code is needed, they spring to mind easily. The higher frequency of usage of telephone numbers reduces the need to use imagination (or lack thereof) and select something else.

Is Jenny there? The fouth most popular seven digit password is 8675309 (It’s a popular 80’s song).

Eight digit passwords are just as expected. Lots of pattern, and lots of repetition.

Common nine digit passwords also follow patterns and repetition. 789456123 appears as an easy “Along the top, middle and bottom of the keypad” 147258369 is related in the vertical direction (and other variants appear high up). Again we get a 420 moment with 420420420, and also the shaken, not stirred, but repeated 007007007 returns.

Interestingly for ten digits 1029384756 appears (alternating ascending/descending digits), as well as the odd/even 1357924680.

Hurrah for math! In position #17 of the ten digit password list we get 3141592654 (The first few digits of Pi)

Conclusions

If you are a developer , tester or executive I hope you are sufficiently paranoid that you will immediately check to see that your systems do not store sensitive information, like passwords, unencrypted. The entire reason I was able to perform this analysis is because dumb stupid and lazy coders stored information in clear text. Your lazyness has the potential to impact millions.

If you are a consumer and your recognize any of the numbers I’ve used in this article to be your passwords/pins I hope you apply common sense and immediately change them to something a little less predictable. Alternatively, you could be lazy and not change things (In that case, at least the only person you are harming with this apathy is yourself.)

Updates

Since publishing this article, it’s been brought to my attention that, of course, in addition to anniversary years, many people encapsulate dates in the format MMDD (such as birthdays …) for their PIN codes.

This clearly explains the lower left corner where, if you look at the heatmap, there is a huge contrast change at the height of around 30-31 (the number of days in a month), extending to 12 on the x-axis. (Thanks to zero79 for first pointing this out).

Many people also asked the significance of 1004 in the four character PIN table. This comes from Korean speakers. When spoken, “1004” is cheonsa (cheon = 1000, sa=4).

“Cheonsa” also happens to be the Korean word for Angel.

Another XKCD cartoon

It only seems appropriate to end with another XKCD cartoon. This one is Password Strength

You can find a complete list of all the articles here. Click here to receive email alerts on new articles.

PIN analysis A good friend of mine, Ian, recently forwarded me an internet joke. The headline was something like: “All credit card PIN numbers in the World leaked” The body of the

Probability of Finding Strings In Pi

On the main Pi page, I mentioned that if we view Pi as a big, random string of numbers (which is close enough for our purposes), then we can figure out the odds of finding any string in the first 100 million digits of Pi:

Number Length Chance of Finding
1-5 100%
6 Nearly 100%
7 99.995%
8 63%
9 9.5%
10 0.995%%
11 0.09995%

Where do these numbers come from, and how can you compute them?

Let’s say you’re searching for a single digit in Pi, and pretend again that Pi is random. If you pick a number between 0 and 9 at random, the chance that it’s equal to your search digit is 1 in 10, (10%, or 0.1).

That’s pretty simple, but what happens if you want to search for a two digit string? Well, you can approximate this by picking two numbers. If the first doesn’t match, then it’s over. But if the first does match, you have to try to match the second, too. Each of these has a probability of 0.1, and we’ll assume that the numbers are completely independent. So 10% of the time the first number matches, and 10% of 10% of the time, both numbers match, which is just 1%, or a probability of 0.01. We’d have a 1 in 1000 chance (0.001) of finding a three digit search string, and so on.

If we assume that Pi is random, the above formula gives us the chance that any particular position matches. So for a two digit search string, there’s a 1% chance that it matches at position 1, a 1% chance that it matches at position 2, and so on. So the chance of finding the search string at all is equal to the chance of finding it at any of those positions. How do we figure that out?

Let’s turn the problem on its head. The chance of finding it is simply the opposite of the chance of not finding the search string. “Well, duh, Dave, that’s obvious!” you may say — but wait. We already figured out this kind of chance earlier. How do we not find something? Well, we first have to not find it at position 1, and then not find it at position 2, . and keep on going all the way to the end of our digits. This is just like what we did earlier to figure out the chance of one position matching!

If we have a 10% chance of matching at any position, then we have a 90% chance of not matching. So the odds of not matching the entire string of pi is equal to 90% of 90% of 90% of . and so on, for each digit of Pi that we have. Mathematically, this would be 0.9 to the power of “N” (0.9 N ) if we have N digits. And then the odds of finding the string would just be 1 – (0.9) N .

Putting that all together, we know that the chances of finding a search string at any position are 0.1 d , where “d” is the length of our search string. So the entire probability is 1 – (1 – 0.1 d ) N .

Continuing along the mathematical path, it turns out that we’ve accidentally stumbled into something called the binomial probabilities. Binomials come about when you ask “what are the odds of getting some number k of heads out of n flips of a coin.” Just to make things tricky, let’s let the coin be biased in some way – it gets “heads” with probability p (that is, if p = 0.6, then 60% of the time, the coin lands heads).

Luckily for us, asking about zero occurrences of heads is easy, as the formula above showed. But we could ask other questions, like “what are the odds of finding my birthday twice in the first 100,000,000 digits of Pi?” These questions are harder (computationally) to answer than the zero case, because we have lots of different ways to find your birthday twice. We could find it once at position 1 and once at position 2, or once at position 1 and once at position three, and so on. Even very fast computers start to choke when the numbers get big. And then we could make it even worse — what if we want to know how likely it is to find your birthday at least 100 times in Pi?

The solution to this problem is to use what’s known as the Poisson approximation to the binomial, when the numbers are large. We can actually approximate the above formula as:

Odds(finding string of length k in N digits of pi) = 1 – 1/e (N*0.1 d ) .

That looks a little complicated, until we realize that 0.1 d is just really just one divided by the number of search strings that have d digits. So if d is three, there are 1000 strings (0, 1, 2, . 999). So 0.1 3 = 1/1000 = 0.001. And N is just the number of digits of pi. Ah-ha! So what this really means is that we can calculate the odds simply as 1 – 1/e (digits of pi / possible searches) . So if we have 100,000,000 digits of pi, and we can search for 100,000,000 possible strings (8 digit search strings), then our probability is simply 1 – 1/e. With twice as many digits as search strings, the probability becomes 1 – 1/e 2 . And so on.

At this point, other people have explained the math far better than I, so I leave you to the good graces of the Internet.

Thanks to Evan Romer for the suggestion to add this explanation.

Probability of Finding Strings In Pi On the main Pi page, I mentioned that if we view Pi as a big, random string of numbers (which is close enough for our purposes), then we can figure out the ]]>