Main content

## Bitcoin

# Bitcoin: Cryptographic hash functions

## Video transcript

Voiceover: Cryptographic hash functions are basically fundamental building blocks that are used within many
cryptographic algorithms and protocols, and they have a number of very important applications in the context
of information security as a whole. Now, some of the more common algorithms in this category that are known as cryptographic has functions include: things like MD5, and also, it has some predecessors
like MD4 and others, as well as algorithms like SHA-256, and actually, SHA-256 is preceded by other algorithms like SHA-1 and so on, and also there are some algorithms that maybe you've heard of, or maybe that are a bit lesser known, but things like RIPEMD,
and BLAKE, and Skein, and others. Now, cryptographic hash
functions are basically used as critical building
blocks in many applications, and really the first
motivating application, the first historical application of these types of hash functions was in the context of what's known as a digital signature, and digital signatures are
used in so many different cryptographic applications today. They're the cornerstone of many ecommerce protocols. They are used in doing things
like bitcoin generation and so on. Cryptographic has functions are also used in things like message
authentication protocols, in pseudorandom number generation and password security, even encryption to some degree. In fact, aside from their
use in digital signatures, these hash functions are
also used in other places in the bitcoin protocol as well. First of all, let me talk about what a cryptographic hash
function actually is, and of course, as the name implies, the first thing it is, it's a hash function. And by hash function, I basically mean that it will take input. It's a mathematical function, a transformation, if you will, that takes a particular input, and we call this input a message, and the message can be
of arbitrary length, and the hash function basically applies a mathematical transformation, or maybe a set of
mathematical transformations to this input to produce a single output, and we typically call
this output a digest, although, sometimes you will see the output referred to
as a tag, or as a hash, or as a fingerprint, but digest is sort of the
most common nomenclature. In fact, MD5, which was one of the earlier hash functions, stands for message digest 5, and MD4 was message digest 4, and so on, and so forth. Now, the message, as I mentioned briefly can be of arbitrary size. It can be as long as you want, or as short as you want, but the output, the size of the digest
or the size of the tag, is going to be fixed in length, and for example, in the
context of a hash function like, let's say, SHA-256, the digest will actually be
exactly 256 bits in length. It's going to have a fixed-length output, but an arbitrary-length input. The other thing I want to point out about these cryptographic hash functions is that the function here
is a deterministic function, and by that, I mean that the output will always be the same for a given input, so if you have a given input, you're going to see the exact same output. You won't have a
situation in which, maybe, a given input will have two
different possible outputs. It's going to be consistent. Now, traditional hash functions have been used in computer science
for quite some time, and they are used in many computing applications, so, for example, you may
have heard of something like a hash function used in something called a hash table, but the kinds of hash
functions that are used in hash tables, and I want to stress this, they aren't necessarily the same as cryptographic hash functions. The qualifier "cryptographic"
here is very important, and it usually means or implies that that hash function has
to have a certain set of critical design goals and maybe some particular properties in mind that make it suitable for use in applications that use cryptography or cryptographic applications, areas where perhaps security or privacy or confidentiality or authentications are a serious concern. First and foremost, maybe in describing
some of these properties is that, and maybe this
almost goes without saying, one of the first properties you want of a cryptographic hash function is that it should be
computationally efficient, and by that, I mean that it
shouldn't take a lot of time to compute the output from a given input. If you're given a message and you want to apply this
set of transformations to that message to get a digest, that set of transformation
should not take a long time to implement on a computer. It should be very fast,
or relatively fast. It almost goes without saying, but I think it's important
to stress and point it out because I've seen people come up with grossly inefficient
hash functions sometimes, and those would not be considered suitable in the context of when typical cryptographic hash functions are used for cryptographic applications. The second property you typically need in the context, and this is especially in the
context of digital signing, is that you want it to be the case that it's hard to find two
inputs that actually map to the same output, and I mean two distinct inputs whose corresponding digest is identical. This property is typically referred to as collision resistance. That means it's hard to find
a colliding pair of input, so in other words, if you have two inputs and let's say you have a message M1, and you have a message M2, their output under the application of the hash function
should not be the same. You won't ever have it
be the same that the, you won't even have it
be the case, rather, that the output of M1 and M2 under an application of the hash function is the same. They should never be the same. It should always be different. Now, I should take a step back here and point out that of course in practice, given that messages can
be of arbitrary size and given that the input
or the output, rather, is a fixed size, it's not mathematically
possible to guarantee that the output will always be different for two distinct messages, but what you typically want
is not that the outputs are necessarily different, but that it's hard to
find two distinct messages that produce the same output. We know they exist by virtue of the fact that there are many
messages that can be hash, and only a finite, small number, relatively small number compared to the number of messages, a smaller number of
possible digest values, but aside from that consideration, it should be hard. It
should take a long time, and by long, I mean an
astromonically long time to find two distinct
messages that would result in the same output under the application
of the hash function. Now, the third thing
that I want to point out is that in many cases, you might want, also, in the context of a hash function, for the hash function to be able to hide information about the inputs. In other words, given the output, it should be hard to glean anything useful or interesting about the input. Anything, any relevant detail, and I don't just mean the whole input, but maybe even something as simple as was the input an even
number or an odd number? I mean, that should be the kind of thing that should be hard to infer
when you see the output, even something as simple as the, the evenness or oddness of the input. Now, a fourth property I want to point out is that you typically
want the output to be well distributed. In other words, the output should look random. In other words, it should look like a set of coin flips took place, not that there was a predictable way in which the output was created. Really, what that means is that, and you can think of it
maybe more conceptually as it's almost as if you
flipped a bunch of coins to get to the output. It should look just that random. And so what you can maybe think of cryptographic hash functions as, as it's, perhaps, maybe
the mathematical equivalent or mathematical analog of
a meat grinder of sort. It can take inputs and apply these mathematical transformations to them such that the output looks, for all intents and purposes, completely random and unrelated to the original input. I do want to make a few quick remarks about these particular properties, and first of all, they are interrelated. For example, if you have a situation where the outputs, let's say, really appear to bear no
relationship to the input and the outputs also look random, then that will, in fact,
give you to a large degree, a lot of the collision
resistance properties because just the fact that you can't predict the output and the fact that it
hides all this information implies that it's going to be hard to find two inputs that are distinct that map to the same output. And so sometimes, you get one property in exchange for the others. The second thing I want to point out is that typically, these
properties in practice, or maybe in the underlying mathematics, are things that you hope for but you can't always guarantee that they'll always hold, and it may be entirely possible that you could design a hash function that you think is completely
collision resistant, but someone might come
along a year from now and come up with a more
clever way for finding a collision. Maybe they'll find a clever shortcut that does not involve
doing a brute force search of any sort. It turns out that cryptographers, for better or worse currently do not have any mathematical techniques. They haven't developed
techniques for being able to work around some of these limitations. So we often do take it on face value that these schemes are secure based on how long they've been around. Now, I also want to point out, and the last thing I want to point out, is that this treatment that I've given is not meant to be mathematically rigorous by any stretch of the imagination. There are far more
precise ways to describe these underlying properties, but my hope, instead, is that this video gives you,
maybe, a bit of a flavor for what is required of a
cryptographic hash function without maybe getting
bogged down in some of the mathematical minutia and formalism.