Prerequisites to understand this post well: Basic notions in integration, probability, and complex numbers
Recently I came across this surprising trigonometric identity:
Now, pretty much all trigonometric identities feel somewhat surprising, even the simplest ones such as double angle identities and addition of sines. But there are two things that stand out with this one. The first is that this is an infinite identity; while most other trigonometric identities deal with finite products or sums, this one deals with an infinite product of cosines. The second is that this is not a completely trigonometric identity; while the left hand side is indeed made up of only cosines, the right hand side has an additional term of dividing by x.
So how would you go about proving this identity? One possible idea is to use the identity for multiplication of cosines:
However, the more we apply this identity, the number of terms we have to deal with grows exponentially. What this will leave us is a nightmare of sines and cosines which you wouldn’t want to give to your worst enemy (or maybe you would, depending on how sadistic you are)
So just using trigonometric tools won’t get us anywhere, plus it also won’t explain the 1/x term on the right hand side. How then, do we prove this? The answer lies surprisingly (or unsurprisingly if you read the title) in probability
First, a prerequisite; Given a random variable X, we can define a function from the reals to the complex plane as follows:
That is, for each real number θ, we define a new random variable eiXθ and take its expectation (mean). This function of θ will be called the characteristic function of X. As the name implies, this function allows us to completely characterize the distribution of X (under some assumptions). What this means is that for two random variables X and Y, their distributions will be the same if and only if their characteristic functions are the same.
Now to address a certain elephant in the room, the characteristic functions is indeed very reminiscent of the Fourier transform of a function:
So is there a difference between the two? Well, yes and no. If the random variable X has a probability density function p, then the two definitions indeed coincide in the following sense:
But not all random variables have a probability density function, say a random variable which is 1 with probability half and 2 with probability half. For them the regular definition of a Fourier transform does not immediately work (unless you want to define it in terms of Dirac delta functions, a move which will piss off a lot of mathematicians including myself). In a sense, the characteristic functions is a generalization of the Fourier transform, which allows us to deal with not just functions but general random variables.
How then, do we use characteristic functions to solve our problem? Here is one possible approach: we will define two random variables, with the same distribution. The characteristic function of the first will be the right hand side of the equation, while the characteristic function of the second will be the left hand side.
So what random variable do we choose? Let’s experiment, for example we can try a random variable X which is uniform on the interval [a,b]. Then its characteristic function will be:
More specifically, if we pick a=-1 and b=1, we can simplify further:
Which is exactly the right hand side of our identity!
As mentioned earlier, we want the distribution of the two variables to be the same, so our second random variable Y must also have a uniform distribution on [-1,1]. But how else can we describe picking a uniform variable on [-1,1]? Let’s say instead we wanted to pick a uniformly random element of the interval [0,1], as we will see later this will actually be sufficient for us. We can then represent each number in binary (base 2) “decimals” like so:
So for example, in this notation, 1/2 will be 0.1, 3/4 will be 0.11, and 1/3 will be 0.01111…. (before continuing, try to think what other numbers, like 1, 0.4, 0.125 , will look like in this notation)
This immediately gives us an idea for another way to choose a uniformly random element in [0,1]; randomize each digit. That is, take an infinite sequence of random variables y1,y2,y3,…, independent from each other, and each one is 0 with probability 1/2 and 1 with probability 1/2 (you can think as these as infinitely many coin tosses which will decide the number). This will give us a random number in [0,1], given by:
But wait, we wanted a uniformly random number in [0,1], not just “a random number”. Is this X necessarily a uniformly random variable? Turns out that it is, though the formal proof for this is a bit convoluted. Here’s some intuition though: note that if the first digit (after the dot) of X is 0, X must be less than or equal to half, otherwise, X must be greater or equal to half. In other words, whether X will be in [0,1/2] or [1/2,1] solely depends on the first digits, and being in each interval has probability 1/2. A similar thing happens if we look only at the first two digits. If the first two digits are 00, X must be in [0,1/4], if they are 01 X is in [1/4,1/2], if they are 10 X is in [1/2,3/4], and finally if they are 11 X is in [3/4,1]. The probability of X in being in each one of those intervals is 1/4. Looking at more digits and more, the more X is looking uniform on the entirety of [0,1], so it is intuitive that at infinity we would get a uniformly random variable on [0,1].
Before calculating the characteristic function, let’s first move back to a random variable on [-1,1] instead. To do this, we can simply multiply X by 2 and subtract 1:
Now to find the characteristic function of X. First, here is a rather nice property of characteristic functions of two independent random variables Y,Z:
Due to the fact that the expectation of a product of independent variables is the product of the expectations. The same fact as above still holds if we have more than two or even infinitely many random variables. So, applying this to our random variable X, we get the following:
Each element of the product will be (recall that yn is 0 with probability 1/2 and 1 with probability 1/2)
Putting this all together, we get that the characteristic function of X’ is:
But wait, we said that X’ is uniform on [-1,1]! So putting this together with the calculation from before of the characteristic function of a uniform random variable, we get:
And that is exactly the identity we wanted to prove! How neat is that?!