Given discrete random variables $X_1, \dots, X_n$ that take $\alpha_1, \dots, \alpha_n$ values, respectively, number of parameters to represent:
Deals with efficient coding and transmission of information.
transmit a large corpus of say English text over a digital line.
$ P(X) $: distribution over a random variable $ X $.
logarithms of base 2. The entropy of $ X $ $$ \boxed{\mathbb{H}_P (X) = \mathbb{E}_P \left[ \log{\frac{1}{P(x)}} \right] = \sum_x P(x) \log{\frac{1}{P(x)}}} $$
entropy of $ X $ is the lower bound on the average number of bits required to encode values of $ X $.
entropy is as a measure of our uncertainty about the value of $ X $.
Proposition $$ 0 \leq \mathbb{H}_P (X) \leq \log{|Val(X)|} $$
$$ \boxed{\mathbb{H}_P (X_1, \dots, X_n) = \mathbb{E}_P \left[ \log{\frac{1}{P(X_1, \dots, X_n)}} \right]} $$
Capture how many bits are needed (on average) to encode joint instances of the variables.
The additional cost (in terms of bits) of encoding $ X $ when we already encoding $ Y $ $$ \begin{align*} P(X, Y) = P(Y) \cdot P(X \mid Y) &\Leftrightarrow \log{\frac{1}{P(X, Y)}} = \log{\frac{1}{P(Y)}} + \log(\frac{1}{P( X \mid Y )}) \\ &\Leftrightarrow \mathbb{H}_P (X, Y) = \mathbb{E} \left[ \log{\frac{1}{P(Y)}} \right] + \mathbb{E} \left[ \log(\frac{1}{P( X \mid Y )}) \right] \\ &\Leftrightarrow \boxed{ \mathbb{H} ( X \mid Y ) = \mathbb{E} \left[ \log{\frac{1}{P( X \mid Y )}} \right] = \mathbb{H} (X, Y) - \mathbb{H} (Y) } \end{align*} $$
$$ \boxed{ \mathbb{H} ( X_1, \dots, X_n) = \mathbb{H}(X_1) + \mathbb{H} (X_2 \mid X_1) + \dots + \mathbb{H} (X_n \mid X_1, \dots, X_{n - 1}) } $$
Intuitively, we would expect $ \mathbb{H} (X \mid Y) $ to be at least as small as the cost of encoding $ X $ alone $$ \mathbb{H} (X \mid Y) \leq \mathbb{H} (X) $$
The mutual information between $ X $ and $ Y $ $$ \boxed{ \mathbb{I}(X ; Y) = \mathbb{H}(X) - \mathbb{H}(X \mid Y) = \mathbb{E} \left[ \log{\frac{P(X \mid Y)}{X}} \right] } $$
Captures how many bits we save (on average) in the encoding of $ X $ given $ Y $. The mutual information statisfies:
$ \rightarrow $ view mutual information as a quantitative measure of the strength of the dependency between $ X $ and $ Y $. The bigger the mutual information, the stronger the dependency.
A distance measure $ d $ that satisfies:
However, we often don’t know $ P \rightarrow $ use $ Q $.
Question: How much we lost, due to the inaccuracy of using $ Q $?
The relative entropy of $ P $ and $ Q $
$$ \boxed{ \mathbb{D}(P(X_1, \dots, X_n) || Q(X_1, \dots, X_n)) = \mathbb{E}_P \left[ \log{\frac{P(X_1, \dots, X_n)}{Q(X_1, \dots, X_n)}} \right] } $$
Short hand notation, $$ \mathbb{D}(P || Q) $$
Properties
Relative entropy is not a distance measure over distributions since it does not satisfy:
Maximization/minimization problems:
Formulate this problem into a maximization problem by considering the negative of the sum of squared distances: $$ f_{\text{obj}} (\theta_x, \theta_y) = - \sum_i \left( (x[i] - \theta_x)^2 + (y[i] - \theta_y)^2 \right) $$
At the maximum, the gradient of function (vector of partial derivatives) $ f_{\text{obj}} (\theta_1, \dots, \theta_n) $ is 0, $$ \nabla f = \left\langle \frac{\partial f}{\partial \theta_1}, \ldots, \frac{\partial f}{\partial \theta_n} \right\rangle $$
Optimizing a continuous function over its entire domain, under a set of equally constraints.
Find $ \theta $
Maximizing $ f(\theta) $
subject to $$ c_1 (\theta) = 0 $$
$$ \cdots $$
$$ c_m (\theta) = 0 $$
Rephrase equality constraint as constraining a function $ c $ to 0 $$ \mathcal{C} = \{ \theta : \forall j = 1, \dots, n, c_j (\theta) = 0 \} $$