Entropy
Prelude: To make sure you wonβt get loss in the notation, please read through the notation to take a quick refresher of notation and convention used in probability theory.
Definition
Entropy is a measure of uncertainty of a random variable. The entropy of a random variable is:
as for the continuous random variable, the entropy is defined as:
- If the base of the logarithm is 2, then the unit of entropy is bit.
- If the base of the logarithm is , then the unit of entropy is nat.
Letβs say random variable is the number of heads when flipping a coin:
random variable is the outcome when rolling a fair six-sided dice:
Go back to the thumbtack example (the thumbtack has different probabilities of landing on head or tail), letβs denote that , given following configuration of :
for the task for predict , in which configuration of is the hardest?
when , is the highest entropy, meaning itβs the hardest to predict.
Cross Entropy
Given two probability distributions (or PMFs) and over the same random variable , the cross entropy from to is defined as:
Warning
The notation of cross entropy is a bit confusing, it looks like:
but actually itβs NOT. A better notation could be , the semantic is:
- is the true distribution (where the data comes from).
- is the approximate distribution (the model we learned).
thatβs also why we read it as βthe cross entropy from to β.
Properties
The prove is given by:
where is the cardinality of the range of random variable.
Conditional Entropy
Random variable : initial distribution , we should have:
Given evidence , we have , then:
Later we will show:
However might be larger than .
Divergence
General ML Setup
In general ML setup, we have:
- : the unknown true distribution (ground truth)
- : the data distribution (sampling from )
- : the learned distribution (model)
what we want is to make as close to as possible. So the question is converted to βhow to measure the closeness between two distributions?β
- Kullback-Leibler (KL) Divergence
- Jensen-Shannon (JS) Divergence
- Earth Moverβs Distance (EMD)
- Maximum Mean Discrepancy (MMD)
- Total Variation (TV) Distance
Kullback-Leibler (KL) Divergence
Assume and are two distributions over the same random variable , the KL divergence from to is defined as:
where:
- is the true distribution (or reference distribution)
- is the approximate distribution (or learned distribution)
Properties
|
| when ,
Unsupervised Learning
Objective: Minimize the KL divergence between and :
since the ground truth distribution will not change, so is fixed, then we have:
with above equation, we have:
which means that: