Entropy

Prelude: To make sure you won’t get loss in the notation, please read through the notation to take a quick refresher of notation and convention used in probability theory.

Definition

Entropy is a measure of uncertainty of a random variable. The entropy of a random variable is:

as for the continuous random variable, the entropy is defined as:

If the base of the logarithm is 2, then the unit of entropy is bit.
If the base of the logarithm is , then the unit of entropy is nat.

Let’s say random variable is the number of heads when flipping a coin:

random variable is the outcome when rolling a fair six-sided dice:

Go back to the thumbtack example (the thumbtack has different probabilities of landing on head or tail), let’s denote that , given following configuration of :

for the task for predict , in which configuration of is the hardest?

when , is the highest entropy, meaning it’s the hardest to predict.

Cross Entropy

Given two probability distributions (or PMFs) and over the same random variable , the cross entropy from to is defined as:

Warning

The notation of cross entropy is a bit confusing, it looks like:

but actually it’s NOT. A better notation could be , the semantic is:

is the true distribution (where the data comes from).

is the approximate distribution (the model we learned).

that’s also why we read it as “the cross entropy from to “.

Properties

The prove is given by:

where is the cardinality of the range of random variable.

Conditional Entropy

Random variable : initial distribution , we should have:

Given evidence , we have , then:

Later we will show:

However might be larger than .

Divergence

General ML Setup

In general ML setup, we have:

: the unknown true distribution (ground truth)
: the data distribution (sampling from )
: the learned distribution (model)

what we want is to make as close to as possible. So the question is converted to “how to measure the closeness between two distributions?”

Kullback-Leibler (KL) Divergence
Jensen-Shannon (JS) Divergence
Earth Mover’s Distance (EMD)
Maximum Mean Discrepancy (MMD)
Total Variation (TV) Distance

Kullback-Leibler (KL) Divergence

Assume and are two distributions over the same random variable , the KL divergence from to is defined as:

where:

is the true distribution (or reference distribution)
is the approximate distribution (or learned distribution)

Properties

|

| when ,

Unsupervised Learning

Objective: Minimize the KL divergence between and :

since the ground truth distribution will not change, so is fixed, then we have:

with above equation, we have:

which means that:

YCG's Oasis🌴

Recent Notes

A gentle introduction to go modules

Error Correction Code

Medium Access Control

Explorer

Information Theory

Entropy

Definition

Cross Entropy

Properties

Conditional Entropy

Divergence

General ML Setup

Kullback-Leibler (KL) Divergence

Properties

Unsupervised Learning

Graph View

Table of Contents