Entropy

Prelude: To make sure you won’t get loss in the notation, please read through the notation to take a quick refresher of notation and convention used in probability theory.

Definition

Entropy is a measure of uncertainty of a random variable. The entropy of a random variable 𝑋 is:

𝐻(𝑋)=βˆ‘π‘₯βˆˆΞ©π‘‹π‘ƒ(𝑋=π‘₯)log1𝑃(𝑋=π‘₯)=βˆ’βˆ‘π‘₯βˆˆΞ©π‘‹π‘ƒ(𝑋=π‘₯)log𝑃(𝑋=π‘₯)

as for the continuous random variable, the entropy is defined as:

𝐻(𝑋)=βˆ’βˆ«π‘π‘‹(π‘₯)log𝑝𝑋(π‘₯)dπ‘₯
  • If the base of the logarithm is 2, then the unit of entropy is bit.
  • If the base of the logarithm is 𝑒, then the unit of entropy is nat.

Let’s say random variable 𝑋 is the number of heads when flipping a coin:

𝐻(𝑋)=𝑃(𝑋=1)log1𝑃(𝑋=1)+𝑃(𝑋=0)log1𝑃(𝑋=0)=β„™(Head)log1β„™(Head)+β„™(Tail)log1β„™(Tail)=12log2+12log2=1

random variable 𝑍 is the outcome when rolling a fair six-sided dice:

𝐻(𝑍)=βˆ‘6𝑖=1𝑃(𝑍=𝑖)log1𝑃(𝑍=𝑖)=6βˆ—16log6=log6

Go back to the thumbtack example (the thumbtack has different probabilities of landing on head or tail), let’s denote that πœƒ=𝑃(𝑋=Head), given following configuration of πœƒ:

  • πœƒ=0.2
  • πœƒ=0.5
  • πœƒ=0.8

for the task for predict π‘₯, in which configuration of πœƒ is the hardest?

𝐻(𝑋)=𝑃(Head)log1𝑃(Head)+𝑃(Tail)log1𝑃(Tail)=πœƒlog1πœƒ+(1βˆ’πœƒ)log11βˆ’πœƒ

when πœƒ=0.5, 𝐻(𝑋)=1 is the highest entropy, meaning it’s the hardest to predict.

Cross Entropy

Given two probability distributions (or PMFs) 𝑃 and 𝑄 over the same random variable 𝑋, the cross entropy from 𝑃 to 𝑄 is defined as:

𝐻(𝑃,𝑄)=βˆ‘π‘₯βˆˆΞ©π‘‹π‘ƒ(𝑋=π‘₯)log1𝑄(𝑋=π‘₯)=βˆ’βˆ‘π‘₯βˆˆΞ©π‘‹π‘ƒ(𝑋=π‘₯)log𝑄(𝑋=π‘₯)

Warning

The notation of cross entropy 𝐻(𝑃,𝑄) is a bit confusing, it looks like:

𝐻(𝑃,𝑄)=𝐻(𝑄,𝑃)

but actually it’s NOT. A better notation could be 𝐻𝑃(𝑄), the semantic is:

  • 𝑃 is the true distribution (where the data comes from).
  • 𝑄 is the approximate distribution (the model we learned).

that’s also why we read it as β€œthe cross entropy from 𝑃 to π‘„β€œ.

Properties

0<𝐻(𝑋)≀log|Ω𝑋|

The prove is given by:

𝐻(𝑋)=βˆ‘π‘₯βˆˆΞ©π‘‹π‘ƒ(π‘₯)log1𝑃(π‘₯)≀logβˆ‘π‘₯βˆˆΞ©π‘‹π‘ƒ(π‘₯)1𝑃(π‘₯)=log|Ω𝑋|

where |Ω𝑋| is the cardinality of the range of random variable.

Conditional Entropy

Random variable π‘Œ: initial distribution 𝑃(π‘Œ), we should have:

𝐻(π‘Œ)=βˆ‘Ξ©π‘Œπ‘ƒ(π‘Œ)log1𝑃(π‘Œ)

Given evidence 𝑋=π‘Ž, we have 𝑃(π‘Œ|𝑋=π‘Ž), then:

𝐻(π‘Œ|𝑋=π‘Ž)⏟condition on event=βˆ‘Ξ©π‘Œπ‘ƒ(π‘Œ|𝑋=π‘Ž)log1𝑃(π‘Œ|𝑋=π‘Ž)𝐻(π‘Œ|𝑋)⏟condition on random variable=βˆ‘π‘ŽβˆˆΞ©π‘‹π‘ƒ(𝑋=π‘Ž)𝐻(π‘Œ|𝑋=π‘Ž)=βˆ‘Ξ©π‘‹,Ξ©π‘Œπ‘ƒ(𝑋,π‘Œ)log1𝑃(π‘Œ|𝑋)

Later we will show:

𝐻(π‘Œ|𝑋)≀𝐻(π‘Œ)

However 𝐻(π‘Œ|𝑋=π‘Ž) might be larger than 𝐻(π‘Œ).

Divergence

General ML Setup

In general ML setup, we have:

  • 𝑃: the unknown true distribution (ground truth)
  • 𝐷: the data distribution (sampling from 𝑃)
  • 𝑄: the learned distribution (model)

what we want is to make 𝑄 as close to 𝑃 as possible. So the question is converted to β€œhow to measure the closeness between two distributions?”

  • Kullback-Leibler (KL) Divergence
  • Jensen-Shannon (JS) Divergence
  • Earth Mover’s Distance (EMD)
  • Maximum Mean Discrepancy (MMD)
  • Total Variation (TV) Distance

Kullback-Leibler (KL) Divergence

Assume 𝑃 and 𝑄 are two distributions over the same random variable 𝑋, the KL divergence from 𝑄 to 𝑃 is defined as:

KL(𝑃‖𝑄)=βˆ‘π‘₯βˆˆΞ©π‘‹π‘ƒ(π‘₯)log𝑃(π‘₯)𝑄(π‘₯)

where:

  • 𝑃 is the true distribution (or reference distribution)
  • 𝑄 is the approximate distribution (or learned distribution)

Properties

| KL(𝑃‖𝑄)β‰₯0

KL(𝑃‖𝑄)=βˆ‘π‘₯βˆˆΞ©π‘‹π‘ƒ(π‘₯)log𝑃(π‘₯)𝑄(π‘₯)=βˆ’βˆ‘π‘₯βˆˆΞ©π‘‹π‘ƒ(π‘₯)log𝑄(π‘₯)𝑃(π‘₯)β‰₯βˆ’logβˆ‘π‘₯βˆˆΞ©π‘‹π‘ƒ(π‘₯)𝑄(π‘₯)𝑃(π‘₯)=βˆ’logβˆ‘π‘₯βˆˆΞ©π‘‹π‘„(π‘₯)=βˆ’log1𝑄(π‘₯)is probability distribution=0

| when 𝑃=𝑄, KL(𝑃‖𝑄)=0

Unsupervised Learning

Objective: Minimize the KL divergence between 𝑃 and 𝑄:

min𝑄KL(𝑃‖𝑄)=βˆ‘π‘₯βˆˆΞ©π‘‹π‘ƒ(π‘₯)log𝑃(π‘₯)𝑄(π‘₯)=βˆ‘π‘₯βˆˆΞ©π‘‹π‘ƒ(π‘₯)(log𝑃(π‘₯)βˆ’log𝑄(π‘₯))=βˆ‘π‘₯βˆˆΞ©π‘‹π‘ƒ(π‘₯)log𝑃(π‘₯)βŸβˆ’π»(𝑃)βˆ’βˆ‘π‘₯βˆˆΞ©π‘‹π‘ƒ(π‘₯)log𝑄(π‘₯)⏟𝐻(𝑃,𝑄)=𝐻(𝑃,𝑄)βˆ’π»(𝑃)

since the ground truth distribution will not change, so 𝐻(𝑃) is fixed, then we have:

=𝐻(𝑃,𝑄)βˆ’π»(𝑃)⇔min𝑄𝐻(𝑃,𝑄)

with above equation, we have:

min𝑄KL(𝑃‖𝑄)⇔min𝑄𝐻(𝑃,𝑄)=βˆ«π‘ƒ(π‘₯)log1𝑄(π‘₯)𝑑π‘₯=βˆ’βˆ«π‘ƒ(π‘₯)log𝑄(π‘₯)𝑑π‘₯β‰ˆβˆ’1π‘βˆ‘π‘π‘–=1log𝑄(π‘₯𝑖)

which means that:

minimizing KL divergence⇔minimizing the cross entropy𝐻(𝑃,𝑄)⇔maximizing the log likelihood of𝑄