Random Variable

Definition

A random variable 𝑋 is a function that maps experiment outcomes πœ”βˆˆΞ© to numerical values ℝ. That is, it maps sample space πœ”βˆˆΞ© to numbers: 𝑀↦𝑋(πœ”)

𝑋:Ω→ℝ

we use Ω𝑋 to denote that the range (or more specifically codomain, image) of random variable 𝑋.

  • If Ω𝑋 is finite or countably infinite, 𝑋 is called a discrete random variable.
  • If Ω𝑋 is uncountably infinite, 𝑋 is called a continuous random variable.

Probability Mass Function

The probability mass function (PMF) of a discrete random variable 𝑋 assigns probabilities to the possible values of the random variable. That is: 𝑝𝑋:Ω𝑋→[0,1] where:

𝑝𝑋(π‘˜)=β„™(𝑋=π‘˜)=βˆ‘πœ”βˆˆΞ©:𝑋(πœ”)=π‘˜β„™(πœ”)

note that π‘˜βˆˆΞ©π‘‹ as shown in the function signature of 𝑝𝑋, and thus the PMF must satisfy:

βˆ‘π‘§βˆˆΞ©π‘‹π‘π‘‹(𝑧)=1

Expectation

The expectation (or expected value, mean) of a discrete random variable 𝑋 is defined as:

𝔼[𝑋]=βˆ‘π‘€βˆˆΞ©π‘‹π‘‹(𝑀)𝑝𝑋(π‘₯)

Recall that random variable 𝑋 is a function, and thus 𝑋(𝑀) is the value of random variable 𝑋 at outcome 𝑀.

Notation

The notation in random variable confuse me a lot in the beginning. Let’s clarify them here:

  • 𝑋=π‘˜ where π‘˜βˆˆΞ©π‘‹ denotes that the random variable 𝑋 takes value π‘˜, which implies that: 𝑋(πœ”)=π‘˜whereπœ”βˆˆΞ©

  • β„™(𝑋=π‘˜)=βˆ‘πœ”βˆˆΞ©:𝑋(πœ”)=π‘˜β„™(πœ”) denotes the probability that random variable 𝑋 takes value π‘˜.

  • Most of the time, β„™(𝑋=π‘˜) is interchangeable with 𝑃(𝑋=π‘˜) and 𝑃(π‘˜)

Important

  • In the β„™(𝑋=☐) notation, ☐ is the value of the random variable, not the event
  • In the β„™(☐) notation, ☐ is the event, not the value of the random variable
  • In the 𝑝𝑋(☐) notation, ☐ is the value of the random variable, not the event

Likelihood

Likelihood Function

The likelihood function β„’οΈ€(πœƒ|π‘₯) measures how likely a particular parameter value πœƒ is given observed data π‘₯.

β„’οΈ€(πœƒ|π‘₯)=π‘ƒπœƒ(𝑋=π‘₯)=𝑃(𝑋=π‘₯|πœƒ)

where:

  • πœƒ represents the parameters of the distribution
  • π‘₯ is the observed data (treated as constant)
  • 𝑃(𝑋=π‘₯|πœƒ) is the probability of observing π‘₯ given parameter πœƒ

Important

The above equation is equal in value but not equal in semantic:

  • Likelihood: β„’οΈ€(πœƒ|π‘₯) is a function of parameters πœƒ given fixed data π‘₯
  • Probability: 𝑃(𝑋=π‘₯|πœƒ) is a function of data π‘₯ given fixed parameters πœƒ

For independent observations π‘₯1,π‘₯2,…,π‘₯𝑛, the joint likelihood is:

β„’οΈ€(πœƒ|π‘₯1,…,π‘₯𝑛)=βˆπ‘›π‘–=1𝑃(𝑋=π‘₯𝑖|πœƒ)

Log Likelihood

The log likelihood β„“(πœƒ|π‘₯) is simply the natural logarithm of the likelihood function:

β„“(πœƒ|π‘₯)=logβ„’οΈ€(πœƒ|π‘₯)=log𝑃(𝑋=π‘₯|πœƒ)

For independent observations, the log likelihood becomes:

β„“(πœƒ|π‘₯1,…,π‘₯𝑛)=log(βˆπ‘›π‘–=1𝑃(𝑋=π‘₯𝑖|πœƒ))=βˆ‘π‘›π‘–=1log𝑃(𝑋=π‘₯𝑖|πœƒ)

Why use log likelihood?

  • Converts products to sums (easier to differentiate)
  • More numerically stable (avoids underflow)
  • Preserves the location of maximum:

argmaxπœƒβ„’οΈ€(πœƒ|π‘₯)=argmaxπœƒβ„“(πœƒ|π‘₯)

Negative Log Likelihood (NLL)

The negative log likelihood (NLL) is simply the negative of the log likelihood:

NLL(πœƒ|π‘₯)=βˆ’β„“(πœƒ|π‘₯)=βˆ’logβ„’οΈ€(πœƒ|π‘₯)

Why use NLL?

  • Most optimization algorithms (like gradient descent) are designed for minimization
  • NLL minimization ⇔ likelihood maximization
  • Often provides cleaner mathematical expressions

For independent observations:

NLL(πœƒ|π‘₯1,…,π‘₯𝑛)=βˆ’βˆ‘π‘›π‘–=1log𝑃(𝑋=π‘₯𝑖|πœƒ)

References

Stanford CS109: Discrete Random Variables: Basics

Stanford CS109: Discrete Random Variables: More on Expectation

Wikipedia: Likelihood function