Entropy Estimation

This module provides functions for estimating entropy of discrete and continuous random variables.

Functions

driada.information.entropy.entropy_d(x)[source]

Calculate entropy for a discrete variable.

Automatically selects between JIT-compiled and numpy implementations based on dataset size for optimal performance. JIT version is used for arrays smaller than ENTROPY_D_JIT_THRESHOLD (1000 elements).

Parameters:: x (array-like) – Discrete variable values. Should contain numeric values (integers or floats representing discrete states).
Returns:: Entropy in bits.
Return type:: float
Raises:: ValueError – If input is not numeric.

Examples

>>> entropy_d([1, 1, 2, 2])  # uniform binary distribution
1.0
>>> entropy_d([1, 2, 3, 4])  # uniform 4-way distribution
2.0

Notes

For small datasets (< 1000 elements), automatically uses JIT-compiled implementation if available. For larger datasets, uses optimized numpy implementation to avoid JIT compilation overhead.

driada.information.entropy.joint_entropy_dd(x, y)[source]

Calculate joint entropy for two discrete variables.

Automatically uses JIT-compiled version which is consistently faster than the histogram2d approach across all dataset sizes.

Parameters:

x (array-like) – First discrete variable. Must have same length as y.
y (array-like) – Second discrete variable. Must have same length as x.

Returns:

Joint entropy H(X,Y) in bits.

Return type:

float

Examples

>>> joint_entropy_dd([1, 1, 2, 2], [1, 2, 1, 2])  # independent
2.0
>>> joint_entropy_dd([1, 1, 2, 2], [1, 1, 2, 2])  # perfectly dependent
1.0

Notes

When JIT compilation is available, always uses the JIT version as it is consistently faster. Falls back to histogram2d-based implementation if JIT is not available.

driada.information.entropy.conditional_entropy_cdd(z, x, y, k=5, estimator='gcmi')[source]

Calculate conditional differential entropy for a continuous variable given two discrete variables.

Computes H(Z|X,Y) where Z is continuous and X,Y are discrete. Two estimators are available: GCMI (fast, Gaussian assumption) and KSG (accurate, nonparametric).

Parameters:

z (array-like) – Continuous variable. Must have same length as x and y.
x (array-like) – First discrete variable. Must have same length as z and y.
y (array-like) – Second discrete variable. Must have same length as z and x.
k (int, optional) – For KSG: number of nearest neighbors. For GCMI: minimum subset size threshold (partitions smaller than k are excluded). Default: 5.
estimator ({'gcmi', 'ksg'}, optional) – Entropy estimation method: - ‘gcmi’: Fast, assumes Gaussian distribution - ‘ksg’: Accurate, nonparametric k-nearest neighbor approach Default: ‘gcmi’.

Returns:

Conditional entropy H(Z|X,Y) in bits.

Return type:

float

Examples

>>> z = [0.1, 0.2, 0.8, 0.9, 0.3, 0.7]
>>> x = [1, 1, 2, 2, 1, 2]
>>> y = [1, 2, 1, 2, 1, 1]
>>> result = conditional_entropy_cdd(z, x, y, k=3)
>>> isinstance(result, float)
True

Notes

GCMI estimator is faster but assumes data follows Gaussian distribution. KSG estimator is slower but works for arbitrary continuous distributions.

driada.information.entropy.conditional_entropy_cd(z, x, k=5, estimator='gcmi')[source]

Calculate conditional differential entropy for a continuous variable given a discrete variable.

Computes H(Z|X) where Z is continuous and X is discrete. Two estimators are available: GCMI (fast, Gaussian assumption) and KSG (accurate, nonparametric).

Parameters:

z (array-like) – Continuous variable. Must have same length as x.
x (array-like) – Discrete variable. Must have same length as z.
k (int, optional) – For KSG: number of nearest neighbors. For GCMI: minimum subset size threshold (partitions smaller than k are excluded). Default: 5.
estimator ({'gcmi', 'ksg'}, optional) – Entropy estimation method: - ‘gcmi’: Fast, assumes Gaussian distribution - ‘ksg’: Accurate, nonparametric k-nearest neighbor approach Default: ‘gcmi’.

Returns:

Conditional entropy H(Z|X) in bits.

Return type:

float

Examples

>>> z = [0.1, 0.2, 0.8, 0.9]
>>> x = [1, 1, 2, 2]
>>> result = conditional_entropy_cd(z, x, k=1)
>>> isinstance(result, float)
True

Notes

GCMI estimator is faster but assumes data follows Gaussian distribution. KSG estimator is slower but works for arbitrary continuous distributions.

Usage Examples

Joint Entropy

from driada.information.entropy import joint_entropy_dd
import numpy as np

# Joint entropy of two variables
x = np.random.randint(0, 4, 1000)
y = np.random.randint(0, 3, 1000)

H_xy = joint_entropy_dd(x, y)
print(f"H(X,Y) = {H_xy:.3f} bits")

Conditional Entropy

from driada.information.entropy import conditional_entropy_cdd
import numpy as np

# H(Z|X,Y) - uncertainty in continuous Z given discrete X,Y
x = np.random.randint(0, 3, 1000)  # Discrete
y = np.random.randint(0, 2, 1000)  # Discrete
z = np.random.randn(1000) + x  # Continuous, depends on X

H_z_given_xy = conditional_entropy_cdd(z, x, y)
print(f"H(Z|X,Y) = {H_z_given_xy:.3f} bits")

Theory

Shannon Entropy:

For discrete random variable X:

\[H(X) = -\sum_{i} p(x_i) \log_2 p(x_i)\]

For continuous random variable X:

\[h(X) = -\int p(x) \log_2 p(x) dx\]

Conditional Entropy:

\[H(X|Y) = H(X,Y) - H(Y)\]

Mutual Information:

\[I(X;Y) = H(X) + H(Y) - H(X,Y)\]