Information Theory Utilities
Utility functions and helper classes for information-theoretic computations.
Data Types
Advanced type detection for time series data.
This module provides comprehensive detection of time series types including: - Discrete vs Continuous classification - Circular/periodic signal detection - Probabilistic type inference with confidence scores - Detection of various data patterns (binary, categorical, count, etc.)
- class driada.information.time_series_types.TimeSeriesType(primary_type, subtype, confidence, is_circular, circular_period, periodicity, metadata)[source]
Result of time series type detection.
Encapsulates the results of comprehensive time series analysis, including classification of the primary data type (discrete/continuous), subtype details, periodicity information, and confidence scores.
This class provides a structured way to represent and query time series characteristics, useful for selecting appropriate analysis methods.
- Parameters:
- primary_type
Primary classification of the time series: - ‘discrete’: Integer-valued or categorical data - ‘continuous’: Real-valued measurements - ‘ambiguous’: Cannot confidently determine type
- Type:
{‘discrete’, ‘continuous’, ‘ambiguous’}
- subtype
More specific classification: - Discrete subtypes: ‘binary’ (0/1), ‘categorical’, ‘count’ (non-negative integers) - Continuous subtypes: ‘linear’, ‘circular’ (phase/angle), ‘timeline’
- Type:
{‘binary’, ‘categorical’, ‘count’, ‘timeline’, ‘linear’, ‘circular’}, optional
- confidence
Confidence score for the classification (0-1). Higher values indicate more certainty in the type detection.
- Type:
- is_circular
Whether the data represents circular/angular quantities (e.g., phases, angles, time-of-day).
- Type:
- circular_period
Period of circular data (e.g., 2π for radians, 360 for degrees).
- Type:
float, optional
- periodicity
Detected period from autocorrelation analysis. Non-circular data can still be periodic (e.g., oscillations, rhythms).
- Type:
float, optional
- metadata
Statistical properties computed during detection, including: - n_unique: number of unique values - unique_ratio: fraction of unique values - is_integer: whether all values are integers - has_decimals: whether any decimals present - entropy: Shannon entropy - Various other statistical measures
- Type:
Examples
>>> # Binary spike trains >>> spikes = np.array([0, 0, 1, 0, 1, 1, 0, 0]) >>> result = analyze_time_series_type(spikes) >>> result.primary_type 'discrete' >>> result.subtype 'binary'
>>> # Circular phase data >>> phases = np.random.uniform(0, 2*np.pi, 100) >>> result = analyze_time_series_type(phases) >>> result.is_circular True
Notes
The detection algorithm uses multiple statistical tests and heuristics to determine the most likely type. For ambiguous cases (e.g., discretized continuous data), the confidence score helps indicate uncertainty.
- property is_discrete: bool
Check if the time series is primarily discrete.
- Returns:
True if the primary type is discrete, False otherwise.
- Return type:
- property is_continuous: bool
Check if the time series is primarily continuous.
- Returns:
True if the primary type is continuous, False otherwise.
- Return type:
- property is_ambiguous: bool
Check if the time series type is ambiguous.
- Returns:
True if the type detection was ambiguous (could not confidently classify as discrete or continuous), False otherwise.
- Return type:
- property is_periodic: bool
Check if the time series has detected periodicity.
- Returns:
True if periodicity was detected (periodicity is not None), False otherwise.
- Return type:
Notes
Returns True only for valid positive finite periods.
- __init__(primary_type, subtype, confidence, is_circular, circular_period, periodicity, metadata)
- Parameters:
- Return type:
None
- driada.information.time_series_types.analyze_time_series_type(data, name=None, confidence_threshold=0.7, min_samples=30, verbose=False)[source]
Analyze and detect the type of a time series using comprehensive statistical analysis.
- Parameters:
data (np.ndarray) – 1D array of time series values. Must contain numeric data.
name (str, optional) – Name of the time series (used for context-aware detection)
confidence_threshold (float) – Minimum confidence for definitive classification
min_samples (int) – Minimum samples for reliable detection
verbose (bool) – Print detection details
- Returns:
Comprehensive type detection results
- Return type:
- Raises:
ValueError – If data is empty, contains non-numeric values, or contains NaN/Inf values.
TypeError – If data cannot be converted to numpy array.
- driada.information.time_series_types.is_discrete_time_series(ts, return_confidence=False)[source]
Simple function that returns whether time series is discrete (True) or continuous (False).
- Parameters:
ts (np.ndarray) – Time series data
return_confidence (bool) – If True, also return confidence score
- Returns:
True if discrete, False if continuous. If return_confidence=True, returns (is_discrete, confidence)
- Return type:
- Raises:
ValueError – If data is empty, contains non-numeric values, or contains NaN/Inf values.
TypeError – If data cannot be converted to numpy array.
- driada.information.time_series_types.detect_ts_type(ts, return_confidence=False)
Simple function that returns whether time series is discrete (True) or continuous (False).
- Parameters:
ts (np.ndarray) – Time series data
return_confidence (bool) – If True, also return confidence score
- Returns:
True if discrete, False if continuous. If return_confidence=True, returns (is_discrete, confidence)
- Return type:
- Raises:
ValueError – If data is empty, contains non-numeric values, or contains NaN/Inf values.
TypeError – If data cannot be converted to numpy array.
Helper Functions
- driada.information.info_utils.py_fast_digamma_arr(data)
Compute digamma function for an array of values using fast approximation.
This is a JIT-compiled version that processes arrays efficiently. Uses a series expansion approximation that is accurate for x > 5.
- Parameters:
data (ndarray) – Input array of positive values. All values must be > 0.
- Returns:
Array of digamma values corresponding to input data.
- Return type:
ndarray
- Raises:
None – Invalid inputs (x <= 0) return NaN instead of raising exceptions due to numba JIT compilation constraints.
Notes
The algorithm uses a recurrence relation to shift x to the range where the series expansion is accurate (x > 5), then applies an asymptotic expansion with correction terms.
For x <= 0, the function returns NaN to avoid infinite loops.
- driada.information.info_utils.py_fast_digamma(x)
Compute digamma function for a single value using fast approximation.
This is a JIT-compiled scalar version of the digamma (psi) function. Uses a series expansion approximation that is accurate for x > 5.
- Parameters:
x (float) – Input value. Must be positive (x > 0).
- Returns:
The digamma function value psi(x).
- Return type:
- Raises:
None – Invalid inputs (x <= 0) return NaN instead of raising exceptions due to numba JIT compilation constraints.
Notes
The digamma function is the logarithmic derivative of the gamma function: psi(x) = d/dx log(Gamma(x)) = Gamma’(x) / Gamma(x)
The algorithm uses: 1. Recurrence relation psi(x) = psi(x+1) - 1/x to shift to x > 5 2. Asymptotic expansion for large x with Bernoulli number corrections
For x <= 0, the function returns NaN to avoid infinite loops.
- driada.information.info_utils.binary_mi_score(contingency)[source]
Calculate mutual information for discrete variables from contingency table.
Computes the mutual information between two discrete random variables based on their joint probability distribution represented as a contingency table.
- Parameters:
contingency (ndarray of shape (n_classes_x, n_classes_y)) – Contingency table where element [i, j] contains the count of samples with x=i and y=j. Must contain non-negative values.
- Returns:
Mutual information score in nats (natural log base). Returns 0.0 if either variable has only one class.
- Return type:
- Raises:
ValueError – If contingency table has wrong dimensions or contains negative values.
TypeError – If contingency is not array-like or contains non-numeric values.
Notes
The mutual information is calculated as: MI(X,Y) = sum_ij P(x_i, y_j) * log(P(x_i, y_j) / (P(x_i) * P(y_j)))
This implementation: - Handles sparse contingency tables efficiently by only computing over non-zero entries - Returns 0 for degenerate cases (single cluster) - Clips negative values due to numerical errors to 0
JIT-Optimized Functions
JIT-compiled entropy calculation functions for performance optimization.
Performance characteristics:
entropy_d_jit: Faster for small datasets (< 1000 samples), slower for large datasets due to numpy’s highly optimized C implementation. Best speedups seen with small data.
joint_entropy_dd_jit: Consistently faster across all dataset sizes (2x-30x speedup) as it avoids the overhead of numpy’s histogram2d function.
The implementations use vectorized operations without explicit loops where possible, leveraging numba’s ability to compile numpy operations efficiently.
- driada.information.entropy_jit.entropy_d_jit(x)
JIT-compiled discrete entropy calculation using vectorized operations.
- Parameters:
x (array-like) – Discrete variable values. Must be numeric and sortable.
- Returns:
Entropy in bits. Returns 0.0 for empty arrays.
- Return type:
- Raises:
AttributeError – If x doesn’t have required array attributes (size).
TypeError – If x contains non-numeric or non-sortable values.
Notes
Uses Shannon entropy formula H = -Σ(p*log2(p)) where p are the probabilities of each unique value. The implementation sorts the input array to efficiently count unique values.
- driada.information.entropy_jit.joint_entropy_dd_jit(x, y)
JIT-compiled joint entropy for two discrete variables.
- Parameters:
x (array-like) – First discrete variable. Must be numeric.
y (array-like) – Second discrete variable. Must be numeric and same length as x.
- Returns:
Joint entropy H(X,Y) in bits.
- Return type:
- Raises:
ValueError – If x and y have different lengths or are empty.
AttributeError – If x or y don’t have required array attributes.
TypeError – If x or y contain non-numeric values.
Notes
Creates a joint encoding of (x,y) pairs and calculates entropy of the joint distribution. Uses overflow-safe encoding with automatic fallback to Cantor pairing for large value ranges.