Linear Dimensionality Methods

This module contains linear methods for dimensionality estimation, primarily based on Principal Component Analysis (PCA).

Functions

driada.dimensionality.linear.pca_dimension(data, threshold=0.95, standardize=True)[source]

Estimate dimensionality using Principal Component Analysis (PCA).

This method determines the number of principal components needed to explain a specified fraction of the total variance in the data.

Parameters:

data (array-like of shape (n_samples, n_features)) – The input data matrix where rows are samples and columns are features.
threshold (float, default=0.95) – The fraction of total variance that should be explained by the selected components. Must be between 0 and 1.
standardize (bool, default=True) – Whether to standardize the data (zero mean, unit variance) before applying PCA. Standardization is recommended when features have different scales.

Returns:

n_components – The number of principal components needed to explain the specified fraction of variance.

Return type:

int

Notes

This method provides a linear estimate of dimensionality based on variance explained. It may overestimate the intrinsic dimension for nonlinear manifolds but is useful for understanding the effective linear dimension of the data.

Examples

>>> import numpy as np
>>> from driada.dimensionality import pca_dimension
>>> # Generate data with 3 effective dimensions
>>> np.random.seed(42)  # For reproducible results
>>> data = np.random.randn(1000, 10)
>>> data[:, 3:] *= 0.1  # Make dimensions 4-10 have small variance
>>> n_dim = pca_dimension(data, threshold=0.95)
>>> print(f"Number of components for 95% variance: {n_dim}")
Number of components for 95% variance: 10

driada.dimensionality.linear.pca_dimension_profile(data, thresholds=None, standardize=True)[source]

Compute PCA dimensionality estimates for multiple variance thresholds.

This function provides a profile of how many components are needed for different levels of variance explained, which can help in understanding the distribution of variance across components.

Parameters:

data (array-like of shape (n_samples, n_features)) – The input data matrix.
thresholds (array-like, optional) – Variance thresholds to evaluate. If None, uses [0.5, 0.8, 0.9, 0.95, 0.99].
standardize (bool, default=True) – Whether to standardize the data before PCA.

Returns:

profile – Dictionary containing: - ‘thresholds’: array of variance thresholds - ‘n_components’: array of components needed for each threshold - ‘explained_variance_ratio’: variance explained by each component - ‘cumulative_variance’: cumulative variance explained

Return type:

dict

Examples

>>> np.random.seed(42)  # For reproducible results
>>> data = np.random.randn(1000, 20)
>>> profile = pca_dimension_profile(data)
>>> for thresh, n_comp in zip(profile['thresholds'], profile['n_components']):
...     print(f"{thresh*100:.0f}% variance: {n_comp} components")
50% variance: 9 components
80% variance: 16 components
90% variance: 18 components
95% variance: 19 components
99% variance: 20 components

driada.dimensionality.linear.effective_rank(data, standardize=True)[source]

Compute the effective rank (Roy & Vetterli, 2007) of the data matrix.

The effective rank is a continuous measure of dimensionality based on the entropy of the normalized eigenvalue distribution.

Parameters:

data (array-like of shape (n_samples, n_features)) – The input data matrix.
standardize (bool, default=True) – Whether to standardize the data before computation.

Returns:

eff_rank – The effective rank of the data matrix.

Return type:

float

References

Roy, O., & Vetterli, M. (2007). The effective rank: A measure of effective dimensionality. In 2007 15th European Signal Processing Conference (pp. 606-610). IEEE.

Examples

>>> import numpy as np
>>> from driada.dimensionality import effective_rank
>>> # Full rank matrix
>>> np.random.seed(42)  # For reproducible results
>>> data = np.random.randn(100, 10)
>>> eff_r = effective_rank(data)
>>> print(f"Effective rank: {eff_r:.2f}")
Effective rank: 9.88

Usage Examples

PCA-based Dimensionality Estimation

from driada.dimensionality import pca_dimension, pca_dimension_profile
import numpy as np

# Generate sample data
data = np.random.randn(1000, 50)  # 1000 samples, 50 features

# Find dimension explaining 90% variance
dim_90 = pca_dimension(data, threshold=0.90)
print(f"Dimensions for 90% variance: {dim_90}")

# Get full PCA profile
profile = pca_dimension_profile(data)
print(f"Cumulative variance: {profile['cumulative_variance'][:10]}")

Effective Rank

from driada.dimensionality import effective_rank

# Compute effective rank of covariance matrix
cov_matrix = np.cov(data.T)
eff_rank = effective_rank(cov_matrix)
print(f"Effective rank: {eff_rank:.2f}")

Implementation Details

The linear dimensionality methods are based on eigenvalue decomposition of the data covariance matrix. The key idea is that the eigenvalue spectrum reveals the intrinsic dimensionality:

PCA dimension: Number of components needed to explain a threshold percentage of variance
Effective rank: Entropy-based measure using the eigenvalue distribution
PCA profile: Complete characterization of the eigenvalue spectrum

These methods are computationally efficient and provide interpretable results, making them suitable for initial dimensionality assessment.