Linear Dimensionality Methods
This module contains linear methods for dimensionality estimation, primarily based on Principal Component Analysis (PCA).
Functions
- driada.dimensionality.linear.pca_dimension(data, threshold=0.95, standardize=True)[source]
Estimate dimensionality using Principal Component Analysis (PCA).
This method determines the number of principal components needed to explain a specified fraction of the total variance in the data.
- Parameters:
data (array-like of shape (n_samples, n_features)) – The input data matrix where rows are samples and columns are features.
threshold (float, default=0.95) – The fraction of total variance that should be explained by the selected components. Must be between 0 and 1.
standardize (bool, default=True) – Whether to standardize the data (zero mean, unit variance) before applying PCA. Standardization is recommended when features have different scales.
- Returns:
n_components – The number of principal components needed to explain the specified fraction of variance.
- Return type:
Notes
This method provides a linear estimate of dimensionality based on variance explained. It may overestimate the intrinsic dimension for nonlinear manifolds but is useful for understanding the effective linear dimension of the data.
Examples
>>> import numpy as np >>> from driada.dimensionality import pca_dimension >>> # Generate data with 3 effective dimensions >>> np.random.seed(42) # For reproducible results >>> data = np.random.randn(1000, 10) >>> data[:, 3:] *= 0.1 # Make dimensions 4-10 have small variance >>> n_dim = pca_dimension(data, threshold=0.95) >>> print(f"Number of components for 95% variance: {n_dim}") Number of components for 95% variance: 10
- driada.dimensionality.linear.pca_dimension_profile(data, thresholds=None, standardize=True)[source]
Compute PCA dimensionality estimates for multiple variance thresholds.
This function provides a profile of how many components are needed for different levels of variance explained, which can help in understanding the distribution of variance across components.
- Parameters:
data (array-like of shape (n_samples, n_features)) – The input data matrix.
thresholds (array-like, optional) – Variance thresholds to evaluate. If None, uses [0.5, 0.8, 0.9, 0.95, 0.99].
standardize (bool, default=True) – Whether to standardize the data before PCA.
- Returns:
profile – Dictionary containing: - ‘thresholds’: array of variance thresholds - ‘n_components’: array of components needed for each threshold - ‘explained_variance_ratio’: variance explained by each component - ‘cumulative_variance’: cumulative variance explained
- Return type:
Examples
>>> np.random.seed(42) # For reproducible results >>> data = np.random.randn(1000, 20) >>> profile = pca_dimension_profile(data) >>> for thresh, n_comp in zip(profile['thresholds'], profile['n_components']): ... print(f"{thresh*100:.0f}% variance: {n_comp} components") 50% variance: 9 components 80% variance: 16 components 90% variance: 18 components 95% variance: 19 components 99% variance: 20 components
- driada.dimensionality.linear.effective_rank(data, standardize=True)[source]
Compute the effective rank (Roy & Vetterli, 2007) of the data matrix.
The effective rank is a continuous measure of dimensionality based on the entropy of the normalized eigenvalue distribution.
- Parameters:
data (array-like of shape (n_samples, n_features)) – The input data matrix.
standardize (bool, default=True) – Whether to standardize the data before computation.
- Returns:
eff_rank – The effective rank of the data matrix.
- Return type:
References
Roy, O., & Vetterli, M. (2007). The effective rank: A measure of effective dimensionality. In 2007 15th European Signal Processing Conference (pp. 606-610). IEEE.
Examples
>>> import numpy as np >>> from driada.dimensionality import effective_rank >>> # Full rank matrix >>> np.random.seed(42) # For reproducible results >>> data = np.random.randn(100, 10) >>> eff_r = effective_rank(data) >>> print(f"Effective rank: {eff_r:.2f}") Effective rank: 9.88
Usage Examples
PCA-based Dimensionality Estimation
from driada.dimensionality import pca_dimension, pca_dimension_profile
import numpy as np
# Generate sample data
data = np.random.randn(1000, 50) # 1000 samples, 50 features
# Find dimension explaining 90% variance
dim_90 = pca_dimension(data, threshold=0.90)
print(f"Dimensions for 90% variance: {dim_90}")
# Get full PCA profile
profile = pca_dimension_profile(data)
print(f"Cumulative variance: {profile['cumulative_variance'][:10]}")
Effective Rank
from driada.dimensionality import effective_rank
# Compute effective rank of covariance matrix
cov_matrix = np.cov(data.T)
eff_rank = effective_rank(cov_matrix)
print(f"Effective rank: {eff_rank:.2f}")
Implementation Details
The linear dimensionality methods are based on eigenvalue decomposition of the data covariance matrix. The key idea is that the eigenvalue spectrum reveals the intrinsic dimensionality:
PCA dimension: Number of components needed to explain a threshold percentage of variance
Effective rank: Entropy-based measure using the eigenvalue distribution
PCA profile: Complete characterization of the eigenvalue spectrum
These methods are computationally efficient and provide interpretable results, making them suitable for initial dimensionality assessment.