Intrinsic Dimensionality Estimation
This module provides nonlinear methods for estimating the intrinsic dimensionality of data manifolds.
Functions
- driada.dimensionality.intrinsic.nn_dimension(data=None, k=2, graph_method='sklearn', precomputed_graph=None)[source]
Estimate intrinsic dimension using the k-NN algorithm.
This method estimates the intrinsic dimensionality by analyzing the ratios of distances to successive nearest neighbors using maximum likelihood estimation. When k=2, this implements the TWO-NN algorithm [1], where “TWO” refers to using the 1st and 2nd nearest neighbors to compute distance ratios.
The algorithm works by:
For each point, computing the ratio r_i = d_2 / d_1 (for k=2), where d_1 and d_2 are distances to the 1st and 2nd nearest neighbors
Using maximum likelihood to estimate dimension from the distribution of these ratios, which follows a specific form dependent on the intrinsic dimension
For k>2, extending to use multiple successive neighbor ratios (d_2/d_1, d_3/d_2, …, d_k/d_{k-1}) for more robust estimation
- Parameters:
data (array-like of shape (n_samples, n_features), optional) – The input data matrix where rows are samples and columns are features. Either data or precomputed_graph must be provided.
k (int, default=2) – The number of nearest neighbors to consider. k=2 gives the TWO-NN algorithm. Higher values can provide more robust estimates.
graph_method ({'sklearn', 'pynndescent'}, default='sklearn') – Method to use for nearest neighbor graph construction: - ‘sklearn’: Uses scikit-learn’s NearestNeighbors (exact) - ‘pynndescent’: Uses PyNNDescent (approximate, faster for large datasets) Only used if data is provided.
precomputed_graph (tuple of (indices, distances), optional) – Precomputed k-NN graph as a tuple of: - indices: array of shape (n_samples, k+1) with neighbor indices - distances: array of shape (n_samples, k+1) with neighbor distances The first column (index 0) should contain self-references with distance 0. Either data or precomputed_graph must be provided.
- Returns:
d – The estimated intrinsic dimension of the dataset.
- Return type:
Notes
The method is based on the principle that in a d-dimensional manifold, the ratio of distances to successive nearest neighbors follows a specific distribution that depends on d.
References
- [1] Facco, E., d’Errico, M., Rodriguez, A., & Laio, A. (2017).
Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific Reports, 7(1), 12140. https://doi.org/10.1038/s41598-017-11873-y
Examples
>>> import numpy as np >>> from driada.dimensionality.intrinsic import nn_dimension >>> >>> # Example 1: Clean low-dimensional data >>> np.random.seed(42) >>> # Generate 2D Swiss roll >>> n = 1000 >>> t = 1.5 * np.pi * (1 + 2 * np.random.rand(n)) >>> height = 30 * np.random.rand(n) >>> X = np.zeros((n, 3)) >>> X[:, 0] = t * np.cos(t) >>> X[:, 1] = height >>> X[:, 2] = t * np.sin(t) >>> d_est = nn_dimension(X, k=10) >>> print(f"Swiss roll (true dim=2): {d_est:.2f}") Swiss roll (true dim=2): 1.94 >>> >>> # Example 2: Impact of ambient noise >>> # The TWO-NN estimator (k=2) can overestimate dimension when data >>> # has noise in ambient dimensions. This is because noise affects >>> # the ratios of nearest neighbor distances. >>> >>> # Pure 2D data: accurate estimate >>> theta = np.random.uniform(0, 2*np.pi, 1000) >>> r = np.sqrt(np.random.uniform(0, 1, 1000)) # Uniform distribution on disk >>> data_2d = np.column_stack([r * np.cos(theta), r * np.sin(theta)]) >>> d_est = nn_dimension(data_2d, k=2) >>> print(f"Pure 2D disk: {d_est:.2f}") Pure 2D disk: 2.01 >>> >>> # Same data with small noise in extra dimensions: overestimates >>> # This happens because even small noise changes neighbor relationships >>> noise = 0.01 * np.random.randn(1000, 8) >>> data_10d = np.hstack([data_2d, noise]) >>> d_est = nn_dimension(data_10d, k=2) >>> print(f"2D disk in 10D with noise: {d_est:.2f}") # Overestimates due to noise 2D disk in 10D with noise: 4.73 >>> >>> # For noisy high-dimensional data: >>> # 1. Use higher k values (more robust but may underestimate) >>> # 2. Consider denoising or PCA preprocessing >>> # 3. Use alternative methods like correlation_dimension
- Raises:
ValueError – If inputs are invalid, sample size too small, or duplicate points exist.
- driada.dimensionality.intrinsic.correlation_dimension(data, r_min=None, r_max=None, n_bins=20)[source]
Estimate the correlation dimension using the Grassberger-Procaccia algorithm.
The correlation dimension measures how the number of neighbor pairs scales with distance, providing another estimate of intrinsic dimensionality.
- Parameters:
data (array-like of shape (n_samples, n_features)) – The input data matrix. For large datasets, consider subsampling as this method requires O(n²) memory.
r_min (float, optional) – Minimum distance to consider. If None, uses 1st percentile of distances.
r_max (float, optional) – Maximum distance to consider. If None, uses 50th percentile of distances.
n_bins (int, default=20) – Number of distance bins for the correlation sum.
- Returns:
dimension – The estimated correlation dimension. Returns NaN if estimation fails.
- Return type:
- Raises:
ValueError – If data has fewer than 2 samples or invalid parameters.
Notes
This method has O(n²) memory complexity. For datasets with more than 10,000 samples, consider using a subsample or alternative methods.
References
Grassberger, P., & Procaccia, I. (1983). Characterization of strange attractors. Physical Review Letters, 50(5), 346.
Examples
>>> import numpy as np >>> from driada.dimensionality import correlation_dimension >>> >>> # Generate points on a 2D circle embedded in 3D >>> t = np.linspace(0, 2*np.pi, 500) >>> data = np.column_stack([np.cos(t), np.sin(t), np.zeros_like(t)]) >>> d_est = correlation_dimension(data, n_bins=15) >>> print(f"Correlation dimension: {d_est:.2f}") # Should be close to 1 Correlation dimension: 1.07 >>> >>> # For noisy data, specify scaling range >>> np.random.seed(42) # For reproducible results >>> noisy_data = data + 0.01 * np.random.randn(*data.shape) >>> d_est = correlation_dimension(noisy_data, r_min=0.05, r_max=0.5) >>> print(f"Correlation dimension: {d_est:.2f}") Correlation dimension: 1.09
- driada.dimensionality.intrinsic.geodesic_dimension(data=None, graph=None, k=15, mode='full', factor=2, dim_step=0.1)[source]
Estimate intrinsic dimension using geodesic distances (De Granata et al. 2016).
This method estimates the intrinsic dimensionality by analyzing the distribution of geodesic distances computed as shortest paths on a k-nearest neighbor graph. The method focuses on the behavior of the distance distribution around its maximum.
Warning
This method is sensitive to the k parameter. Small k values (k < 15) often lead to underestimation due to disconnected or sparse graphs that poorly approximate geodesic distances. For reliable estimates, use k ≥ 15-25, with larger values needed for complex manifolds like Swiss rolls or S-curves.
- Parameters:
data (array-like of shape (n_samples, n_features), optional) – The input data matrix. Either data or graph must be provided.
graph (sparse matrix of shape (n_samples, n_samples), optional) – Precomputed graph adjacency matrix with edge weights as distances. If provided, data is ignored. Either data or graph must be provided.
k (int, default=15) – Number of nearest neighbors for graph construction (if data provided).
mode ({'full', 'fast'}, default='full') – Computation mode: - ‘full’: Use all pairwise geodesic distances - ‘fast’: Use a random subset (1/factor of points)
factor (int, default=2) – Subsampling factor for ‘fast’ mode. Uses n/factor random points.
dim_step (float, default=0.1) – Step size for dimension grid search. Smaller values give more precise estimates but take longer to compute.
- Returns:
dimension – The estimated intrinsic dimension of the dataset/manifold.
- Return type:
Notes
This method implements the approach from: Granata, D., Carnevale, V. (2016). Accurate Estimation of the Intrinsic Dimension Using Graph Distances: Unraveling the Geometric Complexity of Datasets. Scientific Reports, 6, 31377.
The method is particularly robust for complex manifold geometries and can handle small sample sizes effectively.
References
- [1] Granata, D., Carnevale, V. (2016). Accurate Estimation of the Intrinsic
Dimension Using Graph Distances: Unraveling the Geometric Complexity of Datasets. Scientific Reports, 6, 31377. https://doi.org/10.1038/srep31377
Examples
>>> import numpy as np >>> from driada.dimensionality.intrinsic import geodesic_dimension >>> # Generate 2D Swiss roll >>> from sklearn.datasets import make_swiss_roll >>> np.random.seed(42) # For reproducible results >>> data, _ = make_swiss_roll(n_samples=1000, noise=0.05, random_state=42) >>> d_est = geodesic_dimension(data, k=15) >>> print(f"Estimated dimension: {d_est:.2f}") # Should be close to 2 Estimated dimension: 1.90
- Raises:
ValueError – If inputs are invalid or graph is too disconnected.
RuntimeWarning – If graph is disconnected (infinite distances found).
Usage Examples
k-Nearest Neighbors Method
from driada.dimensionality import nn_dimension
import numpy as np
# Data on a nonlinear manifold
theta = np.random.uniform(0, 2*np.pi, 1000)
phi = np.random.uniform(0, np.pi, 1000)
# Points on a sphere (2D manifold in 3D)
data = np.column_stack([
np.sin(phi) * np.cos(theta),
np.sin(phi) * np.sin(theta),
np.cos(phi)
])
# Estimate intrinsic dimension
dim_knn = nn_dimension(data, k=5)
print(f"k-NN dimension estimate: {dim_knn:.2f}") # Should be ~2
Correlation Dimension
from driada.dimensionality import correlation_dimension
# Estimate correlation dimension
dim_corr = correlation_dimension(data, r_min=0.01, r_max=1.0, n_bins=20)
print(f"Correlation dimension: {dim_corr:.2f}")
Theory
Intrinsic dimensionality methods estimate the dimension of the manifold on which data lies:
k-NN Method: Based on the growth rate of nearest neighbor distances:
Correlation Dimension: Measures scaling of correlation integral:
where \(C(r)\) is the fraction of point pairs within distance \(r\).
MLE Method: Maximum likelihood estimation assuming locally uniform distribution.
Choosing a Method
nn_dimension: Good general-purpose method, robust to noise
correlation_dimension: Classic method, requires careful scale selection
geodesic_dimension: Accounts for manifold curvature, slower
For neural data, nn_dimension with k=5-10 often provides reliable estimates.