Intrinsic Dimensionality Estimation

This module provides nonlinear methods for estimating the intrinsic dimensionality of data manifolds.

Functions

driada.dimensionality.intrinsic.nn_dimension(data=None, k=2, graph_method='sklearn', precomputed_graph=None)[source]

Estimate intrinsic dimension using the k-NN algorithm.

This method estimates the intrinsic dimensionality by analyzing the ratios of distances to successive nearest neighbors using maximum likelihood estimation. When k=2, this implements the TWO-NN algorithm [1], where “TWO” refers to using the 1st and 2nd nearest neighbors to compute distance ratios.

The algorithm works by:

For each point, computing the ratio r_i = d_2 / d_1 (for k=2), where d_1 and d_2 are distances to the 1st and 2nd nearest neighbors
Using maximum likelihood to estimate dimension from the distribution of these ratios, which follows a specific form dependent on the intrinsic dimension
For k>2, extending to use multiple successive neighbor ratios (d_2/d_1, d_3/d_2, …, d_k/d_{k-1}) for more robust estimation

Parameters:

data (array-like of shape (n_samples, n_features), optional) – The input data matrix where rows are samples and columns are features. Either data or precomputed_graph must be provided.
k (int, default=2) – The number of nearest neighbors to consider. k=2 gives the TWO-NN algorithm. Higher values can provide more robust estimates.
graph_method ({'sklearn', 'pynndescent'}, default='sklearn') – Method to use for nearest neighbor graph construction: - ‘sklearn’: Uses scikit-learn’s NearestNeighbors (exact) - ‘pynndescent’: Uses PyNNDescent (approximate, faster for large datasets) Only used if data is provided.
precomputed_graph (tuple of (indices, distances), optional) – Precomputed k-NN graph as a tuple of: - indices: array of shape (n_samples, k+1) with neighbor indices - distances: array of shape (n_samples, k+1) with neighbor distances The first column (index 0) should contain self-references with distance 0. Either data or precomputed_graph must be provided.

Returns:

d – The estimated intrinsic dimension of the dataset.

Return type:

float

Notes

The method is based on the principle that in a d-dimensional manifold, the ratio of distances to successive nearest neighbors follows a specific distribution that depends on d.

References

[1] Facco, E., d’Errico, M., Rodriguez, A., & Laio, A. (2017).: Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific Reports, 7(1), 12140. https://doi.org/10.1038/s41598-017-11873-y

Examples

>>> import numpy as np
>>> from driada.dimensionality.intrinsic import nn_dimension
>>>
>>> # Example 1: Clean low-dimensional data
>>> np.random.seed(42)
>>> # Generate 2D Swiss roll
>>> n = 1000
>>> t = 1.5 * np.pi * (1 + 2 * np.random.rand(n))
>>> height = 30 * np.random.rand(n)
>>> X = np.zeros((n, 3))
>>> X[:, 0] = t * np.cos(t)
>>> X[:, 1] = height
>>> X[:, 2] = t * np.sin(t)
>>> d_est = nn_dimension(X, k=10)
>>> print(f"Swiss roll (true dim=2): {d_est:.2f}")
Swiss roll (true dim=2): 1.94
>>>
>>> # Example 2: Impact of ambient noise
>>> # The TWO-NN estimator (k=2) can overestimate dimension when data
>>> # has noise in ambient dimensions. This is because noise affects
>>> # the ratios of nearest neighbor distances.
>>>
>>> # Pure 2D data: accurate estimate
>>> theta = np.random.uniform(0, 2*np.pi, 1000)
>>> r = np.sqrt(np.random.uniform(0, 1, 1000))  # Uniform distribution on disk
>>> data_2d = np.column_stack([r * np.cos(theta), r * np.sin(theta)])
>>> d_est = nn_dimension(data_2d, k=2)
>>> print(f"Pure 2D disk: {d_est:.2f}")
Pure 2D disk: 2.01
>>>
>>> # Same data with small noise in extra dimensions: overestimates
>>> # This happens because even small noise changes neighbor relationships
>>> noise = 0.01 * np.random.randn(1000, 8)
>>> data_10d = np.hstack([data_2d, noise])
>>> d_est = nn_dimension(data_10d, k=2)
>>> print(f"2D disk in 10D with noise: {d_est:.2f}")  # Overestimates due to noise
2D disk in 10D with noise: 4.73
>>>
>>> # For noisy high-dimensional data:
>>> # 1. Use higher k values (more robust but may underestimate)
>>> # 2. Consider denoising or PCA preprocessing
>>> # 3. Use alternative methods like correlation_dimension

Raises:: ValueError – If inputs are invalid, sample size too small, or duplicate points exist.

driada.dimensionality.intrinsic.correlation_dimension(data, r_min=None, r_max=None, n_bins=20)[source]

Estimate the correlation dimension using the Grassberger-Procaccia algorithm.

The correlation dimension measures how the number of neighbor pairs scales with distance, providing another estimate of intrinsic dimensionality.

Parameters:

data (array-like of shape (n_samples, n_features)) – The input data matrix. For large datasets, consider subsampling as this method requires O(n²) memory.
r_min (float, optional) – Minimum distance to consider. If None, uses 1st percentile of distances.
r_max (float, optional) – Maximum distance to consider. If None, uses 50th percentile of distances.
n_bins (int, default=20) – Number of distance bins for the correlation sum.

Returns:

dimension – The estimated correlation dimension. Returns NaN if estimation fails.

Return type:

float

Raises:

ValueError – If data has fewer than 2 samples or invalid parameters.

Notes

This method has O(n²) memory complexity. For datasets with more than 10,000 samples, consider using a subsample or alternative methods.

References

Grassberger, P., & Procaccia, I. (1983). Characterization of strange attractors. Physical Review Letters, 50(5), 346.

Examples

>>> import numpy as np
>>> from driada.dimensionality import correlation_dimension
>>>
>>> # Generate points on a 2D circle embedded in 3D
>>> t = np.linspace(0, 2*np.pi, 500)
>>> data = np.column_stack([np.cos(t), np.sin(t), np.zeros_like(t)])
>>> d_est = correlation_dimension(data, n_bins=15)
>>> print(f"Correlation dimension: {d_est:.2f}")  # Should be close to 1
Correlation dimension: 1.07
>>>
>>> # For noisy data, specify scaling range
>>> np.random.seed(42)  # For reproducible results
>>> noisy_data = data + 0.01 * np.random.randn(*data.shape)
>>> d_est = correlation_dimension(noisy_data, r_min=0.05, r_max=0.5)
>>> print(f"Correlation dimension: {d_est:.2f}")
Correlation dimension: 1.09

driada.dimensionality.intrinsic.geodesic_dimension(data=None, graph=None, k=15, mode='full', factor=2, dim_step=0.1)[source]

Estimate intrinsic dimension using geodesic distances (De Granata et al. 2016).

This method estimates the intrinsic dimensionality by analyzing the distribution of geodesic distances computed as shortest paths on a k-nearest neighbor graph. The method focuses on the behavior of the distance distribution around its maximum.

Warning

This method is sensitive to the k parameter. Small k values (k < 15) often lead to underestimation due to disconnected or sparse graphs that poorly approximate geodesic distances. For reliable estimates, use k ≥ 15-25, with larger values needed for complex manifolds like Swiss rolls or S-curves.

Parameters:

data (array-like of shape (n_samples, n_features), optional) – The input data matrix. Either data or graph must be provided.
graph (sparse matrix of shape (n_samples, n_samples), optional) – Precomputed graph adjacency matrix with edge weights as distances. If provided, data is ignored. Either data or graph must be provided.
k (int, default=15) – Number of nearest neighbors for graph construction (if data provided).
mode ({'full', 'fast'}, default='full') – Computation mode: - ‘full’: Use all pairwise geodesic distances - ‘fast’: Use a random subset (1/factor of points)
factor (int, default=2) – Subsampling factor for ‘fast’ mode. Uses n/factor random points.
dim_step (float, default=0.1) – Step size for dimension grid search. Smaller values give more precise estimates but take longer to compute.

Returns:

dimension – The estimated intrinsic dimension of the dataset/manifold.

Return type:

float

Notes

This method implements the approach from: Granata, D., Carnevale, V. (2016). Accurate Estimation of the Intrinsic Dimension Using Graph Distances: Unraveling the Geometric Complexity of Datasets. Scientific Reports, 6, 31377.

The method is particularly robust for complex manifold geometries and can handle small sample sizes effectively.

References

[1] Granata, D., Carnevale, V. (2016). Accurate Estimation of the Intrinsic: Dimension Using Graph Distances: Unraveling the Geometric Complexity of Datasets. Scientific Reports, 6, 31377. https://doi.org/10.1038/srep31377

Examples

>>> import numpy as np
>>> from driada.dimensionality.intrinsic import geodesic_dimension
>>> # Generate 2D Swiss roll
>>> from sklearn.datasets import make_swiss_roll
>>> np.random.seed(42)  # For reproducible results
>>> data, _ = make_swiss_roll(n_samples=1000, noise=0.05, random_state=42)
>>> d_est = geodesic_dimension(data, k=15)
>>> print(f"Estimated dimension: {d_est:.2f}")  # Should be close to 2
Estimated dimension: 1.90

Raises:

ValueError – If inputs are invalid or graph is too disconnected.
RuntimeWarning – If graph is disconnected (infinite distances found).

Usage Examples

k-Nearest Neighbors Method

from driada.dimensionality import nn_dimension
import numpy as np

# Data on a nonlinear manifold
theta = np.random.uniform(0, 2*np.pi, 1000)
phi = np.random.uniform(0, np.pi, 1000)
# Points on a sphere (2D manifold in 3D)
data = np.column_stack([
    np.sin(phi) * np.cos(theta),
    np.sin(phi) * np.sin(theta),
    np.cos(phi)
])

# Estimate intrinsic dimension
dim_knn = nn_dimension(data, k=5)
print(f"k-NN dimension estimate: {dim_knn:.2f}")  # Should be ~2

Correlation Dimension

from driada.dimensionality import correlation_dimension

# Estimate correlation dimension
dim_corr = correlation_dimension(data, r_min=0.01, r_max=1.0, n_bins=20)
print(f"Correlation dimension: {dim_corr:.2f}")

Theory

Intrinsic dimensionality methods estimate the dimension of the manifold on which data lies:

k-NN Method: Based on the growth rate of nearest neighbor distances:

\[\hat{d} = \left[ \frac{1}{n} \sum_{i=1}^n \log \frac{r_{k}(i)}{r_{j}(i)} \right]^{-1} \log \frac{k}{j}\]

Correlation Dimension: Measures scaling of correlation integral:

\[D_c = \lim_{r \to 0} \frac{\log C(r)}{\log r}\]

where \(C(r)\) is the fraction of point pairs within distance \(r\).

MLE Method: Maximum likelihood estimation assuming locally uniform distribution.

Choosing a Method

nn_dimension: Good general-purpose method, robust to noise
correlation_dimension: Classic method, requires careful scale selection
geodesic_dimension: Accounts for manifold curvature, slower

For neural data, nn_dimension with k=5-10 often provides reliable estimates.