Data Structures
Core data structures for dimensionality reduction.
MVData
- class driada.dim_reduction.data.MVData(data, labels=None, distmat=None, rescale_rows=False, data_name=None, downsampling=None, verbose=False, allow_zero_columns=False)[source]
Main class for multivariate data storage & processing.
This class encapsulates multivariate data and provides methods for preprocessing, distance computation, graph construction, and embedding generation. Data is stored as a matrix with features as rows and samples as columns.
- Parameters:
data (array-like) – Data matrix with shape (n_features, n_samples)
labels (array-like, optional) – Labels for each sample
distmat (array-like, optional) – Precomputed distance matrix
rescale_rows (bool, default=False) – Whether to rescale each row to [0, 1]
data_name (str, optional) – Name for the dataset
downsampling (int, optional) – Downsampling factor
verbose (bool, default=False) – Whether to print progress messages
allow_zero_columns (bool, default=False) – Whether to allow columns with all zero values. If False, raises ValueError when zero columns are detected.
- data
Processed data matrix with shape (n_features, n_samples).
- Type:
np.ndarray
- labels
Labels for each sample.
- Type:
np.ndarray
- distmat
Distance matrix if provided.
- Type:
np.ndarray or None
- Raises:
ValueError – If data contains zero columns and allow_zero_columns=False. From rescale() if rescale_rows=True and data format is invalid. If labels length doesn’t match number of points after downsampling. If distance matrix shape doesn’t match (n_points, n_points).
Notes
Data is downsampled by taking every ds-th column
If rescale_rows=True, each row is rescaled to [0,1] range
Labels default to zeros if not provided
- __init__(data, labels=None, distmat=None, rescale_rows=False, data_name=None, downsampling=None, verbose=False, allow_zero_columns=False)[source]
Initialize MVData object with multi-dimensional data.
- Parameters:
data (array-like) – Data matrix with shape (n_features, n_samples)
labels (array-like, optional) – Labels for each sample. Defaults to zeros if not provided
distmat (array-like, optional) – Pre-computed distance matrix with shape (n_points, n_points)
rescale_rows (bool, default=False) – Whether to rescale each row to [0,1] range
data_name (str, optional) – Name for the dataset
downsampling (int, optional) – Downsampling factor
verbose (bool, default=False) – Whether to print progress messages
allow_zero_columns (bool, default=False) – Whether to allow columns with all zero values
- median_filter(window)[source]
Apply median filter to each row of the data.
Median filtering is useful for removing impulse noise while preserving edges in the signal. Operates row-wise on the data.
- Parameters:
window (int or array-like) – Size of the median filter window. If int, uses a window of that size. Must be odd. See scipy.signal.medfilt documentation for valid window specifications.
- Raises:
ValueError – From scipy.signal.medfilt if window size is invalid.
ImportError – If scipy.signal is not available.
Notes
Modifies self.data in-place
Handles both sparse and dense matrices appropriately
For sparse matrices, converts to dense for filtering then back to sparse
Warning: Converting large sparse matrices to dense may cause memory issues
The window parameter is passed directly to scipy.signal.medfilt
- corr_mat(axis=0)[source]
Compute correlation matrix.
- Parameters:
axis (int, default 0) – Axis along which to compute correlations: - 0: correlations between rows (features) - 1: correlations between columns (samples/timepoints)
- Returns:
Correlation matrix
- Return type:
np.ndarray
- get_distmat(m_params=None)[source]
Compute pairwise distance matrix.
- Parameters:
m_params (dict or str, optional) – If dict: metric parameters with ‘metric_name’ key and optional metric-specific params If str: metric name directly If None: defaults to ‘euclidean’
- Returns:
Distance matrix of shape (n_samples, n_samples)
- Return type:
np.ndarray
- Raises:
ValueError – If metric name is invalid or metric parameters are incompatible.
MemoryError – If dataset is too large for pairwise distance computation.
Notes
The metric ‘l2’ is automatically converted to ‘euclidean’ for scipy compatibility
Distances are computed on transposed data (between columns/samples)
Result is stored in self.distmat
- get_embedding(e_params=None, g_params=None, m_params=None, kwargs=None, method=None, **method_kwargs)[source]
Get embedding using specified method.
- Parameters:
e_params (dict, optional) – Embedding parameters (legacy format)
g_params (dict, optional) – Graph parameters (legacy format)
m_params (dict, optional) – Metric parameters (legacy format)
kwargs (dict, optional) – Additional kwargs for the embedding method
method (str, optional) – Method name for simplified API (e.g., ‘pca’, ‘umap’)
**method_kwargs – Additional parameters when using simplified API
- Returns:
The computed embedding
- Return type:
- Raises:
ValueError – If neither ‘method’ nor ‘e_params’ is provided. If method requires proximity graph but g_params not provided. If method requires weights but m_params not provided.
Exception – If embedding method is unknown. If method requires distance matrix but none available.
Examples
>>> import numpy as np >>> # Create data: 20 features, 100 samples >>> data = np.random.randn(20, 100) >>> mvdata = MVData(data) >>> >>> # Get PCA embedding >>> emb = mvdata.get_embedding(method='pca', dim=3, verbose=False) >>> type(emb).__name__ 'Embedding' >>> emb.coords.shape # (3 dimensions, 100 samples) (3, 100)
- get_proximity_graph(m_params, g_params)[source]
Construct proximity graph from the data.
Creates a graph where nodes are data points and edges connect nearby points according to the specified method.
- Parameters:
- Returns:
Graph object capturing local neighborhood structure.
- Return type:
- Raises:
Exception – If g_method_name is not in GRAPH_CONSTRUCTION_METHODS.
See also
ProximityGraphThe graph construction class.
Embedding
- class driada.dim_reduction.embedding.Embedding(init_data, init_distmat, labels, params, g=None)[source]
Low-dimensional representation of high-dimensional data.
This is an internal class typically created by MVData.get_embedding(). It provides a unified interface for various dimensionality reduction methods including linear (PCA), non-linear manifold learning (Isomap, LLE, UMAP), spectral methods (Laplacian Eigenmaps, Diffusion Maps), and neural network-based approaches (autoencoders, VAEs).
- Parameters:
init_data (ndarray) – Input data matrix of shape (n_features, n_samples).
init_distmat (ndarray or None) – Precomputed distance matrix of shape (n_samples, n_samples). Used by methods like MDS when available.
labels (array-like) – Labels for data points, used for visualization and evaluation.
params (dict) – Filtered embedding parameters from e_param_filter(). Must contain: - ‘e_method’ : DRMethod object from METHODS_DICT - ‘e_method_name’ : str, method name (e.g., ‘pca’, ‘umap’) - ‘dim’ : int, target embedding dimension May also contain method-specific keys: - ‘min_dist’ : float, for UMAP only - ‘dm_alpha’ : float, for dmaps/auto_dmaps only - ‘dm_t’ : int, for dmaps/auto_dmaps only
g (ProximityGraph, optional) – Precomputed proximity graph. Required for graph-based methods (LE, Isomap, LLE, etc.). If None, must be created before building.
- graph
The proximity graph used by graph-based methods.
- Type:
- coords
Embedding coordinates of shape (dim, n_samples). None until build() is called.
- Type:
ndarray or None
- labels
Data point labels.
- Type:
array-like
- init_data
Original high-dimensional data.
- Type:
ndarray
- init_distmat
Precomputed distance matrix if provided.
- Type:
ndarray or None
- transformation_matrix
For linear methods (PCA), the transformation matrix to project new data.
- Type:
ndarray or None
- nnmodel
For neural network methods, the trained model.
- Type:
nn.Module or None
- reducer\_
The underlying reducer object (sklearn model, etc.) for potential reuse.
- Type:
- Plus all parameters from params dict set as attributes via setattr.
- create_ae_embedding_(**kwargs)[source]
Autoencoder with optional correlation/MI losses (deprecated).
- create_flexible_ae_embedding_(**kwargs)[source]
Flexible autoencoder with modular loss composition.
Notes
Graph-based methods require a ProximityGraph to be provided or created
Neural methods (AE/VAE) require PyTorch to be installed
Some methods (MVU) require additional dependencies (cvxpy)
Coordinates are stored as (dim, n_samples) for consistency
The
auto_prefix indicates methods using external libraries
Examples
Direct instantiation (internal use):
>>> from driada.dim_reduction.dr_base import METHODS_DICT >>> import numpy as np >>> data = np.random.randn(100, 500) # 100 features, 500 samples >>> labels = np.random.randint(0, 3, 500) >>> # Parameters after e_param_filter >>> params = { ... 'e_method': METHODS_DICT['pca'], ... 'e_method_name': 'pca', ... 'dim': 2 ... } >>> embedding = Embedding(data, None, labels, params) >>> embedding.build(kwargs={'verbose': False}) >>> embedding.coords.shape (2, 500)
Typical usage via MVData (recommended):
>>> from driada.dim_reduction.data import MVData >>> mvdata = MVData(data, labels=labels) >>> # PCA embedding >>> embedding = mvdata.get_embedding(method='pca', dim=2, verbose=False) >>> embedding.coords.shape (2, 500) >>> # UMAP with custom parameters >>> embedding = mvdata.get_embedding( ... method='umap', ... dim=2, ... n_neighbors=15, ... min_dist=0.1 ... )
- __init__(init_data, init_distmat, labels, params, g=None)[source]
Initialize Embedding object.
- Parameters:
init_data (array-like) – Initial data matrix with shape (n_features, n_samples)
init_distmat (array-like or None) – Pre-computed distance matrix, optional
labels (array-like) – Labels for each data point
params (dict) – Dictionary of embedding parameters, must include ‘e_method’, ‘e_method_name’, and method-specific parameters
g (ProximityGraph, optional) – Pre-computed proximity graph. If None, may be computed later depending on the embedding method
- Raises:
TypeError – If g is provided but not a ProximityGraph instance.
AttributeError – If required params keys are missing during attribute access.
Notes
All keys in params dict are set as instance attributes via setattr. The params are filtered through e_param_filter before use.
Examples
>>> # Typically created via MVData.get_embedding(), not directly >>> import numpy as np >>> from driada.dim_reduction.dr_base import METHODS_DICT >>> data = np.random.randn(100, 50) # 100 features, 50 samples >>> labels = np.repeat([0, 1], 25) >>> params = {'e_method': METHODS_DICT['pca'], 'e_method_name': 'pca', 'dim': 2} >>> emb = Embedding(data, None, labels, params)
- build(kwargs=None)[source]
Build the embedding using the specified method.
Dynamically calls the appropriate embedding creation method based on the embedding method name (e.g., ‘pca’ calls
create_pca_embedding_).- Parameters:
kwargs (dict, optional) – Additional keyword arguments passed to the specific embedding method. For neural network methods (AE/VAE), this includes training parameters.
- Returns:
Modifies self.coords in-place with embedding coordinates.
- Return type:
None
- Raises:
AttributeError – If e_method_name is invalid or corresponding method not found. If self.graph is None for methods requiring it.
Exception – If the graph is disconnected and the method cannot handle it.
Examples
>>> import numpy as np >>> from driada.dim_reduction.dr_base import METHODS_DICT >>> data = np.random.randn(20, 100) # 20 features, 100 samples >>> labels = np.zeros(100) >>> params = {'e_method': METHODS_DICT['pca'], 'e_method_name': 'pca', 'dim': 2} >>> emb = Embedding(data, None, labels, params) >>> emb.build(kwargs={'verbose': False}) # For PCA, no graph needed >>> emb.coords.shape (2, 100)
- create_pca_embedding_(verbose=True)[source]
Create PCA (Principal Component Analysis) embedding.
Linear dimensionality reduction using orthogonal transformation to convert data into linearly uncorrelated components ordered by variance.
- Parameters:
verbose (bool, default=True) – Whether to print progress messages.
- Returns:
Sets self.coords to shape (dim, n_samples).
- Return type:
None
- Raises:
AttributeError – If self.dim is not set.
ValueError – If PCA fails (e.g., dim > n_features).
Notes
Sets self.coords to shape (dim, n_samples) and stores the PCA object in
self.reducer_for potential reuse or analysis.Data is transposed before fitting since init_data is (n_features, n_samples) while sklearn expects (n_samples, n_features).
Examples
>>> import numpy as np >>> from driada.dim_reduction.dr_base import METHODS_DICT >>> data = np.random.randn(10, 500) # 10 features, 500 samples >>> labels = np.random.randint(0, 3, 500) >>> params = {'e_method': METHODS_DICT['pca'], 'e_method_name': 'pca', 'dim': 2} >>> emb = Embedding(data, None, labels, params) >>> emb.create_pca_embedding_(verbose=False) >>> emb.coords.shape (2, 500)
- create_isomap_embedding_()[source]
Create Isomap embedding using geodesic distances.
Non-linear dimensionality reduction through isometric mapping. Preserves geodesic distances between all points by first computing shortest paths on the neighborhood graph, then applying MDS.
- Returns:
Sets self.coords to shape (dim, n_samples).
- Return type:
None
- Raises:
AttributeError – If self.graph, self.dim, or self.nn not set.
MemoryError – If converting sparse matrix to dense fails.
Notes
Requires a proximity graph. Uses Dijkstra’s algorithm to compute shortest paths, then applies classical MDS to the geodesic distance matrix.
Warning: Converts sparse adjacency to dense matrix which may use excessive memory for large datasets.
The Isomap object is stored in
self.reducer_for potential reuse.Examples
>>> import numpy as np >>> from driada.dim_reduction.dr_base import METHODS_DICT >>> from driada.dim_reduction.graph import ProximityGraph >>> data = np.random.randn(10, 500) # 10 features, 500 samples >>> labels = np.random.randint(0, 3, 500) >>> params = {'e_method': METHODS_DICT['isomap'], 'e_method_name': 'isomap', 'dim': 2} >>> emb = Embedding(data, None, labels, params) >>> m_params = {'metric_name': 'euclidean'} >>> g_params = {'g_method_name': 'knn', 'nn': 10, 'weighted': False} >>> emb.graph = ProximityGraph(data, m_params, g_params) >>> emb.create_isomap_embedding_() >>> emb.coords.shape[0] # Number of dimensions 2
- create_mds_embedding_()[source]
Create MDS (Multi-Dimensional Scaling) embedding.
Classical MDS finds a low-dimensional representation that preserves pairwise distances between points. Works with either a pre-computed distance matrix or calculates distances from the data.
- Returns:
Sets self.coords to shape (dim, n_samples).
- Return type:
None
- Raises:
ImportError – If sklearn.manifold.MDS not available.
AttributeError – If self.dim not set.
Notes
Sets self.coords to shape (dim, n_samples) containing the MDS coordinates. If init_distmat is provided, uses it as a precomputed distance matrix. Otherwise, computes Euclidean distances from init_data.
The algorithm minimizes the stress function: stress = sum((d_ij - ||x_i - x_j||)^2) where d_ij are the input distances.
Uses fixed random_state=42 for reproducibility - to be changed in the future. The MDS object is stored in
self.reducer_for potential reuse.Examples
>>> import numpy as np >>> from driada.dim_reduction.dr_base import METHODS_DICT >>> from scipy.spatial.distance import pdist, squareform >>> data = np.random.randn(10, 500) # 10 features, 500 samples >>> labels = np.random.randint(0, 3, 500) >>> distmat = squareform(pdist(data.T)) >>> params = {'e_method': METHODS_DICT['mds'], 'e_method_name': 'mds', 'dim': 2} >>> emb = Embedding(data, distmat, labels, params) >>> emb.create_mds_embedding_() >>> emb.coords.shape (2, 500)
- create_mvu_embedding_()[source]
Create Maximum Variance Unfolding (MVU) embedding.
Non-linear dimensionality reduction that “unfolds” a manifold by maximizing variance while preserving local distances. Solves a semidefinite programming problem to find the optimal embedding.
Sets self.coords to shape (dim, n_samples) embedding coordinates and
self.reducer_to the fitted MVU object.- Raises:
ImportError – If cvxpy is not installed.
AttributeError – If self.graph or required attributes are not set.
ValueError – If self.dim is not positive or if solver fails to converge.
Notes
Requires cvxpy for convex optimization. Uses the SCS solver with reasonable defaults for most datasets. The embedding preserves local neighborhood structure while maximizing global variance.
The optimization problem maximizes tr(Y’Y) subject to: - Local distance preservation: ||y_i - y_j||² = ||x_i - x_j||² for neighbors - Centering: ∑y_i = 0
Examples
>>> import numpy as np >>> from driada.dim_reduction.dr_base import METHODS_DICT >>> from driada.dim_reduction.graph import ProximityGraph >>> data = np.random.randn(10, 500) # 10 features, 500 samples >>> labels = np.random.randint(0, 3, 500) >>> params = {'e_method': METHODS_DICT['mvu'], 'e_method_name': 'mvu', 'dim': 2} >>> emb = Embedding(data, None, labels, params) >>> m_params = {'metric_name': 'euclidean'} >>> g_params = {'g_method_name': 'knn', 'nn': 10, 'weighted': False} >>> emb.graph = ProximityGraph(data, m_params, g_params) >>> emb.create_mvu_embedding_() >>> emb.coords.shape (2, 500)
- create_lle_embedding_()[source]
Create Locally Linear Embedding (LLE).
Non-linear dimensionality reduction that assumes data lies on a locally linear manifold. Each point is reconstructed from its neighbors, and these weights are preserved in lower dimensions.
Sets self.coords to shape (dim, n_samples) embedding coordinates and
self.reducer_to the fitted sklearn LLE object.- Raises:
AttributeError – If self.graph or required attributes are not set.
ValueError – If self.dim is not positive or n_neighbors is invalid.
Notes
Uses sklearn’s LocallyLinearEmbedding implementation. The algorithm: 1. Finds k nearest neighbors for each point 2. Computes weights to reconstruct each point from neighbors 3. Finds low-d embedding preserving reconstruction weights
Time complexity is O(DNlog(k)N + DNk³) for N points in D dimensions.
Examples
>>> import numpy as np >>> from driada.dim_reduction.dr_base import METHODS_DICT >>> from driada.dim_reduction.graph import ProximityGraph >>> data = np.random.randn(10, 500) # 10 features, 500 samples >>> labels = np.random.randint(0, 3, 500) >>> params = {'e_method': METHODS_DICT['lle'], 'e_method_name': 'lle', 'dim': 2} >>> emb = Embedding(data, None, labels, params) >>> m_params = {'metric_name': 'euclidean'} >>> g_params = {'g_method_name': 'knn', 'nn': 10, 'weighted': False} >>> emb.graph = ProximityGraph(data, m_params, g_params) >>> emb.create_lle_embedding_() >>> emb.coords.shape (2, 500)
References
Roweis & Saul (2000). Nonlinear dimensionality reduction by locally linear embedding. Science 290:2323-2326.
- create_hlle_embedding_()[source]
Create Hessian Locally Linear Embedding (HLLE).
Modified version of LLE that uses Hessian-based regularization to better preserve local geometric structure. Particularly effective for manifolds with varying curvature.
Sets self.coords to shape (dim, n_samples) embedding coordinates and
self.reducer_to the fitted sklearn HLLE object.- Raises:
AttributeError – If self.graph or required attributes are not set.
ValueError – If self.dim is not positive or if the constraint n_neighbors > n_components * (n_components + 3) / 2 is not satisfied.
Notes
Requires n_neighbors > n_components * (n_components + 3) / 2 for the Hessian estimation. More computationally intensive than standard LLE but often produces better embeddings for complex manifolds.
The algorithm estimates the Hessian at each point to capture local curvature, then finds an embedding that preserves this structure.
Examples
>>> import numpy as np >>> from driada.dim_reduction.dr_base import METHODS_DICT >>> from driada.dim_reduction.graph import ProximityGraph >>> data = np.random.randn(10, 500) # 10 features, 500 samples >>> labels = np.random.randint(0, 3, 500) >>> params = {'e_method': METHODS_DICT['hlle'], 'e_method_name': 'hlle', 'dim': 2} >>> emb = Embedding(data, None, labels, params) >>> # For 2D embedding, need at least 6 neighbors >>> m_params = {'metric_name': 'euclidean'} >>> g_params = {'g_method_name': 'knn', 'nn': 10, 'weighted': False} >>> emb.graph = ProximityGraph(data, m_params, g_params) >>> emb.create_hlle_embedding_()
References
Donoho, D. & Grimes, C. (2003). Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. PNAS.
- create_le_embedding_()[source]
Create Laplacian Eigenmaps (LE) embedding.
Spectral embedding method that uses eigenvectors of the graph Laplacian to embed nodes while preserving local neighborhood structure. Particularly effective for data lying on low-dimensional manifolds.
Sets self.coords to shape (dim, n_samples) containing the embedding.
- Raises:
AttributeError – If self.graph or required attributes are not set.
ValueError – If self.dim is not positive or if the graph is disconnected (multiple eigenvalues equal to 1).
Notes
Uses the transition matrix eigenvectors (more stable than Laplacian). Normalizes eigenvectors by node degree to ensure proper embedding.
The embedding minimizes: sum_ij W_ij ||y_i - y_j||^2 subject to orthogonality constraints.
The algorithm: 1. Computes transition matrix P = D^(-1)A 2. Finds top eigenvectors of P (excluding trivial) 3. Normalizes by degree for proper embedding
Examples
>>> import numpy as np >>> from driada.dim_reduction.dr_base import METHODS_DICT >>> from driada.dim_reduction.graph import ProximityGraph >>> data = np.random.randn(10, 500) # 10 features, 500 samples >>> labels = np.random.randint(0, 3, 500) >>> params = {'e_method': METHODS_DICT['le'], 'e_method_name': 'le', 'dim': 2} >>> emb = Embedding(data, None, labels, params) >>> m_params = {'metric_name': 'euclidean'} >>> g_params = {'g_method_name': 'knn', 'nn': 10, 'weighted': False} >>> emb.graph = ProximityGraph(data, m_params, g_params) >>> emb.create_le_embedding_() >>> emb.coords.shape[0] # Number of dimensions 2
References
Belkin, M. & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation.
- create_auto_le_embedding_()[source]
Create Laplacian Eigenmaps embedding using sklearn’s implementation.
Alternative implementation using sklearn’s spectral_embedding function. More robust and handles edge cases better than the manual implementation.
Sets self.coords to shape (dim, n_samples).
- Raises:
AttributeError – If self.graph or required attributes are not set.
ValueError – If self.dim is not positive.
MemoryError – If the graph is too large to convert to dense format.
Notes
Uses normalized Laplacian by default for better numerical stability. Automatically handles disconnected graphs by dropping the first eigenvector.
Warning: Converts sparse adjacency matrix to dense format, which may cause memory issues for graphs with more than ~10,000 nodes.
This is the recommended method for most use cases unless you need fine control over the eigendecomposition process.
Examples
>>> import numpy as np >>> from driada.dim_reduction.dr_base import METHODS_DICT >>> from driada.dim_reduction.graph import ProximityGraph >>> data = np.random.randn(10, 500) # 10 features, 500 samples >>> labels = np.random.randint(0, 3, 500) >>> params = {'e_method': METHODS_DICT['auto_le'], 'e_method_name': 'auto_le', 'dim': 2} >>> emb = Embedding(data, None, labels, params) >>> m_params = {'metric_name': 'euclidean'} >>> g_params = {'g_method_name': 'knn', 'nn': 10, 'weighted': False} >>> emb.graph = ProximityGraph(data, m_params, g_params) >>> emb.create_auto_le_embedding_() >>> emb.coords.shape[0] # Number of dimensions 2
- create_dmaps_embedding_()[source]
Create diffusion maps embedding.
Implements the standard diffusion maps algorithm with alpha normalization for anisotropic diffusion and diffusion time parameter t.
- Raises:
AttributeError – If graph is not built or required attributes are missing.
ValueError – If dim is invalid, graph has isolated nodes, or eigendecomposition fails.
Notes
The algorithm performs the following steps: 1. Apply alpha normalization to adjacency matrix 2. Create Markov transition matrix 3. Compute eigendecomposition 4. Scale eigenvectors by eigenvalues^t
Future enhancement: Variable bandwidth diffusion maps - Berry & Harlim (2016): “Variable bandwidth diffusion kernels” - DOI: https://doi.org/10.1016/j.acha.2015.01.001 - Would allow adaptive kernel bandwidth based on local density
References
Coifman & Lafon (2006): Diffusion maps
- create_auto_dmaps_embedding_()[source]
Create diffusion maps embedding using pydiffmap library.
Alternative implementation using the pydiffmap library which provides automatic bandwidth selection via the Berry-Harlim-Gao (BGH) method. More sophisticated than the manual implementation.
- Raises:
AttributeError – If graph is not built.
ValueError – If parameters are invalid or pydiffmap fails.
ImportError – If pydiffmap is not installed.
Notes
Sets self.coords to shape (dim, n_samples). Uses epsilon=’bgh’ for automatic bandwidth selection based on local geometry. The alpha parameter controls the degree of density normalization (alpha=0: no normalization, alpha=1: full Fokker-Planck normalization).
This method is preferred when you want automatic parameter tuning and don’t need fine control over the diffusion process.
- create_tsne_embedding_()[source]
Create t-SNE (t-distributed Stochastic Neighbor Embedding).
Non-linear dimensionality reduction that converts similarities between data points to joint probabilities and minimizes KL divergence between high-dimensional and low-dimensional distributions.
- Raises:
ValueError – If dim is invalid or t-SNE computation fails.
ImportError – If scikit-learn is not installed.
Notes
Sets self.coords to shape (dim, n_samples). Particularly effective for visualization (dim=2 or 3). Non-parametric: cannot embed new points without refitting. Stochastic: different runs may produce different results.
The perplexity parameter (related to number of neighbors) controls the balance between local and global structure preservation.
References
van der Maaten, L. & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research.
- create_umap_embedding_()[source]
Create UMAP (Uniform Manifold Approximation and Projection) embedding.
State-of-the-art manifold learning technique based on Riemannian geometry and algebraic topology. Preserves both local and global structure better than t-SNE.
Notes
Sets self.coords to shape (dim, n_samples). The min_dist parameter controls how tightly points are packed (smaller values = tighter packing). Unlike t-SNE, UMAP can transform new points after fitting.
Advantages over t-SNE: - Preserves more global structure - Faster for large datasets - Supports supervised/semi-supervised modes - Can embed new points
References
McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426.
- create_ae_embedding_(continue_learning=0, epochs=50, lr=0.001, seed=42, batch_size=32, enc_kwargs=None, dec_kwargs=None, feature_dropout=0.2, train_size=0.8, inter_dim=100, verbose=True, add_corr_loss=False, corr_hyperweight=0, add_mi_loss=False, mi_hyperweight=0, minimize_mi_data=None, log_every=1, device=None)[source]
Create autoencoder embedding.
Deprecated since version This: method is deprecated. Use
create_flexible_ae_embedding_instead for more flexibility and advanced loss functions.- Parameters:
continue_learning (int, default=0) – Whether to continue training existing model.
epochs (int, default=50) – Number of training epochs.
lr (float, default=1e-3) – Learning rate.
seed (int, default=42) – Random seed.
batch_size (int, default=32) – Batch size for training.
enc_kwargs (dict, optional) – Encoder configuration.
dec_kwargs (dict, optional) – Decoder configuration.
feature_dropout (float, default=0.2) – Feature dropout rate.
train_size (float, default=0.8) – Training set fraction.
inter_dim (int, default=100) – Hidden layer dimension.
verbose (bool, default=True) – Print training progress.
add_corr_loss (bool, default=False) – Add correlation loss to encourage decorrelated latent features.
corr_hyperweight (float, default=0) – Weight for correlation loss.
add_mi_loss (bool, default=False) – Add MI-based orthogonality loss.
mi_hyperweight (float, default=0) – Weight for MI loss.
minimize_mi_data (np.ndarray, optional) – External data to minimize correlation with (for MI loss).
log_every (int, default=1) – Logging frequency.
device (torch.device, optional) – Device to run on.
- Raises:
ValueError – If parameters are invalid.
ImportError – If PyTorch is not installed.
Notes
This method is deprecated in favor of
create_flexible_ae_embedding_which provides more flexibility and advanced loss functions.
- create_vae_embedding_(continue_learning=0, epochs=50, lr=0.001, seed=42, batch_size=32, enc_kwargs=None, dec_kwargs=None, feature_dropout=0.2, kld_weight=1, train_size=0.8, inter_dim=128, verbose=True, log_every=10, **kwargs)[source]
Create variational autoencoder embedding.
Deprecated since version This: method is deprecated. Use
create_flexible_ae_embedding_instead for more flexibility and advanced loss functions.- Parameters:
continue_learning (int, default=0) – Whether to continue training existing model.
epochs (int, default=50) – Number of training epochs.
lr (float, default=1e-3) – Learning rate.
seed (int, default=42) – Random seed.
batch_size (int, default=32) – Batch size for training.
enc_kwargs (dict, optional) – Encoder configuration.
dec_kwargs (dict, optional) – Decoder configuration.
feature_dropout (float, default=0.2) – Feature dropout rate.
kld_weight (float, default=1) – Weight for KL divergence loss term.
train_size (float, default=0.8) – Training set fraction.
inter_dim (int, default=128) – Hidden layer dimension.
verbose (bool, default=True) – Print training progress.
log_every (int, default=10) – Logging frequency.
**kwargs – Additional keyword arguments.
- Raises:
ValueError – If parameters are invalid.
ImportError – If PyTorch is not installed.
Notes
This method is deprecated in favor of
create_flexible_ae_embedding_which provides more flexibility and advanced loss functions.
- create_flexible_ae_embedding_(architecture='ae', continue_learning=0, epochs=50, lr=0.001, seed=42, batch_size=32, enc_kwargs=None, dec_kwargs=None, feature_dropout=0.2, train_size=0.8, inter_dim=100, verbose=True, loss_components=None, log_every=1, device=None, logger=None, labels=None)[source]
Create flexible autoencoder embedding with modular loss composition.
- Parameters:
architecture (str, default="ae") – Architecture type: “ae” for standard autoencoder, “vae” for variational.
continue_learning (int, default=0) – Whether to continue training existing model.
epochs (int, default=50) – Number of training epochs.
lr (float, default=1e-3) – Learning rate.
seed (int, default=42) – Random seed for reproducibility.
batch_size (int, default=32) – Batch size for training.
enc_kwargs (dict, optional) – Encoder configuration (e.g., dropout).
dec_kwargs (dict, optional) – Decoder configuration.
feature_dropout (float, default=0.2) – Dropout rate for input features during training.
train_size (float, default=0.8) – Fraction of data for training.
inter_dim (int, default=100) – Hidden layer dimension.
verbose (bool, default=True) – Whether to print training progress.
loss_components (list of dict, optional) – Loss component configurations. Each dict should contain: - “name”: str, the loss type - “weight”: float, the loss weight - Additional parameters specific to each loss type If None, uses standard reconstruction loss for AE or reconstruction + KLD for VAE.
log_every (int, default=1) – Log frequency (epochs).
device (torch.device, optional) – Device to run on.
logger (logging.Logger, optional) – Logger instance.
labels (array-like, optional) – Integer class labels, shape (n_samples,). Passed to ClassificationLoss during training. If None, classification losses contribute zero.
Examples
>>> import numpy as np >>> from driada.dim_reduction.data import MVData >>> data = np.random.randn(50, 500) # 50 features, 500 samples >>> mvdata = MVData(data) >>> >>> # Standard autoencoder with correlation loss >>> emb = mvdata.get_embedding( ... method="flexible_ae", ... architecture="ae", ... dim=10, ... loss_components=[ ... {"name": "reconstruction", "weight": 1.0}, ... {"name": "correlation", "weight": 0.1} ... ], ... epochs=5, # Quick test ... verbose=False ... )
>>> # β-VAE for disentanglement >>> emb = mvdata.get_embedding( ... method="flexible_ae", ... architecture="vae", ... dim=10, ... loss_components=[ ... {"name": "reconstruction", "weight": 1.0}, ... {"name": "beta_vae", "weight": 1.0, "beta": 4.0} ... ], ... epochs=5, # Quick test ... verbose=False ... )
>>> # Recreate deprecated 'ae' method with correlation loss >>> # Old: method="ae", add_corr_loss=True, corr_hyperweight=0.1 >>> # New: >>> emb = mvdata.get_embedding( ... method="flexible_ae", ... architecture="ae", ... dim=10, ... loss_components=[ ... {"name": "reconstruction", "weight": 1.0}, ... {"name": "correlation", "weight": 0.1} ... ], ... epochs=5, # Quick test ... verbose=False ... )
>>> # Recreate deprecated 'ae' method with MI loss >>> # Old: method="ae", add_mi_loss=True, mi_hyperweight=0.1, minimize_mi_data=data >>> # New: >>> emb = mvdata.get_embedding( ... method="flexible_ae", ... architecture="ae", ... dim=10, ... loss_components=[ ... {"name": "reconstruction", "weight": 1.0}, ... {"name": "orthogonality", "weight": 0.1, "external_data": data} ... ], ... epochs=5, # Quick test ... verbose=False ... )
>>> # Recreate deprecated 'vae' method >>> # Old: method="vae", kld_weight=0.1 >>> # New: >>> emb = mvdata.get_embedding( ... method="flexible_ae", ... architecture="vae", ... dim=10, ... loss_components=[ ... {"name": "reconstruction", "weight": 1.0}, ... {"name": "beta_vae", "weight": 1.0, "beta": 0.1} ... ], ... epochs=5, # Quick test ... verbose=False ... )
- Raises:
ValueError – If parameters are invalid or architecture not in [“ae”, “vae”].
ImportError – If PyTorch is not installed.
Notes
This method provides a flexible framework for various autoencoder architectures with modular loss composition. It replaces the deprecated
create_ae_embedding_andcreate_vae_embedding_methods.
- continue_learning(add_epochs, **kwargs)[source]
Continue training an existing autoencoder model.
Allows resuming training of a previously trained autoencoder or VAE for additional epochs with potentially different parameters.
- Parameters:
add_epochs (int) – Number of additional epochs to train. Must be positive.
**kwargs – Additional keyword arguments to pass to the training method. These override the original training parameters (e.g., lr for fine-tuning with a lower learning rate).
- Raises:
ValueError – If add_epochs is not positive or method is not DL-based.
AttributeError – If no model has been trained yet.
Notes
This method requires that an autoencoder model was previously trained using one of the DL-based methods (ae, vae, flexible_ae).
Training parameters from the original call (feature_dropout, batch_size, etc.) are preserved automatically. Only parameters explicitly passed in kwargs are overridden.
- to_mvdata()[source]
Convert embedding coordinates to MVData for further processing.
This allows embeddings to be used as input for additional dimensionality reduction or analysis steps, enabling recursive embedding pipelines.
Label Handling
Graph-based dimensionality reduction methods (LLE, Laplacian Eigenmaps, Isomap, etc.) may remove disconnected nodes during preprocessing, resulting in fewer points in the embedding than in the original data. This method handles labels in the following way:
If all points are preserved: Labels are passed through unchanged.
If nodes were filtered and a node mapping exists: Labels are filtered to match only the kept nodes, preserving the correspondence.
If nodes were filtered but no mapping is available: Labels are set to None to avoid misalignment between data points and labels.
- returns:
An MVData object containing the embedding coordinates as data. Labels will be: - Original labels if all points preserved - Filtered labels matching kept nodes if mapping available - None if nodes were removed but mapping unavailable
- rtype:
MVData
- raises ValueError:
If embedding has not been built yet.
Examples
>>> import numpy as np >>> from driada.dim_reduction.data import MVData >>> np.random.seed(42) >>> high_dim_data = np.random.randn(20, 500) # 20 features, 500 samples >>> labels = np.random.choice(['A', 'B', 'C', 'D'], size=500) >>> mvdata = MVData(high_dim_data, labels=labels) >>> # PCA preserves all points >>> pca_emb = mvdata.get_embedding(method='pca', dim=2, verbose=False) >>> pca_mvdata = pca_emb.to_mvdata() >>> assert len(pca_mvdata.labels) == 500 # All labels preserved
>>> # LLE might remove disconnected nodes >>> lle_emb = mvdata.get_embedding(method='lle', dim=2, nn=2) >>> lle_mvdata = lle_emb.to_mvdata() >>> # Labels either filtered to match remaining nodes or None >>> assert lle_mvdata.labels is None or len(lle_mvdata.labels) == lle_mvdata.n_points
ProximityGraph
- class driada.dim_reduction.graph.ProximityGraph(d, m_params, g_params, create_nx_graph=False, verbose=False)[source]
Proximity graph for manifold learning and dimensionality reduction.
Constructs a graph where nodes are data points and edges connect nearby points, capturing the local geometry of the underlying manifold. Supports multiple graph construction methods including k-NN, UMAP fuzzy topology, and epsilon-ball.
The graph can be used for manifold learning algorithms, intrinsic dimension estimation, and as input to spectral dimensionality reduction methods.
- Parameters:
d (ndarray) – Data matrix of shape (n_features, n_samples). Each column is a data point.
m_params (dict) –
Metric parameters dictionary. Required key:
’metric_name’ : str or callable Distance metric name from pynndescent.distances.named_distances (‘euclidean’, ‘cosine’, ‘manhattan’, etc.), ‘hyperbolic’, or a callable custom distance function.
Optional keys (filtered by m_param_filter):
’sigma’ : float, bandwidth for heat kernel affinity transformation
’p’ : float, parameter for minkowski metric
Additional metric-specific parameters passed to distance function
g_params (dict) –
Graph construction parameters. Required key:
’g_method_name’ : str, graph construction method. Options: ‘knn’, ‘umap’, ‘auto_knn’, ‘eps’.
Method-specific keys (filtered by g_param_filter):
’nn’ : int, number of nearest neighbors (for knn/umap/auto_knn)
’eps’ : float, epsilon radius (for eps method)
’min_density’ : float, minimum graph density threshold (for eps method)
General optional keys:
’weighted’ : bool, whether to create weighted edges
’dist_to_aff’ : str, distance to affinity conversion (‘hk’ for heat kernel)
’max_deleted_nodes’ : float, maximum fraction of nodes that can be deleted during giant component extraction (raises exception if exceeded)
’graph_preprocessing’ : str, preprocessing method (default: ‘giant_cc’)
’seed’ : int, random seed for reproducibility (default: 42)
create_nx_graph (bool, default=False) – Whether to create NetworkX graph representation (passed to Network parent).
verbose (bool, default=False) – Whether to print progress messages.
- data
The input data matrix (n_features, n_samples).
- Type:
ndarray
- adj
Weighted adjacency matrix. For weighted graphs with dist_to_aff=’hk’, contains affinities exp(-d²/(sigma*mean_squared_dist)). For unweighted graphs, same as bin_adj.
- Type:
sparse matrix
- bin_adj
Binary adjacency matrix indicating connections.
- Type:
sparse matrix
- neigh_distmat
Sparse matrix of distances between connected neighbors. For unweighted graphs or methods without distance computation, contains zeros in sparse format.
- Type:
sparse matrix
- knn_indices
k-NN indices array of shape (n_nodes, k+1) including self (only for ‘knn’ method).
- Type:
ndarray or None
- knn_distances
k-NN distances array of shape (n_nodes, k+1) including self (only for ‘knn’ method).
- Type:
ndarray or None
- lost_nodes
Set of node indices removed during giant component preprocessing. Empty set if no preprocessing or no nodes lost.
- Type:
- intrinsic_dimensions
Cached intrinsic dimension estimates. Keys are method names with parameters.
- Type:
- Plus all attributes from g_params set via setattr.
- distances_to_affinities()[source]
Convert neigh_distmat to affinity weights using heat kernel (only ‘hk’ implemented).
- create_knn_graph_()[source]
Create k-NN graph using pynndescent. Saves knn_indices and knn_distances.
- create_umap_graph_()[source]
Create graph using UMAP’s fuzzy_simplicial_set. Sets knn arrays to None.
- create_auto_knn_graph_()[source]
Create unweighted k-NN graph using sklearn. Sets knn arrays to None.
- create_eps_graph_()[source]
Create epsilon-ball graph. Checks density against min_density. Sets knn arrays to None.
- get_int_dim(method='geodesic', force_recompute=False, logger=None, \*\*kwargs)[source]
Estimate intrinsic dimension. Methods: ‘geodesic’ (uses neigh_distmat or adj), ‘nn’ (requires saved knn data from ‘knn’ method). Results are cached.
- Raises:
Exception – If more than max_deleted_nodes fraction of nodes are lost during preprocessing.
Exception – If adjacency matrix is not sparse or not symmetric.
ValueError – If unknown metric or graph method specified.
Notes
Inherits from Network class, gaining spectral analysis capabilities
All graphs are enforced to be symmetric/undirected via check_symmetric
Giant connected component extraction is default preprocessing
For weighted graphs with ‘hk’, distances are converted to similarities
Node indices are remapped after giant component extraction
Examples
>>> import numpy as np >>> from driada.dim_reduction.graph import ProximityGraph >>> # Generate sample data >>> np.random.seed(42) # For reproducible results >>> data = np.random.randn(3, 100) # 100 points in 3D >>> # Define metric parameters >>> m_params = {'metric_name': 'euclidean', 'sigma': 1.0} >>> # Define graph parameters for k-NN graph >>> g_params = { ... 'g_method_name': 'knn', ... 'nn': 15, ... 'weighted': True, ... 'dist_to_aff': 'hk', ... 'max_deleted_nodes': 0.1 ... } >>> # Create proximity graph >>> graph = ProximityGraph(data, m_params, g_params) >>> edges = graph.adj.nnz//2 >>> print(f"Graph has {graph.n} nodes and {edges} edges") Graph has 100 nodes and ... edges
- __init__(d, m_params, g_params, create_nx_graph=False, verbose=False)[source]
Initialize proximity graph from data.
- Parameters:
d (ndarray of shape (n_features, n_samples)) – Data matrix where each column is a sample.
m_params (dict) – Metric parameters. Must contain ‘metric_name’.
g_params (dict) – Graph parameters. Must contain ‘g_method_name’.
create_nx_graph (bool, default=False) – Whether to create NetworkX representation.
verbose (bool, default=False) – Whether to print progress messages.
- Raises:
ValueError – If data is not 2D or is empty. If required parameters are missing.
Notes
lost_nodes attribute is always set (even if empty). Data shape assumes (features, samples) format.
- distances_to_affinities()[source]
Convert distance matrix to affinity matrix.
Transforms distances between neighbors into similarity weights. The transformation method is specified by self.dist_to_aff parameter.
Currently implemented methods:
‘hk’: Heat kernel with adaptive bandwidth:
w_ij = exp(-d_ij^2 / (sigma * mean_squared_distance))
- Raises:
RuntimeError – If no distance matrix is available or graph is not weighted.
ValueError – If sigma is not positive.
Notes
Only applies to weighted graphs. The resulting affinity matrix is symmetrized to ensure undirected graph structure. Mean distance uses only non-zero entries.
- construct_adjacency()[source]
Construct the adjacency matrix using the specified graph method.
Dynamically calls the appropriate graph construction method based on self.g_method_name (e.g., ‘knn’, ‘umap’, ‘auto_knn’, ‘eps’).
- Raises:
ValueError – If g_method_name contains invalid characters.
AttributeError – If the specified graph construction method doesn’t exist.
- create_umap_graph_()[source]
Create graph using UMAP’s fuzzy simplicial set construction.
Uses UMAP’s algorithm to build a fuzzy topological representation that captures both local and global structure of the data manifold.
Notes
The resulting graph has weighted edges representing fuzzy set membership. Sets self.adj (weighted), self.bin_adj (binary), and self.neigh_distmat. Uses seed from g_params if provided, otherwise defaults to 42 for reproducibility. Data is transposed to match UMAP’s expected (n_samples, n_features) format.
- create_knn_graph_()[source]
Create k-nearest neighbors graph.
Constructs a symmetric k-NN graph where each point is connected to its k nearest neighbors. Supports two engines:
'pynndescent'(default): approximate NN, supports custom metrics'cKDTree': exact NN via scipy, faster for moderate dimensions
Symmetrization can be
'intersection'(default,.minimum()) or'union'(.maximum()).- Raises:
ValueError – If nn exceeds number of samples minus 1. If metric is unknown or invalid. If knn_engine is unknown.
- create_auto_knn_graph_()[source]
Create k-NN graph using scikit-learn’s implementation.
A simpler alternative to pynndescent that uses sklearn’s kneighbors_graph. Creates an unweighted, symmetric graph.
- Raises:
ValueError – If nn exceeds number of samples minus 1.
Notes
Always creates unweighted graphs (connectivity mode)
Uses ‘auto’ algorithm selection in sklearn
Sets diagonal to 0 to exclude self-connections
Does not store knn_indices or knn_distances
Data is transposed to match sklearn expected format
Symmetrizes by A = A + A.T
- create_eps_graph_()[source]
Create epsilon-ball graph where edges connect points within distance eps.
Constructs a graph by connecting all pairs of points whose distance is less than or equal to the epsilon threshold. Uses sklearn’s radius_neighbors_graph for efficient computation.
The resulting graph density is checked against self.min_density to ensure sufficient connectivity. Issues a warning if density exceeds 0.5.
- Raises:
ValueError – If eps is not positive. If min_density is not in (0, 1]. If graph density is below self.min_density threshold.
Notes
Sets self.adj (weighted or binary), self.bin_adj (binary), and self.neigh_distmat (distances for weighted graphs, zero matrix otherwise). For weighted graphs, distances are converted to affinities. Does not store knn_indices or knn_distances. Data is transposed to match sklearn expected format.
- scaling()[source]
Compute normalized diagonal sums of the binary adjacency matrix.
Analyzes graph structure by computing the average connectivity at different node separations. For each diagonal offset i, computes the fraction of possible edges that exist between nodes separated by i positions.
- Returns:
List of length n-1 where element i is the average value along the i-th super-diagonal of the binary adjacency matrix, normalized by the diagonal length (n-i).
- Return type:
Notes
This provides a simple graph structure summary. The first elements indicate local connectivity while later elements show longer-range connections. Regular lattices show characteristic patterns. Uses sparse matrix operations to avoid memory issues with large graphs.
- get_int_dim(method='geodesic', force_recompute=False, logger=None, **kwargs)[source]
Estimate intrinsic dimension using graph-based methods.
This method estimates the intrinsic dimensionality using either geodesic distances on the graph or k-NN method with precomputed k-NN information. Results are cached to avoid recomputation.
- Parameters:
method ({'geodesic', 'nn'}, default='geodesic') – The method to use for intrinsic dimension estimation: - ‘geodesic’: Uses geodesic distances (shortest paths) on the graph - ‘nn’: k-nearest neighbor based estimation using saved k-NN data
force_recompute (bool, default=False) – If True, recompute the dimension even if it has been computed before. If False, return cached result if available.
logger (logging.Logger, optional) – Logger instance for logging messages. If None, creates a default logger.
**kwargs – Additional parameters passed to the dimension estimation method: - For ‘geodesic’: mode (‘full’/’fast’), factor (subsampling factor)
- Returns:
dimension – The estimated intrinsic dimension of the dataset/manifold.
- Return type:
- Raises:
ValueError – If an unknown method is specified or required data is not available.
Examples
>>> import numpy as np >>> from driada.dim_reduction.graph import ProximityGraph >>> # Generate Swiss roll data >>> from sklearn.datasets import make_swiss_roll >>> np.random.seed(42) # For reproducible results >>> data, _ = make_swiss_roll(n_samples=500, random_state=42) >>> # Create proximity graph >>> m_params = {'metric_name': 'euclidean', 'sigma': 1.0} >>> g_params = {'g_method_name': 'knn', 'nn': 15, 'weighted': True, ... 'dist_to_aff': 'hk', 'max_deleted_nodes': 0.5} >>> graph = ProximityGraph(data.T, m_params, g_params) >>> # Estimate dimension using geodesic method >>> dim_geo = graph.get_int_dim(method='geodesic') >>> # Get cached result (fast) >>> dim_geo_cached = graph.get_int_dim(method='geodesic') >>> # Force recomputation >>> dim_geo_new = graph.get_int_dim(method='geodesic', force_recompute=True) >>> # Or using nn method (only if k-NN graph was used) >>> dim_nn = graph.get_int_dim(method='nn') >>> # Access all computed dimensions >>> dims = sorted(graph.intrinsic_dimensions.keys()) >>> print(dims) ['geodesic_full_f2', 'nn']