Utilities
Helper functions and specialized methods for dimensionality reduction.
Data Validation
- driada.dim_reduction.data.check_data_for_errors(d, verbose=True)[source]
Check data matrix for zero columns which can cause issues in DR methods.
- Parameters:
d (np.ndarray or scipy.sparse matrix) – Data matrix with shape (n_features, n_samples)
verbose (bool, default=True) – Whether to print information about zero points
- Raises:
ValueError – If data contains columns with all zeros.
Sequence Validation
- driada.dim_reduction.sequences.validate_sequence_dimensions(steps, initial_dim, logger=None)[source]
Validate and report dimension flow through a sequence of reductions.
This function helps plan reduction sequences by showing how dimensions will change at each step, without actually performing the reductions.
- Parameters:
steps (List[Union[Tuple[str, Dict], str]]) – List of reduction steps in the same format as dr_sequence
initial_dim (int) – Initial data dimension
logger (logging.Logger, optional) – Logger for reporting dimension flow
- Returns:
List of (method_name, input_dim, output_dim) for each step
- Return type:
- Raises:
ValueError – If any step has invalid format. If any reduction method is unknown.
Examples
>>> from driada.dim_reduction.sequences import validate_sequence_dimensions >>> >>> # Check dimension flow before running expensive computation >>> flow = validate_sequence_dimensions( ... [('pca', {'dim': 50}), 'tsne'], ... initial_dim=1000 ... ) >>> flow [('pca', 1000, 50), ('tsne', 50, 2)]
Notes
Logs dimension changes for each step via provided or module logger
Warns when a step attempts to increase dimensions
Does not perform actual reductions, only predicts dimensions
Specialized Methods
Maximum Variance Unfolding
- class driada.dim_reduction.mvu.MaximumVarianceUnfolding(equation='berkley', solver=None, solver_tol=0.01, eig_tol=1e-10, solver_iters=2500, warm_start=False, seed=None)[source]
Maximum Variance Unfolding (MVU) for nonlinear dimensionality reduction.
MVU learns a low-dimensional representation that preserves local distances while maximizing the variance of the embedding. It formulates the problem as a semidefinite program (SDP) to find a kernel matrix that respects local isometry constraints.
- Parameters:
equation ({'berkley', 'wikipedia'}, default='berkley') – Formulation variant to use: - ‘berkley’: UC Berkeley formulation with centered embedding constraint - ‘wikipedia’: Standard formulation with trace maximization
solver (cvxpy.Solver, optional) – CVXPY solver to use. Default is SCS (Splitting Conic Solver).
solver_tol (float, default=1e-2) – Convergence tolerance for the SDP solver.
eig_tol (float, default=1e-10) – Tolerance for eigenvalue thresholding. Eigenvalues in (-eig_tol, eig_tol) are set to zero to handle numerical errors in the PSD constraint.
solver_iters (int, default=2500) – Maximum number of solver iterations. Set to None for default solver limit.
warm_start (bool, default=False) – Whether to use warm start in solver. Useful when solving similar problems repeatedly.
seed (int, optional) – Random seed for reproducibility.
- neighborhood_graph
Binary matrix indicating k-nearest neighbor connections.
- Type:
ndarray of shape (n_samples, n_samples)
- Raises:
ImportError – If cvxpy is not installed. Install with pip install cvxpy.
Examples
>>> import numpy as np >>> from driada.dim_reduction.mvu import MaximumVarianceUnfolding >>> # Generate sample data >>> data = np.random.randn(100, 10) >>> # Initialize MVU >>> mvu = MaximumVarianceUnfolding(equation='berkley', solver_tol=1e-3) >>> # Fit and transform to 2D >>> embedding = mvu.fit_transform(data, dim=2, k=5) >>> print(embedding.shape) (100, 2)
Notes
Requires cvxpy package for solving the semidefinite program. The algorithm can be slow for large datasets due to the SDP formulation. The SDP solver may require tuning for different dataset sizes.
References
Weinberger, K. Q., & Saul, L. K. (2006). An introduction to nonlinear dimensionality reduction by maximum variance unfolding. AAAI.
- __init__(equation='berkley', solver=None, solver_tol=0.01, eig_tol=1e-10, solver_iters=2500, warm_start=False, seed=None)[source]
Initialize Maximum Variance Unfolding instance.
Sets up the MVU solver with specified configuration for the semidefinite programming problem. Validates cvxpy availability and initializes solver parameters.
- Parameters:
equation ({'berkley', 'wikipedia'}, default='berkley') – Formulation variant to use: - ‘berkley’: UC Berkeley formulation with centered embedding - ‘wikipedia’: Standard formulation with trace maximization
solver (cvxpy.Solver, optional) – CVXPY solver to use. Default is SCS (Splitting Conic Solver). Common options: cp.SCS, cp.MOSEK, cp.CVXOPT
solver_tol (float, default=1e-2) – Convergence tolerance for the SDP solver. Lower values give more accurate solutions but take longer.
eig_tol (float, default=1e-10) – Tolerance for eigenvalue thresholding. Eigenvalues with absolute value less than eig_tol are set to zero.
solver_iters (int, default=2500) – Maximum number of solver iterations. None uses solver default. Increase if solver doesn’t converge.
warm_start (bool, default=False) – Whether to use warm start in solver. Useful when solving similar problems repeatedly.
seed (int, optional) – Random seed for reproducibility in k-NN graph construction.
- Raises:
ImportError – If cvxpy is not installed.
Notes
The Berkeley formulation often provides better numerical stability than the Wikipedia formulation. SCS solver is recommended for most cases as it handles large problems efficiently.
- fit(data, k)[source]
Fit MVU model by solving the semidefinite program.
Constructs a k-nearest neighbor graph and solves an SDP to find the optimal Gram matrix that preserves local distances while maximizing variance.
- Parameters:
data (ndarray of shape (n_samples, n_features)) – High-dimensional input data.
k (int) – Number of nearest neighbors to preserve distances for. Must be large enough to create a connected graph.
- Returns:
Optimal Gram matrix Q from the SDP solution. This is the inner product matrix of the embedded points.
- Return type:
ndarray of shape (n_samples, n_samples)
- Raises:
ValueError – If the k-NN graph has disconnected components and solver_iters is None (some solvers may not converge with disconnected graphs).
ImportError – If cvxpy is not installed.
Examples
>>> import numpy as np >>> # Create simple line data - MVU should preserve distances perfectly >>> np.random.seed(42) >>> n_points = 30 >>> t = np.linspace(0, 2*np.pi, n_points) >>> # Circle in 3D space >>> data = np.column_stack([np.cos(t), np.sin(t), 0.1*np.sin(3*t)]) >>> # Initialize and fit MVU >>> mvu = MaximumVarianceUnfolding(equation='berkley', solver_tol=1e-4) >>> gram_matrix = mvu.fit(data, k=6) >>> print(gram_matrix.shape) (30, 30) >>> # Check that Gram matrix eigenvalues are reasonable >>> # Small negative values are expected numerical errors from SDP solvers >>> eigvals = np.linalg.eigvalsh(gram_matrix) >>> # Verify the matrix is approximately PSD (within solver tolerance) >>> num_significant_negative = np.sum(eigvals < -1e-3) >>> print(f"Number of eigenvalues < -1e-3: {num_significant_negative}") Number of eigenvalues < -1e-3: 0
Notes
The Gram matrix Q satisfies:
Q is positive semidefinite
Sum of each row is zero (centering constraint)
For neighbors i,j:
||x_i - x_j||² = ||y_i - y_j||²where x are original points and y are embedded points
- fit_transform(data, dim, k)[source]
Fit MVU model and transform data to lower dimension.
Combines fit() and transform steps: solves the SDP to get the optimal Gram matrix, then extracts the embedding via eigendecomposition.
- Parameters:
- Returns:
Low-dimensional embedding of the data.
- Return type:
ndarray of shape (n_samples, dim)
- Raises:
ValueError – If k-NN graph has disconnected components (from fit). If requested dimension exceeds data samples.
ImportError – If cvxpy is not installed.
Examples
>>> import numpy as np >>> # Create Swiss roll data >>> n_samples = 100 >>> noise = 0.05 >>> t = 3 * np.pi * (1 + 2 * np.random.rand(n_samples)) >>> x = t * np.cos(t) >>> y = 30 * np.random.rand(n_samples) >>> z = t * np.sin(t) >>> data = np.column_stack([x, y, z]) + noise * np.random.randn(n_samples, 3) >>> # Apply MVU >>> mvu = MaximumVarianceUnfolding(equation='berkley', solver_tol=1e-3) >>> embedding = mvu.fit_transform(data, dim=2, k=7) >>> print(f"Embedded shape: {embedding.shape}") Embedded shape: (100, 2)
Notes
The embedding is obtained by: 1. Solving SDP to get Gram matrix Q 2. Eigendecomposition: Q = V * Lambda * V^T 3. Embedding: Y = sqrt(Lambda_top) * V_top^T where Lambda_top and V_top are the top ‘dim’ eigenvalues/vectors.
Small negative eigenvalues may appear due to numerical errors in the SDP solution. These are thresholded to zero using eig_tol parameter.