Utilities

Helper functions and specialized methods for dimensionality reduction.

Data Validation

driada.dim_reduction.data.check_data_for_errors(d, verbose=True)[source]

Check data matrix for zero columns which can cause issues in DR methods.

Parameters:
  • d (np.ndarray or scipy.sparse matrix) – Data matrix with shape (n_features, n_samples)

  • verbose (bool, default=True) – Whether to print information about zero points

Raises:

ValueError – If data contains columns with all zeros.

Sequence Validation

driada.dim_reduction.sequences.validate_sequence_dimensions(steps, initial_dim, logger=None)[source]

Validate and report dimension flow through a sequence of reductions.

This function helps plan reduction sequences by showing how dimensions will change at each step, without actually performing the reductions.

Parameters:
  • steps (List[Union[Tuple[str, Dict], str]]) – List of reduction steps in the same format as dr_sequence

  • initial_dim (int) – Initial data dimension

  • logger (logging.Logger, optional) – Logger for reporting dimension flow

Returns:

List of (method_name, input_dim, output_dim) for each step

Return type:

List[Tuple[str, int, int]]

Raises:

ValueError – If any step has invalid format. If any reduction method is unknown.

Examples

>>> from driada.dim_reduction.sequences import validate_sequence_dimensions
>>>
>>> # Check dimension flow before running expensive computation
>>> flow = validate_sequence_dimensions(
...     [('pca', {'dim': 50}), 'tsne'],
...     initial_dim=1000
... )
>>> flow
[('pca', 1000, 50), ('tsne', 50, 2)]

Notes

  • Logs dimension changes for each step via provided or module logger

  • Warns when a step attempts to increase dimensions

  • Does not perform actual reductions, only predicts dimensions

Specialized Methods

Maximum Variance Unfolding

class driada.dim_reduction.mvu.MaximumVarianceUnfolding(equation='berkley', solver=None, solver_tol=0.01, eig_tol=1e-10, solver_iters=2500, warm_start=False, seed=None)[source]

Maximum Variance Unfolding (MVU) for nonlinear dimensionality reduction.

MVU learns a low-dimensional representation that preserves local distances while maximizing the variance of the embedding. It formulates the problem as a semidefinite program (SDP) to find a kernel matrix that respects local isometry constraints.

Parameters:
  • equation ({'berkley', 'wikipedia'}, default='berkley') – Formulation variant to use: - ‘berkley’: UC Berkeley formulation with centered embedding constraint - ‘wikipedia’: Standard formulation with trace maximization

  • solver (cvxpy.Solver, optional) – CVXPY solver to use. Default is SCS (Splitting Conic Solver).

  • solver_tol (float, default=1e-2) – Convergence tolerance for the SDP solver.

  • eig_tol (float, default=1e-10) – Tolerance for eigenvalue thresholding. Eigenvalues in (-eig_tol, eig_tol) are set to zero to handle numerical errors in the PSD constraint.

  • solver_iters (int, default=2500) – Maximum number of solver iterations. Set to None for default solver limit.

  • warm_start (bool, default=False) – Whether to use warm start in solver. Useful when solving similar problems repeatedly.

  • seed (int, optional) – Random seed for reproducibility.

neighborhood_graph

Binary matrix indicating k-nearest neighbor connections.

Type:

ndarray of shape (n_samples, n_samples)

Raises:

ImportError – If cvxpy is not installed. Install with pip install cvxpy.

Examples

>>> import numpy as np
>>> from driada.dim_reduction.mvu import MaximumVarianceUnfolding
>>> # Generate sample data
>>> data = np.random.randn(100, 10)
>>> # Initialize MVU
>>> mvu = MaximumVarianceUnfolding(equation='berkley', solver_tol=1e-3)
>>> # Fit and transform to 2D
>>> embedding = mvu.fit_transform(data, dim=2, k=5)
>>> print(embedding.shape)
(100, 2)

Notes

Requires cvxpy package for solving the semidefinite program. The algorithm can be slow for large datasets due to the SDP formulation. The SDP solver may require tuning for different dataset sizes.

References

Weinberger, K. Q., & Saul, L. K. (2006). An introduction to nonlinear dimensionality reduction by maximum variance unfolding. AAAI.

__init__(equation='berkley', solver=None, solver_tol=0.01, eig_tol=1e-10, solver_iters=2500, warm_start=False, seed=None)[source]

Initialize Maximum Variance Unfolding instance.

Sets up the MVU solver with specified configuration for the semidefinite programming problem. Validates cvxpy availability and initializes solver parameters.

Parameters:
  • equation ({'berkley', 'wikipedia'}, default='berkley') – Formulation variant to use: - ‘berkley’: UC Berkeley formulation with centered embedding - ‘wikipedia’: Standard formulation with trace maximization

  • solver (cvxpy.Solver, optional) – CVXPY solver to use. Default is SCS (Splitting Conic Solver). Common options: cp.SCS, cp.MOSEK, cp.CVXOPT

  • solver_tol (float, default=1e-2) – Convergence tolerance for the SDP solver. Lower values give more accurate solutions but take longer.

  • eig_tol (float, default=1e-10) – Tolerance for eigenvalue thresholding. Eigenvalues with absolute value less than eig_tol are set to zero.

  • solver_iters (int, default=2500) – Maximum number of solver iterations. None uses solver default. Increase if solver doesn’t converge.

  • warm_start (bool, default=False) – Whether to use warm start in solver. Useful when solving similar problems repeatedly.

  • seed (int, optional) – Random seed for reproducibility in k-NN graph construction.

Raises:

ImportError – If cvxpy is not installed.

Notes

The Berkeley formulation often provides better numerical stability than the Wikipedia formulation. SCS solver is recommended for most cases as it handles large problems efficiently.

fit(data, k)[source]

Fit MVU model by solving the semidefinite program.

Constructs a k-nearest neighbor graph and solves an SDP to find the optimal Gram matrix that preserves local distances while maximizing variance.

Parameters:
  • data (ndarray of shape (n_samples, n_features)) – High-dimensional input data.

  • k (int) – Number of nearest neighbors to preserve distances for. Must be large enough to create a connected graph.

Returns:

Optimal Gram matrix Q from the SDP solution. This is the inner product matrix of the embedded points.

Return type:

ndarray of shape (n_samples, n_samples)

Raises:
  • ValueError – If the k-NN graph has disconnected components and solver_iters is None (some solvers may not converge with disconnected graphs).

  • ImportError – If cvxpy is not installed.

Examples

>>> import numpy as np
>>> # Create simple line data - MVU should preserve distances perfectly
>>> np.random.seed(42)
>>> n_points = 30
>>> t = np.linspace(0, 2*np.pi, n_points)
>>> # Circle in 3D space
>>> data = np.column_stack([np.cos(t), np.sin(t), 0.1*np.sin(3*t)])
>>> # Initialize and fit MVU
>>> mvu = MaximumVarianceUnfolding(equation='berkley', solver_tol=1e-4)
>>> gram_matrix = mvu.fit(data, k=6)
>>> print(gram_matrix.shape)
(30, 30)
>>> # Check that Gram matrix eigenvalues are reasonable
>>> # Small negative values are expected numerical errors from SDP solvers
>>> eigvals = np.linalg.eigvalsh(gram_matrix)
>>> # Verify the matrix is approximately PSD (within solver tolerance)
>>> num_significant_negative = np.sum(eigvals < -1e-3)
>>> print(f"Number of eigenvalues < -1e-3: {num_significant_negative}")
Number of eigenvalues < -1e-3: 0

Notes

The Gram matrix Q satisfies:

  • Q is positive semidefinite

  • Sum of each row is zero (centering constraint)

  • For neighbors i,j: ||x_i - x_j||² = ||y_i - y_j||² where x are original points and y are embedded points

fit_transform(data, dim, k)[source]

Fit MVU model and transform data to lower dimension.

Combines fit() and transform steps: solves the SDP to get the optimal Gram matrix, then extracts the embedding via eigendecomposition.

Parameters:
  • data (ndarray of shape (n_samples, n_features)) – High-dimensional input data.

  • dim (int) – Target dimensionality for the embedding.

  • k (int) – Number of nearest neighbors to preserve distances for.

Returns:

Low-dimensional embedding of the data.

Return type:

ndarray of shape (n_samples, dim)

Raises:
  • ValueError – If k-NN graph has disconnected components (from fit). If requested dimension exceeds data samples.

  • ImportError – If cvxpy is not installed.

Examples

>>> import numpy as np
>>> # Create Swiss roll data
>>> n_samples = 100
>>> noise = 0.05
>>> t = 3 * np.pi * (1 + 2 * np.random.rand(n_samples))
>>> x = t * np.cos(t)
>>> y = 30 * np.random.rand(n_samples)
>>> z = t * np.sin(t)
>>> data = np.column_stack([x, y, z]) + noise * np.random.randn(n_samples, 3)
>>> # Apply MVU
>>> mvu = MaximumVarianceUnfolding(equation='berkley', solver_tol=1e-3)
>>> embedding = mvu.fit_transform(data, dim=2, k=7)
>>> print(f"Embedded shape: {embedding.shape}")
Embedded shape: (100, 2)

Notes

The embedding is obtained by: 1. Solving SDP to get Gram matrix Q 2. Eigendecomposition: Q = V * Lambda * V^T 3. Embedding: Y = sqrt(Lambda_top) * V_top^T where Lambda_top and V_top are the top ‘dim’ eigenvalues/vectors.

Small negative eigenvalues may appear due to numerical errors in the SDP solution. These are thresholded to zero using eig_tol parameter.