Data Utilities

driada.utils.data.create_correlated_gaussian_data(n_features=10, n_samples=10000, correlation_pairs=None, seed=42)[source]

Generate multivariate Gaussian data with specified correlations.

Creates synthetic data from a multivariate normal distribution with specified correlations between features. The data has zero mean and correlations specified by correlation_pairs.

Parameters:

n_features (int, optional) – Number of features (dimensions). Must be positive. Default is 10.
n_samples (int, optional) – Number of samples to generate. Must be non-negative. Default is 10000.
correlation_pairs (list of tuples or None, optional) – List of (i, j, correlation) tuples specifying correlated features. Indices i, j should be in range [0, n_features). Correlations must be in [-1, 1]. Out-of-bounds indices are silently ignored. If None, uses default pattern: [(1, 9, 0.9), (2, 8, 0.8), (3, 7, 0.7)].
seed (int, optional) – Random seed for reproducibility. Sets global numpy random state. Default is 42.

Returns:

data (np.ndarray) – Data array of shape (n_features, n_samples) with samples as columns.
cov_matrix (np.ndarray) – Correlation matrix used to generate the data (n_features, n_features). Positive definite, with 1s on diagonal.

Raises:

ValueError – If n_features <= 0 or n_samples < 0. If any correlation value is outside [-1, 1].

Notes

This function modifies the global numpy random state via np.random.seed(). For thread-safe random generation, consider using numpy.random.Generator.

The correlation matrix is made positive definite if needed by adding a small value to the diagonal.

Examples

>>> data, corr = create_correlated_gaussian_data(n_features=3, n_samples=100)
>>> data.shape
(3, 100)
>>> np.allclose(corr.diagonal(), 1.0)
True

driada.utils.data.populate_nested_dict(content, outer, inner)[source]

Create a nested dictionary with specified structure and content.

Creates a two-level nested dictionary where each outer key maps to a dictionary of inner keys, and each inner key maps to a copy of the provided content.

Parameters:

content (dict or any copyable object) – The content to populate at each leaf of the nested dictionary. Must have a .copy() method. Will be copied for each inner key to avoid aliasing.
outer (list or iterable) – Keys for the outer level of the nested dictionary.
inner (list or iterable) – Keys for the inner level of the nested dictionary.

Returns:

Nested dictionary with structure {outer_key: {inner_key: content_copy}}.

Return type:

dict

Raises:

AttributeError – If content does not have a .copy() method.

Notes

Duplicate keys in outer or inner iterables will overwrite previous values.
The content must have a .copy() method (e.g., dict, list, numpy array).
Primitive types (int, str, float) will raise AttributeError.

Examples

>>> content = {'value': 0, 'count': 0}
>>> outer = ['A', 'B']
>>> inner = ['x', 'y', 'z']
>>> nested = populate_nested_dict(content, outer, inner)
>>> nested['A']['x']
{'value': 0, 'count': 0}
>>> # Each entry is a separate copy
>>> nested['A']['x']['value'] = 5
>>> nested['B']['x']['value']
0

driada.utils.data.nested_dict_to_seq_of_tables(datadict, ordered_names1=None, ordered_names2=None)[source]

Convert a nested dictionary to a sequence of 2D tables.

Transforms a nested dictionary with structure {outer: {inner: {key: value}}} into a dictionary of 2D numpy arrays where rows correspond to outer keys and columns to inner keys.

Parameters:

datadict (dict) – Nested dictionary with three levels. Structure should be: {outer_key: {inner_key: {data_key: value}}}
ordered_names1 (list or None, optional) – Ordered list of outer keys to use as row indices. If None, uses sorted outer keys. Default is None.
ordered_names2 (list or None, optional) – Ordered list of inner keys to use as column indices. If None, uses sorted inner keys. Default is None.

Returns:

Dictionary mapping data keys to 2D numpy arrays where: - Rows correspond to ordered_names1 (outer keys) - Columns correspond to ordered_names2 (inner keys) - Values are from the nested dictionary

Return type:

dict

Raises:

ValueError – If datadict is empty.
IndexError – If datadict structure is inconsistent.

Notes

Missing values are filled with np.nan, not zeros.
Assumes all inner dictionaries have the same data keys.
Uses the first entry to determine the data key structure.

Examples

>>> data = {
...     'A': {'x': {'metric1': 1, 'metric2': 2},
...           'y': {'metric1': 3, 'metric2': 4}},
...     'B': {'x': {'metric1': 5, 'metric2': 6},
...           'y': {'metric1': 7, 'metric2': 8}}
... }
>>> tables = nested_dict_to_seq_of_tables(data)
>>> tables['metric1']
array([[1., 3.],
       [5., 7.]])
>>> # Rows are ['A', 'B'], columns are ['x', 'y']

driada.utils.data.add_names_to_nested_dict(datadict, names1, names2)[source]

Replace numeric keys in a nested dictionary with meaningful names.

Takes a nested dictionary with integer keys at two levels and replaces them with provided names. Useful for converting indexed data structures to named structures for better readability and access.

Parameters:

datadict (dict) – Nested dictionary with integer keys at two levels. Expected structure is datadict[i][j] where i and j are integers starting from 0.
names1 (list-like or None) – Names to use for the first level keys. If None, uses range(n1) where n1 is the number of first-level keys. Length must match the number of first-level keys in datadict.
names2 (list-like or None) – Names to use for the second level keys. If None, uses range(n2) where n2 is the number of second-level keys. Length must match the number of second-level keys in datadict.

Returns:

New nested dictionary with the same data but keys replaced by names. If both names1 and names2 are None, returns the original dict unchanged. Structure: renamed_dict[name1][name2] contains datadict[i][j].

Return type:

dict

Raises:

KeyError – If integer keys are not consecutive starting from 0.
ValueError – If names length doesn’t match number of keys.

Examples

>>> data = {0: {0: {'value': 1}, 1: {'value': 2}},
...         1: {0: {'value': 3}, 1: {'value': 4}}}
>>> names1 = ['row1', 'row2']
>>> names2 = ['col1', 'col2']
>>> renamed = add_names_to_nested_dict(data, names1, names2)
>>> renamed['row1']['col1']
{'value': 1}

Notes

Requires consecutive integer keys starting from 0.
Uses .update() to merge inner dictionaries.
Returns original dict if both names are None.
Assumes all outer keys contain the same inner keys (e.g., if datadict[0] has keys [0,1,2], then datadict[1], datadict[2], etc. must also have keys [0,1,2]).

driada.utils.data.retrieve_relevant_from_nested_dict(nested_dict, target_key, target_value, operation='=', allow_missing_keys=False)[source]

Find all (outer_key, inner_key) pairs where a condition is met.

Searches through a nested dictionary structure to find all locations where a specific key-value condition is satisfied. Useful for filtering nested data structures based on criteria.

Parameters:

nested_dict (dict) – Nested dictionary with structure {outer_key: {inner_key: data_dict}}, where data_dict contains the key-value pairs to search.
target_key (str) – The key to look for in the innermost dictionaries.
target_value (any) – The value to compare against. Must be comparable with stored values when using “>” or “<” operations.
operation ({"=", ">", "<"}, default="=") –
Comparison operation to use:
- ”=” : Find entries where target_key equals target_value
- ”>” : Find entries where target_key is greater than target_value
- ”<” : Find entries where target_key is less than target_value
allow_missing_keys (bool, default=False) – If True, skip entries where target_key is missing instead of raising an error. Missing keys are treated as not matching any criteria.

Returns:

List of (outer_key, inner_key) tuples identifying locations where the condition is satisfied.

Return type:

list of tuples

Raises:

ValueError – If target_key is not found and allow_missing_keys is False, or if an invalid operation is specified.
TypeError – If comparison operations “>” or “<” are used with incomparable types (will propagate from Python’s comparison).

Examples

>>> data = {
...     'exp1': {'cell1': {'score': 0.8, 'type': 'A'},
...              'cell2': {'score': 0.6, 'type': 'B'}},
...     'exp2': {'cell1': {'score': 0.9, 'type': 'A'},
...              'cell2': {'score': 0.7, 'type': 'B'}}
... }
>>> retrieve_relevant_from_nested_dict(data, 'score', 0.7, '>')
[('exp1', 'cell1'), ('exp2', 'cell1')]
>>> retrieve_relevant_from_nested_dict(data, 'type', 'A', '=')
[('exp1', 'cell1'), ('exp2', 'cell1')]

Notes

For “>” and “<” operations: - Missing keys (when allow_missing_keys=True) are treated as not matching - None values are treated as not matching (since None comparisons would fail) - Incomparable types (e.g., string vs number) will raise TypeError

driada.utils.data.rescale(data)[source]

Rescale 1D data to the range [0, 1] using min-max normalization.

Applies min-max scaling to transform data linearly so that the minimum value becomes 0 and the maximum value becomes 1. Useful for normalizing time series or feature vectors to a common scale.

Parameters:: data (array-like) – Input data to rescale. Must be 1-dimensional.
Returns:: Rescaled data with same length as input, values in [0, 1]. If input min equals max, returns array of 0.5 values.
Return type:: ndarray
Raises:: ValueError – If input data has more than 1 dimension.

Notes

Uses sklearn’s MinMaxScaler internally. The transformation is: X_scaled = (X - X.min()) / (X.max() - X.min())
Constant arrays (where all values are equal) return 0.5.
NaN values are preserved in the output.
Single element arrays return 0.0.

Examples

>>> data = np.array([1, 2, 3, 4, 5])
>>> rescale(data)
array([0.  , 0.25, 0.5 , 0.75, 1.  ])

>>> # Attempting to rescale 2D data raises an error
>>> data2d = np.array([[1, 5], [2, 4]])
>>> try:
...     rescale(data2d)
... except ValueError as e:
...     print(f"Error: {e}")
Error: Input data must be 1-dimensional, got shape (2, 2)

driada.utils.data.get_hash(data)[source]

Create a hash of numpy array or other data.

Parameters:: data (np.ndarray or other) – Data to hash. For arrays, uses the raw bytes. For other types, converts to string first.
Returns:: Hexadecimal hash string
Return type:: str

Notes

For numpy arrays: includes shape and dtype in hash, so same data with different shape produces different hashes.
For non-arrays: uses str() representation which may vary across Python versions and implementations.
Byte order affects the hash for arrays.
Uses SHA256 algorithm with UTF-8 encoding.

Examples

>>> arr = np.array([1, 2, 3])
>>> hash1 = get_hash(arr)
>>> arr_reshaped = arr.reshape(3, 1)
>>> hash2 = get_hash(arr_reshaped)
>>> hash1 == hash2  # Different shapes produce different hashes
False

driada.utils.data.phase_synchrony(vec1, vec2)[source]

Calculate instantaneous phase synchrony between two signals.

Computes phase synchrony using the Hilbert transform to extract instantaneous phases, then measures phase coupling using a sine-based metric that ranges from 0 (no synchrony) to 1 (perfect synchrony).

Parameters:

vec1 (array-like) – First signal, must be 1D array of same length as vec2.
vec2 (array-like) – Second signal, must be 1D array of same length as vec1.

Returns:

Phase synchrony values at each time point, ranging from 0 to 1. Same length as input signals.

Return type:

ndarray

Raises:

ValueError – If input signals have different lengths.

Notes

The algorithm: 1. Applies Hilbert transform to get analytic signals 2. Extracts instantaneous phases using np.angle() 3. Computes phase difference at each time point 4. Maps to [0,1] using: 1 - sin(abs(Δφ)/2)

This metric is 1 when phases are aligned (Δφ = 0) and 0 when maximally misaligned (Δφ = π).

Edge effects from Hilbert transform may affect boundary values.
Designed for real-valued signals.
NaN values propagate through the calculation.

Examples

>>> t = np.linspace(0, 1, 1000)
>>> signal1 = np.sin(2 * np.pi * 10 * t)  # 10 Hz
>>> signal2 = np.sin(2 * np.pi * 10 * t + np.pi/4)  # 10 Hz, phase shifted
>>> sync = phase_synchrony(signal1, signal2)
>>> round(np.mean(sync), 2)  # High synchrony despite phase shift
0.62

See also

scipy.signal.hilbert: Hilbert transform used to extract phases.

driada.utils.data.correlation_matrix(A)[source]

Compute Pearson correlation matrix between variables (rows).

Parameters:: A (numpy array of shape (n_variables, n_observations)) – Data matrix where each row is a variable
Returns:: Correlation matrix
Return type:: numpy array of shape (n_variables, n_variables)

Notes

Variables with zero variance (constant values) are handled by setting their correlation to 1.0 with themselves and NaN with other variables.
Single observation (n=1) returns NaN matrix as correlation is undefined.
Uses row-wise computation (variables are rows).

Examples

>>> A = np.array([[1, 2, 3], [4, 5, 6]])
>>> corr = correlation_matrix(A)
>>> corr.shape
(2, 2)

driada.utils.data.cross_correlation_matrix(A, B)[source]

Compute cross-correlation matrix between two sets of variables.

Computes Pearson correlations between variables (rows) in A and variables (rows) in B.

Parameters:

A (numpy array of shape (n_variables1, n_observations)) – First data matrix where each row is a variable
B (numpy array of shape (n_variables2, n_observations)) – Second data matrix where each row is a variable

Returns:

Cross-correlation matrix where element [i,j] is the correlation between A[i,:] and B[j,:]

Return type:

numpy array of shape (n_variables1, n_variables2)

Raises:

ValueError – If A and B have different numbers of observations (columns).

Notes

Uses row-wise computation (variables are rows), consistent with correlation_matrix.
Variables with zero variance will result in NaN correlations.
Centers data by row means.

Examples

>>> A = np.array([[1, 2, 3], [4, 5, 6]])  # 2 variables, 3 observations
>>> B = np.array([[7, 8, 9], [10, 11, 12]])  # 2 variables, 3 observations
>>> cross_corr = cross_correlation_matrix(A, B)
>>> cross_corr.shape
(2, 2)

driada.utils.data.norm_cross_corr(a, b, mode='full')[source]

Compute normalized cross-correlation between two signals.

This function computes the normalized cross-correlation between two signals. Each overlapping window is normalized independently to have zero mean and unit variance, making the correlation insensitive to amplitude scaling and DC offset.

Parameters:

a (array_like) – First signal
b (array_like) – Second signal
mode ({'full', 'valid', 'same'}, optional) – Mode parameter controlling output size. Default is ‘full’. - ‘full’: returns correlation at all lags (length: len(a) + len(b) - 1) - ‘valid’: returns only correlations without zero-padding (length: max(len(a) - len(b) + 1, 1)) - ‘same’: returns correlation of same length as first input (length: len(a))

Returns:

Normalized cross-correlation values. Values are in the range [-1, 1]. The lag index can be computed as: lag = index - (len(a) - 1) for ‘full’ mode.

Return type:

np.ndarray

Raises:

ValueError – If mode is not one of ‘full’, ‘valid’, or ‘same’. If either input signal is empty.

Notes

The normalization is performed on each overlapping window separately, ensuring that correlation values are in the range [-1, 1], where: - 1 indicates perfect positive correlation - -1 indicates perfect negative correlation (anti-correlation) - 0 indicates no correlation

For constant signals (zero variance), the function returns zeros.

Examples

>>> # Detect time shift between signals
>>> signal = np.sin(np.linspace(0, 4*np.pi, 100))
>>> shifted = np.roll(signal, 10)  # Shift by 10 samples
>>> corr = norm_cross_corr(signal, shifted, mode='full')
>>> lag = np.argmax(corr) - (len(signal) - 1)  # Should be close to -10

driada.utils.data.to_numpy_array(data)[source]

Convert various data types to numpy array.

Handles numpy arrays, sparse matrices, and other array-like objects. Warning: Converting large sparse matrices can cause memory issues.

Parameters:: data (array_like, sparse matrix, or any object) – Input data to convert to numpy array. Can be: - numpy array (returned as-is) - scipy sparse matrix (converted to dense) - list, tuple, or other array-like object
Returns:: Dense numpy array representation of the input data.
Return type:: numpy.ndarray

Warning

Converting large sparse matrices to dense arrays can cause memory issues. Consider the memory implications before converting sparse data.

Examples

>>> import scipy.sparse as sp
>>> sparse_data = sp.csr_matrix([[1, 0, 0], [0, 2, 0]])
>>> dense = to_numpy_array(sparse_data)
>>> dense.toarray() if hasattr(dense, 'toarray') else dense
array([[1, 0, 0],
       [0, 2, 0]])

driada.utils.data.remove_outliers(data, method='zscore', threshold=3.0, quantile_range=(0.05, 0.95))[source]

Remove outliers from data using various detection strategies.

Parameters:

data (np.ndarray) – 1D array of data points.
method (str, default='zscore') – Outlier detection method: - ‘zscore’: Remove points beyond threshold standard deviations from mean - ‘iqr’: Interquartile range method (1.5*IQR beyond Q1/Q3) - ‘mad’: Median absolute deviation method - ‘quantile’: Remove points outside specified quantile range - ‘isolation’: Local outlier factor based on density
threshold (float, default=3.0) – Detection threshold: - For ‘zscore’: number of standard deviations - For ‘iqr’: multiplier for IQR (typically 1.5) - For ‘mad’: number of median absolute deviations - For ‘isolation’: contamination fraction (0-0.5)
quantile_range (tuple, default=(0.05, 0.95)) – For ‘quantile’ method: (lower_quantile, upper_quantile)

Returns:

inlier_indices (np.ndarray) – Indices of non-outlier points.
clean_data (np.ndarray) – Data with outliers removed.

Raises:

ValueError – If method is not recognized.
ImportError – If ‘isolation’ method is used but scikit-learn is not installed.

Notes

Input data is flattened with ravel().
For constant data (zero variance), all points are considered inliers.
MAD method uses scaling factor 1.4826 for consistency with normal distribution.
Isolation method uses random_state=42 for reproducibility.

Examples

>>> data = np.array([1, 2, 3, 100, 4, 5])  # 100 is outlier
>>> # Z-score method
>>> indices, cleaned = remove_outliers(data, method='zscore', threshold=2)
>>> cleaned
array([1, 2, 3, 4, 5])

>>> # IQR method
>>> indices, cleaned = remove_outliers(data, method='iqr', threshold=1.5)

>>> # Quantile method
>>> indices, cleaned = remove_outliers(data, method='quantile', quantile_range=(0.1, 0.9))

driada.utils.data.check_nonnegative(**kwargs)[source]

Check that all provided parameters are non-negative.

Validates that numeric parameters are >= 0. Useful for input validation in functions that require non-negative values (counts, rates, probabilities).

Parameters:: **kwargs – Parameter name to value mappings. All values should be numeric.
Raises:: ValueError – If any parameter value is negative, NaN, or infinite. Error message includes parameter name and value.

Examples

>>> check_nonnegative(n_neurons=10, rate=0.5)  # No error

>>> check_nonnegative(n_neurons=10, rate=-0.5)
Traceback (most recent call last):
    ...
ValueError: rate must be non-negative, got -0.5

>>> import numpy as np
>>> check_nonnegative(count=5, prob=np.nan)
Traceback (most recent call last):
    ...
ValueError: prob cannot be NaN

driada.utils.data.check_unit(left_open=False, right_open=False, **kwargs)[source]

Check that all provided parameters are in the unit interval.

Validates that numeric parameters are within [0, 1] with configurable bound inclusion. Useful for probabilities, fractions, and normalized values.

Parameters:

left_open (bool, optional) – If True, left bound is open (0, 1]. If False, closed [0, 1]. Default: False.
right_open (bool, optional) – If True, right bound is open [0, 1). If False, closed [0, 1]. Default: False.
**kwargs – Parameter name to value mappings. All values should be numeric in [0, 1].

Raises:

ValueError – If any parameter value is outside the unit interval, NaN, or infinite. Error message includes parameter name, value, and expected range.

Examples

>>> check_unit(probability=0.5, fraction=0.8)  # No error

>>> check_unit(probability=1.5)
Traceback (most recent call last):
    ...
ValueError: probability must be in [0, 1], got 1.5

>>> check_unit(left_open=True, rate=0.0)
Traceback (most recent call last):
    ...
ValueError: rate must be in (0, 1], got 0.0

>>> check_unit(left_open=True, right_open=True, value=0.5)  # No error

driada.utils.data.check_positive(**kwargs)[source]

Check that all provided parameters are positive (> 0).

Validates that numeric parameters are strictly positive. Useful for input validation in functions that require positive values (dimensions, sizes).

Parameters:: **kwargs – Parameter name to value mappings. All values should be numeric.
Raises:: ValueError – If any parameter value is not positive, NaN, or infinite. Error message includes parameter name and value.

Examples

>>> check_positive(n_neurons=10, dim=5)  # No error

>>> check_positive(n_neurons=0)
Traceback (most recent call last):
    ...
ValueError: n_neurons must be positive, got 0

>>> check_positive(dim=-5)
Traceback (most recent call last):
    ...
ValueError: dim must be positive, got -5

Utilities for data manipulation, I/O operations, and common data transformations.