Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TerminatedWorkerError when using GridSearchCV #177

Open
JohannesWiesner opened this issue Jul 26, 2023 · 29 comments
Open

TerminatedWorkerError when using GridSearchCV #177

JohannesWiesner opened this issue Jul 26, 2023 · 29 comments

Comments

@JohannesWiesner
Copy link
Contributor

Hi James, with the latest version of cca_zoo I get this error:

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGSEGV(-11), SIGSEGV(-11), SIGSEGV(-11)}

Didn't happen in older versions (although I am using the exact same script). Can you reproduce this? Here's my full code + attached my X,y and groups as txt files.

import numpy as np
from sklearn.model_selection import GroupShuffleSplit
from cca_zoo.model_selection import GridSearchCV
from cca_zoo.linear import SCCA_PMD

###############################################################################
## Settings ###################################################################
###############################################################################

n_jobs = 8
pre_dispatch = 3
rng = np.random.RandomState(42)

###############################################################################
## Prepare Analysis ###########################################################
###############################################################################

X = np.loadtxt('X.txt')
y = np.loadtxt('y.txt')
groups = np.loadtxt('groups.txt')

###############################################################################
## Analysis settings ##########################################################
###############################################################################

# define latent dimensions
latent_dimensions = 3

# pretend that there are subject groups in the dataset
cv = GroupShuffleSplit(n_splits=10,train_size=0.7,random_state=rng)

# define a search space (optimize left and right penalty parameters)
param_grid = {'tau':[np.arange(0.1,1.1,0.1),0]}

# define an estimator
estimator = SCCA_PMD(latent_dimensions=latent_dimensions,random_state=rng)

##############################################################################
## Run GridSearch
##############################################################################

def scorer(estimator, views):
    scores = estimator.score(views)
    return np.mean(scores)

grid = GridSearchCV(estimator,param_grid,scoring=scorer,n_jobs=n_jobs,cv=cv)
grid.fit([X,y],groups=groups)

Data:

groups.txt
X.txt
y.txt

Note that X and y have been normalized prior to GridSearch, so each fold "sees" different batches of the normalized dataset. Not sure if this is related to #175

@jameschapman19
Copy link
Owner

Ok haven't managed to replicate exactly on my (windows) laptop or colab. Will try linux later.

In the meantime a smal change to your code is to go with:

# define a search space (optimize left and right penalty parameters)
param_grid = {'tau':[list(np.arange(0.1,1.0,0.1)),0]}

Instead of

# define a search space (optimize left and right penalty parameters)
param_grid = {'tau':[np.arange(0.1,1.1,0.1),0]}

I'm not sure if the code previously supported numpy arrays and I lost this support in a refactor or if it's always been this way. I think I should be able to add the support back relatively easily.

It's possible that this alone would fix your example because the multiprocessing can give confusing error codes whenever there is a bug

@JohannesWiesner
Copy link
Contributor Author

True, now I remember that we had this issue before. Unfortunately I still get the error, even when using param_grid = {'tau':[list(np.arange(0.1,1.0,0.1)),0]}

P.S.: Maybe it would make sense to open a separate issue for the data types in param_grid? I think it would sense if both list, numpy arrays or other iterables would be valid inputs?

@JohannesWiesner
Copy link
Contributor Author

I then tried to use simulated data and it seems to work with:

from cca_zoo.data.simulated import LinearSimulatedData
n = 100
p = 10
q = 100
latent_dims = 3
correlation = 0.9

data = LinearSimulatedData(
    view_features=[p, q],
    latent_dims=latent_dims,
    correlation=[0.9,0.8,0.7],
    structure='identity'
)
(X,y) = data.sample(n)
groups = np.repeat(np.arange(0,5,1),20)

Here I get LinAlgError: SVD did not converge but I guess that could stem from a different source (at least GridSearch runs through).

@jameschapman19
Copy link
Owner

P.S.: Maybe it would make sense to open a separate issue for the data types in param_grid? I think it would sense if both list, numpy arrays or other iterables would be valid inputs?

Yes, agree

@JohannesWiesner
Copy link
Contributor Author

I guess I have to take a look at my dataset and check if the error stems from there. Maybe it has to do something with #175 because that should be the only difference here. Right now, my X, y are already normalized before passing them to GridSearchCV.

@JohannesWiesner
Copy link
Contributor Author

I will implement StandardScaler to mimic the old scale=True behavior and come back to you. Data should not have changed in the mean time so right now I don't really see why it should stem from there.

@JohannesWiesner
Copy link
Contributor Author

Okay, I test the following code on a Windows machine and it works fine. Both on the Windows and Ubuntu machine I have scikit-learn 1.3.0 and cca-zoo 2.1.0 installed.

import numpy as np
import pandas as pd

from sklearn.model_selection import GroupShuffleSplit
from cca_zoo.model_selection import GridSearchCV
from cca_zoo.linear import rCCA
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from mvlearn.utils import check_Xs
from sklearn.base import TransformerMixin
from sklearn.utils.validation import check_is_fitted

###############################################################################
## Prepare Analysis ###########################################################
###############################################################################

rng = np.random.RandomState(42)

brain_df = np.loadtxt('brain_df.txt')
behavior_df = np.loadtxt('behavior_df.txt')
groups = np.loadtxt('groups.txt')

###############################################################################
## Analysis settings ##########################################################
###############################################################################

# define latent dimensions
latent_dimensions = 1

# define cross validation strategy
cv = GroupShuffleSplit(n_splits=10,train_size=0.7,random_state=rng)

# define a search space (optimize left and right penalty parameters)
param_grid = {'cca__c':[list(np.arange(0.1,1.1,0.1)),list(np.arange(0.1,1.1,0.1))]}


"""
Class which allows for the different (or the same) processing of multiple views of data.
"""


class MultiViewPreprocessing(TransformerMixin):
    def __init__(self, preprocessing_list):
        self.preprocessing_list = preprocessing_list

    def fit(self, views, y=None):
        """
        Fits the associated preprocessing steps to each view.
        Parameters
        ----------
        views
        y

        Returns
        -------

        """
        if len(self.preprocessing_list) == 1:
            self.preprocessing_list = self.preprocessing_list * len(views)
        elif len(self.preprocessing_list) != len(views):
            raise ValueError("Length of preprocessing_list must be 1 (apply the same preprocessing to each view) or equal to the number of views")
        check_Xs(views, enforce_views=range(len(self.preprocessing_list)))
        for view, preprocessing in zip(views, self.preprocessing_list):
            preprocessing.fit(view, y)
        return self

    def transform(self, X, y=None):
        """
        Transforms each view using the associated preprocessing steps.
        Parameters
        ----------
        X
        y

        Returns
        -------

        """
        [check_is_fitted(preprocessing) for preprocessing in self.preprocessing_list]
        check_Xs(X, enforce_views=range(len(self.preprocessing_list)))
        return [preprocessing.transform(view) for view, preprocessing in zip(X, self.preprocessing_list)]


# # define an estimator
estimator = Pipeline([
    ('preprocessing', MultiViewPreprocessing((StandardScaler(),StandardScaler()))),
    ('cca',rCCA(latent_dimensions=latent_dimensions,random_state=rng))
    ])

###############################################################################
## Run GridSearch
##############################################################################

def scorer(estimator, views):
    scores = estimator.score(views)
    return np.mean(scores)

grid = GridSearchCV(estimator,param_grid,scoring=scorer,n_jobs=5,cv=cv)
grid.fit([brain_df,behavior_df],groups=groups)
best_params = grid.best_params_
estimator_best = grid.best_estimator_
X_weights,y_weights = estimator_best.weights


print(f"Best parameters are: {best_params}\n")

Data:

behavior_df.txt
brain_df.txt
groups.txt

Could you perhaps also re-check on a Linux machine if this is an OS issue?

@jameschapman19
Copy link
Owner

Works fine or doesn't work fine?

@JohannesWiesner
Copy link
Contributor Author

Ah sorry. Works fine on Windows but not on Ubuntu.

@jameschapman19
Copy link
Owner

ok - got an error message I can see? I can try and get one myself otherwise XD

@JohannesWiesner
Copy link
Contributor Author

Here's the complete traceback:

Traceback (most recent call last):

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/spyder_kernels/py3compat.py:356 in compat_exec
    exec(code, globals, locals)

  File ~/work/projects/project_hcp/testing/test_cca.py:108
    grid.fit([brain_df,behavior_df],groups=groups)

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/cca_zoo/model_selection/_search.py:208 in fit
    self = BaseSearchCV.fit(self, np.hstack(X), y=y, groups=groups, **fit_params)

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/sklearn/base.py:1151 in wrapper
    return fit_method(estimator, *args, **kwargs)

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/sklearn/model_selection/_search.py:898 in fit
    self._run_search(evaluate_candidates)

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/cca_zoo/model_selection/_search.py:199 in _run_search
    evaluate_candidates(param_grid)

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/sklearn/model_selection/_search.py:845 in evaluate_candidates
    out = parallel(

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/sklearn/utils/parallel.py:65 in __call__
    return super().__call__(iterable_with_config)

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/joblib/parallel.py:1944 in __call__
    return output if self.return_generator else list(output)

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/joblib/parallel.py:1587 in _get_outputs
    yield from self._retrieve()

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/joblib/parallel.py:1691 in _retrieve
    self._raise_error_fast()

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/joblib/parallel.py:1726 in _raise_error_fast
    error_job.get_result(self.timeout)

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/joblib/parallel.py:735 in get_result
    return self._return_or_raise()

  File ~/micromamba/envs/csp_wiesner_johannes/lib/python3.8/site-packages/joblib/parallel.py:753 in _return_or_raise
    raise self._result

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGSEGV(-11), SIGSEGV(-11)}

@JohannesWiesner
Copy link
Contributor Author

Okay, tested it on our Linux server. Same error here. Seems to be an OS-issue!

@jameschapman19
Copy link
Owner

OK I think it's also possible that its consuming more than expected memory. Will investigate - apologies and thanks for bringing this to my attention!

@jameschapman19
Copy link
Owner

I'm thinking trying two things will help diagnose this:

  1. num_workers=1
  2. removing preprocessing

1 will help work out if it is multiprocessing causing the problem, 2 will help work out if it is in the pipeline

@JohannesWiesner
Copy link
Contributor Author

Yup, with n_jobs=1 I don't have this issue, but of course, that makes sense because apparently, it's a parallelization issue. Removing preprocessing does not change the error.

@jameschapman19
Copy link
Owner

Thanks for this. Will have a dig around.

@jameschapman19
Copy link
Owner

jameschapman19 commented Aug 2, 2023

Hi @JohannesWiesner. From some reading I'm thinking this is versions of scipy/numpy e.g.

because there's nothing substantive that has changed to rCCA which could have caused this (it essentially does a lossless PCA [keeping all components] for efficiency and then sets up an eigenvalue problem which it sends to scipy).

So I think if you give updating scipy/numpy a go that might work?

@JohannesWiesner
Copy link
Contributor Author

Hm, are you sure? Also tried it with SCCA_PMD and getting the same error. Will update scipy/numpy and also do a sanity check with other sparse CCAs!

@JohannesWiesner
Copy link
Contributor Author

Ah true, all of those will use the same underlying numpy/scipy functions I guess. You'll get an update tomorrow!

@JohannesWiesner
Copy link
Contributor Author

JohannesWiesner commented Aug 3, 2023

Worked: On our Windows machine, these versions are installed:

numpy                     1.23.3           py39h9061af7_0    conda-forge
scipy                         1.9.1            py39h316f440_0    conda-forge
scikit-learn               1.3.0                    pypi_0    pypi
cca-zoo                      2.1.0

Worked: We also tested in a Docker Container (with Debian-Bookworm as the base image) running on Windows:

numpy                     1.21.5                   pypi_0    pypi
scipy                     1.10.1           py39h7360e5f_0    conda-forge
scikit-learn              1.2.2            py39h86b2a18_0    conda-forge
cca-zoo                           2.1.0

Worked: Then I set up a completely fresh conda environment with cca-zoo only:

numpy                    1.25.2
scipy                    1.11.1
scikit-learn             1.3.0
cca-zoo                  2.1.0

Did not work: And in my default conda environment I got these versions:

numpy                 1.21.5                   pypi_0    pypi
scipy                     1.10.1           py38h59b608b_3    conda-forge
scikit-learn           1.3.0            py38hc099248_0    conda-forge
cca-zoo                  2.1.0

@JohannesWiesner
Copy link
Contributor Author

Hard to say what's causing the issue. Can't really see, how numpy or scipy versions could be responsible. Maybe it's a complex interplay between different packages? Also checked if it's the Python version causing the issue. Created a second environment forcing python to 3.8.17 (to match the python version of my non-working environemt) but couldn't reproduce the error.

@jameschapman19
Copy link
Owner

Ergh! And this worked in a previous version?

@jameschapman19
Copy link
Owner

@JohannesWiesner
Copy link
Contributor Author

Geez, that sounds not trivial. For now, I will just use the working conda environment for the analysis. Let me know, if I should test something out for you. Probably a good idea to implement a testing workflow with different os-runners in the long term.

@jameschapman19
Copy link
Owner

Agree about testing with different OS - my ‘hack’ has been that I develop on windows and the automatic tests here use Ubuntu.

although weirdly that suggests the package does work on Ubuntu! So suggests I need to make the numpy/scipy versions explicit (I’ve tended towards laziness/relying on scikit-learn dependencies to be about right)

@jameschapman19
Copy link
Owner

So this passes all the tests on Ubuntu:

Installing numpy (1.24.4)

Installing scipy (1.9.3)

Installing scikit-learn (1.3.0)

If that works then I’ll make the dependencies hard to avoid your issue in the future- thanks and apologies! I’m always learning 🙏

@jameschapman19
Copy link
Owner

Ah no because I haven’t been testing the jobs>1 behaviour. Will add to the tests

@JohannesWiesner
Copy link
Contributor Author

Agree about testing with different OS - my ‘hack’ has been that I develop on windows and the automatic tests here use Ubuntu.

although weirdly that suggests the package does work on Ubuntu! So suggests I need to make the numpy/scipy versions explicit (I’ve tended towards laziness/relying on scikit-learn dependencies to be about right)

Should be feasible to implement a CI-Workflow with different os-runners and and then running pytest.py test for each of them.

@JohannesWiesner
Copy link
Contributor Author

Could send a PR if I have some time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants