-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI crash on Debian/Ubuntu during classifier training #104
Comments
Message de NVVoici un jeux de données et un scenario pour le bug mpi
Quelques precisions
|
We found a temporary fix for the release: reducing the version of MPICH to at most 3.4.3 in linux conda package. It is temporary because he haven't really found the cause of the error and it could be a bug. @sgouache has a patch for it. This issue will be closed when we have investigated the problem thoroughly. |
Maybe this issue is related pmodels/mpich#6186 so it will hopefully we fixed in mpich 4.2.0. The change generating it was introduced in mpich 4.1 (source) This reinforces the idea to have as a workaround the pinning of mpich to 3.4.3 in Linux and Mac |
Nicolas a réussi a reproduire le bug dans un cas très simple, celui de la base adult en ayant changé le type d'une variable de Categorical en Numerical, ce qui génère beaucoup d'erreur et fait planter MPI quand on est en multi-coeur. Voici le jeu de test minimal que je viens d'intégrer dans la base de test LearningTest\TestKhiops\Standard\BugMPIWithErrors |
Contact: NV
Un bug (avec khiops ended with return code 15) vient d'etre observe récemment (2023/12/04) par Nicolas sur un environnement cloud GCP.
ubuntu: seulement sur Debian 11 ou 10[0] 2023-12-05 16:30:36.677 Run task Selection operand extraction (signature 'Selection operand extraction_11_10_16')
[0] 2023-12-05 16:30:36.680 In MasterInitialize [Selection operand extraction]
...
[0] 2023-12-05 16:30:37.564 In MasterFinalize
error : MPI driver : Other MPI error, error stack:
internal_Comm_disconnect(81)...: MPI_Comm_disconnect(comm=0x558e32098028) failed
MPID_Comm_disconnect(493)......:
MPIR_Comm_free_impl(809).......:
MPIR_Comm_delete_internal(1224): Communicator (handle=84000003) being freed has 1 unmatched message(s)error : MPI driver : Other MPI error, error stack:
internal_Comm_disconnect(81)...: MPI_Comm_disconnect(comm=0x5582628c7028) failed
MPID_Comm_disconnect(493)......:
MPIR_Comm_free_impl(809).......:
MPIR_Comm_delete_internal(1224): Communicator (handle=84000003) being freed has 1 unmatched message(s)
[0] 2023-12-05 16:30:37.564 Out MasterFinalize
A traiter en hot-fix des que possible, apres la sortie de la version open source
The text was updated successfully, but these errors were encountered: