Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI crash on Debian/Ubuntu during classifier training #104

Closed
marcboulle opened this issue Dec 6, 2023 · 4 comments · Fixed by #155
Closed

MPI crash on Debian/Ubuntu during classifier training #104

marcboulle opened this issue Dec 6, 2023 · 4 comments · Fixed by #155
Assignees
Labels
Type/Bug Something isn't working v11 Issue for Khiops 11
Milestone

Comments

@marcboulle
Copy link
Collaborator

marcboulle commented Dec 6, 2023

Contact: NV

Un bug (avec khiops ended with return code 15) vient d'etre observe récemment (2023/12/04) par Nicolas sur un environnement cloud GCP.

  • le return code 15 correspond à un SIGTERM
  • cf. https://komodor.com/learn/sigterm-signal-15-exit-code-143-linux-graceful-termination/
  • le cloud GCP utilise ce signal : https://cloud.google.com/run/docs/samples/cloudrun-sigterm-handler?hl=fr
  • bug en cours d'investigation
  • reproduit y compris en mode scenario sans passer par python (install conda)
  • non reproduit en windows ni ubuntu: seulement sur Debian 11 ou 10
  • informations de log obtenues:
    • pas d'info particuliere sur le log
      • warning : Data table : ...
    • sur sdtderr
      • Abort(609831951) on node 2 (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 609831951) - process 2
      • Abort(542723087) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 542723087) - process 1
    • sur stdout (avec variable d'environnement KhiopsPreparationTraceMode=true et KhiopsParallelTrace=2)
      [0] 2023-12-05 16:30:36.677 Run task Selection operand extraction (signature 'Selection operand extraction_11_10_16')
      [0] 2023-12-05 16:30:36.680 In MasterInitialize [Selection operand extraction]
      ...
      [0] 2023-12-05 16:30:37.564 In MasterFinalize
      error : MPI driver : Other MPI error, error stack:
      internal_Comm_disconnect(81)...: MPI_Comm_disconnect(comm=0x558e32098028) failed
      MPID_Comm_disconnect(493)......:
      MPIR_Comm_free_impl(809).......:
      MPIR_Comm_delete_internal(1224): Communicator (handle=84000003) being freed has 1 unmatched message(s)error : MPI driver : Other MPI error, error stack:
      internal_Comm_disconnect(81)...: MPI_Comm_disconnect(comm=0x5582628c7028) failed
      MPID_Comm_disconnect(493)......:
      MPIR_Comm_free_impl(809).......:
      MPIR_Comm_delete_internal(1224): Communicator (handle=84000003) being freed has 1 unmatched message(s)
      [0] 2023-12-05 16:30:37.564 Out MasterFinalize

A traiter en hot-fix des que possible, apres la sortie de la version open source

@marcboulle marcboulle added Type/Bug Something isn't working v11 Issue for Khiops 11 labels Dec 6, 2023
@folmos-at-orange folmos-at-orange added this to the v11.0.0 milestone Dec 6, 2023
@marcboulle
Copy link
Collaborator Author

marcboulle commented Dec 11, 2023

Message de NV

bug_mpi.zip

Voici un jeux de données et un scenario pour le bug mpi
Il a lieu sous linux avec le packging conda quand on install la version mpich 4.1.2
Il n’y a pas de bug quand on install la version deb qui utilise mpi 3.3 sur ubuntu focal, debain 11 et 10
Quand il y a plus orphan record dans la base T_sample2_synthetic_log.csv alors le bug disparait il en faut un certain nombre pour le voir apparaitre
Le bug ce produit rapidement.

ID00004] with key inferior to that of the including record TablePrincipale[ID00006]
warning : Data table ./T_sample2_synthetic_log.csv : Record 449 : Ignored record, orphan record LOGS[ID00004] with key inferior to that of the including record TablePrincipale[ID00006]
warning : Data table ./T_sample2_synthetic_log.csv : Record 450 : Ignored record, orphan record LOGS[ID00004] with key inferior to that of the including record TablePrincipale[ID00006]
warning : Data table ./T_sample2_synthetic_log.csv : Record 451 : Ignored record, orphan record LOGS[ID00004] with key inferior to that of the including record TablePrincipale[ID00006]
warning : Data table : ...
error : MPI driver : Other MPI error, error stack:
internal_Comm_disconnect(81)...: MPI_Comm_disconnect(comm=0x562e64c4a028) failed
MPID_Comm_disconnect(493)......: 
MPIR_Comm_free_impl(809).......: 
MPIR_Comm_delete_internal(1224): Communicator (handle=84000003) being freed has 1 unmatched message(s)error : MPI driver : Other MPI error, error stack:
internal_Comm_disconnect(81)...: MPI_Comm_disconnect(comm=0x559268036028) failed
MPID_Comm_disconnect(493)......: 
MPIR_Comm_free_impl(809).......: 
MPIR_Comm_delete_internal(1224): Communicator (handle=84000003) being freed has 1 unmatched message(s)

Quelques precisions

  • test.prm au format LearningTest: test.zip
    • jeu de test simplifie avec une seule variable a construire
    • lancement avec 3 coeurs
    • plante tres rapidement, meme en debug
      • deroulement complet correct en debug sur windows
    • plante dans un autre tache
      • cela ne semble donc pas provenir d'une tache particuliere
      • cela semble provenir de l'interaction avec MPI a partir de mpich 4
  • plante sur la Machine de Nicolas dans la tache "Run task Database slicer"
    • trace sur stdout:
...
[0]	2023-12-12 10:40:13.757	In MasterFinalize
[0]	2023-12-12 10:40:13.758	Out MasterFinalize
error : MPI driver : Other MPI error, error stack:
internal_Comm_disconnect(81)...: MPI_Comm_disconnect(comm=0x555abbeef028) failed
MPID_Comm_disconnect(493)......:
MPIR_Comm_free_impl(809).......:
MPIR_Comm_delete_internal(1224): Communicator (handle=84000003) being freed has 1 unmatched message(s)
error : MPI driver : Other MPI error, error stack:
internal_Comm_disconnect(81)...: MPI_Comm_disconnect(comm=0x556ac7f0a028) failed
MPID_Comm_disconnect(493)......:
MPIR_Comm_free_impl(809).......:
MPIR_Comm_delete_internal(1224): Communicator (handle=84000003) being freed has 1 unmatched message(s)
  • Ajout de traces dans les methodes StartFileServers et StopFileServers de PLMPITaskDriver pour preciser le probleme
    • cf. sur repo clone de Stephane: Bug MPI traces 1
    • plante dans StopFileServers, au moment de l'appel a MPI_Barrier
...
error : MPI driver : Other MPI error, error stack:
internal_Comm_disconnect(81)...: MPI_Comm_disconnect(comm=0x562936cab028) failed
MPID_Comm_disconnect(493)......:
MPIR_Comm_free_impl(809).......:
MPIR_Comm_delete_internal(1224): Communicator (handle=84000003) being freed has 1 unmatched message(s)error : MPI driver : Other MPI error, error stack:
internal_Comm_disconnect(81)...: MPI_Comm_disconnect(comm=0x55bb78216028) failed
MPID_Comm_disconnect(493)......:
MPIR_Comm_free_impl(809).......:
MPIR_Comm_delete_internal(1224): Communicator (handle=84000003) being freed has 1 unmatched message(s)

[0]	2023-12-12 15:05:54.601	Out MasterFinalize
StopFileServers	BEGIN	1
	StopFileServers	Step1
	StopFileServers	Step2
[0]	2023-12-12 15:05:54.601	0 SEND MASTER_STOP_FILE_SERVERS to 1
	StopFileServers	Step3	1
	StopFileServers	Step4	1
[0]	2023-12-12 15:05:54.601	0 SEND MASTER_STOP_FILE_SERVERS to 2
	StopFileServers	Step3	2
	StopFileServers	Step4	2
	StopFileServers	Step5

@folmos-at-orange
Copy link
Member

We found a temporary fix for the release: reducing the version of MPICH to at most 3.4.3 in linux conda package. It is temporary because he haven't really found the cause of the error and it could be a bug. @sgouache has a patch for it.

This issue will be closed when we have investigated the problem thoroughly.

@folmos-at-orange folmos-at-orange changed the title Crash on Debian 11 GCP during classifier training Crash on Debian/Ubuntu during classifier training Dec 12, 2023
@folmos-at-orange
Copy link
Member

folmos-at-orange commented Dec 12, 2023

Maybe this issue is related pmodels/mpich#6186 so it will hopefully we fixed in mpich 4.2.0. The change generating it was introduced in mpich 4.1 (source)

This reinforces the idea to have as a workaround the pinning of mpich to 3.4.3 in Linux and Mac

@folmos-at-orange folmos-at-orange changed the title Crash on Debian/Ubuntu during classifier training MPI crash on Debian/Ubuntu during classifier training Dec 12, 2023
@marcboulle
Copy link
Collaborator Author

marcboulle commented Dec 15, 2023

Nicolas a réussi a reproduire le bug dans un cas très simple, celui de la base adult en ayant changé le type d'une variable de Categorical en Numerical, ce qui génère beaucoup d'erreur et fait planter MPI quand on est en multi-coeur.

Voici le jeu de test minimal que je viens d'intégrer dans la base de test LearningTest\TestKhiops\Standard\BugMPIWithErrors
BugMPIWithErrors.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type/Bug Something isn't working v11 Issue for Khiops 11
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants