Fatal error in PMPI_Init: Other MPI error, error stack: Missing hostname or invalid host/port

4.7k views Asked by At

A simple MPI application is failing with the following error when host1 is included in the hostfile.

Error: Fatal error in PMPI_Init: Other MPI error, error stack: Missing hostname or invalid host/port description in business card

This application works fine when host1 is excluded from the hostfile. I tried using cluster checker. I have attached the corresponding cluster checker log. Can you please help me with the interpretation of this log as this seems to mostly contain the differences between various hosts that were specified with “-f (machinelist) without really high-lighting any issue with host-e8 that can explain this error. Please find below the logs

SUMMARY
  Command-line:   clck -f machinesToTest -c clck.xml -Fhealth_user -Fhealth_base
                  -Fhealth_extended_user -Fmpi_prereq_user -l debug
  Tests Run:      health_user, health_base, health_extended_user,
                  mpi_prereq_user
  **WARNING**:    9 tests failed to run. Information may be incomplete. See
                  clck_execution_warnings.log for more information.
  Overall Result: 33 issues found - FUNCTIONALITY (3), HARDWARE UNIFORMITY (11),
                  PERFORMANCE (9), SOFTWARE UNIFORMITY (10)
--------------------------------------------------------------------------------
7 nodes tested:         host-a2, host-b[1,3,6], host1,
                        host-c1, host-d
0 nodes with no issues: 
7 nodes with issues:    host-a2, host-b[1,3,6], host1,
                        host-c1, host-d
--------------------------------------------------------------------------------
FUNCTIONALITY
The following functionality issues were detected:
  1. mpi-local-broken
       Message: The single node MPI "Hello World" program did not run
                successfully.
       7 nodes: host-a2, host-b[1,3,6], host1,
                host-c1, host-d
       Test:    mpi_local_functionality
  2. memlock-too-small
       Message: The memlock limit, '64', is smaller than recommended.
       Remedy:  We recommend correcting the limit of locked memory in
                /etc/security/limits.conf to the following values: "* hard
                memlock unlimited" "* soft memlock unlimited"
       7 nodes: host-a2, host-b[1,3,6], host1,
                host-c1, host-d
       Test:    memory_uniformity_user
  3. memlock-too-small-ethernet
       Message: The memlock limit, '64', is smaller than recommended.
       Remedy:  We recommend correcting the limit of locked memory in
                /etc/security/limits.conf to the following values: "* hard
                memlock unlimited" "* soft memlock unlimited"
       7 nodes: host-a2, host-b[1,3,6], host1,
                host-c1, host-d
       Test:    mpi_ethernet

HARDWARE UNIFORMITY
The following hardware uniformity issues were detected:
  1.  memory-not-uniform
        Message: The amount of physical memory is not within the range of
                 792070572.0 KiB +/- 262144.0 KiB defined by nodes in the same
                 grouping.
        5 nodes: host-b[1,3], host1, host-c1, host-d
        Test:    memory_uniformity_base
        Details: 
          #Nodes Memory           Nodes           
          1      1584974816.0 KiB host-c1    
          1      2113513608.0 KiB host1    
          1      529153152.0 KiB  host-d    
          1      790940180.0 KiB  host-b1 
          1      790940184.0 KiB  host-b3 
  2.  logical-cores-not-uniform:24
        Message: The logical cores, '24', is not uniform across all nodes in the
                 same grouping. 67% of nodes in the same grouping have the same
                 number of logical cores.
        Remedy:  Please ensure that BIOS settings that can influence the number
                 of logical cores, like Hyper-Threading, VMX, VT-d and x2apic,
                 are uniform across nodes in the same grouping.
        2 nodes: host-b[1,3]
        Test:    cpu_base
  3.  logical-cores-not-uniform:48
        Message: The logical cores, '48', is not uniform across all nodes in the
                 same grouping. 33% of nodes in the same grouping have the same
                 number of logical cores.
        Remedy:  Please ensure that BIOS settings that can influence the number
                 of logical cores, like Hyper-Threading, VMX, VT-d and x2apic,
                 are uniform across nodes in the same grouping.
        1 node:  host-b6
        Test:    cpu_base
  4.  threads-per-core-not-uniform:1
        Message: The number of threads available per core, '1', is not uniform.
                 67% of nodes in the same grouping have the same number of
                 threads available per core.
        Remedy:  Please enable/disable hyper-threading uniformly on Intel(R)
                 CPUs.
        2 nodes: host-b[1,3]
        Test:    cpu_base
  5.  threads-per-core-not-uniform:2
        Message: The number of threads available per core, '2', is not uniform.
                 33% of nodes in the same grouping have the same number of
                 threads available per core.
        Remedy:  Please enable/disable hyper-threading uniformly on Intel(R)
                 CPUs.
        1 node:  host-b6
        Test:    cpu_base
  6.  cpu-model-name-not-uniform
        Message: The CPU model, 'Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz', is
                 not uniform. 14% of nodes in the same grouping have the same
                 CPU model.
        1 node:  host-a2
        Test:    cpu_base
  7.  cpu-model-name-not-uniform
        Message: The CPU model, 'Intel(R) Xeon(R) Gold 6256 CPU @ 3.60GHz', is
                 not uniform. 43% of nodes in the same grouping have the same
                 CPU model.
        3 nodes: host-b[1,3,6]
        Test:    cpu_base
  8.  cpu-model-name-not-uniform
        Message: The CPU model, 'Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz', is
                 not uniform. 14% of nodes in the same grouping have the same
                 CPU model.
        1 node:  host1
        Test:    cpu_base
  9.  cpu-model-name-not-uniform
        Message: The CPU model, 'Intel(R) Xeon(R) CPU E7-4830 v3 @ 2.10GHz', is
                 not uniform. 14% of nodes in the same grouping have the same
                 CPU model.
        1 node:  host-c1
        Test:    cpu_base
  10. cpu-model-name-not-uniform
        Message: The CPU model, 'Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz', is
                 not uniform. 14% of nodes in the same grouping have the same
                 CPU model.
        1 node:  host-d
        Test:    cpu_base
  11. ethernet-firmware-version-is-not-consistent
        Message: Inconsistent Ethernet firmware version.
        3 nodes: host-a2, host1, host-c1
        Test:    ethernet
        Details: 
          #Nodes Firmware Version          Nodes         
          1      0x80000887, 1.2028.0      host-c1  
          1      0x800008e8                host-a2 
          1      4.0.596                   host1  
          1      5719-v1.46 NCSI v1.3.16.0 host-a2 

PERFORMANCE
The following performance issues were detected:
  1. process-is-high-cpu
       Message: Processes using high CPU.
       Remedy:  If this command is running in error, kill the process on the
                node (if you are not the owner of the process, elevated
                privileges may be required.)
       5 nodes: host-a2, host-b[3,6], host1, host-c1
       Test:    node_process_status
       Details: 
         #Nodes User     PID    %CPU Process                                                                      Nodes           
         1      usera 204058 98.9 /med/code7/usera/blue4/rnd/software/amd64.linux.gnu.product/distribVelsyn host-b3 
         1      userb 120854 98.5 /med/code7/userb/rb21B/software/amd64.linux.gnu.product/velsyn            host1    
         1      userb 71486  98.6 /med/code7/userb/rb21B/software/amd64.linux.gnu.product/velsyn            host1    
         1      wvgrid   11116  37.2 /wv/wv-med/sge/bin/lx-amd64/sge_execd                                        host-a2   
         1      wvgrid   19160  21.1 /wv/wv-med/sge/bin/lx-amd64/sge_execd                                        host-b6 
         1      wvgrid   25097  79.7 /wv/wv-med/sge/bin/lx-amd64/sge_execd                                        host-c1    
         1      wvgrid   90731  58.1 /wv/wv-med/sge/bin/lx-amd64/sge_execd                                        host1    
  2. substandard-dgemm-due-to-high-cpu-process
       Message: The substandard DGEMM benchmark result of 1.528 TFLOPS is due to
                a conflicting process, pid '204058', using a large amount of
                cpu.
       Remedy:  If this command is running in error, kill the process on the
                node (if you are not the owner of the process, elevated
                privileges may be required.)
       1 node:  host-b3
       Test:    node_process_status
  3. substandard-sgemm-due-to-high-cpu-process
       Message: The substandard SGEMM benchmark result of 3.277 TFLOPS is due to
                a conflicting process, pid '204058', using a large amount of
                cpu.
       Remedy:  If this command is running in error, kill the process on the
                node (if you are not the owner of the process, elevated
                privileges may be required.)
       1 node:  host-b3
       Test:    node_process_status
  4. sgemm-data-is-substandard-avx512
       Message: The following SGEMM benchmark results are below the accepted
                4.147 TFLOPS(100%). The acceptable fraction (90%) can be set
                using the <sgemm-peak-fraction> option in the configuration
                file. For more details, please refer to the Intel(R) Cluster
                Checker User Guide.
       3 nodes: host-b[1,3,6]
       Test:    sgemm_cpu_performance
       Details: 
         #Nodes Result       %Below Peak Nodes           
         1      2.355 TFLOPS 57          host-b6 
         1      3.181 TFLOPS 77          host-b1 
         1      3.277 TFLOPS 79          host-b3 
  5. substandard-sgemm-due-to-high-cpu-process
       Message: The substandard SGEMM benchmark result of 2.355 TFLOPS is due to
                a conflicting process, pid '19160', using a large amount of cpu.
       Remedy:  If this command is running in error, kill the process on the
                node (if you are not the owner of the process, elevated
                privileges may be required.)
       1 node:  host-b6
       Test:    node_process_status
  6. dgemm-data-is-substandard-avx512
       Message: The DGEMM benchmark result is below the accepted 2.074
                TFLOPS(100%). The acceptable fraction (90%) can be set using the
                <dgemm-peak-fraction> option in the configuration file. For more
                details, please refer to the Intel(R) Cluster Checker User
                Guide.
       3 nodes: host-b[1,3,6]
       Test:    dgemm_cpu_performance
       Details: 
         #Nodes Result       %Below Peak Nodes           
         1      1.389 TFLOPS 67          host-b1 
         1      1.528 TFLOPS 74          host-b3 
         1      1.570 TFLOPS 76          host-b6 
  7. dgemm-data-is-substandard
       Message: The following DGEMM benchmark results are below the theoretical
                peak of 1.165 TFLOPS.
       1 node:  host-a2
       Test:    dgemm_cpu_performance
       Details: 
         #Nodes Result         %Below Peak Nodes         
         1      845.441 GFLOPS 73          host-a2 
  8. substandard-dgemm-due-to-high-cpu-process
       Message: The substandard DGEMM benchmark result of 845.441 GFLOPS is due
                to a conflicting process, pid '11116', using a large amount of
                cpu.
       Remedy:  If this command is running in error, kill the process on the
                node (if you are not the owner of the process, elevated
                privileges may be required.)
       1 node:  host-a2
       Test:    node_process_status
  9. substandard-dgemm-due-to-high-cpu-process
       Message: The substandard DGEMM benchmark result of 1.570 TFLOPS is due to
                a conflicting process, pid '19160', using a large amount of cpu.
       Remedy:  If this command is running in error, kill the process on the
                node (if you are not the owner of the process, elevated
                privileges may be required.)
       1 node:  host-b6
       Test:    node_process_status

SOFTWARE UNIFORMITY
The following software uniformity issues were detected:
  1.  ethernet-driver-is-not-consistent
        Message: Inconsistent Ethernet driver.
        2 nodes: host-a2, host1
        Test:    ethernet
        Details: 
          #Nodes Driver     Nodes         
          1      netxen_nic host1  
          1      tg3        host-a2 
  2.  kernel-not-uniform
        Message: The Linux kernel version, '3.10.0-957.27.2.el7.x86_64', is not
                 uniform. 86% of nodes in the same grouping have the same
                 version.
        6 nodes: host-a2, host-b[1,3,6], host1,
                 host-c1
        Test:    kernel_version_uniformity
  3.  kernel-not-uniform
        Message: The Linux kernel version, '2.6.32-573.26.1.el6.x86_64', is not
                 uniform. 14% of nodes in the same grouping have the same
                 version.
        1 node:  host-d
        Test:    kernel_version_uniformity
  4.  environment-variable-not-uniform
        Message: Environment variables are not uniform across the nodes.
        7 nodes: host-a2, host-b[1,3,6], host1,
                 host-c1, host-d
        Test:    environment_variables_uniformity
        Details: 
          #Nodes Variable           Value                                           Nodes         
          6      G_BROKEN_FILENAMES                                                 host-a2, host-b[1,3,6], host1, host-c1
          6      KDE_IS_PRELINKED                                                   host-a2, host-b[1,3,6], host1, host-c1
          6      MODULEPATH                                                         host-a2, host-b[1,3,6], host1, host-c1
          6      MODULESHOME                                                        host-a2, host-b[1,3,6], host1, host-c1
          1      G_BROKEN_FILENAMES 1                                               host-d  
          1      KDE_IS_PRELINKED   1                                               host-d  
          1      MODULEPATH         /usr/share/Modules/modulefiles:/etc/modulefiles host-d  
          1      MODULESHOME        /usr/share/Modules                              host-d  
  5.  perl-not-uniform
        Message: The Perl version, '5.16.3', is not uniform. 86% of nodes in the
                 same grouping have the same version.
        6 nodes: host-a2, host-b[1,3,6], host1,
                 host-c1
        Test:    perl_functionality
  6.  perl-not-uniform
        Message: The Perl version, '5.10.1', is not uniform. 14% of nodes in the
                 same grouping have the same version.
        1 node:  host-d
        Test:    perl_functionality
  7.  python-not-uniform
        Message: The Python version, '2.7.5', is not uniform. 86% of nodes in
                 the same grouping have the same version.
        6 nodes: host-a2, host-b[1,3,6], host1,
                 host-c1
        Test:    python_functionality
  8.  python-not-uniform
        Message: The Python version, '2.6.6', is not uniform. 14% of nodes in
                 the same grouping have the same version.
        1 node:  host-d
        Test:    python_functionality
  9.  ethernet-driver-version-is-not-consistent
        Message: Inconsistent Ethernet driver version.
        2 nodes: host-a2, host1
        Test:    ethernet
        Details: 
          #Nodes Version Nodes         
          1      3.137   host-a2 
          1      4.0.82  host1  
  10. ethernet-interrupt-coalescing-state-not-uniform
        Message: Ethernet interrupt coalescing is not enabled/disabled uniformly
                 across nodes in the same grouping.
        Remedy:  Append "/sbin/ethtool -C eno1 rx-usecs <value>" to the site
                 specific system startup script. Use '0' to permanently disable
                 Ethernet interrupt coalescing or other value as needed. The
                 site specific system startup script is typically
                 /etc/rc.d/rc.local or /etc/rc.d/boot.local.
        1 node:  host1
        Test:    ethernet
        Details: 
          #Nodes State   Interface Nodes        
          1      enabled eno1      host1 
          1      enabled eno3      host1 

--------------------------------------------------------------------------------
INFORMATIONAL
The following additional information was detected:
  1. mpi-network-interface
       Message: The cluster has 1 network interfaces (Ethernet). Intel(R) MPI
                Library uses by default the first interface detected in the
                order of: (1) Intel(R) Omni-Path Architecture (Intel(R) OPA),
                (2) InfiniBand, (3) Ethernet. You can set a specific interface
                by setting the environment variable I_MPI_OFI_PROVIDER.
                Ethernet: I_MPI_OFI_PROVIDER=sockets mpiexec.hydra; InfiniBand:
                I_MPI_OFI_PROVIDER=verbs mpiexec.hydra; Intel(R) OPA:
                I_MPI_OFI_PROVIDER=psm2 mpiexec.hydra.
       7 nodes: host-a2, host-b[1,3,6], host1,
                host-c1, host-d
       Test:    mpi_prereq_user

--------------------------------------------------------------------------------
Intel(R) Cluster Checker 2021 Update 1
00:34:46 April 23 2021 UTC
Nodefile used: machinesToTest
Databases used: $HOME/.clck/2021.1.1/clck.db
1

There are 1 answers

0
Arpita - Intel On

I tried to use a consistent ethernet driver version in host1 and follow the remedy provided in the log for ethernet-interrupt-coalescing-state-not-uniform and run the sample on heterogeneous nodes including host1.