A simple MPI application is failing with the following error when host1 is included in the hostfile.
Error: Fatal error in PMPI_Init: Other MPI error, error stack: Missing hostname or invalid host/port description in business card
This application works fine when host1 is excluded from the hostfile. I tried using cluster checker. I have attached the corresponding cluster checker log. Can you please help me with the interpretation of this log as this seems to mostly contain the differences between various hosts that were specified with “-f (machinelist) without really high-lighting any issue with host-e8 that can explain this error. Please find below the logs
SUMMARY
Command-line: clck -f machinesToTest -c clck.xml -Fhealth_user -Fhealth_base
-Fhealth_extended_user -Fmpi_prereq_user -l debug
Tests Run: health_user, health_base, health_extended_user,
mpi_prereq_user
**WARNING**: 9 tests failed to run. Information may be incomplete. See
clck_execution_warnings.log for more information.
Overall Result: 33 issues found - FUNCTIONALITY (3), HARDWARE UNIFORMITY (11),
PERFORMANCE (9), SOFTWARE UNIFORMITY (10)
--------------------------------------------------------------------------------
7 nodes tested: host-a2, host-b[1,3,6], host1,
host-c1, host-d
0 nodes with no issues:
7 nodes with issues: host-a2, host-b[1,3,6], host1,
host-c1, host-d
--------------------------------------------------------------------------------
FUNCTIONALITY
The following functionality issues were detected:
1. mpi-local-broken
Message: The single node MPI "Hello World" program did not run
successfully.
7 nodes: host-a2, host-b[1,3,6], host1,
host-c1, host-d
Test: mpi_local_functionality
2. memlock-too-small
Message: The memlock limit, '64', is smaller than recommended.
Remedy: We recommend correcting the limit of locked memory in
/etc/security/limits.conf to the following values: "* hard
memlock unlimited" "* soft memlock unlimited"
7 nodes: host-a2, host-b[1,3,6], host1,
host-c1, host-d
Test: memory_uniformity_user
3. memlock-too-small-ethernet
Message: The memlock limit, '64', is smaller than recommended.
Remedy: We recommend correcting the limit of locked memory in
/etc/security/limits.conf to the following values: "* hard
memlock unlimited" "* soft memlock unlimited"
7 nodes: host-a2, host-b[1,3,6], host1,
host-c1, host-d
Test: mpi_ethernet
HARDWARE UNIFORMITY
The following hardware uniformity issues were detected:
1. memory-not-uniform
Message: The amount of physical memory is not within the range of
792070572.0 KiB +/- 262144.0 KiB defined by nodes in the same
grouping.
5 nodes: host-b[1,3], host1, host-c1, host-d
Test: memory_uniformity_base
Details:
#Nodes Memory Nodes
1 1584974816.0 KiB host-c1
1 2113513608.0 KiB host1
1 529153152.0 KiB host-d
1 790940180.0 KiB host-b1
1 790940184.0 KiB host-b3
2. logical-cores-not-uniform:24
Message: The logical cores, '24', is not uniform across all nodes in the
same grouping. 67% of nodes in the same grouping have the same
number of logical cores.
Remedy: Please ensure that BIOS settings that can influence the number
of logical cores, like Hyper-Threading, VMX, VT-d and x2apic,
are uniform across nodes in the same grouping.
2 nodes: host-b[1,3]
Test: cpu_base
3. logical-cores-not-uniform:48
Message: The logical cores, '48', is not uniform across all nodes in the
same grouping. 33% of nodes in the same grouping have the same
number of logical cores.
Remedy: Please ensure that BIOS settings that can influence the number
of logical cores, like Hyper-Threading, VMX, VT-d and x2apic,
are uniform across nodes in the same grouping.
1 node: host-b6
Test: cpu_base
4. threads-per-core-not-uniform:1
Message: The number of threads available per core, '1', is not uniform.
67% of nodes in the same grouping have the same number of
threads available per core.
Remedy: Please enable/disable hyper-threading uniformly on Intel(R)
CPUs.
2 nodes: host-b[1,3]
Test: cpu_base
5. threads-per-core-not-uniform:2
Message: The number of threads available per core, '2', is not uniform.
33% of nodes in the same grouping have the same number of
threads available per core.
Remedy: Please enable/disable hyper-threading uniformly on Intel(R)
CPUs.
1 node: host-b6
Test: cpu_base
6. cpu-model-name-not-uniform
Message: The CPU model, 'Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz', is
not uniform. 14% of nodes in the same grouping have the same
CPU model.
1 node: host-a2
Test: cpu_base
7. cpu-model-name-not-uniform
Message: The CPU model, 'Intel(R) Xeon(R) Gold 6256 CPU @ 3.60GHz', is
not uniform. 43% of nodes in the same grouping have the same
CPU model.
3 nodes: host-b[1,3,6]
Test: cpu_base
8. cpu-model-name-not-uniform
Message: The CPU model, 'Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz', is
not uniform. 14% of nodes in the same grouping have the same
CPU model.
1 node: host1
Test: cpu_base
9. cpu-model-name-not-uniform
Message: The CPU model, 'Intel(R) Xeon(R) CPU E7-4830 v3 @ 2.10GHz', is
not uniform. 14% of nodes in the same grouping have the same
CPU model.
1 node: host-c1
Test: cpu_base
10. cpu-model-name-not-uniform
Message: The CPU model, 'Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz', is
not uniform. 14% of nodes in the same grouping have the same
CPU model.
1 node: host-d
Test: cpu_base
11. ethernet-firmware-version-is-not-consistent
Message: Inconsistent Ethernet firmware version.
3 nodes: host-a2, host1, host-c1
Test: ethernet
Details:
#Nodes Firmware Version Nodes
1 0x80000887, 1.2028.0 host-c1
1 0x800008e8 host-a2
1 4.0.596 host1
1 5719-v1.46 NCSI v1.3.16.0 host-a2
PERFORMANCE
The following performance issues were detected:
1. process-is-high-cpu
Message: Processes using high CPU.
Remedy: If this command is running in error, kill the process on the
node (if you are not the owner of the process, elevated
privileges may be required.)
5 nodes: host-a2, host-b[3,6], host1, host-c1
Test: node_process_status
Details:
#Nodes User PID %CPU Process Nodes
1 usera 204058 98.9 /med/code7/usera/blue4/rnd/software/amd64.linux.gnu.product/distribVelsyn host-b3
1 userb 120854 98.5 /med/code7/userb/rb21B/software/amd64.linux.gnu.product/velsyn host1
1 userb 71486 98.6 /med/code7/userb/rb21B/software/amd64.linux.gnu.product/velsyn host1
1 wvgrid 11116 37.2 /wv/wv-med/sge/bin/lx-amd64/sge_execd host-a2
1 wvgrid 19160 21.1 /wv/wv-med/sge/bin/lx-amd64/sge_execd host-b6
1 wvgrid 25097 79.7 /wv/wv-med/sge/bin/lx-amd64/sge_execd host-c1
1 wvgrid 90731 58.1 /wv/wv-med/sge/bin/lx-amd64/sge_execd host1
2. substandard-dgemm-due-to-high-cpu-process
Message: The substandard DGEMM benchmark result of 1.528 TFLOPS is due to
a conflicting process, pid '204058', using a large amount of
cpu.
Remedy: If this command is running in error, kill the process on the
node (if you are not the owner of the process, elevated
privileges may be required.)
1 node: host-b3
Test: node_process_status
3. substandard-sgemm-due-to-high-cpu-process
Message: The substandard SGEMM benchmark result of 3.277 TFLOPS is due to
a conflicting process, pid '204058', using a large amount of
cpu.
Remedy: If this command is running in error, kill the process on the
node (if you are not the owner of the process, elevated
privileges may be required.)
1 node: host-b3
Test: node_process_status
4. sgemm-data-is-substandard-avx512
Message: The following SGEMM benchmark results are below the accepted
4.147 TFLOPS(100%). The acceptable fraction (90%) can be set
using the <sgemm-peak-fraction> option in the configuration
file. For more details, please refer to the Intel(R) Cluster
Checker User Guide.
3 nodes: host-b[1,3,6]
Test: sgemm_cpu_performance
Details:
#Nodes Result %Below Peak Nodes
1 2.355 TFLOPS 57 host-b6
1 3.181 TFLOPS 77 host-b1
1 3.277 TFLOPS 79 host-b3
5. substandard-sgemm-due-to-high-cpu-process
Message: The substandard SGEMM benchmark result of 2.355 TFLOPS is due to
a conflicting process, pid '19160', using a large amount of cpu.
Remedy: If this command is running in error, kill the process on the
node (if you are not the owner of the process, elevated
privileges may be required.)
1 node: host-b6
Test: node_process_status
6. dgemm-data-is-substandard-avx512
Message: The DGEMM benchmark result is below the accepted 2.074
TFLOPS(100%). The acceptable fraction (90%) can be set using the
<dgemm-peak-fraction> option in the configuration file. For more
details, please refer to the Intel(R) Cluster Checker User
Guide.
3 nodes: host-b[1,3,6]
Test: dgemm_cpu_performance
Details:
#Nodes Result %Below Peak Nodes
1 1.389 TFLOPS 67 host-b1
1 1.528 TFLOPS 74 host-b3
1 1.570 TFLOPS 76 host-b6
7. dgemm-data-is-substandard
Message: The following DGEMM benchmark results are below the theoretical
peak of 1.165 TFLOPS.
1 node: host-a2
Test: dgemm_cpu_performance
Details:
#Nodes Result %Below Peak Nodes
1 845.441 GFLOPS 73 host-a2
8. substandard-dgemm-due-to-high-cpu-process
Message: The substandard DGEMM benchmark result of 845.441 GFLOPS is due
to a conflicting process, pid '11116', using a large amount of
cpu.
Remedy: If this command is running in error, kill the process on the
node (if you are not the owner of the process, elevated
privileges may be required.)
1 node: host-a2
Test: node_process_status
9. substandard-dgemm-due-to-high-cpu-process
Message: The substandard DGEMM benchmark result of 1.570 TFLOPS is due to
a conflicting process, pid '19160', using a large amount of cpu.
Remedy: If this command is running in error, kill the process on the
node (if you are not the owner of the process, elevated
privileges may be required.)
1 node: host-b6
Test: node_process_status
SOFTWARE UNIFORMITY
The following software uniformity issues were detected:
1. ethernet-driver-is-not-consistent
Message: Inconsistent Ethernet driver.
2 nodes: host-a2, host1
Test: ethernet
Details:
#Nodes Driver Nodes
1 netxen_nic host1
1 tg3 host-a2
2. kernel-not-uniform
Message: The Linux kernel version, '3.10.0-957.27.2.el7.x86_64', is not
uniform. 86% of nodes in the same grouping have the same
version.
6 nodes: host-a2, host-b[1,3,6], host1,
host-c1
Test: kernel_version_uniformity
3. kernel-not-uniform
Message: The Linux kernel version, '2.6.32-573.26.1.el6.x86_64', is not
uniform. 14% of nodes in the same grouping have the same
version.
1 node: host-d
Test: kernel_version_uniformity
4. environment-variable-not-uniform
Message: Environment variables are not uniform across the nodes.
7 nodes: host-a2, host-b[1,3,6], host1,
host-c1, host-d
Test: environment_variables_uniformity
Details:
#Nodes Variable Value Nodes
6 G_BROKEN_FILENAMES host-a2, host-b[1,3,6], host1, host-c1
6 KDE_IS_PRELINKED host-a2, host-b[1,3,6], host1, host-c1
6 MODULEPATH host-a2, host-b[1,3,6], host1, host-c1
6 MODULESHOME host-a2, host-b[1,3,6], host1, host-c1
1 G_BROKEN_FILENAMES 1 host-d
1 KDE_IS_PRELINKED 1 host-d
1 MODULEPATH /usr/share/Modules/modulefiles:/etc/modulefiles host-d
1 MODULESHOME /usr/share/Modules host-d
5. perl-not-uniform
Message: The Perl version, '5.16.3', is not uniform. 86% of nodes in the
same grouping have the same version.
6 nodes: host-a2, host-b[1,3,6], host1,
host-c1
Test: perl_functionality
6. perl-not-uniform
Message: The Perl version, '5.10.1', is not uniform. 14% of nodes in the
same grouping have the same version.
1 node: host-d
Test: perl_functionality
7. python-not-uniform
Message: The Python version, '2.7.5', is not uniform. 86% of nodes in
the same grouping have the same version.
6 nodes: host-a2, host-b[1,3,6], host1,
host-c1
Test: python_functionality
8. python-not-uniform
Message: The Python version, '2.6.6', is not uniform. 14% of nodes in
the same grouping have the same version.
1 node: host-d
Test: python_functionality
9. ethernet-driver-version-is-not-consistent
Message: Inconsistent Ethernet driver version.
2 nodes: host-a2, host1
Test: ethernet
Details:
#Nodes Version Nodes
1 3.137 host-a2
1 4.0.82 host1
10. ethernet-interrupt-coalescing-state-not-uniform
Message: Ethernet interrupt coalescing is not enabled/disabled uniformly
across nodes in the same grouping.
Remedy: Append "/sbin/ethtool -C eno1 rx-usecs <value>" to the site
specific system startup script. Use '0' to permanently disable
Ethernet interrupt coalescing or other value as needed. The
site specific system startup script is typically
/etc/rc.d/rc.local or /etc/rc.d/boot.local.
1 node: host1
Test: ethernet
Details:
#Nodes State Interface Nodes
1 enabled eno1 host1
1 enabled eno3 host1
--------------------------------------------------------------------------------
INFORMATIONAL
The following additional information was detected:
1. mpi-network-interface
Message: The cluster has 1 network interfaces (Ethernet). Intel(R) MPI
Library uses by default the first interface detected in the
order of: (1) Intel(R) Omni-Path Architecture (Intel(R) OPA),
(2) InfiniBand, (3) Ethernet. You can set a specific interface
by setting the environment variable I_MPI_OFI_PROVIDER.
Ethernet: I_MPI_OFI_PROVIDER=sockets mpiexec.hydra; InfiniBand:
I_MPI_OFI_PROVIDER=verbs mpiexec.hydra; Intel(R) OPA:
I_MPI_OFI_PROVIDER=psm2 mpiexec.hydra.
7 nodes: host-a2, host-b[1,3,6], host1,
host-c1, host-d
Test: mpi_prereq_user
--------------------------------------------------------------------------------
Intel(R) Cluster Checker 2021 Update 1
00:34:46 April 23 2021 UTC
Nodefile used: machinesToTest
Databases used: $HOME/.clck/2021.1.1/clck.db
I tried to use a consistent ethernet driver version in host1 and follow the remedy provided in the log for ethernet-interrupt-coalescing-state-not-uniform and run the sample on heterogeneous nodes including host1.