Bringing Up von Infiniband schlägt fehl

justj

Member of Honour
Hallo,

ich bin derzeit dabei, einen HPC-Cluster zu installieren/konfigurieren. Dieser besteht aus einen SUN Bladecenter X6048 mit 48 SUN X6275 Blades. Die Blades booten diskless per Netzwerk eine stark reduzierte CentOS 7.2 Version.
Vernetzt sind die Blades mittels Infiniband QDR. Verwaltet werden die Ressourcen über einen Kopfknoten mittels Torque/PBS. Der Kopfknoten stellt sowohl per Ethernet als auch Infiniband (IPoIB) in zwei verschiedenen Subnetzen einen DNS-, DHCP- und PXE-Server zur Verfügung, der Subnetmanager für Infiniband (openSM) läuft ebenfalls auf dem Kopfknoten, alle Dienste funktionieren einwandfrei.

Für die Einrichtung habe ich jeden Blade per Ethernet und per Infiniband angeschlossen. Der PXE-Bootprozess funktioniert über beide Wege, ebenso DHCP und DNS.
Da unter anderem aufgrund nicht vorhandener Hardware und der Kabelmenge, den Cluster nur mittels Infiniband zu betreiben, habe ich die Ethernetkabel entfernt. Der Bootprozess funktioniert wunderbar, jeder Client holt sich seine, für ihn vorgesehene, IP beim DHCP-Server ab, registriert sich beim DNS-Server und bootet das bereitgestellte Image.

An diesem Punkt beginnt mein Problem:
Während des Bootprozesses wird versucht, den Infiniband-Adapter zu aktivieren, was scheinbar fehlschlägt. Infolge dessen kann auch IPoIB nicht aktiviert werden und die NFS-Mounts schlagen fehl. Logge ich mich direkt auf dem jeweiligen Blade ein und aktiviere den Infiniband-Adapter per
Code:
ifup ib0
funktioniert alles und ich kann per Netzwerk zugreifen. Bei den 48 Blades mit jeweils 2 Nodes ist manuelles Eingreifen aber keine Lösung.

Sobald das Ethernetkabel eingesteckt ist, funktioniert auch das Aktivieren des IB-Adapters beim Bootprozess.

Kennt ihr dieses Problem eventuelle oder habt ihr einen Lösungsvorschlag?

Anbei noch ein Ausschnitt aus den Bootlogs, einmal mit Ethernet und Infiniband und einmal nur mit Infiniband.
Feb 22 13:08:20 localhost systemd[1]: Starting LSB: Bring up/down networking...
Feb 22 13:08:20 localhost systemd[1]: Starting GSSAPI Proxy Daemon...
Feb 22 13:08:20 localhost sshd-keygen[360]: Generating SSH2 RSA host key: [ OK ]
Feb 22 13:08:20 localhost sshd-keygen[360]: Generating SSH2 ECDSA host key: [ OK ]
Feb 22 13:08:20 localhost systemd[1]: Starting Login Service...
Feb 22 13:08:20 localhost sshd-keygen[360]: Generating SSH2 ED25519 host key: [ OK ]
Feb 22 13:08:20 localhost systemd[1]: Started D-Bus System Message Bus.
Feb 22 13:08:20 localhost dbus[446]: [system] Successfully activated service 'org.freedesktop.systemd1'
Feb 22 13:08:20 localhost systemd[1]: Starting D-Bus System Message Bus...
Feb 22 13:08:20 localhost dbus-daemon[446]: dbus[446]: [system] Successfully activated service 'org.freedesktop.systemd1'
Feb 22 13:08:20 localhost systemd[1]: Starting Network Time Service...
Feb 22 13:08:20 localhost ntpd[453]: ntpd 4.2.6p5@1.2349-o Mon Jan 25 14:27:34 UTC 2016 (1)
Feb 22 13:08:20 localhost ntpd[454]: proto: precision = 6.146 usec
Feb 22 13:08:20 localhost ntpd[454]: 0.0.0.0 c01d 0d kern kernel time sync enabled
Feb 22 13:08:20 localhost ntpd[454]: ntp_io: estimated max descriptors: 1024, initial socket boundary: 16
Feb 22 13:08:20 localhost ntpd[454]: Listen and drop on 0 v4wildcard 0.0.0.0 UDP 123
Feb 22 13:08:20 localhost ntpd[454]: Listen and drop on 1 v6wildcard :: UDP 123
Feb 22 13:08:20 localhost ntpd[454]: Listen normally on 2 lo 127.0.0.1 UDP 123
Feb 22 13:08:20 localhost ntpd[454]: Listen normally on 3 lo ::1 UDP 123
Feb 22 13:08:20 localhost ntpd[454]: Listening on routing socket on fd #20 for interface updates
Feb 22 13:08:20 localhost ntpd[454]: 0.0.0.0 c016 06 restart
Feb 22 13:08:20 localhost ntpd[454]: 0.0.0.0 c012 02 freq_set kernel 0.000 PPM
Feb 22 13:08:20 localhost ntpd[454]: 0.0.0.0 c011 01 freq_not_set
Feb 22 13:08:20 localhost network[366]: Bringing up loopback interface: [ OK ]
Feb 22 13:08:20 localhost systemd[1]: Started OpenSSH Server Key Generation.
Feb 22 13:08:20 localhost systemd[1]: Started Dump dmesg to /var/log/dmesg.
Feb 22 13:08:20 localhost network[366]: Bringing up interface enp0s25:
Feb 22 13:08:20 localhost systemd[1]: Started GSSAPI Proxy Daemon.
Feb 22 13:08:20 localhost systemd[1]: Started Network Time Service.
Feb 22 13:08:20 localhost systemd[1]: Started RPC security service for NFS client and server.
Feb 22 13:08:20 localhost systemd[1]: Started RPC security service for NFS server.
Feb 22 13:08:20 localhost systemd-logind[422]: New seat seat0.
Feb 22 13:08:20 localhost systemd[1]: Reached target NFS client services.
Feb 22 13:08:20 localhost systemd[1]: Starting NFS client services.
Feb 22 13:08:20 localhost systemd-logind[422]: Watching system buttons on /dev/input/event1 (Power Button)
Feb 22 13:08:20 localhost systemd-logind[422]: Watching system buttons on /dev/input/event0 (Power Button)
Feb 22 13:08:20 localhost systemd[1]: Started Login Service.
Feb 22 13:08:20 localhost kernel: e1000e 0000:00:19.0: irq 38 for MSI/MSI-X
Feb 22 13:08:20 localhost kernel: e1000e 0000:00:19.0: irq 38 for MSI/MSI-X
Feb 22 13:08:20 localhost kernel: IPv6: ADDRCONF(NETDEV_UP): enp0s25: link is not ready
Feb 22 13:08:21 localhost kernel: mlx4_core 0000:02:00.0: PCIe link speed is 5.0GT/s, device supports 5.0GT/s
Feb 22 13:08:21 localhost kernel: mlx4_core 0000:02:00.0: PCIe link width is x8, device supports x8
Feb 22 13:08:21 localhost kernel: mlx4_core 0000:02:00.0: irq 39 for MSI/MSI-X
Feb 22 13:08:21 localhost kernel: mlx4_core 0000:02:00.0: irq 40 for MSI/MSI-X
Feb 22 13:08:21 localhost kernel: mlx4_core 0000:02:00.0: irq 41 for MSI/MSI-X
Feb 22 13:08:21 localhost kernel: mlx4_core 0000:02:00.0: irq 42 for MSI/MSI-X
Feb 22 13:08:21 localhost kernel: mlx4_core 0000:02:00.0: irq 43 for MSI/MSI-X
Feb 22 13:08:21 localhost kernel: mlx4_core 0000:02:00.0: irq 44 for MSI/MSI-X
Feb 22 13:08:21 localhost kernel: mlx4_core 0000:02:00.0: irq 45 for MSI/MSI-X
Feb 22 13:08:21 localhost kernel: mlx4_core 0000:02:00.0: irq 46 for MSI/MSI-X
Feb 22 13:08:21 localhost kernel: mlx4_core 0000:02:00.0: irq 47 for MSI/MSI-X
Feb 22 13:08:21 localhost kernel: mlx4_core 0000:02:00.0: irq 48 for MSI/MSI-X
Feb 22 13:08:21 localhost kernel: mlx4_core 0000:02:00.0: irq 49 for MSI/MSI-X
Feb 22 13:08:21 localhost kernel: mlx4_core 0000:02:00.0: irq 50 for MSI/MSI-X
Feb 22 13:08:21 localhost kernel: mlx4_core 0000:02:00.0: irq 51 for MSI/MSI-X
Feb 22 13:08:21 localhost kernel: mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.2-1 (Feb 2014)
Feb 22 13:08:21 localhost kernel: mlx4_en 0000:02:00.0: UDP RSS is not supported on this device
Feb 22 13:08:21 localhost kernel: <mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v2.2-1 (Feb 2014)
Feb 22 13:08:21 localhost kernel: <mlx4_ib> mlx4_ib_add: counter index 0 for port 1 allocated 0
Feb 22 13:08:21 localhost kernel: Rounding down aligned max_sectors from 4294967295 to 4294967288
Feb 22 13:08:21 localhost kernel: Loading iSCSI transport class v2.0-870.
Feb 22 13:08:21 localhost kernel: iscsi: registered transport (iser)
Feb 22 13:08:21 localhost kernel: RPC: Registered rdma transport module.
Feb 22 13:08:21 localhost systemd[1]: Started Initialize the iWARP/InfiniBand/RDMA stack in the kernel.
Feb 22 13:08:21 localhost systemd[1]: Reached target Remote File Systems (Pre).
Feb 22 13:08:21 localhost systemd[1]: Starting Remote File Systems (Pre).
Feb 22 13:08:23 localhost kernel: e1000e: enp0s25 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
Feb 22 13:08:23 localhost kernel: IPv6: ADDRCONF(NETDEV_CHANGE): enp0s25: link becomes ready
Feb 22 13:08:23 localhost dhclient[626]: DHCPDISCOVER on enp0s25 to 255.255.255.255 port 67 interval 3 (xid=0x6caeb5d6)
Feb 22 13:08:25 localhost ntpd[454]: Listen normally on 4 enp0s25 fe80::221:28ff:fe6b:5346 UDP 123
Feb 22 13:08:25 localhost ntpd[454]: new interface(s) found: waking up resolver
Feb 22 13:08:26 localhost dhclient[626]: DHCPDISCOVER on enp0s25 to 255.255.255.255 port 67 interval 8 (xid=0x6caeb5d6)
Feb 22 13:08:34 localhost dhclient[626]: DHCPDISCOVER on enp0s25 to 255.255.255.255 port 67 interval 15 (xid=0x6caeb5d6)
Feb 22 13:08:49 localhost dhclient[626]: DHCPDISCOVER on enp0s25 to 255.255.255.255 port 67 interval 10 (xid=0x6caeb5d6)
Feb 22 13:08:59 localhost dhclient[626]: DHCPDISCOVER on enp0s25 to 255.255.255.255 port 67 interval 17 (xid=0x6caeb5d6)
Feb 22 13:09:00 localhost dhclient[626]: DHCPREQUEST on enp0s25 to 255.255.255.255 port 67 (xid=0x6caeb5d6)
Feb 22 13:09:00 localhost dhclient[626]: DHCPOFFER from 10.0.0.1
Feb 22 13:09:00 localhost dhclient[626]: DHCPACK from 10.0.0.1 (xid=0x6caeb5d6)
Feb 22 13:09:02 localhost NET[678]: /usr/sbin/dhclient-script : updated /etc/resolv.conf
Feb 22 13:09:02 localhost dhclient[626]: bound to 10.0.10.5 -- renewal in 3217 seconds.
Feb 22 13:09:02 localhost network[366]: Determining IP information for enp0s25... done.
Feb 22 13:09:02 localhost network[366]: [ OK ]
Feb 22 13:09:02 localhost kernel: ib0: enabling connected mode will cause multicast packet drops
Feb 22 13:09:02 localhost kernel: ib0: mtu > 4092 will cause multicast packet drops.
Feb 22 13:09:02 localhost network[366]: Bringing up interface ib0:
Feb 22 13:09:02 localhost kernel: IPv6: ADDRCONF(NETDEV_UP): ib0: link is not ready
Feb 22 13:09:02 localhost kernel: IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
Feb 22 13:09:02 localhost kernel: ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
Feb 22 13:09:02 localhost dhclient[753]: DHCPDISCOVER on ib0 to 255.255.255.255 port 67 interval 8 (xid=0x27fab603)
Feb 22 13:09:02 localhost dhclient[753]: DHCPREQUEST on ib0 to 255.255.255.255 port 67 (xid=0x27fab603)
Feb 22 13:09:02 localhost dhclient[753]: DHCPOFFER from 10.10.0.1
Feb 22 13:09:02 localhost dhclient[753]: DHCPACK from 10.10.0.1 (xid=0x27fab603)
Feb 22 13:09:03 localhost ntpd[454]: Listen normally on 5 enp0s25 10.0.10.5 UDP 123
Feb 22 13:09:03 localhost ntpd[454]: new interface(s) found: waking up resolver
Feb 22 13:09:04 localhost kernel: ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0002, status -22
Feb 22 13:09:04 localhost kernel: ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
Feb 22 13:09:04 localhost NET[794]: /usr/sbin/dhclient-script : updated /etc/resolv.conf
Feb 22 13:09:04 n1001 dhclient[753]: bound to 10.10.1.101 -- renewal in 3523 seconds.
Feb 22 13:09:04 n1001 network[366]: Determining IP information for ib0... done.
Feb 22 13:09:04 n1001 network[366]: [ OK ]
Feb 22 13:09:04 n1001 systemd[1]: Started LSB: Bring up/down networking.
Feb 22 13:09:04 n1001 systemd[1]: Reached target Network.
Feb 22 13:09:04 n1001 systemd[1]: Starting Network.
Feb 22 13:09:04 n1001 systemd[1]: Starting TORQUE pbs_mom daemon...
Feb 22 13:09:04 n1001 systemd[1]: Starting Notify NFS peers of a restart...
Feb 22 13:09:04 n1001 sm-notify[844]: Version 1.3.0 starting
Feb 22 13:09:04 n1001 kernel: pbs_mom (846): /proc/846/oom_adj is deprecated, please use /proc/846/oom_score_adj instead.
Feb 22 13:09:04 n1001 pbs_mom[857]: LOG_ERROR::read_tcp_reply, Mismatching protocols. Expected protocol 4 but read reply for 0
Feb 22 13:09:04 n1001 pbs_mom[857]: LOG_ERROR::read_tcp_reply, Could not read reply for protocol 4 command 4: End of File
Feb 22 13:09:04 n1001 pbs_mom[857]: LOG_ERROR::mom_server_update_stat, Couldn't read a reply from the server
Feb 22 13:09:04 n1001 pbs_mom[857]: LOG_ERROR::send_update_to_a_server, Could not contact any of the servers to send an update
Feb 22 13:09:04 n1001 pbs_mom[857]: LOG_ERROR::send_update_to_a_server, Status not successfully updated for 1 MOM status update interv
als
Feb 22 13:09:04 n1001 sshd[859]: Server listening on 0.0.0.0 port 22.
Feb 22 13:09:04 n1001 sshd[859]: Server listening on :: port 22.
Feb 22 13:09:04 n1001 systemd[1]: Started OpenSSH server daemon.
Feb 22 13:09:04 n1001 systemd[1]: Starting OpenSSH server daemon...
Feb 22 13:09:04 n1001 systemd[1]: Reached target Network is Online.
Feb 22 13:09:04 n1001 systemd[1]: Starting Network is Online.
Feb 22 13:09:04 n1001 systemd[1]: home.mount: Directory /home to mount over is not empty, mounting anyway.
Feb 22 13:09:04 n1001 systemd[1]: Mounting /home...
Feb 22 13:09:04 n1001 kernel: FS-Cache: Loaded
Feb 22 13:09:04 n1001 kernel: FS-Cache: Netfs 'nfs' registered for caching
Feb 22 13:09:04 n1001 kernel: Key type dns_resolver registered
Feb 22 13:09:04 n1001 kernel: NFS: Registering the id_resolver key type
Feb 22 13:09:04 n1001 kernel: Key type id_resolver registered
Feb 22 13:09:04 n1001 kernel: Key type id_legacy registered
Feb 22 13:09:04 n1001 systemd[1]: Mounting /data...
Feb 22 13:09:05 n1001 systemd[1]: Mounted /home.
Feb 22 13:09:05 n1001 systemd[1]: Mounted /data.
Feb 22 13:09:05 n1001 systemd[1]: Started TORQUE pbs_mom daemon.
Feb 22 13:09:05 n1001 systemd[1]: Started Notify NFS peers of a restart.
Feb 22 13:09:05 n1001 systemd[1]: Reached target Remote File Systems.
Feb 22 13:09:05 n1001 systemd[1]: Starting Remote File Systems.
Feb 22 13:09:05 n1001 systemd[1]: Starting Permit User Sessions...
Feb 22 13:09:05 n1001 kernel: ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
Feb 22 13:09:05 n1001 systemd[1]: Started Permit User Sessions.
Feb 22 13:09:05 n1001 systemd[1]: Started Getty on tty1.
Feb 22 13:09:05 n1001 systemd[1]: Starting Getty on tty1...
Feb 22 13:09:05 n1001 systemd[1]: Reached target Login Prompts.
Feb 22 13:09:05 n1001 systemd[1]: Starting Login Prompts.
Feb 22 13:09:05 n1001 systemd[1]: Reached target Multi-User System.
Feb 22 13:09:05 n1001 systemd[1]: Starting Multi-User System.
Feb 22 13:09:05 n1001 systemd[1]: Reached target Graphical Interface.
Feb 22 13:09:05 n1001 systemd[1]: Starting Graphical Interface.
Feb 22 13:09:05 n1001 systemd[1]: Started Stop Read-Ahead Data Collection 10s After Completed Startup.
Feb 22 13:09:05 n1001 systemd[1]: Starting Stop Read-Ahead Data Collection 10s After Completed Startup.
Feb 22 13:09:05 n1001 systemd[1]: Starting Update UTMP about System Runlevel Changes...
Feb 22 13:09:05 n1001 systemd[1]: Started Update UTMP about System Runlevel Changes.
Feb 22 13:09:05 n1001 systemd[1]: Startup finished in 15.420s (kernel) + 46.964s (userspace) = 1min 2.384s.
Feb 22 13:09:06 n1001 kernel: ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0002, status -22
Feb 22 13:09:06 n1001 ntpd[454]: Listen normally on 6 ib0 10.10.1.101 UDP 123
Feb 22 13:09:06 n1001 ntpd[454]: Listen normally on 7 ib0 fe80::5280:200:b3:9739 UDP 123
Feb 22 13:09:06 n1001 ntpd[454]: 10.10.0.1 interface 10.0.10.5 -> 10.10.1.101
Feb 22 13:09:06 n1001 ntpd[454]: new interface(s) found: waking up resolver
Feb 22 13:09:07 n1001 kernel: ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
Feb 22 13:09:07 n1001 ntpd[454]: 0.0.0.0 c61c 0c clock_step +0.130122 s
Feb 22 13:09:07 n1001 ntpd[454]: 0.0.0.0 c614 04 freq_mode
Feb 22 13:09:07 n1001 systemd[1]: Time has been changed
Feb 22 13:09:08 n1001 kernel: ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0002, status -22
Feb 22 13:09:08 n1001 ntpd[454]: 0.0.0.0 c618 08 no_sys_peer
Feb 22 13:09:10 n1001 kernel: ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0002, status -22
Feb 22 13:09:12 n1001 kernel: ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0002, status -22
Feb 22 13:09:14 n1001 kernel: ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0002, status -22
Feb 22 13:09:35 n1001 systemd[1]: Starting Stop Read-Ahead Data Collection...
Feb 22 13:09:35 n1001 systemd[1]: Started Stop Read-Ahead Data Collection.

Feb 22 13:08:16 localhost systemd[1]: Starting LSB: Bring up/down networking...
Feb 22 13:08:16 localhost systemd[1]: Starting GSSAPI Proxy Daemon...
Feb 22 13:08:16 localhost systemd[1]: Starting OpenSSH Server Key Generation...
Feb 22 13:08:16 localhost sshd-keygen[406]: Generating SSH2 RSA host key: [ OK ]
Feb 22 13:08:16 localhost systemd[1]: Started D-Bus System Message Bus.
Feb 22 13:08:16 localhost sshd-keygen[406]: Generating SSH2 ECDSA host key: [ OK ]
Feb 22 13:08:16 localhost systemd[1]: Starting D-Bus System Message Bus...
Feb 22 13:08:16 localhost sshd-keygen[406]: Generating SSH2 ED25519 host key: [ OK ]
Feb 22 13:08:16 localhost systemd[1]: Starting Login Service...
Feb 22 13:08:16 localhost network[364]: Bringing up loopback interface: [ OK ]
Feb 22 13:08:16 localhost systemd[1]: Started Dump dmesg to /var/log/dmesg.
Feb 22 13:08:16 localhost network[364]: Bringing up interface enp0s25:
Feb 22 13:08:16 localhost systemd[1]: Started Network Time Service.
Feb 22 13:08:16 localhost systemd[1]: Started GSSAPI Proxy Daemon.
Feb 22 13:08:16 localhost systemd[1]: Started OpenSSH Server Key Generation.
Feb 22 13:08:16 localhost systemd[1]: Started RPC security service for NFS server.
Feb 22 13:08:16 localhost systemd[1]: Started RPC security service for NFS client and server.
Feb 22 13:08:16 localhost systemd-logind[449]: New seat seat0.
Feb 22 13:08:16 localhost systemd[1]: Reached target NFS client services.
Feb 22 13:08:16 localhost systemd[1]: Starting NFS client services.
Feb 22 13:08:16 localhost systemd-logind[449]: Watching system buttons on /dev/input/event1 (Power Button)
Feb 22 13:08:16 localhost systemd-logind[449]: Watching system buttons on /dev/input/event0 (Power Button)
Feb 22 13:08:16 localhost systemd[1]: Started Login Service.
Feb 22 13:08:16 localhost kernel: e1000e 0000:00:19.0: irq 38 for MSI/MSI-X
Feb 22 13:08:16 localhost kernel: e1000e 0000:00:19.0: irq 38 for MSI/MSI-X
Feb 22 13:08:16 localhost kernel: IPv6: ADDRCONF(NETDEV_UP): enp0s25: link is not ready
Feb 22 13:08:17 localhost kernel: mlx4_core 0000:02:00.0: PCIe link speed is 5.0GT/s, device supports 5.0GT/s
Feb 22 13:08:17 localhost kernel: mlx4_core 0000:02:00.0: PCIe link width is x8, device supports x8
Feb 22 13:08:17 localhost kernel: mlx4_core 0000:02:00.0: irq 39 for MSI/MSI-X
Feb 22 13:08:17 localhost kernel: mlx4_core 0000:02:00.0: irq 40 for MSI/MSI-X
Feb 22 13:08:17 localhost kernel: mlx4_core 0000:02:00.0: irq 41 for MSI/MSI-X
Feb 22 13:08:17 localhost kernel: mlx4_core 0000:02:00.0: irq 42 for MSI/MSI-X
Feb 22 13:08:17 localhost kernel: mlx4_core 0000:02:00.0: irq 43 for MSI/MSI-X
Feb 22 13:08:17 localhost kernel: mlx4_core 0000:02:00.0: irq 44 for MSI/MSI-X
Feb 22 13:08:17 localhost kernel: mlx4_core 0000:02:00.0: irq 45 for MSI/MSI-X
Feb 22 13:08:17 localhost kernel: mlx4_core 0000:02:00.0: irq 46 for MSI/MSI-X
Feb 22 13:08:17 localhost kernel: mlx4_core 0000:02:00.0: irq 47 for MSI/MSI-X
Feb 22 13:08:17 localhost kernel: mlx4_core 0000:02:00.0: irq 48 for MSI/MSI-X
Feb 22 13:08:17 localhost kernel: mlx4_core 0000:02:00.0: irq 49 for MSI/MSI-X
Feb 22 13:08:17 localhost kernel: mlx4_core 0000:02:00.0: irq 50 for MSI/MSI-X
Feb 22 13:08:17 localhost kernel: mlx4_core 0000:02:00.0: irq 51 for MSI/MSI-X
Feb 22 13:08:17 localhost kernel: mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.2-1 (Feb 2014)
Feb 22 13:08:17 localhost kernel: mlx4_en 0000:02:00.0: UDP RSS is not supported on this device
Feb 22 13:08:17 localhost kernel: <mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v2.2-1 (Feb 2014)
Feb 22 13:08:17 localhost kernel: <mlx4_ib> mlx4_ib_add: counter index 0 for port 1 allocated 0
Feb 22 13:08:17 localhost kernel: Rounding down aligned max_sectors from 4294967295 to 4294967288
Feb 22 13:08:17 localhost kernel: Loading iSCSI transport class v2.0-870.
Feb 22 13:08:17 localhost kernel: iscsi: registered transport (iser)
Feb 22 13:08:17 localhost kernel: RPC: Registered rdma transport module.
Feb 22 13:08:17 localhost systemd[1]: Started Initialize the iWARP/InfiniBand/RDMA stack in the kernel.
Feb 22 13:08:17 localhost systemd[1]: Reached target Remote File Systems (Pre).
Feb 22 13:08:17 localhost systemd[1]: Starting Remote File Systems (Pre).
Feb 22 13:08:22 localhost network[364]: Determining IP information for enp0s25... failed; no link present. Check cable?
Feb 22 13:08:22 localhost network[364]: [FAILED]
Feb 22 13:08:22 localhost kernel: ib0: enabling connected mode will cause multicast packet drops
Feb 22 13:08:22 localhost kernel: ib0: mtu > 4092 will cause multicast packet drops.
Feb 22 13:08:22 localhost network[364]: Bringing up interface ib0:
Feb 22 13:08:22 localhost kernel: IPv6: ADDRCONF(NETDEV_UP): ib0: link is not ready
Feb 22 13:08:27 localhost network[364]: Determining IP information for ib0... failed; no link present. Check cable?
Feb 22 13:08:27 localhost network[364]: [FAILED]
Feb 22 13:08:27 localhost systemd[1]: network.service: control process exited, code=exited status=1
Feb 22 13:08:27 localhost systemd[1]: Failed to start LSB: Bring up/down networking.
Feb 22 13:08:27 localhost systemd[1]: Unit network.service entered failed state.
Feb 22 13:08:27 localhost systemd[1]: network.service failed.
Feb 22 13:08:27 localhost systemd[1]: Reached target Network.
Feb 22 13:08:27 localhost systemd[1]: Starting Network.
Feb 22 13:08:27 localhost systemd[1]: Starting Notify NFS peers of a restart...
Feb 22 13:08:27 localhost sm-notify[705]: Version 1.3.0 starting
Feb 22 13:08:27 localhost systemd[1]: Starting TORQUE pbs_mom daemon...
Feb 22 13:08:27 localhost pbs_mom[708]: LOG_ERROR::Access from host not allowed, or unknown host (15010) in mom_server_add, Cannot res
olve host acinonyx for pbs_server
Feb 22 13:08:27 localhost kernel: pbs_mom (709): /proc/709/oom_adj is deprecated, please use /proc/709/oom_score_adj instead.
Feb 22 13:08:28 localhost systemd[1]: Reached target Network is Online.
Feb 22 13:08:28 localhost systemd[1]: Starting Network is Online.
Feb 22 13:08:28 localhost systemd[1]: Mounting /data...
Feb 22 13:08:28 localhost kernel: FS-Cache: Loaded
Feb 22 13:08:28 localhost kernel: FS-Cache: Netfs 'nfs' registered for caching
Feb 22 13:08:28 localhost mount[722]: mount.nfs: Network is unreachable
Feb 22 13:08:28 localhost kernel: Key type dns_resolver registered
Feb 22 13:08:28 localhost kernel: NFS: Registering the id_resolver key type
Feb 22 13:08:28 localhost kernel: Key type id_resolver registered
Feb 22 13:08:28 localhost kernel: Key type id_legacy registered
Feb 22 13:08:28 localhost systemd[1]: home.mount: Directory /home to mount over is not empty, mounting anyway.
Feb 22 13:08:28 localhost systemd[1]: Mounting /home...
Feb 22 13:08:28 localhost mount[736]: mount.nfs: Network is unreachable
Feb 22 13:08:28 localhost sshd[739]: Server listening on 0.0.0.0 port 22.
Feb 22 13:08:28 localhost sshd[739]: Server listening on :: port 22.
Feb 22 13:08:28 localhost systemd[1]: Started OpenSSH server daemon.
Feb 22 13:08:28 localhost systemd[1]: Starting OpenSSH server daemon...
Feb 22 13:08:28 localhost systemd[1]: Started Notify NFS peers of a restart.
Feb 22 13:08:28 localhost systemd[1]: Started TORQUE pbs_mom daemon.
Feb 22 13:08:28 localhost systemd[1]: data.mount mount process exited, code=exited status=32
Feb 22 13:08:28 localhost systemd[1]: Failed to mount /data.
Feb 22 13:08:28 localhost systemd[1]: Dependency failed for Remote File Systems.
Feb 22 13:08:28 localhost systemd[1]: Job remote-fs.target/start failed with result 'dependency'.
Feb 22 13:08:28 localhost systemd[1]: Unit data.mount entered failed state.
Feb 22 13:08:28 localhost systemd[1]: home.mount mount process exited, code=exited status=32
Feb 22 13:08:28 localhost systemd[1]: Failed to mount /home.
Feb 22 13:08:28 localhost systemd[1]: Unit home.mount entered failed state.
Feb 22 13:08:28 localhost systemd[1]: Starting Permit User Sessions...
Feb 22 13:08:28 localhost systemd[1]: Started Permit User Sessions.
Feb 22 13:08:28 localhost systemd[1]: Started Getty on tty1.
Feb 22 13:08:28 localhost systemd[1]: Starting Getty on tty1...
Feb 22 13:08:28 localhost systemd[1]: Reached target Login Prompts.
Feb 22 13:08:28 localhost systemd[1]: Starting Login Prompts.
Feb 22 13:08:28 localhost systemd[1]: Reached target Multi-User System.
Feb 22 13:08:28 localhost systemd[1]: Starting Multi-User System.
Feb 22 13:08:28 localhost systemd[1]: Reached target Graphical Interface.
Feb 22 13:08:28 localhost systemd[1]: Starting Graphical Interface.
Feb 22 13:08:28 localhost systemd[1]: Started Stop Read-Ahead Data Collection 10s After Completed Startup.
Feb 22 13:08:28 localhost systemd[1]: Starting Stop Read-Ahead Data Collection 10s After Completed Startup.
Feb 22 13:08:28 localhost systemd[1]: Starting Update UTMP about System Runlevel Changes...
Feb 22 13:08:28 localhost systemd[1]: Started Update UTMP about System Runlevel Changes.
Feb 22 13:08:28 localhost systemd[1]: Startup finished in 15.426s (kernel) + 14.100s (userspace) = 29.526s.
Feb 22 13:08:35 localhost kernel: IPv6: ADDRCONF(NETDEV_CHANGE): ib0: link becomes ready
Feb 22 13:08:35 localhost kernel: ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
Feb 22 13:08:36 localhost kernel: ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0002, status -22
Feb 22 13:08:37 localhost kernel: ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0016, status -22
Feb 22 13:08:38 localhost ntpd[363]: Listen normally on 4 ib0 fe80::5280:200:b3:9769 UDP 123
Feb 22 13:08:38 localhost ntpd[363]: new interface(s) found: waking up resolver
Feb 22 13:08:38 localhost kernel: ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0002, status -22
Feb 22 13:08:40 localhost kernel: ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0002, status -22
Feb 22 13:08:42 localhost kernel: ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0002, status -22
Feb 22 13:08:44 localhost kernel: ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0002, status -22
Feb 22 13:08:46 localhost kernel: ib0: sendonly multicast join failed for ff12:601b:ffff:0000:0000:0000:0000:0002, status -22
Feb 22 13:08:58 localhost systemd[1]: Starting Stop Read-Ahead Data Collection...
Feb 22 13:08:58 localhost systemd[1]: Started Stop Read-Ahead Data Collection.
Feb 22 13:09:08 localhost systemd[1]: Created slice user-0.slice.
Feb 22 13:09:08 localhost systemd[1]: Starting user-0.slice.
Feb 22 13:09:08 localhost systemd-logind[449]: New session 1 of user root.
Feb 22 13:09:08 localhost systemd[1]: Started Session 1 of user root.
Feb 22 13:09:08 localhost systemd[1]: Starting Session 1 of user root.
Feb 22 13:09:08 localhost login[742]: pam_unix(login:session): session opened for user root by LOGIN(uid=0)
Feb 22 13:09:08 localhost login[742]: ROOT LOGIN ON tty1
Feb 22 13:09:12 localhost pbs_mom[769]: LOG_ERROR::send_update_to_a_server, Status update successfully sent after 1 MOM status update
intervals
Feb 22 13:09:17 localhost kernel: ib0: enabling connected mode will cause multicast packet drops
Feb 22 13:09:17 localhost dhclient[788]: DHCPDISCOVER on ib0 to 255.255.255.255 port 67 interval 3 (xid=0x3b109a1)
Feb 22 13:09:17 localhost dhclient[788]: DHCPREQUEST on ib0 to 255.255.255.255 port 67 (xid=0x3b109a1)
Feb 22 13:09:17 localhost dhclient[788]: DHCPOFFER from 10.10.0.1
Feb 22 13:09:17 localhost dhclient[788]: DHCPACK from 10.10.0.1 (xid=0x3b109a1)
Feb 22 13:09:19 localhost NET[833]: /usr/sbin/dhclient-script : updated /etc/resolv.conf
Feb 22 13:09:19 n1003 dhclient[788]: bound to 10.10.1.103 -- renewal in 3031 seconds.
Feb 22 13:09:21 n1003 ntpd[363]: Listen normally on 5 ib0 10.10.1.103 UDP 123
Feb 22 13:09:21 n1003 ntpd[363]: new interface(s) found: waking up resolver
Feb 22 13:09:22 n1003 ntpd[363]: 0.0.0.0 c614 04 freq_mode

mfg justj
 
Zuletzt bearbeitet:

bitmuncher

Senior-Nerd
Ich würde noch ibcheckstate nach einem Fehlschlag prüfen. Manchmal gibt das entsprechende Hinweise. Dass der IB-Adapter beim Booten nicht hoch kommt, ist aber ein relativ häufiges Problem. Normalerweise geht es aber dann nach einer Weile, weil das System automatisch versucht den Adapter später nochmal zu aktivieren. Geschieht dies nicht, kann es manchmal helfen ein Init-Skript mit dem ifup-Befehl zu erstellen, das ganz am Ende des Bootvorgangs ausgeführt wird.
 

justj

Member of Honour
Hallo Bitmuncher,

vielen Dank für deine Antwort. Ich habe das Problem nicht in dem Sinne lösen können, aber einen Workaround gefunden, der für mich akzeptabel ist:
Ich habe einen Cronjob installiert, der nach dem Booten aufgerufen wird und nacheinander das IPoIB-Interface startet, die NFS-Freigaben mounted und den Torque-Deamon für die Rechenknoten startet.

Es ist wahrscheinlich nicht die sauberste Lösung, aber sie funktioniert reproduzierbar und sehr zuverlässig.

mfg
justj
 
Oben