• Can't be due to a recent kernel upgrade as the VM servers are still running SL5.3
  • According to Redhat 520888 a bug fix for servers with bnx2 cards was included in the 2.6.18-194.3.1.el5 kernel.

  • 041010:1600 metropolitan R710 - lost connectivity to various servers, including on-wire. Active slave was eth0. Pulling down eth0 sufficient to force over to eth1 and return connectivity. Note: metropolitan not configured to use SOL
  • 051010:1400 metropolitan R710 - lost all net connectivity, routing disappeared. Active slave was eth1. Pulled down network, removed the bnx2 module, added options bnx2 disable_msi=1 to /etc/modprobe.conf, modprobed bnx2 and restarted network. Can't confirm that the disable_msi took effect, though.

As of 28th October, we seem not to have seen a recurrence for a while. If you do experience this again, try :-

 !hardware.options       mADD(bnx2)
 hardware.modopts_bnx2   disable_msi=1


April 2012:

cockerel - an HP ProLiant DL180 G6 running SL6_64, currently at SL6.1 - occasionally loses bonding: cat /proc/net/bonding/bond0 reports eth1 down, but ifconfig -a reports eth1 is up. See below. Physically, eth1 is the add-on PCI network card.

  [cockerel]root: /sbin/ifconfig -a
  bond0     Link encap:Ethernet  HWaddr 78:E3:B5:08:DB:01
            inet addr:129.215.33.10  Bcast:129.215.33.255  Mask:255.255.255.0
            inet6 addr: fe80::7ae3:b5ff:fe08:db01/64 Scope:Link
            UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
            RX packets:252218532 errors:0 dropped:69466 overruns:1823 frame:0
            TX packets:308686496 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:0
            RX bytes:81414229216 (75.8 GiB)  TX bytes:46581724233 (43.3 GiB)

  eth0      Link encap:Ethernet  HWaddr 78:E3:B5:08:DB:01
            UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
            RX packets:252148811 errors:0 dropped:0 overruns:1823 frame:0
            TX packets:308686496 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:1000
            RX bytes:81409226811 (75.8 GiB)  TX bytes:46581724233 (43.3 GiB)
            Memory:fbee0000-fbf00000

  eth1      Link encap:Ethernet  HWaddr 78:E3:B5:08:DB:01
            UP BROADCAST SLAVE MULTICAST  MTU:1500  Metric:1
            RX packets:69721 errors:0 dropped:69466 overruns:0 frame:0
            TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:1000
            RX bytes:5002405 (4.7 MiB)  TX bytes:0 (0.0 b)
            Interrupt:32 Memory:f8000000-f8012800
  ...[snip]...

  [cockerel]root: cat /proc/net/bonding/bond0
  Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)

  Bonding Mode: fault-tolerance (active-backup)
  Primary Slave: None
  Currently Active Slave: eth0cetus
  MII Status: up
  MII Polling Interval (ms): 100
  Up Delay (ms): 0
  Down Delay (ms): 0

  Slave Interface: eth0
  MII Status: up
  Link Failure Count: 0
  Permanent HW addr: 78:e3:b5:08:db:01
  Slave queue ID: 0

  Slave Interface: eth1
  MII Status: down
  Link Failure Count: 1
  Permanent HW addr: d8:d3:85:ae:68:40
  Slave queue ID: 0

To 'fix' the problem, it seems enough to run

  [cockerel]root: /sbin/ifdown eth1
  [cockerel]root: /sbin/ifup eth1

On all occasions this has happened, the 'down' interface has been eth1; eth0 has always been okay.

Next time this happens, should also record the output of mii-tool.

-- AlastairScobie - 04 Oct 2010


June 2014:

Sauce lost bonding. It's an HP ProLiant DL180 G6 running SL6 6.4 with kernel 2.6.32-431.17.1. Nagios reported eth0 down but eth1 up. The problem was fixed with ifdown and ifup as suggested above.

[sauce+]cc: cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v3.6.0 (September 26, 2009)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: eth1 (primary_reselect always)
Currently Active Slave: eth1
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: down
Speed: Unknown
Duplex: Unknown
Link Failure Count: 1
Permanent HW addr: d4:85:64:65:18:dc
Slave queue ID: 0

Slave Interface: eth1
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: d8:d3:85:dc:fa:69
Slave queue ID: 0

[sauce+]cc: ifconfig -a
bond0     Link encap:Ethernet  HWaddr D4:85:64:65:18:DC  
          inet addr:129.215.216.6  Bcast:129.215.216.255  Mask:255.255.255.0
          inet6 addr: fe80::d685:64ff:fe65:18dc/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:13350103 errors:0 dropped:9924 overruns:0 frame:0
          TX packets:6957807 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:13589365904 (12.6 GiB)  TX bytes:711261478 (678.3 MiB)

eth0      Link encap:Ethernet  HWaddr D4:85:64:65:18:DC  
          UP BROADCAST SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:10179 errors:0 dropped:9924 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:887237 (866.4 KiB)  TX bytes:0 (0.0 b)

eth1      Link encap:Ethernet  HWaddr D4:85:64:65:18:DC  
          UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
          RX packets:13339924 errors:0 dropped:0 overruns:0 frame:0
          TX packets:6957807 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:13588478667 (12.6 GiB)  TX bytes:711261478 (678.3 MiB)
          Memory:fbee0000-fbf00000 

eth2      Link encap:Ethernet  HWaddr D8:D3:85:DC:FA:68  
          BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
          Memory:fbe60000-fbe80000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:158957623 errors:0 dropped:0 overruns:0 frame:0
          TX packets:158957623 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:34119287298 (31.7 GiB)  TX bytes:34119287298 (31.7 GiB)

[sauce]root: /sbin/mii-tool bond0
bond0: 10 Mbit, half duplex, link ok
[sauce]root: /sbin/mii-tool eth0
eth0: negotiated 100baseTx-FD, link ok
[sauce]root: /sbin/mii-tool eth1
eth1: negotiated 100baseTx-FD, link ok

[sauce]root: /sbin/mii-tool -v bond0
bond0: 10 Mbit, half duplex, link ok
  product info: vendor 00:01:00, model 0 rev 4
  basic mode:   10 Mbit, half duplex
  basic status: link ok
  capabilities:
  advertising: 
[sauce]root: /sbin/mii-tool -v eth0
eth0: negotiated 100baseTx-FD, link ok
  product info: vendor 00:08:18, model 54 rev 6
  basic mode:   autonegotiation enabled
  basic status: autonegotiation complete, link ok
  capabilities: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
  advertising:  100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
  link partner: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
[sauce]root: /sbin/mii-tool -v eth1
eth1: negotiated 100baseTx-FD, link ok
  product info: vendor 00:aa:00, model 57 rev 1
  basic mode:   autonegotiation enabled
  basic status: autonegotiation complete, link ok
  capabilities: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD
  advertising:  100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD flow-control
  link partner: 100baseTx-FD 100baseTx-HD 10baseT-FD 10baseT-HD

tycho Tue, 26 Aug 2014 07:16:18 +0100 eth2

Linux tycho.inf.ed.ac.uk 2.6.32-431.17.1.el6.x86_64 #1 SMP Wed May 7 14:14:17 CDT 2014 x86_64 x86_64 x86_64 GNU/Linux (up since Wed Aug 20 13:46)

First time tycho has done this. Fixed by ifdown/ifup, but not immediately!? eth2 didn't show as back up in /proc/net/bonding/bond0 after the ifdown/ifup, but did a few minutes later after no other interventions.

See the inf-unit report for the 2014-08-27 Operational meeting for more details.

Services Machines June/July 2014

Having not seen problems for ages (many months), in the last week or so we've had the following occur. They are all HP DL180s. The usual ifdown/up of the affected NIC restores normal operation.

machine date of bonding issue date last rebooted - kernel
naga 02-07-2014 13:31:38 eth0 down last reboot Mon Jun 23 06:34
naga 03-07-2014 12:41:26 eth0 down last reboot Mon Jun 23 06:34
cetus 24-06-2014 21:43:54 eth0 down last reboot Tue Jun 24 06:37
gorgon 02-07-2014 02:03:47 eth0 down last reboot Thu Jun 26 06:34
gorgon 18-07-2014 01:07:22 eth0 down last reboot Thu Jun 26 06:34
minotaur 27-06-2014 05:31:57 eth0 down last reboot Wed Jun 25 06:34
minotaur 17-07-2014 eth0 down last reboot Wed Jun 25 06:40
naga 31-07-2014 10:40 eth0 down still June 23
cetus 31-07-2014 23:41 eth0 down last reboot Tue Jun 24 06:48
minotaur 01-08-2014 16:600 eth0 down last reboot Wed Jun 25 06:40
minotaur 02-08-2014 09:15 eth0 down last reboot Wed Jun 25 06:40
minotaur 05-08-2014 00:05 eth0 down last reboot Wed Jun 25 06:40
cetus 07-08-2014 16:35 eth0 down still Jun 25
minotaur 19-08-2014 04:00 eth0 down  
naga 21-08-2014 19:10 eth0 down still June 23
cetus 22-08-2014 13:45 eth0 down still June 24
cetus 23-08-2014 00:45 eth0 down still June 24
naga 27-08-2014 01:50 eth0 down still June 23
cetus 27-08-2014 11:40 eth0 down still June 24
cetus 29-08-2014 22:00 eth0 down still June 24
cetus 31-08-2014 16:00 eth0 down still June 24
cetus 09-09-2014 09:40 eth0 down 5 Sept 2014 - latest kernel, SL6.5, and firmware updated
cetus 09-09-2014 17:20 eth0 down 5 Sept 2014 - latest kernel, SL6.5, and firmware updated
cetus 11-09-2014 22:50 eth0 down 5 Sept 2014 - note it does not have the MSI LCFG tweak
cetus 14-09-2014 08:30 eth0 down ditto, but it will next reboot
cetus 14-09-2014 20:30 eth0 down ditto, but it will next reboot
cetus rebooted 15-09-2014 06:30 with MSI disabled

-- RogerBurroughes - 19 Aug 2014

Topic revision: r33 - 15 Sep 2014 - 08:35:50 - NeilBrown
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
This Wiki uses Cookies