2015년 2월 26일 목요일

Server Network Port Enumeration in RHEL5.X/6.X

Problem

Let's say you have three network cards plugged into your server's PCI expansion slots as well as four LOM (LAN On Motherboard) ports. The default port enumeration on the rear of the server is as follows:

LOM        PCI
===        ================
[3]        -[8 ]-[4]-[12]-
[2]        -[9 ]-[5]-[13]-
[1]        -[10]-[6]-[14]-
[0]        -[11]-[7]-[15]-
===        ================

You would like to change the port enumeration and set up bonding channels as follows:

LOM        PCI
===        ================
[15]        -[0]-[4]-[8 ]-   bond0: eth0,4
[14]        -[1]-[5]-[9 ]-   bond1: eth1,5
[13]        -[2]-[6]-[10]-   bond2: eth2,6
[12]        -[3]-[7]-[11]-   bond3: eth3,7
===        ================

In the days before RHEL 5.X (kernel 2.6.18.X) and RHEL 6.X (kernel 2.6.32.X) it was customary to explicitly assign network interface names (i.e. eth0, eth1, etc) to MAC addresses or UUID's in the files ifcfg-eth{0..N} in /etc/sysconfig/network-scripts/

For example, here is a sample ifcfg-eth0 that would work reliably pre-RHEL5.X/RHEL6.X:

DEVICE=eth0
ONBOOT=yes
BOOTPROTO=none
HWADDR= 00:1a:2b:3c:4d:5e
MASTER=bond0
SLAVE=yes
USERCTL=no

Regardless of the order in which network interfaces come up and make themselves known to the Linux kernel, since we have assigned the name eth0 to MAC 00:1a:2b:3c:4d:5e the interface will keep this name. If we had not explicitly mapped this HWADDR to eth0, it might for instance change to eth2 if it was the 3rd interface to come up on boot. Obviously port names that change with every boot would be a big headache.

Unfortunatelym explicitly mapping a MAC to a device name within ifcfg-ethX files is no longer guaranteed to work 100% of the time:
https://bugzilla.redhat.com/show_bug.cgi?id=491432


Solution

Because explicitly assigning HWADDR or UUID's to network device names within the files in /etc/sysconfig/network-scripts/ is not reliable, first we will comment out all lines starting with HWADDR or UUID within ifcfg-ethX files. Assuming there are 16 ports to be enumerated (0~15), you could achieve this with the following bash for-loop (assuming you are in the directory /etc/sysconfig/network-scripts):

for i in {0..15}; do
  sed -i "s/HWADDR/#HWADDR/g" ifcfg-eth$i
  sed -i "s/UUID/#UUID/g" ifcfg-eth$i
done

We will rely exclusively on /etc/udev/rules.d/70-persistent-net.rules to enumerate ports. The udev daemon reads from this file at boot to determine the device names for network ports. Within this file you can either map MAC addresses to ethX device names or PCI bus-info ID's to ethX device names.

Although it has historically been more common to use MAC addresses, I recommend using PCI bus-info ID's for device name mapping in 70-persistent-net.rules. When NIC's crash and you stop (start) the network with service network stop(start), two different device names using the same MAC can appear. For some reason, however, this same issue does not occur when using network cards' PCI bus-info ID's.

1. Find PCI bus-info ID's

You can find the PCI bus-info ID in several ways: from the kernel ring buffer with dmesg, from lspci or from ethtool -i. I will cover the latter two methods as they are easiest to parse.

# lspci -D | grep Solar
41:00.0 Ethernet controller: Solarflare Communications SFC9020 [Solarstorm]
41:00.1 Ethernet controller: Solarflare Communications SFC9020 [Solarstorm]

The -D flag makes sure that PCI bus numbers are printed. From man lscpi:
Always  show PCI domain numbers. By default, lspci suppresses them on machines which have only domain 0.
In the example above, I found the PCI bus-info ID's for two ports on a 10G Solarflare NIC. Since the first field from the above command will always show PCI bus-info, we can just print this field alone using awk `{print$1}`. We can print just the bus-info ID's for all Ethernet interfaces using the following commands:

lspci -D | grep -i ether | awk '{print$1}'

Note that the -i flag for grep above enables case-insensitive search.

On a machine with two 4-port NIC's plugged into PCI slots and with 3 onboard network ports the output of the above command might look something like the this:

0000:04:00.3
0000:04:00.2
0000:04:00.1
0000:04:00.0
0000:03:00.0
0000:03:00.1
0000:03:00.2
0000:03:00.3
0000:0d:04.2
0000:0d:04.1
0000:0d:04.0

You will notice that ports on the same NIC share the same PCI bus prefix, but differ only in the final number. Therefore we can deduce that 0000:04:00.{0..3} denote 4 ports on one particular PCI network card and that 0000:03:00.{0..3} denote 4 ports on a separate PCI NIC. Also notice that the 3 LOM (LAN On Motherboard) Ethernet ports have totally separate PCI bus-info ID's from the PCI NIC's. lspci | grep -i ether will print the NIC PCI bus ID's in order of network interface, so the first line would be eth0, second line eth1, and so on.

You can also obtain the PCI bus ID for each network port using ethtool -i ethX which returns the following fields:

driver: 
version:
firmware-version: 
bus-info:
supports-statistics:
supports-test:
supports-eeprom-access:
supports-register-dump:
supports-priv-flags:

Manually typing this command for each interface is not a good idea when you are working on a server with many network ports, so just use a bash for-loop one-liner:

for i in {0..n}; do ethtool -i eth$i; done


2. Map NIC port locations using ethtool -p ethX

In the problem statement at the beginning of this post, we assumed that you already knew the layout of device names to ports (i.e. eth0 is the bottom LOM port, eth12 is the top right-most NIC port). But if you're working on a brand-new server, how would you map out device name/port locations in the first place?

Thanks to ethtool -p ethX you can identify where each network port is located because this command will tell the port to flash it's status LED until the command is terminated. For the command to work, however, the network interface ethX must be up. To quickly put up all the Ethernet interfaces on a system, use the following bash for-loop one-liner (where N is the highest-numbered network port on your server):

for i in {0..N}; do ip link set eth$i up; done

It is better to use ip link set ... up from iproute2 rather than ifup from the venerable net-tools package because the first command is very fast; ifup is 8-10x slower because it tries to bind an IP to each interface when bringing it up, whereas ip link set ethX up simply activates the interface.

Once you have activated all Ethernet interfaces, you can use ethtool -p ethX , pencil and paper to write down the location of each port when it flashes. (Note that some LOM ports do not support ethtool -p)


3. Map PCI bus-info ID to network ifaces in 70-persistent-net.rules

The format for entries in /etc/udev/rules.d/70-persistent-net.rules for RHEL6.X using PCI bus ID's instead of MAC addresses is as follows:

SUBSYSTEM=="net",ACTION=="add",BUS=="pci",KERNEL=="eth*",KERNELS=="0000:01:00.0",NAME="ethX"
...
where ethX is eth0, eth1, etc.

In step 2, you should have found the PCI bus ID for each network port. To change the name of a given port, simply assign a different name to the line containing the relevant PCI bus ID number. The PCI bus-info ID should be entered between double quotes after KERNELS=

RHEL6.X initially generates /etc/udev/rules.d/70-persistent-net.rules automatically from the udev rule /lib/udev/rules.d/75-persistent-net-generator.rules but the network ifaces will not be nicely ordered in the file. By default this file will contain MAC addresses mapped to port names and will look something like this:

SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="40:a8:f0:3b:a1:c6", ATTR{type}=="1", KERNEL=="eth*", NAME="eth6"$
$
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="40:a8:f0:3b:a1:c7", ATTR{type}=="1", KERNEL=="eth*", NAME="eth7"$
$
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="28:80:23:a2:d7:c0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"$
$
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="40:a8:f0:3b:a1:c5", ATTR{type}=="1", KERNEL=="eth*", NAME="eth5"$
$
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="28:80:23:a2:d7:c2", ATTR{type}=="1", KERNEL=="eth*", NAME="eth2"$
$
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="28:80:23:a2:d7:c1", ATTR{type}=="1", KERNEL=="eth*", NAME="eth1"$
$
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="28:80:23:a2:d7:c3", ATTR{type}=="1", KERNEL=="eth*", NAME="eth3"$
$
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="40:a8:f0:3b:a1:c4", ATTR{type}=="1", KERNEL=="eth*", NAME="eth4"

This is inconvenient because eth6, not eth0, is on the first line, so when we change the port names, it is easy to get confused. It is apparent that the field separator character is comma "," and that the field we wish to sort in ascending order is field 7, "NAME". We can achieve this using the GNU coreutils program sort:

sort -t ',' -k7 -V 70-persistent-net.rules

where -t is the delimiter character, -k7 is for sorting based on the 7th column, and -V (--version-sort) sorts version numbers within text (of the form string+number). Without the -V flag, eth10, eth11 ... would follow eth1 instead of eth9.

If we now look at 70-persistent-net.rules we can see that the file is nicely sorted in ascending order of network ports:

SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="28:80:23:a2:d7:c0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="28:80:23:a2:d7:c1", ATTR{type}=="1", KERNEL=="eth*", NAME="eth1"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="28:80:23:a2:d7:c2", ATTR{type}=="1", KERNEL=="eth*", NAME="eth2"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="28:80:23:a2:d7:c3", ATTR{type}=="1", KERNEL=="eth*", NAME="eth3"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="40:a8:f0:3b:a1:c4", ATTR{type}=="1", KERNEL=="eth*", NAME="eth4"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="40:a8:f0:3b:a1:c5", ATTR{type}=="1", KERNEL=="eth*", NAME="eth5"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="40:a8:f0:3b:a1:c6", ATTR{type}=="1", KERNEL=="eth*", NAME="eth6"
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*", ATTR{address}=="40:a8:f0:3b:a1:c7", ATTR{type}=="1", KERNEL=="eth*", NAME="eth7"

Now you have to replace the fields DRIVERS, ATTR, ATTR with the fields BUS and KERNELS and then paste in PCI bus ID's you obtained earlier through lspci -D | grep -i ether or ethtool -i ethX.

For RHEL5.X, however, the key KERNELS is not recognized by udev. You must use the key ID instead (RHEL6.X understands both the keys KERNELS and ID, however):

SUBSYSTEM=="net",ACTION=="add",BUS=="pci",KERNEL=="eth*",ID=="0000:04:00.0",NAME="eth7"
SUBSYSTEM=="net",ACTION=="add",BUS=="pci",KERNEL=="eth*",ID=="0000:04:00.1",NAME="eth6"
SUBSYSTEM=="net",ACTION=="add",BUS=="pci",KERNEL=="eth*",ID=="0000:04:00.2",NAME="eth5"
SUBSYSTEM=="net",ACTION=="add",BUS=="pci",KERNEL=="eth*",ID=="0000:04:00.3",NAME="eth4"
SUBSYSTEM=="net",ACTION=="add",BUS=="pci",KERNEL=="eth*",ID=="0000:03:00.0",NAME="eth0"
SUBSYSTEM=="net",ACTION=="add",BUS=="pci",KERNEL=="eth*",ID=="0000:03:00.1",NAME="eth1"
SUBSYSTEM=="net",ACTION=="add",BUS=="pci",KERNEL=="eth*",ID=="0000:03:00.2",NAME="eth2"
SUBSYSTEM=="net",ACTION=="add",BUS=="pci",KERNEL=="eth*",ID=="0000:03:00.3",NAME="eth3"

(Note that the network ifaces above in 70-persistent-net.rules are not sorted anymore because we have customized the port enumerations.)

Also when creating a port enumeration file for RHEL/CentOS 5.X, beware of existing port enumeration files in /etc/udev/rules.d/

Sometimes a file named 60-net.rules exists in this path in lieu of 70-persistent-net.rules, so if another udev rule file enumerating ports exists, edit that existing file instead of creating a new 70-persistent-net.rules!

For some reason, ATCA blade servers (from Emerson Network Power, Adlink, etc) will accept a shortened format for 70-persistent-net.rules that only contains the fields KERNEL, ID (for RHEL5.X) and NAME as follows:

KERNEL=="eth*", ID=="0000:07:00.0", NAME="ethX"

Note that this format only seems to work for ATCA (Advanced Telecom Computing Architecture) hardware. If you try this on HP Proliant machines, for example, /var/log/messages will complain of invalid udev rules.


4. Apply changes

First shut down network services:

service network stop

Unload all network drivers (this step is not always necessary, but it is good to get into the habit; you can find the network drivers being used by each iface using ethtool -i ethX)

modprobe -r driverName (tg3, igp, bnx, ixgbe, e1000e etc)
...

Note that if your machine uses Solarflare 10G cards, you cannot just remove the driver sfc with modprobe -r or rmmod; You must use a special script called onload_tool provided by Solarflare. From the directory containing the Solarflare scripts:

./onload_tool unload (without any arguments)

To reload the sfc driver , simply replace unload with reload.

Unload bonding module (if you have bonding channels defined in /etc/sysconfig/network-scripts)

modprobe -r bonding

Reload the kernel modules you just removed

modprobe drivername
modprobe bonding
...

Reload udev rules to apply the changes in 70-persistent-net.rules

start_udev
(or udevadm control --reload-rules)

Start network services

service network start

Verify with ethtool -p ethX that your network port enumerations have taken effect.