2014년 12월 15일 월요일

[SOLVED] PXE booting woes on ATCA blade server hardware (ENP & Adlink)

Background

At my current company, we do a lot of work for two of the "Big Three" (SKT, LGU+, KT) mobile telecom providers in Korea. Interestingly, most of the 3G and 4G mobile communications infrastructure here runs on RHEL 5.X and 6.X.

In addition to more conventional server hardware like the HP Proliant line, Korean telecoms also use ATCA (Advanced Telecom Architecture) blade servers made by ENP (Emerson Network Power) and AdLink. ATCA blades are unique because they do not possess regular back panels with RJ45 network ports. Instead, the ATCA server backplane plugs directly into an ATCA switch.

Another oddity of ATCA hardware is that in many cases, there is no VGA port to connect to (as in the ENP ATCA 7350 and 736X series). Instead, to get video output, you must connect via serial console cable and use serial/modem communications through a terminal emulator like minicom or putty. Although the ATCA hardware engineers usually provide console speed settings (i.e. 57600 8N1, 115200 8N2, etc.) for getting video output through a terminal emulator, in many cases you still have to play around with the speed until you get a screen that doesn't show gibberish.

Unlike 4U servers which have PCI expansion slots for plugging in network cards, everything on ATCA blades is on-board, including the network adapters. This makes it unfeasible to flash updated PXE ROMs onto the NIC itself. Instead, you would have to get a BIOS upgrade to update outdated PXE implementations.

Problem

gpxelinux.0 from syslinux 4.05-8.el7 for RHEL 7 / CentOS 7 returns a kernel error when attempting to boot from PXE on various models of ATCA blades from ENP and Adlink. If I use pxelinux.0, at least I can get as far as the PXE menu (menu.c32), but then the boot seems to hang after sending the Linux kernel image vmlinuz and an associated initrd.img

My PXE server setup using dnsmasq and darkhttpd works just fine when installing via PXE to more conventional server hardware like the HP Proliant series.

The pxe config I used for PXE on ATCA blades is as follows:


The syslinux documentation mentions that broken PXE implementations are not uncommon, and presents a list of hardware known to have problems with syslinux PXE. I didn't see any mention of ATCA in the list, but I suspect that differences in PXE implementation are causing problems for me.

Stabs at a solution

Other engineers at my company use relatively old versions of syslinux, generally version 4.X available from the CentOS 6 repos, yet are able to install Linux over PXE using pxelinux.0 and a more conventional pxe server setup with httpd, xinetd, dhcpd, tftp, etc. I plan to recreate their setup and see if that resolves my issues with PXE on ATCA hardware.

I also need to make note of what kind of PXE boot agents (i.e. Intel Boot Agent, etc) are being used by ATCA as well as the PXE firmware version numbers.

Starting from syslinux 5.X, lpxelinux.0 became available, which natively supports sending pxe images by http and nfs instead of tftp. Perhaps trying lpxelinux.0 or the most recent version of syslinux (6.03 in Dec. 2014) pxe files will address my problem.

Postscript 2014-12-26:

It turns out that the PXE booting problems I experienced were due to "luser" error, not any problems with ATCA hardware.

Luser Error 1:

Incorrect syntax in the append initrd= block specifying serial console settings

Since most ATCA blades don't have VGA connectors, to get any kind of video output you must connect via remote serial console (through serial-to-RJ45 cable). The variable console= must be appended to the invocation of initrd (initramfs image). The correct syntax is

console=tty0 console=ttyS0,X (where X is serial communication speed in bps)

Unfortunately, I specified console=ttyS0 first, which won't work, according to the tldp Remote Serial Console HOW-TO:

The Linux kernel is configured to select the console by passing it the console parameter. The console parameter can be given repeatedly, but the parameter can only be given once for each console technology. So console=tty0 console=lp0 console=ttyS0 is acceptable but console=ttyS0 console=ttyS1 will not work.

Information from kernel.org regarding console over serial port:

tty0 for the foreground virtual console
ttyX for any other virtual console
ttySx for a serial port
lp0 for the first parallel port
ttyUSB0 for the first USB serial device 

You can specify multiple console= options on the kernel command line.
Output will appear on all of them. The last device will be used when
you open /dev/console. So, for example:
console=ttyS1,9600 console=tty0
defines that opening /dev/console will get you the current foreground
virtual console, and kernel messages will appear on both the VGA
console and the 2nd serial port (ttyS1 or COM2) at 9600 baud.

Since I specify console=ttyS0,57600 last, it will be the device used when /dev/console is opened.


Luser Error 2:

Incorrect vmlinuz and initrd.img

The PXE boot .cfg file above indicates that RHEL5.4 will be installed, but I accidentally used vmlinuz and initrd.img from RHEL5.6!


In addition to fixing errors 1 and 2, I also tried using lpxelinux.0 from syslinux 6.03 as the dhcp-boot image and am happy to report that it works fine. Now PXE boot and RHEL 5.X installation over http works for me!