Emerging Technologies: SSI at SuperComputing ‘09

The Intel Server System Infrastructure (SSI) Project has a lofty goal: To standardize the hardware for x86/x86_64 based blade servers and their backplanes. This is an enterprise and academic computing game changer without a doubt, but its current incarnation leaves a big hole in the debugability and security for applications and operating systems running on theblades.

So, what’s a blade server?

I’ll get to exactly what a blade server is shortly, but first an analogy serves well for visualization: think of your local telephone company and the phone line they provide you. If you want an extra telephone in your house and you’ve already got the connectors wired, you buy a new phone, connect it to the jack, then you’ve instantly got a dial tone with no fuss and no pain. This is exactly how blade servers are supposed to work for the computing industry. Put simply, a blade server (or simply “blade” for short) is a modular computer; a self contained motherboard, processor, RAM, and storage module that’s easily plugged or unplugged from a rack specifically designed to accept them. It typically has a proprietary connector containing power, network, and health management functionaliy that’s automatically connected and configured when the blade is plugged into the backplane (which is the “jack”.)

So, when a company or university needs more compuational power because the web-server is bogged down generating pages or because a scientific simulation would be too slow otherwise, they buy an extra blade and plug it in to instantly get another computing resource. Of course they probably need to configure software on the new computer for it to be useful, but that’s not important for this discussion. The take-away point is that a blade gives a no-fuss method for additing additional computers to your infrastructure, and as a bonus, when a computer inevitably fails blades give a great way to swap out the failed module with a new one in a matter of seconds.

The SSI Project

A major problem hindering blade adoption has always been the lack of any standard (blades have been around en-masse for at least a decade, and yet most haven’t ever heard of them!) Concretely, Appro will sell you their own design for a backplane and blade, which is different then the one IBM sells, which is different than the one Dell sells. Of course, this is a headache for numerous reasons. First there is inherit risk in the future cost of any blade you buy; if a vendor decides there isn’t enough margin in their blade product line and doubles their prices, in general you can’t seek a third party blade as a second source to combat the margin surfing. Next, the same vendor may decide to end-of-life the very blades that your backplane accepts, leaving you searching eBay for used parts should you ever need more blades or replacement components. And worst case, what happends if the vendor goes out of business right when you buy your first backplane and single node? It’s potentially the Edsel car of the computing industry!

Intel has done something to change all of this. Their goal is of course to sell more CPUs, and blades are a perfect way for them to do so since the marginal per-blade upgrade cost is typically much less than that of a full computer, people buy more CPUs because they can afford more blades. So, they’ve pulled together a consortium of juggernauts in the blade industry, to design a standard architecture for blade servers and their backplanes to ensure that most of the drawbacks to blade infrastructure are washed away. If and when vendors adopt the standard, you’ll be able to cross company lines for sourcing blade servers and backplanes, just like you can cross company lines for hard disks, RAM, workstations, etc. today. If they pull such a feat off, it will be a landmark event in the computing industry to say the least, an event as significant as Compaq’s upheaval of the PC market with the reverse engineering and re-implementation of IBM’s BIOS, or with AMD’s implementation of the x86 processor line in the i386 and i486 processor days. Let’s hope they do.

The Gaping Hole

Unfortunately, the SSI picture isn’t all roses today; the standards committee has inadvertently created a security and debuggability nightmare.

I’ve glossed over the networking aspects of blade computing, but further discussion is warranted, because this is the cheif problem with the current SSI implementation. The backplanes for blade servers usually have an integrated network switch of some sort, with ethernet and infiniband incarnations being the most common. The utility here is clear; by including a network switch in the backplane a single network cable can be connected to the backplane and provide outside-world connectivity for every blade in the rack. There’s not a thing wrong with the idea behind this method, but the SSI implementation lacks a way to monitor inter-node traffic which probably makes the security administrator and MPI application developer readers groan.

What’ exactly is the problem, in case you didn’t catch it? (And if you didn’t catch it don’t worry — it can be a subtle point even if you’re on the periphery of one of the above categories.) The problem is that you can’t see anything that the nodes in an SSI backplane say to one another. In effect, you can only monitor the connections from the backplane to the outside world.

Consider the following illustrative case for why the current SSI specification is currently broken: A very common architecture for an internet website running an online store is to run a single or a few web-server blades that talk directly to internet shoppers serving them images, shopping carts, and other pages, with two or three times as many database blades connected to the web-servers with current information about item availability, stock, prices, outstanding orders, etc. A typical attack vector is the following: a malicious user breaks into the web-server through a known or newly developed vulnerability over an encrypted (https) link. The user then directs the web-server to fetch credit card numbers, names, and addresses from the database server, typically through the unencrypted link between the web-server and the database server, then tunnels the information through the encrypted link back to their PC. With the SSI blade system, a network forensics or capture device would have no way of seeing the unencrypted data-leakage, since it would happen exclusively on the blade backplane. In fact, unless the database query statements are audited and/or an SSL decryptor in used to feed the forensics systems, the company under attack will probably never know. In practice, most corporations have neither an SSL decryptor nor query auditing, since both are an expensive and detail-oriented tasks and their need is normally mitigated by forensics devices snooping the un-encrypted traffic.

Another, concrete example is in order. Super-computers are typically strung together from many single computers of the same makeup, obviously a prime market for blade servers. The developers of the applications that run on super-computers typically use the Message Passing Interface (MPI) framework for making the single computers act in parallel and in lock-step as one large super-computer. MPI programming is unfortunately error prone and hard, however, which is why super-computer programmers command big salaries. To debug MPI programs the quintessential method is to capture the messages that individual computers pass one-another, and examine them for errors or other incorrect behaviour. With a super-computer made of SSI blades, however, this debugging paradigm is completely unavailable. A packet capture appliance has no single point of entry, and thus doesn’t see the messages that nodes pass one-another. Instead, developers need to debug through other means, like developing a framework for dumping messages to a log on each machine, collecting them, ordering them, and analyizing them, hoping that the framework didn’t miss a critical component of the message; or perhaps they could run tcpdump on each node, and hope that the traffic is slow enough for that tool to keep up (which may sound trivial but is in fact a major problem,) though in that case they still need a way to collect coalesce the resulting PCAP files.

Of course, there are many other examples of what is lost without the ability to snoop backplane network traffic, but the idea behind the problem should at least be clear with the scenarios already presented. What’s needed then, is a fix. The SSI specification can be augmented to support a network TAP port and all of these issues vanish in a blink. I’ve personally told the SSI developers about this issue, and its now on their radar. More feedback, of course, will always help.

Conclusion

The SSI platform represents a giant leap forward for the computing industry as a whole but it introduces a major security and the debugging nightmare into environments that already have too many of those things. A simple change can make the collective lives of every SSI blade user simpler, so they can worry about everything else.

Leave a Reply

You must be logged in to post a comment.