att mail: Unlearning College

To contact us Click HERE

I’m writing up my Shmoocon talk as a series of blog posts. In this post, I’m going to talk about the pernicious problem of college indoctrination. What colleges teach about networking is bad. A lot of material is out of date. A lot is just plane inaccurate (having never been “in date”). A lot is aspirational: networks don’t work that way, but your professor wishes they would. As a consequence, students leaving college are helpless, failing at real-world problems like portability, reliability, cybersecurity, and scalability.

A good example of this problem is the college textbook “Unix Network Programming” by Richard Stevens. It’s the most common textbook on the subject. In my Shmoocon talk, over half the audience raised their hands in response to my question “have you read this book?”.

Students love this book. Stevens is considered almost a saint. But he’s not a saint, he’s the devil. Stevens is the AOL of network programming. AOL was a huge company during the dot-com boom that aggressively marketed dialup access to newbies. It flooded the Internet with people steeped in the AOL version of the Internet that was out of sync with the real Internet. Both teach that AOL/Unix is the primary provider of the Internet, and that if it’s not something AOL/Unix provides, then it’s not something you need to worry about.

But in much the same way AOL sucked at networking, so does Unix.

Consider the issue of byte-order or “endianness”. When computers store bytes in memory, they can do so either left-to-right or right-to-left. It’s a simple idea made simpler by the fact that you almost NEVER need to worry about it.

Consider the lowly IPv4 address. There are three “canonical” forms of the IP address. These are:

char *ip_string = "10.2.3.4";
unsigned char ip_bytes[] = {10, 1, 2, 3};
int ip_integer = 0x0a010203;

The first is the string humans read. The second is how the address appears in a packet on the network. The third is an internal representation of the address as an integer (written in hex, so the decimal number “10” becomes “0a”).

Converting “ip_string” or “ip_bytes” into an integer means parsing the bytes one at a time, such as in in this example:

int ip_integer = ip_bytes[0]<<24 | ip_bytes[1]<<16 | ip_bytes[2]<<8 | ip_bytes[3]

The problem is that back in 1983 when Bill Joy and BBN put the first TCP/IP stack into Unix, they added a fourth incorrect form where they used C to “cast” the raw form of the packet into an integer:

int ip_something = *(int*)ip_bytes;

While conceptually the same, “ip_integer” and “ip_something” are different, for two reasons.

The first is that on RISC processors, this will sometimes crash. RISC demands that integers be aligned on natural boundaries. Compilers guarantee that all “internal” integers will be aligned with padding and such. But they can’t guarantee that for “external” integers. When the coder casts a packet pointer into an integer pointer, the program may crash.

The second issue is the byte-order/endian problem. On little-endian processors like Intel x86 and most ARM, the bytes will be reversed, so the value of “ip_something” will be:

int ip_something = 0x0302010a;

Having first made the wrong decision to cast integers, Unix provides functions to fix the situation with “byte-swapping” macros of “htons()”, “ntohs()”, “htonl()”, and “ntohl()”. You might use them like:

int ip_integer = ntohl(*(int*)ip_bytes);

Students think Unix’s byte-swapping macros are a good solution to the problem. But this is a problem of Unix’s own making. Had you never mixed integer points with byte pointers in the first place, you never would’ve needed to swap the bytes.

By the way, if you learned networking using some language other than C/C++, you are probably wondering what the big deal is. Your language doesn’t allow you to cast integers. You’ve always been forced to do things the right way. Therefore, you’ve never been confused by byte-order. This is purely a problem for people who use C and who cast pointers between integers and bytes.

I use byte-order because while it’s a largely unimportant issue, it’s easy to understand. The more important issues are harder to understand.

My Shmoocon talk was about scalability. The Unix kernel has enormous scalability problems. That’s because it wasn’t designed with carrying network traffic in mind. Back in the day, AT&T designed Unix to be a “control” system, to control how data was flowing through the network. It wasn’t designed to be that “data” system itself.

Over the years, it’s in fact been good enough as a “data” system for most purposes. But as the Internet grows, its limitations are becoming more and more important. The solution to big Internet scale problems is to get rid of more and more of the kernel. That’s why the Nginx web server is rapidly displacing Apache on big websites. Apache relies heavily on the kernel to maintain abstractions like threads. Nginx does more raw programming, getting rid of this abstraction.

But even Nginx has limits. That was the purpose of my talk, to introduce people to the idea that instead of relying upon the Unix kernel to do things for you, that you can do things for yourself and achieve much higher levels of scalability. The Unix solution to byte-order is wrong. So is the Unix solution to network stacks, multi-core programming, and network management. These things are holding you back, so if you learn more non-Unix network programming, you can achieve things that otherwise look impossible today.

att mail

19 Şubat 2013 Salı

Unlearning College

Hiç yorum yok:

Yorum Gönder