Wednesday, October 16, 2013

Understanding Virtualization


@ericwhyne
Originally posted at  http://datatactics.blogspot.com/

The word virtual means to be very close to being something without actually being it. Virtualization is one of those terms that are thrown about in our industry that usually only gets virtually close to being understood. This article is a short primer on what virtualization is and describes some of the surrounding vocabulary you need to understand in order to get a stronger grasp on it's more nuanced aspects.

In order to better understand virtualization, lets first review a few basics about computers. Most modern computers have three fundamental things: persistent memory, temporary memory, and processors. The bits that make up the program and the bits that make up the data it operates on are stored and processed alongside each other in memory. This is a basic description of what's called the Von Neumann architecture and there's a ton of other stuff in there augmenting this approach on modern computers. Programs are loaded from persistent memory into temporary memory for execution. The memory location where the program starts pointed at and the processor reads chunks of it and acts on the instructions contained therein. The actions usually involve the modification of other memory locations, like the memory that holds the image on your screen, or the memory location that sends information over your network card. Processors have something called an instruction set. This is the list of valid “go do this” instructions that a given processor will understand. When you compile software from human readable instructions called source code it's converted into machine readable instructions called machine code. It's all very simple, but gets more complicated by the fact that all processors don't have the same instruction sets. If you remember one thing out of this paragraph, remember this, processors aren't all the same. Different families of processors are called processor architectures. Software compiled to machine code for one architecture won't run on any of the others and this created some big hurdles to making software that can be widely distributed. Despite processors being slightly different, we expect the software we write and compile to run across them without error.

Some clever software engineers came up with a solution to this compatibility problem. The general idea was to create a virtual machine. A virtual machine can be logically similar to a real machine, but not in most cases. In most cases, it's really just more software that acts as a translator from a standard set of instructions it knows to the various instruction sets of different physical computers. Software could then be written to this virtual machine which would be uniform and then the virtual machine could be written for each of the processor architectures. Recall from the previous paragraph that when you compile software to a specific architecture, it's called machine code. When you compile software to run on a virtual machine it is called byte code or portable code (p code for short). Byte code is a bit more general purpose since it's made to be interpreted by the uniform constructs of the virtual machines, and hence does run slower.

The piece of software that handles the virtual processor is called a hypervisor. The most famous hypervisor is probably the Java instantiation. It spawns Java Virtual Machines (JVMs) which are the magic that makes the same compiled Java files run on multiple processor architectures or operating systems. In this way, Java achieved something called portability. You could write once, and run anywhere; albeit originally it ran kind of slow. This problem was overcome by implementing a technique called Just In Time compiling or JIT. JVMs that do JIT, rather than translating to machine code at run time, compile the byte code to machine code which runs much faster. But since compiling is often a time consuming task itself, there's a prioritization that has to take place. JIT is usually just done for pieces of the code that run a lot during the execution, inner loops and the such.

If you're like most people in the information technology industry we've probably already drifted away from your preconceived notions of virtualization. Most people don't think of byte code vs machine code when you mention virtualization. I'm about to clear things up a bit for you before I muddy things up again. Hypervisors come in two types. Type 1 hypervisors just act as an arbitrator directly to the hardware. Type 2 hypervisors are the type that I think most people think of when they think of virtualization. In this case we can go beyond just acting as an abstract interface to an architecture and act as an arbitrator to a host operating system's resources. Type 2 hypervisors are how you can run a windows Virtual Machine on a Linux host operating system, or vice versa. With the advent of multi-core machines the benefits of being able to host multiple operating systems on one piece of hardware has been a boon of productivity. Virtualizing whole operating systems instantiations has enabled configuration management at the OS level, allowing operating systems to be configured in a specific way and rapidly deployed or removed from hardware. It's a lot easier to do configuration management at the machine level if you have a working machine to do it with. With the advent of multiple processor and multiple core machines and processor architectures designed to support virtualization (I know, crazy right?) it has become efficient to run multiple instances of operating systems simultaneously on the same piece of hardware.

There are no good pictures of virtualization.
Here's a picture of my kid playing in sand.
Another great side effect of both types of virtualization is called sandboxing. If you want to give an untrusted user or program resources, but want to be able to easily destroy any changes made and completely block them off from the rest of the system, with virtualization you can truly achieve that. Just think of a child playing in a sandbox. It doesn't matter what they do to the sand, you can always just set it back to whatever state it was before they got in there; rake the sand per se. The analogy stops when the child buries your car keys (I have a toddler at home), so forget about all the chaos normal children can undertake. Sandboxes are for untrusted users or programs and give you the ability to easily wipe away anything they could possibly do even if they have full permissions to the sandboxed environment.

Now that you think you understand it all, time to muddy things up a bit again. Having virtualized operating systems is great, but duplicating an entire operating system is still a lot of overhead. Creating a virtual machine that pretends to be actual hardware or has a bunch of caveats or exceptions is hard work for a computer. Wouldn't it be easier if we could just use the overhead work already done by the host operating system and just have the tenant machines implement just the new stuff we wanted or at least just be properly sandboxed? Enter the advent of the lightweight virtual systems, most notably Linux containers. With some clever modifications to the OS kernel (the part of the OS that interfaces with the hardware) lightweight virtual systems achieve something called resource isolation. This is still a topic with a lot of change happening, but I think in most production environments we are going to see containers overtaking more traditional Type 2 hypervisors in most circumstances within a short timeframe.

If this article was helpful, don't forget to comment or share it. We will keep writing this stuff it people keep finding it useful.