Thursday, December 15, 2011

Question Enterprise & Cloud computing

Last year we ran a little series called Ask the Experts where you all wrote in your virtualization related questions and we got them answered by experts at Intel, VMWare as well as our own expert on all things Enterprise & Cloud Computing - Johan de Gelas.
Below you find the first 2 answers which answer some of the questions posed in our "Ask the experts: Enterprise & Cloud computing" blog post.
Q (" Tarrant64"): With the growing use of the cloud computing, what are ISPs doing to ensure adequate bandwidth for not only the provider but the customers?
A: As a company outsources more and more services to the cloud, it gets increasingly important to make sure that the internet connection is not the single point of failure. Quite a few Cloud vendors understand this and offer dedicated links to their data centers and provide appropriate SLAs for uptime or response time. Of course these SLA are pretty expensive. These kind of "network costs" are often "forgotten" when vendors praise the cost efficiency of "going cloud".

Or the short answer: Don't rely on a simple internet connection to an ISP if you are planning to outsource vital IT services.
(" Tarrant64"): How do the cloud servers give results so fast? I imagine there is literally Petabytes of data in Google servers and it would be way too expensive to run on SSDs. Is their some sort of hierarchy that masks latency? Even then wouldn't it still take long just to communicate to the server itself?
Currently we have been in contact with some of the engineers of Facebook as we are evaluating the servers that are part of Facebook's Opencompute Project. Facebook built their own software, servers and of course datacenters and is sharing these technologies. As Facebook is a lot more open about their underlying architecture, I will focus on Facebook.
The foundations of facebook are not different from what most sites use: PHP and MySQL. But that is not a very scalable combination. The first trick that the Facebook engineers used was to compile the PHP code into C++ code with the help of their own "HipHop" software. C++ code can be compiled in much better performing and smaller binaries. Facebook reported that this alone reduced CPU usage by 50%. These binaries which are compute intensive are now run on Dual Xeons "Westmere". 
A look inside an empty datacenter room at Facebook.
The main speedup comes from an improved version of memcached. In the early days of Facebook, it was Mark Zuckerberg himself who improved this opensource software. He describes the improvements Facebook did to Memcached here. By caching the web objects, only 5% of the requests have to make use of the database. Memcached runs on many dual Opteron "Magny-Cours" servers as these servers are the cheapest way to house 384 GB (now even 512 GB in version 2) of RAM in one server.
The most recent cache servers are AMD Magny-Cours based. Notice the 12 DIMM slots on each CPU node

The database (of the inbox for example) is not a plain MySQL either. It is based upon a distributed database management system, called Cassandra. Cassandra works decentralized and is able to scale horizontally, e.g. across cheap servers. This in great contrast with most relational databases which scale only vertically well unless you pay huge premiums for complex stuff like Oracle RAC.


Below you find the next 3 answers which answer some of the questions posed in our "Ask the experts: Enterprise & Cloud computing" blog post.
Q ("Gamoniac"): Cloud is the buzzword, but while it makes sense to outsource hardware resources using cloud computing, does it really make sense to add extra layers of complexity to the application ?
A: Cloud computing does not necessarily add extra layers of complexity. Let me focus on the cloud computing that I know best: IaaS. Running a VM on for example Terremark's Enterprise Cloud is no different than running one on your own VMware based infrastructure. The only extra complexity comes from the fact that Terremark uses DRS (Dynamic Resource Scheduling). We have noticed that sometimes response times are relatively high even with medium traffic. This happens if your webtraffic has increased quickly the past minutes. The reason for these high response times is that DRS moves a previously low intensive website that has now become a lot more active to a different resource pool.
But that is about it. Our own measurements have shown that in most cases, using IaaS is not that different from running your own virtualized server.
While I have less experience with Amazon EC2, I believe the same applies. An Amazon VM runs after all on a Xen hypervisor. The real magic is the load balancing that goes on in the Amazon datacenter but that should be transparant to your application. Although I have to admit that it gets a lot more complex for storage intensive applications. It is lot harder to guarantee response times for those applictions.
Q ("Gamoniac"): Are the (cloud computing) savings justifiable to inadvertently creating a single point of failure?
One of the advantages of the better cloud vendors is that there is no single point of failure unless you have only one connection to the internet. As I explained in the previous post, it is unwise to rely on only one internet connection.
Large cloud vendors make sure that every link in the datacenter chain is redundant from power lines to redundant switches and storage (check out Google's Presentation here). It is very unlikely that the datacenter infrastructure of a typical SME can reach the same level of redundancy. 
Q ("FireKingdom"): Assume a server with 4 cpu's. Virtual Machine 1 (VM1) runs on cpu 1, VM2 runs on cpu 1, VM3 is on cpu 2. VM1 ask for data from VM2. VM2 ask its backup Vm3 to see if it has it. VM2 and 3 are like name servers. So they both trust each other and VM1 trust VM2 and VM3. Can VM3 send the data to VM1 with out leaving the machine and not involving the nic?
I'll assume that you run VMware's ESX as hypervisor. Then the answer is relatively simple: yes, if both VMs are connected to the same virtual switch. 
A VMware Virtual Switch (vSwitch) is a piece of software that is part of the kernel of ESX. So if your VMs are connected to the same vSwitch, the network traffic does not need to go on to the wire and will not be affected by the speed of the physical NIC. According to VMware's measurements, two Windows VMs at the same vSwitch can sent at a speed of about 1.35 to 1.6 Gbps to each other. Two Linux VMs were even capable of getting 2.5 Gbps. This was measured on ESX 3.5, vSphere and thus ESX 4.1 might even show better results.

 vSwitch is a network Layer Two device. So it cannot perform any routing.

If your VMs are on separate vSwitches, the traffic has to go through an external switch or even router (if the VMs are on a different network).

No comments:

Post a Comment