As Raja stated in his New Year Resolution for Network Engineers post, Virtualization is the new hotness (for very valid reasons), but with it comes new challenges. The interaction of the virtualization and network layers tend to create new classes of problems that manifest in obscure and hard to track ways. More and more IT teams are running into this and the more mature organizations are starting to be mindful of the fact, and coming up with new strategies to deal with this.
Let's give you a real world example.
We were monitoring a customer's internal VLANs with our ExtraHop Application Delivery Assurance system. A cursory glance at the virtualized clusters showed a lot of potential issues. Take a quick look at the chart below, what jumps out at you? A lot of dropped segments and very high Retransmission Timeouts (RTOs).
Because all stats inside the ExtraHop Delivery Assurance system are symmetrical, it's easy to click through and drill down to the other side. So we clicked over to see who's experience all these delays and packet loss due to this virtual server, and it was the core DB server. Looking at the Round-Trip Times (RTTs), even though the median response time is quite reasonable at 2ms, the maximum is much more worrisome at 679ms.
This issue was very much systemic to the entire VM cluster, every virtual machine was showing similar levels of packet loss. The customer's server team looked at the issue more closely, and it turned out that the VM servers were under-provisioned. This resulted in a lot of context switching which exhibited themselves as "virtual packet loss". Re-provisioning and moving some of the less essential VMs out of this core cluster solved the problem handily.
Since then, we've seen a number of customers with the exact same issue. Moral of the story? Virtualization is a great technology, but you need different tools and strategies to manage the performance of applications running in a virtual environment.