Capacity Management is now a must-have capability for VMware Infrastructure, not a "nice to have" or "I'll do it later". VMware Infrastructure builds software mainframes, and mainframes have a lot of compute resource that must be well managed to:
Ensure the Enterprise receives ROI for their IT resources
Ensure capacity levels support established service level targets – Quality of Service
Ensure capacity is forecasted based on business events
The first step to efficient VI3 and effective capacity management is to bring together the VMware experts and Capacity Management experts: this workshop starts that process.
The outcome of this workshop is (a) a join between VMware and capacity management, in terms of understanding, and (b) a list of design decisions/research to make sure VI3 Capacity Management is fit for purpose.
VMware Certified Professionals (VCPs) and Capacity Management professionals.
Key Metrics for VI3
Increasing Efficiency
Intelligent Reporting
Intelligent Trending / Forecasting
Intelligent Modeling
ITIL Capacity Management
7-point action plan
1. Key Metrics for VI3
|
|---|
This part of the workshop explores which metrics to use, monitor, alert and the reasons why.
Host / guest metric families
CPU %busy, %ready etc
Memory used/free, reclaimed, swapped etc
I/O rate and response times by disk
NIC packets in/out, data rate in/out
Datastore size/free/used (host only)
Logical disk size/free/used (guest only)
Resource pools
Definitions, limits, shares
CPU Available for VM Reservation
CPU Usage
Memory Available for VM reservation
Memory Usage
Is the Resource Pool expandable?
See Proven Practice - Resource Pool Capacity Management with Metron Athene
Clusters
Effective CPU Available for VMs
Effective Memory Available for VMs
Total Number of Hosts
Total VM Migrations
See Proven Practice: Capacity Management Reporting for VMware Clusters with Metron Athene
See Proven Practice - Creating a VMware Capacity Management Dashboard with Metron Athene
Guest
Ready Time
Ballooning Driver Activity
CPU & Memory Usage
Disk Occupancy
Storage
Total IOPS (I/O’s per second)
Data Reads and Writes
Disk Capacity
Disk Freespace
Not a metric: review the placement of the disks within the enterprise and their access
Tool Requirements |
Capacity data for VI3 is spread around multiple components, so you need an efficient and effective way to centralize this intelligently into a capacity management database. The tool to do this needs to have the following characteristics:
Data capture/collect/storage
Simple, consistent data retrieval
Scalable, accessible database
**Auto-manage data (aggregate, copy, delete)
Automated reporting
How it was yesterday / last week/ this year
How it might be if things carry on the same way
Create reports in HTML, Word, Excel, PDF,…
Hands-on tool needed
Capacity data lives on different levels and you need a tool to bring all the data together
2. Increasing Efficiency
|
|---|
Capacity management is about increasing the efficiency of your VI3 by providing the right resources to the right workloads at the right time - and NOT over-provisioning (hurts ROI), and NOT under-provisioning (hurts service levels).
Designing for efficiency
There are capacity management influences on the design and implementation of VI within Enterprise
Allocate VMs just 1 vCPU vs. multiple vCPU’s by default - only use SMP (2vCPU and 4vCPU) if you have data to back up that decision.
Transparent Page Sharing
Rogue and Under-Utilized VMs
Identify VMs that are idle (< 10% CPU?) and/or rogue (administrator created on a Friday afternoon, but it is not really needed so it can be turned off, archived and probably deleted).
CPU efficiency
See how unefficient 2vCPUs can be in the following diagram:
Figure 1 - %READY time increase on a 2vCPU guest where there is contention for CPU
Figure 1 illustrates the potential for blocking work on a guest system that has more than one virtual CPU. Guest VM1GBVIF021 (the area graph) has two VCPU's allocated to it, but is spending much more time wanting to run but being unable to do so than VM1GBVAP199 (the bar graph) which has only 1 VCPU. This effect is magnified even further with a 4-way guest.
Memory efficiency
In Figure 2 there are some memory statistics from a single guest system using VMware's Transparent Page Sharing (TPS) system which significantly increases the efficiency of VMs:Hosts, allowing for an even greater density of VMs on a single host. From the first column we can see that it has been granted use of 768 MB of memory.
Figure 2 - Guest memory sharing
It’s using about 53 MB for its own things, which is pretty small. For ESX to support this VM is costing about 70 MB of memory, again fairly small.
Now look at the memory shared between VMs - about 272 MB. So without the transparent page sharing, EVERY VM on this host would want another 272 MB of real memory to be able to run.
With all this sharing going on, there is no pressure on real memory, as can be seen by the zeroes in the “Swapped Out” and “reclamation” columns.
Rogue/Idle VMs
Figure 3 shows how you can identify guests that are idle or rogue by simply finding the ones that consistently use less than 10% of CPU - with this list, you can investigate whether to turn off, archive or isolate these VMs so they free up precious capacity.
Figure 3 - Identifying rogue/idle VMs
3. Intelligent Reporting
|
|---|
The keys to intelligent reporting are:
Different schedules for different reports: Daily / Weekly / Monthly
Simple Dashboard
Use Web Portal Publishing instead of paper printouts (not just eco, but faster and easier!)
Intelligent data for easy understanding, and not time consuming to prepare or understand
Resource Pool
Figure 4 shows a Resource Pool report.
Figure 4 - Resource Pool Report
A resource pool consists of an amount of CPU and Memory. Therefore reporting will only need to cover these 2 areas. As a resource pool is populated by VMs you may want to overlay the VM usage of the Resource Pool like we did previously with the ESX Host.
Monitor the utilisation of the resource pool against it’s Reservation and it’s Limit. If pools are consistently using more than their reservations it may be time to reassess the settings you have.
Cluster
Figure 5 shows a cluster dashboard.
Figure 5 - Cluster Dashboard
With a cluster we need to report all the same items we were looking at for the ESX Host. But as we get to these higher levels there is likely to be significantly more management interest in the “health” of the Virtual Infrastructure. So it’s at these levels you might need to start including management type reports.
Figure 6 shows a cluster report.
Figure 6 - Cluster Report
Host
Figure 7 shows a host report.
Figure 7 - Host Report
Given that a host has the same hardware as a normal OS/Hardware type system then we need to monitor the same items. We still need to monitor CPU, Memory, Disk and Network cards, against agreed thresholds.
Figure 8 shows a host report, focusing on overhead
Figure 8 - Host Report (Overhead)
While we need to monitor all the usual items, we can introduce some unique VMware data as a comparison. Here we can see the Pink area representing the observed CPU usage of the ESX Host. The stacked line graph in front represents the CPU usage of the ESX Host by the Guest VMs.
As you can see at the beginning of the graph some VMs generate significant overheads in ESX. In this case the workload in the “blue” VM (VM001) was almost entirely graphical.
Figure 9 shows a host report for memory.
Figure 9 - Host Memory Report
This chart illustrates both the usage of real memory in GB by an ESX host and the average memory percentage used by all the VMs is currently supporting.
The area graph is the percentage of memory used by the VMs, and generally tracks the shape of the line graph, which provides detail of the amount of memory ESX is really using. The more variable nature of the ESX host memory line can be explained by the additional work ESX is performing on behalf of guests, which they do not “see”.
Guest
Figure 10 shows stacked VM CPU.
Figure 10 - Stacked vCPU
This chart illustrates the potential for blocking work on a guest system that has more than one virtual CPU. Guest VM1GBVIF021 (the area graph) has two VCPU's allocated to it, but is spending much more time wanting to run but being unable to do so than VM1GBVAP199 (the bar graph) which has only 1 VCPU.
This effect is magnified even further with a 4-way guest.
Alerting |
We've talked about metrics, talked about monitoring and reports, now for the alerting approach:
Determine what to alert on
Determine how often to alert
Determine what reports to have available when an alert is received
Have a tool box of reports that you run when an alert is received
Review past reports to determine if it is an anomaly or indication of a future problem where action needs to take place
4. Intelligent Trending / Forecasting
|
|---|
Figure 11 shows a threshold/trend alert.
Figure 11 - Threshold alerting
Forecasting
When will I run out of capacity
Business data needed
Trending
Straight-line trends
Trends with a point in time increase “Dog Leg Trend”
Figure 12 shows a trend report that you can build alerts on using thresholds.
Figure 12 - Trends and Thresholds
Figure 13 shows a dog-leg trend report, which shows the likely effect of a 20% jump in CPU utilisation from more work were added to the cluster at a given date/time.
Figure 13 - Dog-leg Trend
5. Intelligent Modeling
|
|---|
Modeling is used to answer questions like: Where do I put the next VMs? Assume all things equal, only capacity is different between two clusters, then which cluster is best for my next VM?
Cluster capacity attributes that might affect your decision include: what are the goals in terms of capacity (e.g. Cluster 1 <50% full Gold Standard, Cluster 2 <80% full Low Cost Standard)
Ideally, from modeling you want to get a suggestion “at a glance” without having to study reports.
Modeling requirements:
More detailed than cluster level
Unit of planning = Host
Make workloads = VMs
Must be simple to set up and use
Must be able to incorporate business information or application data, if available
Typical Modeling scenario: Growing Virtual Workloads
You have a 4-CPU ESX Server currently running 5 virtual machines and your model requires that you grow all workloads by 90% over 10 quarters. Figure 14 shows the effect of this growth
Figure 14 - Model growth at 90% over 10 quarters
You can then model what happens if you add more storage, in Figure 15.
Figure 15 - Modeling additional storage
You can then model the addition of more CPU, in Figure 16:
Figure 16 - Modeling additional CPU
6. ITIL Capacity Management
|
|---|
7. 7-point action plan
|
|---|
1. People
Have the VMware team talk with the Capacity team. Use this slide deck & VI:OPS
Improve your knowledge of ITIL, VI is part of a larger entity. See VIOPS and Metron training webinars
2. Tools
Automate laborious activities with a tool such as Athene. See the whole picture and make informed decisions. Fast payback.
3. Monitoring
Focus on the key metrics and create processes and reports around them.
4. Reporting
Set up reports shown in this presentation and meet regularly with Stakeholders to review.
5. Trending
Create charts of your capacity trends
Use this workshop and VI:OPS as a guide and automate it using a tool such as Athene
6. Modeling
Run scenarios on a regular basis. It’s easy to do with a tool. See when capacity and service levels are impacted. How long do you have to react and what decisions need to be made?
7. Improve
use your new skills and knowledge to improve efficiency of infrastructure. Measure it, use KPI’s in presentation
Capacity Management Best Practices |
What should everyone be doing at a minimum:
Design
1vCPU – avoid SMP contention, get more density ( 4x more VMs per Host)
Use Transparent Page Sharing, get more density (reduce memory needs for VMs)
Metric
%READY – highlight saturated hosts (over-dense)
Process
Track impact of idle + rogue VMs (remove resources not being used)
Capacity management hooked into release and change procedures
Resources
AuthorBig thanks to Metron who provided all of the expertise for this workshop. If you want to talk Capacity Management and VMware, they are your primary team to work with.
Reach out to the Metron team on VIOPS:
Reach out to the VIOPS team on Capacity Management:
Disclaimerstandard text |
There are no comments on this document