Those of you who have been using VMware for a while will no doubt be
aware of the concept of "virtual server sprawl". This is partly caused by the ease in which
Virtual Machines can be created, but effectively means that far from utilizing
your VMware infrastructure at an optimal level you in fact start to re-create
the very issue that VMware was designed to solve i.e. lots of small Windows
servers (albeit now virtualized) running at minimal utilization.
From a Capacity Management perspective this situation obviously needs
attention as left unchecked it may erode any spare capacity that has been purchased
and in the worst scenario means that you have to expand your infrastructure
unnecessarily. There may also be a
requirement to review the process links to the Change Management process to
ensure all new additions to your VMware infrastructure are being properly
assessed.
The knock on effect to the business may include:
increased expenditure caused by extra hardware, software licenses, power, floor space, disaster recovery and support
a potential reduction in server/service performance as the overall utilization increases
devaluing their perception of the technology
The underlying message is that whilst a VM may only be running at 5% or
10% utilization, the more of these "idle" VMs you have running, the more likely
it is you will start to suffer from performance and capacity issues. Whilst one or two of these VM's don't present
an issue multiple occurrences of this can start to add up to entire physical
hosts being occupied unnecessarily.
The purpose of this paper is to discuss the concept of "recycling VMs" that are under utilized, what this means to both the business and the Capacity Management process.
We will also discuss the practical steps required to quickly identify potential recycling candidates using Metron's Athene Capacity Management software.
Targeted at Capacity Management and Service Management professionals, this will also be of interested to VCPs.
1. Introduction
2. Operational Process
3. Using Athene
3.1 Create the appropriate APR structure
3.2 Creating the custom report
3.3 Create the bulletin
3.4 Create the schedule
4. Conclusion
Metron is a privately owned limited company which was founded in 1986. Metron-Athene Inc is a wholly owned subsidiary of Metron technology Ltd. The company is Europe's foremost Capacity Planning and Systems Performance Management specialist. Metron's flagship product, Athene, provides fully integrated ITIL-compliant capacity management, automatic performance analysis and reporting for UNIX, Linux, Windows and Mainframe Servers .
Robert Ford
Proven Practice - Resource Pool Capacity Management with Metron Athene
Proven Practice - Creating a VMware Capacity Management Dashboard with Metron Athene
Proven Practice: Capacity Management Reporting for VMware Clusters with Metron Athene
Proven Practice: Implementing ITIL V3 Capacity Management in a VMware Environment
You use this proven practice at your discretion. VMware and the author do not guarantee any results from the use of this proven practice. This proven practice is provided on an as-is basis and is for demonstration purposes only.
As discussed the next section walks through the process that an Athene user would take to determine viable recycling candidates. Whilst this document is built around the Athene tool some aspects of this process could be applied to a different toolset i.e. same metrics, thresholds, time periods etc.
The basic premise is to review all the VM’s within your estate (or a subset) and determine which of these machines are continuously running at below 10%. This review should be done automatically where possible and sampled over a period that represents “usual” operational activity.
We will use the custom reporting, threshold and HTML generation functions within Athene to generate a dynamic daily list of the Virtual Machines that could potentially be recycled. You probably wouldn’t act on the first list generated, but use this list as a baseline and over perhaps a couple of weeks look for machines that regularly appear.
Then as part of an operational process you would look to delete these VMs if appropriate or perhaps consolidate them into existing workloads.
Ultimately the resulting report will be produced in HTML and can be added to your existing APR (Automatic Reporting) structure. Figure 1 shows the extra "Overview" category called "Under utilized VMs" that I have added to this example APR structure. It will be populated with a report list detailing all the VMs that have breached the threshold.
So we've created the place holder for the resulting report, now we need to create the appropriate report template.
As discussed previously the purpose of this exercise is to determine which of our VM's are consistently running at minimal CPU utilization. In order to do this we need to create a report that will reflect this. Figure 2 shows a graph created with Athene's "Define a Report" application detailing the "Percentage of CPU Time Utilization" for a particular VM. Warning and Alarm thresholds have been added at 8% and 5% respectively.
Figure 2 - Define a Report
This graph will then be reproduced for each of the selected VMs at daily intervals.
Now we have created the report we need to create the associated bulletin. This will allow us to manage the sampling time, alerting thresholds and replicate the report for each VM.
Figure 3 - Add Report
Figure 3 shows the "Add Reports" panel which is accessed via the "Define a Bulletin" application. Here we are selecting the newly created report for CPU Time Utilization per VM.
Once added we can now proceed to edit the report properties. To produce this bulletin we will need to edit the "General" tab to select the required targets, the "Settings" tab to alter the thresholds and the "Actions" tab to ensure that the report is only generated when the thresholds are breached.
Figure 4 - Select target VM's
Figure 4 shows the selection of Virtual Machines. In this example the "All VMware VM" standard grouping has been chosen. This will include all VM's currently monitored by Athene. this is fine for smaller installations but should be used carefully if you have a large number of monitored VMs as the time taken and resource used to generate the report could prove prohibitive.
For larger installations it may be worth splitting VMs into specific service or operational groups to reduce the generation time and load on the Athene database. You can easily do this with "Groups" Application.
We will skip the "Analysis" tab as the defaults of back one day will suffice and move onto the "Settings" tab.
Figure 5 - Bulletin Settings
Figure 5 details the threshold changes that are required. Remember for this report we will need a reverse threshold i.e. when the CPU usage is below the thresholds we want to alert, so the thresholds have been set to 8% and 5% respectively. We are now effectively saying that if the CPU utilization on average across the day remains above 8% we aren't interested.
The remaining settings are fine to be left as the default.
Figure 6 - Actions
Figure 6 shows the configuration of the report actions. The only thing we need to change here is the "Production Options" from its default of "Always produce this report" to "On any breach of Threshold or Variability". This means that we will only generate a report for the VM's that have breached our thresholds. This will filter out any VMs that are being actively used and so aren't recycling candidates. We now click on "OK" and save the bulletin, and we are ready to start configuring a schedule to run the bulletin.
Figure 7 shows the "Reports" tab with the new bulletin added and the dispatch set to a file save into the APR location that we created initially. We have also added the standard AXM641 post processing task to update the APR structure once the schedule has completed. In this instance we've created a new schedule, but assuming the recurrence times are appropriate there is nothing to stop you from adding this bulletin to an existing schedule.
Figure 7 - The Schedule
Now we have created the all required components, we can run the schedule and then review the report.
The resulting web based report detailed in Figure 8 gives you a list of all the VMs that are running below our preset thresholds and that may be candidates for recycling.
Figure 8 - The Report
This is obviously only a daily snapshot, ideally we would run this schedule for a week or more during a period of usual business activity (excluding weekends if appropriate) and compare the daily lists to determine which VM's regularly appeared. The candidate VMs have now been identified, but this is the first part of the review process. It is likely that the following steps will also be required:
Analyze the operational significance of the VM; it may be that it is key and cannot be recycled
Prior to consolidation or deletion a review with the Change Process should be conducted
Ensure that the thresholds used are appropriate to your environment; they are based on our experience and represent the approximate usage of a VM that is purely running Windows in idle
It is likely that this process will need to be repeated at regular intervals to ensure that the results captured are representative.
Should you have any questions regarding this document, any of the broader topics discussed or the services Metron provide please feel free to contact me on
Email: rob.ford@metron.co.uk
Web: www.metron.co.uk
There are no comments on this document