This document is currently being drafted by the VIOPS community and is constantly changing
(if you want to help shape this document, send a private message to Steve Chambers)
This document provides the material, references and structure for a VI3 Event Management workshop.
The purpose of the workshop is to bring together VMware Certified Professionals (VCP), Operational and Systems Management professionals to share understanding of VI3 and Event/Systems Management so that a monitoring system can be jointly developed for VI3.
The goal of the workshop is to use a whiteboard diagram of non-virtual and VI3 managed objects to identify the events and integrations that need to be designed to integrate VI3 with your event management system.
This document is not intended to be a treatise or thesis on event management for VI3 - it is focused onto the absolute necessary pieces for monitoring for the VI3.Blueprint, and of course can be improved and expanded upon to suit any more complex requirements.
This workshop is one of the VI3.Blueprint activities and is essential to create a fit-for-purpose VI3 solution.
VMware Certified Professionals (VCP) and Systems Management, and Operational professionals.
The outline of this document should form the agenda of the VI3 Event Management workshop.
Your Event Management System
Map the Transient/Operational Events
Map the Trend/Planning Events
ITIL v3 Areas most affected by Virtualization
VI3 Event Management Workshop
|
|---|
The first part of the workshop is to develop a common understanding of VI3 and event management.
This is essential to bring together the VCPs, the IT Ops staff and systems management professionals to a common understanding before planning the VI3 Event Management solution and feeding into the design next stage.
This is only a discovery workshop so do not expect a fully working solution at the end of this workshop: be happy with a team who understand the same big picture and are focused on next steps of going deeper to design and implement the final solution.
1 Your Event Management System |
|---|
The first step of the workshop is to draw up, side by side, the non-virtual event management system on the left, and the virtualization event management system on the right - then the second part of this workshop connects the two.
You are aiming to have a diagram like this (also included as a png + visio as an attachment to this document).
Figure 1 - Your Event Management System
As well as creating your own representation like this diagram, you are also trying to produce a table of Managed Object types.
From the left hand, non-virtual side you should be able to build up a list of managed objects - a sample:
Non-VI3 Managed Object Type | Type |
|---|---|
Intranet Web Instance | Stateful (up/down) and performance (<2s response) |
Apache process | Stateful (up/down} |
Linux Server | Stateful (up/down) |
Storage Metalun | Stateful (available/unavailable) and capacity (warning>70, critical>80) |
Network switch port | Stateful (available/unavailable) |
Action Complete the above list with your team.
You should also be able to do the same to the right-hand, VMware Infrastructure side - a sample:
VI3 Managed Object Type | Type |
|---|---|
vCenter Application | Stateful (up/down) |
vCenter Database | Stateful (up/down), performance (tpsping), and capacity (warning>70, critical>80) |
ESX pNIC | Stateful (up/down) |
ESX HBA | Stateful (up/dead/down) |
VM | Stateful (up/down/resetting/powerdown/archived/deleted) |
Action Complete the above list with your team.
2 Map the Transient Operational Events |
|---|
A transient alert signals a period of deviation from normal thresholds which may require the attention of operational staff. Brief duration or isolated threshold crossings should be filtered out before an alert is passed along to operational staff. Such events could include unusually high guest activity due to application or operating system fault, virus activity, or an unusual guest action such as batch file compression or conversions.
Some causes of these alerts may require operational staff to make adjustments to the VMware Infrastructure including initiation of troubleshooting procedures. Operational responses can be routed to appropriate teams based on the Managed Object (e.g. ESX Host) which has triggered the alert.
Note: the table below is not an exhaustive list because the list changes with releases - check the documentation for your release
Name | Managed Object | Definition | Purpose |
|---|---|---|---|
ESX Host Connectivity | Enterprise Network Monitoring Tool | Alerts on loss of ESX Host interface availability by pings sent to each IP address assigned to the Host. (Example: Alert when host is not available for more than 10 seconds) | Provides simple network diagnostics to indicate connectivity issue of ESX Hosts |
Guest Connectivity | Enterprise Network Monitoring Tool | Alerts on loss of Guest network interface availability by pings sent to Guest IP Address. (Example: Alert when Guest is not available for more than 10 seconds) | Provides simple network diagnostics to indicate connectivity issue with a virtual machine. |
Essential Service Availability | Enterprise Network Monitoring Tool | Checks availability of essential network services throughout VMware Infrastructure. Example: Alert when services are not available on standard VI ports | Used to determine when network or service availability is compromised. Services may need to be restarted or network problems addressed. |
Service state change: • sshd, vpxa, httpd • vmware-vpxa • vwmare-hostd • vmware-served • vmware-snmpd | ESX host | Checks availability of essential ESX host services. | These services are required to manage an ESX host. |
ESX Service Console memory swap usages | ESX host | Alerts when service console memory swap usage on an ESX host reaches a predefined level. | Used to determine if the ESX Service Console has sufficient RAM dedicated to it. |
Low Datastore Disk Space | vCenter / ESX Host | Alerts when remaining disk space in on a datastore is less than a target value, Example: Alert when less than 2 GB of datastore space remains | Used to determine when data storage is running low and requires reorganizationof virtual machines via Storage vMotion |
ESX Server CPU Utilization | vCenter | Alerts when CPU usage on an ESX host reaches a predefined level. Recommended: >= 85% | Used to determine when workloads may need to be reorganized via vMotion. |
ESX Server Memory Utilization | vCenter | Alerts when Memory usage on an ESX host reaches a predefined level. Recommended: >= 90% | Used to determine when workloads may need to be reorganized via vMotion. |
Virtual Machine CPU utilization | vCenter | Alerts when virtual machine CPU usage reaches a predefined level. Recommended: >= 95% | Used to determine when a virtual machinemay need additional CPU resources. |
Virtual Machine Memory utilization | vCenter | Alerts when virtual machine memory usage reaches a predefined level. Recommended: >= 95% | Used to determine when a virtual machinemay need additional memory resources. |
Virtual Machine CPU Ready | vCenter | Alerts when a virtual machine’s CPU ready value reaches a predefined level. Recommended: >= 15% (sustained) | Used to determine possible contention within an ESX host. |
Virtual Machine Disk I/O | vCenter | Alerts when a virtual machine’s disk I/O utilization consistently surpasses a predefined level. | Used to determine when virtual machines may need storage paths and/or LUN placement re-balanced. |
Virtual Machine Network I/O | vCenter | Alerts when a virtual machine’s network I/O utilization consistently surpasses a predefined level. | Used to determine when virtual machines may need additional network adaptors. |
HA: Host Failure | vCenter HA | Alerts when vCenter detects a host failure. | Availability may be compromised. |
HA: Host Isolation | vCenter HA | Alerts when vCenter detects a host is isolated. | Availability may be compromised. |
ESX Server Hardware Alert | Server Diagnostic Interface | Alerts on physical problems with an ESX server. Alerts are dependent on the particular server hardware monitoring system. | Used to diagnose and respond to server hardware failures. |
Security Events | Bad Root Logon on Host | Alerts on attempted logons as root using the wrong password. | Indicates possible unauthorized access attempt |
3 Map the Trend/Planning Events. |
|---|
A long-term alert signals a trigger point for expansion within the virtual environment. Expansion triggers originating from an ESX Host may signify a lack of overall resources in the cluster and can initiate the expansion process to add another node to the cluster. Cluster-initiated alerts are clear signals to initiate expansion procedures including the addition of nodes to the cluster.
All long-term monitoring alerts needs to be routed to the VMware Center of Excellence so they can be evaluated and expansion procedures initiated. Detailed discussions of the recommended capacity management process are found in the section entitled Capacity Management in VMware Infrastructure Service Design
Note: the table below is not an exhaustive list because the list changes with releases - check the documentation for your release
Name | Managed Object | Definition | Purpose |
|---|---|---|---|
High CPU Usage | vCenter | Average CPU usage at or above a threshold over a period of time. (Example: CPU above 90% over 7 days) | Trigger event to notify operations of virtual machine CPU usage |
High Memory Usage | vCenter | Average Memory usage over a period of time (Example: Memory Usage above 70% over 7 days) | Trigger event to notify operations of performance impact and should send notification to virtual machine owner that memory usage is high. |
|Low Disk Space |Guest Operating System |Remaining disk space below a threshold (Example: Less than 1GB of space remaining on C:) |Notification event to operations and virtual machine owner that increase in storage allocation is required |
4 ITIL v3 Areas Most Affected by Virtualization |
|---|
ITIL drives business value by helping IT organizations standardize processes, share common terminology, and provide integrated IT service management across the IT organization . VMware virtual infrastructure enhances these processes.
Financial and Service Portfolio Mgmt
Cost model changed from a fixed cost basis in the physical environment to a charge what you used model in a shared virtual environmen
Availability, Capacity and Continuity Mgmt
Higher utilization on the same asset, better ability to manage capacity across a larger pool of shared resources (instead of simply increasing capacity)
Availability and continuity can be done with less cost and complexity
Change, Configuration and Release Mgmt
VMotion allows movement of VMs across physical resources
Increased flexibility to change as well as more ways to control change
Event, Incident, Problem and IT Operations Mgmt
Eliminate agents while improving visibility for Incident detection and problem root-cause
Greater automation in incident response, better ways to minimize problem impact
Resources
Authors
Disclaimer
TBD |
There are no comments on this document