VI3 Event Management Workshop

VERSION 2 Published

Created on: Feb 23, 2009 9:20 AM by Steve Chambers - Last Modified:  Feb 26, 2009 1:44 PM by Steve Chambers

Introduction

 

This document is currently being drafted by the VIOPS community and is constantly changing

(if you want to help shape this document, send a private message to Steve Chambers)

 

This document provides the material, references and structure for a VI3 Event Management workshop.

 

The purpose of the workshop is to bring together VMware Certified Professionals (VCP), Operational and Systems Management professionals to share understanding of VI3 and Event/Systems Management so that a monitoring system can be jointly developed for VI3.

 

The goal of the workshop is to use a whiteboard diagram of non-virtual and VI3 managed objects to identify the events and integrations that need to be designed to integrate VI3 with your event management system.

 

This document is not intended to be a treatise or thesis on event management for VI3 - it is focused onto the absolute necessary pieces for monitoring for the VI3.Blueprint, and of course can be improved and expanded upon to suit any more complex requirements.

 

This workshop is one of the VI3.Blueprint activities and is essential to create a fit-for-purpose VI3 solution.

 

Intended Audience

VMware Certified Professionals (VCP) and Systems Management, and Operational professionals.

 

Outline

The outline of this document should form the agenda of the VI3 Event Management workshop.

 

  1. Your Event Management System

  2. Map the Transient/Operational Events

  3. Map the Trend/Planning Events

  4. ITIL v3 Areas most affected by Virtualization

 

VI3 Event Management Workshop

 

defcon_4.gif

 

The first part of the workshop is to develop a common understanding of VI3 and event management.

 

This is essential to bring together the VCPs, the IT Ops staff and systems management professionals to a common understanding before planning the VI3 Event Management solution and feeding into the design next stage.

 

This is only a discovery workshop so do not expect a fully working solution at the end of this workshop: be happy with a team who understand the same big picture and are focused on next steps of going deeper to design and implement the final solution.

 

1 Your Event Management System

 

The first step of the workshop is to draw up, side by side, the non-virtual event management system on the left, and the virtualization event management system on the right - then the second part of this workshop connects the two.

 

You are aiming to have a diagram like this (also included as a png + visio as an attachment to this document).

 

VI3 Event Management Whiteboard.png

 

Figure 1 - Your Event Management System

 

As well as creating your own representation like this diagram, you are also trying to produce a table of Managed Object types.

 

From the left hand, non-virtual side you should be able to build up a list of managed objects - a sample:

 

Non-VI3 Managed Object Type

Type

Intranet Web Instance

Stateful (up/down) and performance (<2s response)

Apache process

Stateful (up/down}

Linux Server

Stateful (up/down)

Storage Metalun

Stateful (available/unavailable) and capacity (warning>70, critical>80)

Network switch port

Stateful (available/unavailable)

 

Action Complete the above list with your team.

 

You should also be able to do the same to the right-hand, VMware Infrastructure side - a sample:

 

VI3 Managed Object Type

Type

vCenter Application

Stateful (up/down)

vCenter Database

Stateful (up/down), performance (tpsping), and capacity (warning>70, critical>80)

ESX pNIC

Stateful (up/down)

ESX HBA

Stateful (up/dead/down)

VM

Stateful (up/down/resetting/powerdown/archived/deleted)

 

Action Complete the above list with your team.

 

2 Map the Transient Operational Events

 

A transient alert signals a period of deviation from normal thresholds which may require the attention of operational staff. Brief duration or isolated threshold crossings should be filtered out before an alert is passed along to operational staff. Such events could include unusually high guest activity due to application or operating system fault, virus activity, or an unusual guest action such as batch file compression or conversions.

 

Some causes of these alerts may require operational staff to make adjustments to the VMware Infrastructure including initiation of troubleshooting procedures. Operational responses can be routed to appropriate teams based on the Managed Object (e.g. ESX Host) which has triggered the alert.

 

Note: the table below is not an exhaustive list because the list changes with releases - check the documentation for your release

 

Name

Managed Object

Definition

Purpose

ESX Host Connectivity

Enterprise Network Monitoring Tool

Alerts on loss of ESX Host interface availability by pings sent to each IP address assigned to the Host. (Example: Alert when host is not available for more than 10 seconds)

Provides simple network diagnostics to indicate connectivity issue of ESX Hosts

Guest Connectivity

Enterprise Network Monitoring Tool

Alerts on loss of Guest network interface availability by pings sent to Guest IP Address. (Example: Alert when Guest is not available for more than 10 seconds)

Provides simple network diagnostics to indicate connectivity issue with a virtual machine.

Essential Service Availability

Enterprise Network Monitoring Tool

Checks availability of essential network services throughout VMware Infrastructure. Example: Alert when services are not available on standard VI ports

Used to determine when network or service availability is compromised. Services may need to be restarted or network problems addressed.

Service state change: • sshd, vpxa, httpd • vmware-vpxa • vwmare-hostd • vmware-served • vmware-snmpd

ESX host

Checks availability of essential ESX host services.

These services are required to manage an ESX host.

ESX Service Console memory swap usages

ESX host

Alerts when service console memory swap usage on an ESX host reaches a predefined level.

Used to determine if the ESX Service Console has sufficient RAM dedicated to it.

Low Datastore Disk Space

vCenter / ESX Host

Alerts when remaining disk space in on a datastore is less than a target value, Example: Alert when less than 2 GB of datastore space remains

Used to determine when data storage is running low and requires reorganizationof virtual machines via Storage vMotion

ESX Server CPU Utilization

vCenter

Alerts when CPU usage on an ESX host reaches a predefined level. Recommended: >= 85%

Used to determine when workloads may need to be reorganized via vMotion.

ESX Server Memory Utilization

vCenter

Alerts when Memory usage on an ESX host reaches a predefined level. Recommended: >= 90%

Used to determine when workloads may need to be reorganized via vMotion.

Virtual Machine CPU utilization

vCenter

Alerts when virtual machine CPU usage reaches a predefined level. Recommended: >= 95%

Used to determine when a virtual machinemay need additional CPU resources.

Virtual Machine Memory utilization

vCenter

Alerts when virtual machine memory usage reaches a predefined level. Recommended: >= 95%

Used to determine when a virtual machinemay need additional memory resources.

Virtual Machine CPU Ready

vCenter

Alerts when a virtual machine’s CPU ready value reaches a predefined level. Recommended: >= 15% (sustained)

Used to determine possible contention within an ESX host.

Virtual Machine Disk I/O

vCenter

Alerts when a virtual machine’s disk I/O utilization consistently surpasses a predefined level.

Used to determine when virtual machines may need storage paths and/or LUN placement re-balanced.

Virtual Machine Network I/O

vCenter

Alerts when a virtual machine’s network I/O utilization consistently surpasses a predefined level.

Used to determine when virtual machines may need additional network adaptors.

HA: Host Failure

vCenter HA

Alerts when vCenter detects a host failure.

Availability may be compromised.

HA: Host Isolation

vCenter HA

Alerts when vCenter detects a host is isolated.

Availability may be compromised.

ESX Server Hardware Alert

Server Diagnostic Interface

Alerts on physical problems with an ESX server. Alerts are dependent on the particular server hardware monitoring system.

Used to diagnose and respond to server hardware failures.

Security Events

Bad Root Logon on Host

Alerts on attempted logons as root using the wrong password.

Indicates possible unauthorized access attempt

 

3 Map the Trend/Planning Events.

 

A long-term alert signals a trigger point for expansion within the virtual environment. Expansion triggers originating from an ESX Host may signify a lack of overall resources in the cluster and can initiate the expansion process to add another node to the cluster. Cluster-initiated alerts are clear signals to initiate expansion procedures including the addition of nodes to the cluster.

 

All long-term monitoring alerts needs to be routed to the VMware Center of Excellence so they can be evaluated and expansion procedures initiated. Detailed discussions of the recommended capacity management process are found in the section entitled Capacity Management in VMware Infrastructure Service Design

 

Note: the table below is not an exhaustive list because the list changes with releases - check the documentation for your release

 

Name

Managed Object

Definition

Purpose

High CPU Usage

vCenter

Average CPU usage at or above a threshold over a period of time. (Example: CPU above 90% over 7 days)

Trigger event to notify operations of virtual machine CPU usage

High Memory Usage

vCenter

Average Memory usage over a period of time (Example: Memory Usage above 70% over 7 days)

Trigger event to notify operations of performance impact and should send notification to virtual machine owner that memory usage is high.

|Low Disk Space |Guest Operating System |Remaining disk space below a threshold (Example: Less than 1GB of space remaining on C:) |Notification event to operations and virtual machine owner that increase in storage allocation is required |

 

4 ITIL v3 Areas Most Affected by Virtualization

 

ITIL drives business value by helping IT organizations standardize processes, share common terminology, and provide integrated IT service management across the IT organization . VMware virtual infrastructure enhances these processes.

 

itil_v3.png

 

  • Financial and Service Portfolio Mgmt

    • Cost model changed from a fixed cost basis in the physical environment to a charge what you used model in a shared virtual environmen

  • Availability, Capacity and Continuity Mgmt

    • Higher utilization on the same asset, better ability to manage capacity across a larger pool of shared resources (instead of simply increasing capacity)

    • Availability and continuity can be done with less cost and complexity

  • Change, Configuration and Release Mgmt

    • VMotion allows movement of VMs across physical resources

    • Increased flexibility to change as well as more ways to control change

  • Event, Incident, Problem and IT Operations Mgmt

    • Eliminate agents while improving visibility for Incident detection and problem root-cause

    • Greater automation in incident response, better ways to minimize problem impact

 

Resources

 

Authors

  • Steve Chambers

 

Disclaimer

 

TBD

 

 

 

 

 

 

 

 

Average User Rating
(0 ratings)




There are no comments on this document

More Like This

  • Retrieving data ...