Using Expert Systems to Manage Diverse Networks & Systems
With a Focus on Operations
Greg Stanley

Using Expert Systems to Manage Diverse Networks & Systems
I. Overview
II. Representation of networks & applications
III. Architectures
IV. Case studies
I. Overview
• Managing diverse networks
• Major operational goals
• Major components
• Alarm filtering & correlation examples
Diverse networks & systems
• Numerous device types & manufacturers
Circuit switching/packet switching hardware
• Data vs. real-time for voice & video
(it's not all ATM yet...)
• Different protocols (TCP/IP, CMIP, ...)
Connection-oriented vs. connectionless
• LAN/WAN differences
• Wireless vs. terrestrial
• Changing topologies
Portable computers, wireless, low-earth-orbit satellites)
• Complex devices
Sub-objects with one IP address
• Subsystem interfaces & proxies
Element management systems
Diverse enterprise network systems
Network, plus software processes and overall applications need to be managed
• Software processes & overall application
• New Client/Server applications
• Applications, including legacy
• Services
• Resources
e.g., disk
Rapidly changing technology increases the need for flexible systems & rapid development
OSI Network management
Operations areas considered here
• Fault management
• Performance management
Other areas
• Security management
• Configuration management
• Accounting management
How does a real-time, object-oriented expert system help?
• Flexibility
• Speed of development - overall development environment
• Incremental development environment for rapid development & feedback, partial solutions
• Representation power: modelling the systems for use in diagnostics, analysis, prediction, ...
• Portability between platforms
• Systems integration capabilities
Some operations issues addressed by real-time, object-oriented expert systems
• Early detection of problems (proactive)
Predictions from performance or patterns
• Alarm/message/event filtering
Suppression of repetitive alarms
• Alarm correlation
Grouping of related alarms
• Diagnosis
Pinpointing the causes of alarms
• Procedure automation
Testing for diagnostic & filtering purposes
Resolving problems
Enforcing standard procedures
Semi-automatic - guiding operator
• Online information
Help, topology, hierarchy, relations
• "What-if" simulations for analysis or training
Alarm/message filtering examples
• Alarm X occurs, then clears by itself within timeout. Suppress it (do not present to operator).
(Also log suppressed alarms for analysis)
• Alarm X occurs. Further testing reveals this alarm to be false or to have cleared itself. Suppress it.
• Alarm X is repeated n times. Present first alarm only, update a repetition counter
• Alarm X is not a real problem until it occurs n times within timeout. Present one alarm only, after n alarms, update repetition counter
Alarm correlation examples
• Alarm X and Alarm Y occur within timeout. Suppress these, present new message Z to operator
• Alarms X1, X2, ..., X6, sent from different agents, are all complaining about "target" device Y. Acknowledge X1...X6, and send an alarm about Y.
• Alarms X1, X2, ..., X8 were all sent by "sender" device Y. Send an alarm indicating suspicious behavior of Y.
Model-based alarm correlation & diagnosis
Models are typically based on connectivity, part-of hierarchy, cause-effect failure models, individual device models such as state diagrams
• Multiple failures have occurred on the same LAN segment. Poll the remaining devices - if all fail, then warn the operator that that segment as a whole has failed (e.g., cable break), and acknowledge the individual source alarms
• Multiple devices X1, X2, ... are sending messages complaining that they cannot communicate with device Y. Send a message that device Y has failed, and acknowledge all the messages for X1, X2, ...
• High-level services requiring particular interface cards X1, X2, ... are all failing. X1, X2, ..., are all plugged into a common backplane or have some other common failure mode. Diagnose and alarm on the common mode failure, and acknowledge X1, X2, ... .
Procedure-based alarm correlation, diagnosis & resolution
• Alarm X occurs. Wait 60 seconds. Check for symptoms again by polling, or log in to a computer execute some UNIX commands (using remote shell). If the problem is still there, send an alarm, otherwise suppress it (except for an optional log entry).
II. Representation of networks as a basis for applications
• Knowledge management view
• "Build yourself a graphical language" to more closely match your tool to your domain
• Representation in OPA
Knowledge management view
• Emphasizes representation of knowledge for applications
• Not just data!
- System hardware & software models
- System topology
- Failure & fault propagation models
- Operating rules & procedures
• Example: alarm correlation & diagnostics need process topology and device models, also usable in operator training and in planning.
• Object-oriented, with graphical representation
Major characteristics of a KBES ("Knowledge-Based Expert System)
• KBES represents both qualitative and quantitative models
• Object orientation is the key part of modern expert systems
• KBES represent information explicitly, rather than embedded in code
analogy: simultaneous mathematical equations vs. set of assignment statements and iterative procedure in FORTRAN, or schematics rather than as a set of statements generating the schematic
• Emphasis on building "declarative" descriptions, independent of subsequent use, and easily inspectable by wide class of users
- Goal to simplify representation & re-use of knowledge for multiple purposes
• Some KBES's (G2) have strong graphics orientation as part of its declarative knowledge
• Static vs. real-time KBES
• Development environment
• KBES provide powerful new high-level tools for modelling and re-use
• High-level descriptions:
- Equipment class implies behavior
- Schematic drawings: connections imply fault propagation, data flow, reachability, reliability
- "Part-of" relation implies fault propagation model
- "Is-a-kind-of" specialization simplifies descriptions
all modems share some common properties
reachability analysis ignores differences between most devices, and may include software processes
- Generic statements utilize these high-level constructs to generate specific diagnosis or simulation, using common attributes
• Model declarations are independent of ultimate usage
• Qualitative models (e.g., cause-effect)
• Portion of a class hierarchy

Portions of class hierarchy (indented form)
| | | TELECOM-DEVICE
| | | | ELECTRONIC-DEVICE
| | | | | LOGICAL-UNIT
| | | | | BUS-NODE -- 1 instance
| | | | | | TOKEN-RING-REPEATER -- 1 instance
| | | | | | LAN-TRANSCEIVER -- 1 instance
| | | | | | | ETHERNET-TRANSCEIVER -- 48 instances
| | | | | | CLUSTER-CONTROL-EXT -- 1 instance
| | | | | | IBM-CHANNEL-CONTROLLER -- 4 instances
| | | | | | Q-BUS-NODE -- 3 instances
| | | | | | | Q-BUS-RS-232-NODE -- 1 instance
| | | | | | HUB -- 1 instance
| | | | | COMM-TWO-PORT
| | | | | | MODEM
| | | | | | | REMOTE-LOOPBACK-MODEM -- 7 instances
| | | | | | | | REMOTE-LOOPBACK-MODEM-RS-232 -- 11 instances
| | | | | | | MANUAL-LOOPBACK-MODEM -- 1 instance
| | | | | | | | MANUAL-LOOPBACK-MODEM-RS-232 -- 5 instances
| | | | | | | IN-HOUSE-MODEM -- 1 instance
| | | | | | | | IN-HOUSE-MODEM-RS-232 -- 3 instances
| | | | | | | MODEM-NO-LOOPBACK -- 1 instance
| | | | | | | | MODEM-NO-LOOPBACK-RS-232 -- 1 instance
| | | | | | PROTOCOL-CONVERTER -- 4 instances
| | | | | | BRIDGE -- 1 instance
| | | | | | REPEATER -- 3 instances
| | | | | | GATEWAY -- 1 instance
| | | | | | ROUTER -- 2 instances
| | | | | | TRANS-LAN -- 3 instances
| | | | | CLUSTER-CONTROL
| | | | | | SMALL-CLUSTER-CONTROL
| | | | | | | IBM-3274-CLUSTER-CONTROLLER -- 2 instances
| | | | | SERVER
| | | | | | TERMINAL-SERVER
| | | | | | | RS-232-TERMINAL-SERVER -- 4 instances
| | | | | COMPUTER
| | | | | | MEDIUM-COMPUTER -- 5 instances
| | | | | | SMALL-COMPUTER -- 11 instances
| | | | | | BIG-COMPUTER -- 2 instances
| | | | | | WORKSTATION -- 27 instances
| | | | | | GMS-NODE -- 124 instances
| | | | | COMPUTER-PERIPHERAL-DEVICE
| | | | | | TERMINAL -- 4 instances
| | | | | | | RS-232-TERMINAL -- 8 instances
| | | | | | PRINTER -- 4 instances
Sample class definitions


Using the editor to change the stubs for a class

Example: Icon editor

Example relation used in diagnosis

Example generic rule using connectivity
For any telecom-device D connected to any hub H
if the status of H is failed
then conclude that the status of D is failed
and conclude that the alarm-priority of D = the alarm-priority of H
(actual syntax)
Example generic rule used in alarm filtering
For any electrical-device D
whenever any message MSG becomes an-event-for D
and when the count of each message MSG2 that is an-event-for D > 4
then conclude that ....
and start multiple-message-filter(D)
(actual syntax)
Some benefits of KBES representation
• Reduces gaps between system analysis, specification, design, implementation, run-time use, maintenance.
- Explicit models carried through all phases
- Inspectable by all classes of users, not just programmers
• Common representation for multiple applications, with one consistent model for development & maintenance
• Generic library: default behavior specified for given class of object, connections - no additional special lists to fill out unless object deviates from the defaults
Some features of G2 - the graphically-oriented, real-time Knowledge-Based Expert System (KBES)
• Objects with attributes
• Class hierarchy for objects, with inheritance of properties and behavior - allowing "differential modelling"
• Associative knowledge, relating objects in the form of connections and relations
• Structural knowledge (e.g., "part-of" relation)
• Representation and manipulation of objects and connections graphically
• Generic rules and associated inference engine
• Concurrent procedures
• Analytic knowledge, such as functions, formulas, differential equation simulation
• Real-time task scheduler, supporting concurrency, priorities, time stamping, validity intervals, timed actions, event-driven activity, reasoning within a fixed deadline, history-keeping, data interfaces
• Interactive development environment and run-time environment
• Graphics
• External interfaces for systems integration
An option: "Build yourself a graphical language"
• Match tool to domain - reduce semantic gap between tool and problem
• Build library of classes & methods (procedures), rules, etc.
• Build "configurer" GUI based on cloning objects from a palette, connecting them, filling out tables of attributes
• Fairly common in many domains
Common graphical elements
• Containment hierarchy/"part-of" for physical areas, common-modes, physical equipment, hierarchy
• Objects in a class hierarchy with specialization & inheritance.
Workstation is a-kind-of computer
Abstract classes such as "hardware"
• Objects include attributes and methods (procedures), e.g., test methods
• Almost everything, whether physical or abstract, is an object
• Graphical connections represent physical connectivity, logical connectivity, or relationships such as cause/effect, hierarchy
RTES-based Petri net example
The language
• Petri net represents actions & state transitions
• Procedures executed at each node
• "Token" passed among nodes, split when parallel operations are launched
• Explicit concurrency control
e.g., "Rendevous" to re-unite concurrent operations
• Used in control & other applications, to execute sequential, procedural operations
The RTES/Object-oriented implementation
• Objects represent nodes, rendezvous, token
• Methods (procedures) called at each node, using underlying implementation language
• Connections (objects) for transitions
• Rules or procedures watch for state transitions
RTES-based state diagram example
The language
• Diagram represents states & state transitions
• Procedures executed at each node
• "Token" passed among nodes
The RTES/Object-oriented implementation
• Objects represent nodes, token
• Connections (objects) for transitions
• Rules or procedures watch for state transitions
Implementation simpler than Petri net, similar
Other common graphical approaches
• Logic networks (AND/OR gates, etc.)
Input symptoms, output causes
Roughly equivalent to specific rules
• Fault trees, decision trees, AND/OR trees, hierarchical fault models, with goal-seeking
Similar objects, different program control
• Cause/effect diagrams
• Procedures to analyze schematic/map
Representation in OPA
• Telecom devices and software processes
includes the "managed objects"
• Class hierarchy
• Workspace ("part-of", "containment") hierarchy
• Containers ("sites", "networks")
alarm and acknowledgement status is propagated up the containment hierarchy
• Alarms/messages/events
• Relations
• Connections - topology information
• Test and operator actions representation
• Common framework shared between message-handling, OPAC graphics language, and schematics
Example palettes for telecom-devices

Example attribute table: telecom-device

Container configuration palette

Workspace (map) hierarchy I

Workspace (map) hierarchy II

A site

Processing for incoming events
• Decode messages as needed, including identification of target, sender, category
• Eliminate obvious repetitions by simple message filtering
• Create "raw" warning messages
• Apply model-based diagnosis, heuristics, procedural reasoning when possible
• Acquire additional information & run tests
• Select candidate "most likely" failures based on model or other information
• Draw conclusions about root causes and sympathy events, prove nodes "good" or "bad"
• Cluster remaining alarms into reasonable groups when possible
• Automatically fix problems where possible
• Notify the operator with summarized alarms and other alarms, guide through repairs
• Pass information to trouble ticket system
• Recognizing recurring problems & notify system administrator
Sample filtered message

Filtered messages
The main message sent would be the ones on the filtered-message handler.
The above message shows up in summary form on the message handler as:

G2-manager-process




Example filtering scenario: raw messages

Instead of sending all these failure messages to the operator, the following
filtered message would be sent, as shown on the "filtered messages" handler:
Filtered version of the previous 22 messages, sent to operator

The details of this message show the original information that went into this
summarized message. The additional-text explanation is assembled automatically.






Decision block with manual input

The endless loop was started for "target" P6S04. The menu above was generated
automatically by the decision block, which had the following table. Note the use of
variables, indicated with the $.

Block pause capability

The menu above was generated automatically by the block-pause-capability,
which had the following table. Note the use of variables, indicated with the $.










III. OPA Architecture
• Overall Architecture/System Integration
• Major components
Overall Architecture

OPA Major building blocks

Message Management - a message "MIB"
• Messages are objects with attributes such as priority, acknowledgement status, time stamp, timeouts for proecedure execution such as escalation, etc.
• Message handlers store messages
• Individual views or messages handlers can be set up per Telewindows user
• Messages are organized and related to "target", "sender", "ID" (category), window, etc., for analysis or browsing.
• Unified framework with OPAC, e.g., same "target", "sender", "ID" (OPAC uses"notify" to designate message-handler
• Messages determine the priority and acknowledgement status of objects in the schematic (map)
• Programmatic access, as well as access by users
User interactions with message handlers
• Acknowledgement
• Deletion
• Optional modification (e.g., comments)
• Navigation to find sender, target, etc.
• Navigation from schematic objects to browse messages at any level (object, or larger unit with a subworkspace hierarchy)
Systems Integration
• GSI C-based library to build custom bridges
• GSI runs as separate process, across network
• Asynchronous communications
• Remote procedure calls
• Polling or event-driven
• SQL-type interfaces to databases (Oracle, Sybase, ...)
• OpenView (SNMP/DM) interface
• File I/O and process spawning
OpenView bridge
• Interfaces G2/OPA to OpenView and general network via SNMP
• Runs as separate process, on same CPU as OpenView (G2 generally runs on a different machine)
• Written in C, using binaries from GSI library and OpenView library
• Works with OpenView DM platform, or the SNMP platform (which is a subset of DM)
• Supports standard SNMP get, get-next, set, send-trap, receive-trap
• With DM platform, can register for events using HP Event Management Services
• XMP calls (which support CMIP protocol) available later
• "Blocking" and "Non-blocking" modes
Using the OpenView bridge: mechanics
• G2/OPA can initiate interactions, or receive unsolicited traps
• G2/OPA can poll, or do management by exception
• G2/OPA can communicate directly with other managers, agents, or software (e.g., Bridgeway's Eventix) by getting and sending traps
• G2/OPA can change colors on OpenView map by sending standard "status" trap
• Operators using OpenView Windows can send traps directly to G2 (via "snmptrap" utility, after configuration of executable icons or menu entries)
• G2/OPA communications to or from bridge are G2 remote procedure calls
Using the OpenView bridge: strategies
• Might configure G2 as an "intelligent operator", or as an "intermediary", intercepting all alarms on the way to OpenView
• Might use polling or management by exception
Polling can cause slow response, poor scaling
Pure management by exception works better when messages have guaranteed delivery, but SNMP datagrams don't guarantee delivery
• Large distributed system may require some filtering, parsing & tokenization of alarm messages close to the sources
May want "proxy" or other agents
Future OPA versions may directly help generate intelligent agents when needed
File I/O and process spawning
• Typical need to launch UNIX processes, receive results, read and write files
example - log in via remsh, do a "ps -ef | grep xxxxx" to find if a particular process is running, and interpret the results, possibly kill a process and start a process
example - again via remsh, check if a file exists. If not, start some process. When the file exists, read its first line and take action based on that first line.
• OPAC language blocks directly execute spawns, file I/O
IV. Case studies using real-time expert systems
• AT&T EasyLink (Commercial electronic mail service): Using OPA for alarm filtering & diagnostics, procedure automation
• Intelsat: network monitoring, satellite telemetry monitoring
• Stanford Telecom ATM applications: DoD SSCN & SPANet, ANMA ATM manager
• Texaco Trading & Transportation
• SWIFT (Belgium) - monitoring bank wire transfers
• CRT Banca (Italy), others: Remote bank security monitoring
• Telefonica (Spain)
Home
About Us
Products
Services
Success Stories
White Papers
Resumes
Contact Info
|