Openincident Technology


Projects and Thoughts on Incident Managment Technology

Tags


Openincident Technology

Computer Systems Monitoring: An Overview

4th December 2019

Recently at work I've been in discussions evaluating our monitoring infrastructure I've come across very disparate views on what computer system monitoring is, and the differences between monitoring and notification. Additionally, I've not seen it well documented anywhere about a middle routing layer, despite it existing in most monitoring software today. So, hence this post.

First of all, what is Monitoring? Everybody I talk to has a different opinion of it, depending on their role in a company. From a software developer's perspective, system monitoring entails a software package that looks at the responsiveness of a website, showing which calls to which urls are slow. However, from an SRE or System Administrator's point of view, Monitoring is a software package that gives insight on the status of a server, giving information on storage usage, IO load, or how busy a cpu is. Further, from a business manager's perspective, all of that information is meaningless to him, he will want to know how the website is being used to make money. So he would view relevant monitoring as an external website that tracks sales verses clicks, and similar metrics.

You can generalize these different concepts onto the OSI Stack. (If you don't know what the OSI stack is, read this document https://en.wikipedia.org/wiki/OSI_model)



Briefly, the OSI model is an abstraction of computer communications mapped out in a chart. At the bottom is where we connect ethernet cables, and at the top is the actual website. Data center technicians occupy themselves with the lower-most level, SREs concern themselves with the middle layers, software devs look at layer 6 and 7, and the business people concern themselves with how the entire structure is working, and how well it's making the company money.

As you can see, if you wanted to monitor any single part of that structure, you would need an entirely different package, measuring different variables in different ways.
I can break down these different types into the following classes:
* Layer 6-7: Website Analytics (Google Analytics, Tealeaf, Piwik, Dynatrace, etc.)
* Layer 4-5: System Monitoring (Solarwinds, Prometheus, TiCK Stack, etc)
* Layer 2-3: Network Monitoring (ehop, Netmon, iftop, etc)
* Layer 1: Hardware monitoring (various SNMP and baseboard monitoring packages)

Note that a lot of these products lay claim to being the one product to monitor your entire infrastructure. (Additionally,the more a package costs, the bigger claims it makes to this end). For instance, Dynatrace started out as a website analytics package, but offers extended insight into network issues, servers that are down,  and database problems, but all through the eyes of an application that is designed to offer application monitoring.

So, as you can see, different users have different expectations when monitoring computer infrastructure. Also, packages that are designed for each of those expectations often attempt to branch out to fill other users needs.

Monitoring vs. Notification

So, each of these different packages look at different types of the computer system stack in different ways, for different users. Almost all of these packages offer some type of web browser interface to configure the package, as well as see the current status of the objects being monitored. Now, in order to find out if something is broken, how do you find out about it? Usually the product offers a screen where you can view the current status. But how do you find out about when something breaks when you're not at the screen?

This is the difference between monitoring and notification, and this is an important distinction. Monitoring is the act of observing a system, and is a polling activity. That is, you have to constantly check the status of the system by observing it. This is the activity when SREs are at a NOC watching screens full of graphs and charts. In order to not miss a problem, you have to not get distracted. You can't ever leave or you'll miss an event.

Contrast that with notification. Notification is the activity of being told when something in the monitored system changes. An SRE is able to set up notification for a specific event and walk away, go to lunch, go home and go to sleep and will be informed via a phone call, text or pager when something happens.

Routing and Triggering

When you differentiate these two ideas, you can start to see there is a third process that sits in-between. I call it a routing layer, and it is a set of rules that says when a monitored metric exceeds a level and should become a notification. This third process is usually integrated into a product, and set up for you as a general use case, and is very rarely configurable.

In order for a given user (say, an SRE or a software developer) to set up notification, he needs to find the package that provides the monitoring of the system he is interested in, he will need to set up rules or triggers to fire when the system is out of line, and then configure if he wants an email or text or call when something happens.

Note here again, that for the same system, a different developer or SRE will be concerned about different parts of the system. Thus a different user will create different rules and triggers, and each user will want to be notified seperately, and via different methods. This shows a new requirement for a monitoring solution that can support triggers per user group.

Summary

In summary, I've tried here to illustrate some of the issues in deploying modern monitoring and notification software on computer systems. There are different packages targeting different sections of the software stack, and each is catered to a different type of computer professional, and there is a tendancy for each package to claim to be the end-all-be-all monitoring solution for a company.

However, as I've shown, there are three seperate activities that get lumped up into the term "Monitoring". Monitoring is the process of oberving a system in a certain way. Then, users with different needs create different routing and triggering rules. These rules create notifications that get sent out via different ways (SMS, call, email, etc).

Hopefully, with these new definitions, more useful monitoring systems can be deployed.


AUTHOR

Dan Zubey

View Comments