Modernising regional monitoring.
Monitoring and alerting is a key part of keeping a distributed platform operational. This has typically been done by generating a topology for a region using the GOCDB and using Nagios to monitor services at sites. These monitors are linked to operating level agreements that define thresholds; if these thresholds of service state, availability, etc are exceeded, an alert is issued, and site administrators are urged to take action.
This is all well and good, but it’s time to experiment with a different monitoring service, Prometheus. This to scratch the following itches :
- have a closer view into the trends of site services
- have some deeper tooling (in order to have better understanding of their behaviour)
- just investigate new ways of doing things.
Regional monitoring infrastructure
The regional services include everything that falls under the GOC region, as well as the services in the region which are not in the GOC. These are perhaps out-of-project resources, services that are not yet at production level, or experimental services which are not in the EGI service catalogue.
It is the desire to have a monitoring infrastructure that discovers these services (via some as-yet unknown mechanism), includes them in the monitoring topology and collects metrics, eventually sending them to a dashboard. This requires some initial infrastructure:
- Deployment cloud: We need somewhere to deploy the initial services to
- Monitoring service: We’ll use Prometheus for this and put a Prometheus server on the cloud
- Alerting service: We need a service to issue alerts based on monitoring data; we’ll use the Prometheus “AlertManager” for this
- Dashboard: The metrics will be displayed on a Grafana dashboard.
- Instrumentation: We need to instrument the services so that we can collect monitoring data from them. This data is host-level as well as application-level data.
Starting from the bare cloud 1, we adopt the pattern of “Infrastructure as Code”. This means expressing the deployment strategy for our monitoring services as code, and deciding on an execution model to deploy them. Deployment models could include tools such as Terraform, or some Ansible or Puppet code.
For our purposes, I will be assuming a set of Ansile roles used in playbooks which express the desired state layer by layer, using the Ansible OpenStack modules to provision machines. Of course, these roles need to be developed and tested, before being used in the playbooks, and the playbooks themselves need to be tested in the staging environment, in order to run functional tests on their results. We will use a combination of Molecule, OpenStack2, Docker and Travis to run the tests.
Typically the roles are tested independently with molecule, both locally, and by Travis on every push. Once the roles are passing their own checks, they are combined into the deployment playbook, which both provisions the servers and configures them. These playbooks will express different servers based on the environment they are given 3. Once the machines are provisioned, the functional tests are run on the servers using ServerSpec or TestInfra; these can be considered integration tests that supplement the functional tests run during dev by Molecule. If the staging environment passes the tests, it is destroyed (VMs in it are deleted), and the playbook is run with the prod environment, to update the service configuration.
This will require a few roles and playbooks:
- Prometheus: the prometheus server itself, as well as the alert maanger. Monitoring data will be sent to a grafana dashboard
- Grafana: role to deploy grafana dashboard, to display the monitoring data
The playbook would include these roles as follows:
Here we can see that the Prometheus role is applied to the set of machines in the
monitors group and the Grafana role is applied to a different set of machines in the
The main task in Grafana would be the configuration
Orchestration and context
The most important aspect of the Grafana task is the configuraiton of the server, which is kept in an
Grafana has 1st-class support for a Prometheus data source, but the admin has to manually add the source once the installation is done according to the docs.
It seems to be possible4, via the Grafana REST interface:
References and Footnotes
We will be assuming an OpenStack private cloud, but you could probably do this with anything. It just changes the infrastructure provisioning side, probably. ↩
This role is not in Galaxy yet, perhaps later we’ll move it out of the main DevOps repo and into it’s own Galaxy repo. ↩