2017 : Off to a … start.
Once more round the sun we go. Oh yay. Whatever, the holidays are over, we’d best just get used to the idea.
All the same, it’s time to get the year off to a solid start in the Africa-Arabia Regional Operations Centre. The main activity that we undertake here is to ensure operations coordination, which is the responsibility of the so-called Regional Operator on Duty, or ROD. The term was coined long ago during the EGEE projects, and survived into the current EGI world of federated infrastructure.
ROD (Regional Operator on Duty) is a role which oversees the smooth operation of EGI infrastructure in the respective NGI. ROD team is responsible for solving problems on the infrastructure within own Operations Centre according to agreed procedures. They ensure that problems are properly recorded and progress according to specified time lines. They ensure that necessary information is available to all parties. The team is provided by each Operations Centre and requires procedural knowledge on the process. The role is usually covered by a team or people and is provided by each NGI. Depending on how an NGI is organised there might be a number of members inthe ROD team who work on duty roster (shifts on a daily or weekly basis), or there may be one person working as ROD on a daily basis and a few deputies who take over the responsibilities when necessary. This latter model is generally more suitable for small NGIs.
Operations == Stability
Operations is a plce where you typically want as little as possible to change. The art of ROD is making sure that there is as little disruption as possible to the services in the federation. Of course, things do happen from time to time (and yes, bad things also happen). Avoiding these disruptions is in practice impossible, so the responsibility of ROD is to ensure that the relevant actors are informed and that any mitigating actions necessary are taken.
This typically means checking the ROD dashboard on the Operations Portal for new alarms from the ARGO monitoring service. When there’s a new alarm generated by ARGO due to a service failure or problem, a GGUS ticket is opened by ROD and assigned to the site. Already open tickets are checked on, to see whether the probe that generated them has changed state (we always hope for the beloved green probe, meaning that the issue has been fixed); if not, an update is requested from the responsible site.
The Site Operating Level Agreement describes the acceptable levels of availability and reliability of services (A/R), and the expected timeframes within which issues should be addressed. If issues are not fixed or acknowledged by the site administrator within the timeframe specified in the OLA (typically 5 working days), the ticket starts to expire and should be escalated by ROD, to a relevant support unit perhaps in the case that expert advice is needed. The goal is to always achieve maximium reliability even when services aren’t 100 % available.
Stability == Good communications
This emphasis on reliability even in imperfect operating conditions translates to humans taking appropriate actions. They may require just a short amount of time, and minimal effort but help to keep user communities and peer infrastructures alike informed of the state of our services. Most importantly, however, ROD has to ensure that where support is required at site-level, that the experts who can provide that support are informed and are able to help. The worst-case scenario is where a site administrator is told that they have problems at their site, but have not idea how to fix it, and feel that they have been abandoned to their fate by a team that instead was supposed to support them.
The relationship beteween the sites (and of course the people who staff them) and the Regional Operations Centre is a deicate one, based on compromise and mutual support, for the greater benefit of infrastructure over the limited benefit of isolated control. Often the need to respond to operations issues, make changes to serivces, or indeed maintai services necessary for interoperation can seem like unnecessary overhead to a site administrator which has to deal with the incessant demands of an active local user group. It’s ROD’s job therefore to ensure that the federation’s need for a transparent and reliable infrastructure are met, as well as the site’s need to satisfy their users’ demands, local policies and the demands that the rest of their job makes on their time.
How we roll.
The ROD duty is in principle shared amongst two institutes in the region - CSIR Meraka and CNRST. Since the activity is not very demanding for experienced operators, it is usually done by CSIR Meraka. A few minutes each day (typically 2 pomodoros ) is sufficient to go through the alarms, tickets and issues, reminding the people that they’ve been assigned to that action is needed, and closing them where possible.
Duties and procedures
The ROD Duties are well-described on the EGI Wiki. Quoted from their wiki page :
The ROD on duty is required to:
- check alarm notifications in the Dashboard at least twice a day;
- close alarms which are in the OK state;
- handle non-OK alarms less than 24 hours old (notify the site administrators according to your NGI’s procedures);
- create tickets for alarms older then 24 hours that are not in an OK state;
- escalate tickets to NGI Management/EGI Operations if necessary (in the Dashboard);
- monitor and update any GGUS tickets up to the solved status (preferably via the Dashboard);
- handle the final state of GGUS tickets not opened from the Dashboard by changing their status to verified.
Complementing this list of duties is an outline procedure to be followed. Following common procedures is an important part of ensuring interoperability between AAROC and other infrastructures. For this reason, we try to follow the standard operation procedures when addressing issues.
While the operations portal really helps in ensuring that raised alarms are indeed associated with a ticket, the last item in the list of the ROD’s duties - “handle the final state of tickets not opened from the Dashboard” - is not the easiest thing in the world to keep track of. Especially since a few times issues are not specifically related to the operations, but rather various campaigns which are undertaken either at the ROC or EGI level. Keeping track of these - specifically their deadlines and followups - is a black art.
Things we do in Africa-Arabia
Every region or NGI organises things their own way. Some have weekly meetings face-to-face, some rely on common task trackers, email lists, whiteboards, etc. We have the privelige of living in the largest region (bar, perhaps, Russia) that EGI interoperates with, and have sites which are vastly different between them, both in the nature of the hardware and services, as well as the actual people who staff them. In the early days of SAGrid, we had the luxury of at least inhabiting the same country, speaking the same language, being more or less on the same technical level. With the fusion of ex-SAGrid sites with the ex-EUMedGrid sites during CHAIN, we were able to operate at a large scale, thereby making interoperation with peer infrastructures feasible, but we also introduced some new complexity.
Too much complexity can cause chaos, and inhibit the capacity of a system to do work (in a thermodynamic sense, you understand), but just the right amount can be a good thing. In order to organise the work that’s requested of the sites, we need tasks to be properly assigned, with clear responsibilities, deadlines and checking. Often this is done with a periodic teleconference where a task list is taken item by item to get feedback and see what still needs to be done. In my opinion however, this is more of a time-eater in an environment which is already tight on that precious commodity, so we turned to the web to make things better.
We use Slack a lot at the Africa-Arabia Regional Operations Centre Slack, for all kinds of situational awareness and collaboration. We have an operations channel for each site, which typically only has the site admin for that site in it, as well as anyone who is working on technical issues related to the site. The idea here is to keep things focussed. ROD also creates a new channel each month to give the the operators a place to discuss quickly whatever issues are faced by the region as a whole. Some pertinent aspects of this channel :
- Operators Only: Only site operators and bots which help operations issues are in the chanel.
- Issues at scale: The channel is for dealing with collective issues, not site-specific issues. There are dedicated channels for that.
- Get Stuff Done ™: Or in other words : “A little less talk, a little more action please”. This is not a place for discussion, but for ensuring that people are informed of relvant issues timeously.
- Best Before: The channel lasts a month. This makes it easy to summarise the events of the month into a handover message.
- Bots, bots bots: We use several bots and integrations to help us keep track of who needsd to do what.
/remind command in slack helps ROD and operators to gently remind each other of things that need to be looked at by a particular time. The Trello integration is also very useful in organising work.
The Africa-Arabia Regional Operations Centre Trello Board is organised into columns per-site, as well as an incoming “triage” column, and a ROD column. The site columns are integrated into the site-specific ops channel in slack, so that changes to cards in those columns are propagated specifically to the people in the channel responsible for the site.
Site operators can also create cards for ROD, triage etc on Trello directly from Slack using the
/trello integration. This makes it much easier to stay in Slack and update ROD and other teammates on progress or new issues, as well as file tasks away as “done”. Another great thing about Trello is the ability to update cards via email.
One of the things we’ve got in “triage” is the full list of GGUS tickets :
This allows ROD to keep a more comprehensive list of GGUS tickets than are present on the ROD dashboard, as well as add other issues or tasks that need to stay in context.
Finally the ability to generate and retain knowledge about technical issues was for a long time relegated to the capability of individuals to trawl their email to find that one mail that contained the magic fix for the thing that keeps breaking. Well, clearly there are better ways of communicating and remembering things than using email. Slack isn’t great for this either since it’s a more private, real-time place to chat, so we’ve turned to Discourse to build a discussion forum where longer-form discussions can be started - and most importantly ended - by marking them as “solved”.
Let’s get it together in ‘17
So, the tools are pretty much in place to make sure that we can sustain stable operations in 2017. Whether or not we actually do, as a ROC or as individual sites, really depends on our commitment to respecting the procedures we’ve set out - discussing them and updating them where necesary, but most importantly sticking to them for the most part. Let’s get the year off with the traditional “New Year’s resolution”. Although 2017 promises to be a year of tremendous activity and very demanding, let’s set a few goals for ROD this year :
- Honour the Deal: We will try to stick to the deadlines specified in the ROD procedure
- Use Downtimes better: We will put sites into downtime rather than allow tickets to expire
- Improve Core Services: Site A/R has often suffered from problems in core services. We need to improve the performance of these, and be more proactive on dealing with issues which impact them.
- Tell the story: We undertake to write a monthly summary of the region’s infrastructure status and issues, by summarising the ops channel for that month, and writing a handover blog post.
Ok, 2017. Time to take you on.