relief.engineering

Slack Bots @ ITDRC

15.06.2020 — ITDRC, Bots — 3 min read

I'll have another post covering the Information Technology Disaster Resource Center (ITDRC) as an organization later on (it's super awesome and you should join us!), but first I wanted to dive into some of the tooling we've been working on to support our disaster response efforts.

The challenge

ITDRC uses a large number of tools to support our backend operations, including:

Snipe (asset/inventory management)
Salesforce (volunteer, site, and operation management)
Confluence (wiki/knowledge base)
Trello (field assignments and documentation)
Everbridge (volunteer messaging)
Slack (95% of our internal communications)
Vendor-specific tools, like access point controllers and SDN configurations

This tooling is essential to help us manage the hundreds of physical sites and thousands of assets we have deployed for our COVID-19 response. However, it is also challenging a number of ways:

For example, to process one sample installation on our backend, we'd need to:

Login to Trello to pull up the installers' field data (3 clicks)
Login to Salesforce and create and instance of our custom object to represent the site. This means copying a large number of data fields from our submission form spreadsheet, and populating additional support fields, for a total of 16 fields. Let's say optimistically that this takes 2 clicks per field, so ~32 clicks to create a Salesforce site.
Login to our Ruckus controller to create a new zone (~6 clicks plus a lot of waiting for load screens), setup DHCP scopes (~10 clicks), set up logging (~5 clicks), determine the external IP (~3 clicks)
Login to Umbrella and create a new network (~5 clicks plus a lot more waiting), and add to the our policy (~7 clicks and even more waiting)
Login to Snipe and for each asset at the site, we'd check-in the asset (~4 clicks), (~6 clicks)

In the simplest case, assuming we have only a single asset at a site, that's 55 very error-prone clicks and a lot of waiting for slow Web UIs to load and render. If we have 10 assets at a site, that's 145 clicks. Multiply this by the hundreds of of sites we've set up, and my mouse is a lot shinier than it was when we started this operation a few months ago.

Besides being incredibly error-prone, that's a lot of context switching and a lot of open tabs, and we're often dropping links to these tools in Slack when we have conversations about these sites. We also have limitations on the number of accounts we can create to access our tooling due to licensing, which means that we are unable to create accounts for users and would have to share credentials if we wanted to give access to our 2,200+ volunteers (ugh). What if we could do better?

Slack Bots to the rescue

I eventually got tired of clicking (especially in Snipe), and built Botty McBotface to do some automation for us. Snipe can easily eat up the most clicks, so I started with a simple /snipe command:

This takes care of automatically checking in assets if necessary, and provides bulk checkout capabilities when you need to checkout multiple assets to a user or site, e.g. /snipe checkout [asset1 asset2 ...] [user].

How this works

Slack has a powerful API that provides the ability to create apps that can be installed in Slack workspaces, including the ability to install slash commands. You can connect these slash commands to webhooks that call your middleware server of choice.

ITDRC has been using Zapier, which is another powerful tool that makes it relatively easy to connect data from different platforms (for example, it takes alerts from Meraki's dashboard via webhooks, and sends them to a Slack channel where we monitor our networks). Pricing is a bit steep though, and we'd easily hit the $299/month tier doing all the integrations we need to do (and Zapier's non-profit discounts are minimal). There are plenty of copycat tools, but I found the best functionality-price ratio in Integromat, which charges $29/month for roughly the same number of operations. With Integromat, I prototyped a workflow that accepted the webhook from the /snipe command I created on Slack, did some light parsing, queried Snipe's APIs, and returned the results to Slack. Pretty neat!

Building this workflow took an evening, and a lot of that time was spent playing around with how I wanted to structure the /snipe command, and mucking around with Snipe's API (which was not nearly as bad as I'd imagined). Working with and debugging data flows in Integromat is relatively easy, though you can see the logic branching becoming complex pretty fast. We also ran into some issues with Slack returning /snipe failed with error "operation_timeout", often immediately after Integromat's workflow deploys, but also intermittently even well after deploys. This error is triggered when Slack doesn't get a response within 3 seconds of the webhook going out. It's not clear to me what Integromat is doing behind the scenes, but my guess is that when workflows run with a cold-start they spin up a container which introduces additional latency. Occasionally, when these containers are idle for long enough, they will get evicted and trigger another cold-start delay. Interestingly Slack will continue to process responses from Integromat despite the error, but it's not a great user experience.