I'll have another post covering the Information Technology Disaster Resource Center (ITDRC) as an organization later on (it's super awesome and you should join us!), but first I wanted to dive into some of the tooling we've been working on to support our disaster response efforts.
ITDRC uses a large number of tools to support our backend operations, including:
This tooling is essential to help us manage the hundreds of physical sites and thousands of assets we have deployed for our COVID-19 response. However, it is also challenging a number of ways:
For example, to process one sample installation on our backend, we'd need to:
In the simplest case, assuming we have only a single asset at a site, that's 55 very error-prone clicks and a lot of waiting for slow Web UIs to load and render. If we have 10 assets at a site, that's 145 clicks. Multiply this by the hundreds of of sites we've set up, and my mouse is a lot shinier than it was when we started this operation a few months ago.
Besides being incredibly error-prone, that's a lot of context switching and a lot of open tabs, and we're often dropping links to these tools in Slack when we have conversations about these sites. We also have limitations on the number of accounts we can create to access our tooling due to licensing, which means that we are unable to create accounts for users and would have to share credentials if we wanted to give access to our 2,200+ volunteers (ugh). What if we could do better?
I eventually got tired of clicking (especially in Snipe), and built Botty McBotface to do some automation for us. Snipe can easily eat up the most clicks, so I started with a simple /snipe
command:
This takes care of automatically checking in assets if necessary, and provides bulk checkout capabilities when you need to checkout multiple assets to a user or site, e.g. /snipe checkout [asset1 asset2 ...] [user]
.
Slack has a powerful API that provides the ability to create apps that can be installed in Slack workspaces, including the ability to install slash commands. You can connect these slash commands to webhooks that call your middleware server of choice.
ITDRC has been using Zapier, which is another powerful tool that makes it relatively easy to connect data from different platforms (for example, it takes alerts from Meraki's dashboard via webhooks, and sends them to a Slack channel where we monitor our networks). Pricing is a bit steep though, and we'd easily hit the $299/month tier doing all the integrations we need to do (and Zapier's non-profit discounts are minimal). There are plenty of copycat tools, but I found the best functionality-price ratio in Integromat, which charges $29/month for roughly the same number of operations. With Integromat, I prototyped a workflow that accepted the webhook from the /snipe
command I created on Slack, did some light parsing, queried Snipe's APIs, and returned the results to Slack. Pretty neat!
Building this workflow took an evening, and a lot of that time was spent playing around with how I wanted to structure the /snipe
command, and mucking around with Snipe's API (which was not nearly as bad as I'd imagined). Working with and debugging data flows in Integromat is relatively easy, though you can see the logic branching becoming complex pretty fast. We also ran into some issues with Slack returning /snipe failed with error "operation_timeout"
, often immediately after Integromat's workflow deploys, but also intermittently even well after deploys. This error is triggered when Slack doesn't get a response within 3 seconds of the webhook going out. It's not clear to me what Integromat is doing behind the scenes, but my guess is that when workflows run with a cold-start they spin up a container which introduces additional latency. Occasionally, when these containers are idle for long enough, they will get evicted and trigger another cold-start delay. Interestingly Slack will continue to process responses from Integromat despite the error, but it's not a great user experience.
More on how we solved this soon...