Designing High Availability and Disaster Recovery for IoT/Event Hub

Before we jump directly to the topic, it requires some pre-requisites. So, make yourself comfortable with them.

As per wiki, High Availability (HA) is a characteristic of a system, which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period. It is measured as a percentage of uptime in a given year. For details, please refer.

And, Disaster Recovery (DR) involves a set of policies, procedures and tools to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster. Disaster recovery focuses on the IT or technology systems supporting critical business functions. For details, please refer.

Azure like any other cloud provider, has many built-in platform features that support highly available applications. However, you need to design the application specific logic (checklist) which absorbs fluctuations in availability, load, and temporary failures in dependent services and hardware. So that, the overall solution continues to perform acceptably, as defined by business requirements or application service-level agreements (SLAs). For details, please refer.

Hoping above info provides the high-level picture of HA/DR. Remaining post is more focused on a specific scenario in Internet of Things (IOT), basically the headline 🙂

I’m intentionally skipping the conceptual part of HA/DR importance, how to measure it, and different enables. As enough literature is available of the www.

Designing HA/DR for a solution which is using IoT/Event Hub has few considerations –

  • Devices are Smart – The devices should either have logic to differentiate between the primary and secondary region/site or shouldn’t declaratively aware of any endpoint. One of the way is to devices regularly check a concierge service for the current active endpoint. The concierge service can be a web service that is replicated and kept reachable using DNS-redirection techniques (Example, Azure Traffic Manager or AWS Route 53). So, you need to ask yourself what will happened to messages when cloud endpoint is not available? Message loss is acceptable/not? If yes, then fine otherwise you need some offline storage/queue at device end also.
  • Devices Identities – Generally endpoint understand the devices identities, if so then all device identities should be geo-replicated/backups and pushed to the secondary IoT hub before switching the active endpoint for the devices. Accordingly, the concierge service and ultimately devices must be made aware of this change in the endpoint. Also you need to develop the tools/utilities to quickly upload/push devices metadata to the IoT Hub.
  • Delta Identification and Merge – Once the primary region becomes available again, all the state and data that have been created in the secondary site must be migrated back to the primary region. This state and data mostly relates to device identities and application metadata, which must be merged with the primary IoT hub and any other application-specific stores in the primary region.

How much time it should take to fall back to secondary site and recover from it, is something which is solution specific and depends on solution’s RPOs and RTOs.

The overall approach includes following considerations in two major areas –

  • Device – IoT Hub
    • A secondary IoT hub
    • Backup Identities to a geo-redundant store
    • Device routing logic
    • Merging identities, when Primary is back
    • Either interim message store on device or message loss acceptable.
  • Application Components/Storages
    • A secondary App/Services Instance
    • Enable geo-redundant for all storages
    • Restoration of data/states from used storages (SQL & NoSQL)
    • Anything custom

Here is the conceptual architecture diagram, which depicts the proposed solution.

Although diagram is self-explanatory – but feel free to comment/ask on anything.