Kubernetes DR strategy
Over the last years, many organizations have selected Kubernetes as their platform for the business core system. It actually provides more reliable infrastructure and high availability at the application level. But we sometimes forget that Kubernetes is also set up on another layer of infrastructure (VM, node) so we need to consider its availability as well.
As All business-critical applications on the Kubernetes platform need to have a disaster recovery strategy along with high availability, now I will talk about how to drive the Kubernetes disaster recovery strategy. (HA is different from DR, I will talk more about DR at this time).
And I will focus only on the Kubernetes, NOT all integrated applications or network instances linked to the clusters.
Stakeholders & Requirements
Before scoping the resources to be in the DR plan, we need to know the stakeholders and their requirements (expectations). The cost will be very different from how comprehensively prepare for the DR. For instance, as a major stakeholder, “I don’t expect hot/warm DR site, it will be fine if I can recover it in 7 days, and don’t care about how much data I will lose”. In this case, we have only a minimum requirement and it will be all wastes anything beyond the expectation.
- Identify the stakeholders
- Collect their requirement (Scope, RTO, RPO, MTTR, etc.)
- Get it signed-off as it will impact the overall DR architecture & recovery way
Scope the resources to be recovered
After getting all requirements and stakeholders (business application running on the Kubernetes), we need to scope the resources.
What do I mean by “scope the resources”? Any components on the Kubernetes cluster.
- All applications to be in DR plans
- Resources will severely hit/impact our business (e.g. POD/Deployment, DockerFiles, YAML files, ETCD)
What resources you will recover?
- Master & Worker nodes
- Network instances
- Kubernetes platform related instances (Cluster & Docker images)
Backups! (What is essential for backup)
Let me emphasize again. We are not doing HA or resiliency testing, instead of discussing DR. So we can’t expect there would be another replicated instances up & running when the disaster occurs. (Even HA instances are not the same with Hot DR site)
So we need to make sure everything we want to bring back (recover) has a backup.
- Statefile (configuration files)
- Cluster configuration files (YAML files)
- Certificates
- ETCD
- Persistent storage (If any)
- Dockerfiles (Container files) — image registry
If you want to recover but don’t have a backup? You are not ready for the disaster scenarios.
DR scenarios & specify how to make failure & recover it back
If you have stakeholders, the scope of DR, checked all backups in place, then it’s time to prepare real documentations.
Don’t expect that the principal (senior) engineers are always on standby and bring all services alive back. You may have only a semi-technician where the disaster happens.
Expect the on-call ordinary technician to recover all we lose and save your business. If so, you really need detailed documents and explanations for anyone to follow the instruction and save the world.
Recovery plan
- Automated redeploy (If you have IaC like terraform, you may need to redeploy everything to the different (cold DR site) region or even the same region different zone.
- Manual redeployment. If you have Kubernetes messed up partially, then you can identify the part and recover the specific area.
- Well-prepared script. If you have backed-up resources, you may prepare the script to recover (e.g.) the database from the backup by simply running an automated script.
Backup solutions
There are a lot of solutions to help the organization perform backup/DR.
Most of them help us to backup Kubernetes Storages.
- Velero (https://velero.io/)
- KubeDR (https://github.com/catalogicsoftware/kubedr)
- Cloud providers (If you are using clouds, GCP, AWS, Azure, they will have their own solutions for backing up the clusters)
- Kasten (https://www.kasten.io/)
- Portworx (https://portworx.com/products/px-backup/)
- Cohesity (https://www.cohesity.com/solutions/backup-and-recovery/kubernetes/)
- OpenEBS (https://openebs.io/)
- Rancher Longhorn (https://rancher.com/products/longhorn/)
- Trilio TrilioVault (https://www.trilio.io/triliovault-for-kubernetes/)