15 minutes
The ability of a system recover from infrastructure or service desruption, and dynamically acquire computing resources to meet demand & migrate desruptions such as misconfiguration or transient network issues.
- Test "Recovery Procedure" :
Use automation to simulate different failures or recreate scenarios that led to failovers before.
- Automatically Recover from failure :
Anticipate & remidiate failures before they occur.
- Scale Horizontally to increase Aggregate System Availablity:
Distribute Request across multiple smaller resources to ensure that they dont share a common point of failure.
- Stop Guessing Capacity:
Maintain the optimal level to satisfy demand without over/under provisioning it. i.e. Use AutoScaling
- Manage Change in Automation: Use Automation to make changes to infrastructure.
- Foundation: IAM, VPC, Service Limits, Trusted Advisor
- Change Management: AWS AutoScaling, CloudWatch, CloudTrail, Config
- Failure Management: AWS Backs, CloudFormation, S3, S3 Glacier, Route53