6 pilars AWS¶
In this note, we discuss the 6 “pillars” introduced in “The AWS Well-Architected Coloring Book”:
- The operational excellence pillar focuses on running and monitoring systems, and continually improving processes and procedures. Key topics include automating changes, responding to events, and defining standards to manage daily operations.
- The security pillar focuses on protecting information and systems. Key topics include confidentiality and integrity of data, managing user permissions, and establishing controls to detect security events.
- The reliability pillar focuses on workloads performing their intended functions and how to recover quickly from failure to meet demands. Key topics include distributed system design, recovery planning, and adapting to changing requirements.
- The performance efficiency pillar focuses on structured and streamlined allocation of IT and computing resources. Key topics include selecting resource types and sizes optimized for workload requirements, monitoring performance, and maintaining efficiency as business needs evolve.
- The cost optimization pillar focuses on avoiding unnecessary costs. Key topics include understanding spending over time and controlling fund allocation, selecting resources of the right type and quantity, and scaling to meet business needs without overspending.
- The sustainability pillar focuses on minimizing the environmental impacts of running cloud workloads. Key topics include a shared responsibility model for sustainability, understanding impact, and maximizing utilization to minimize required resources and reduce downstream impacts.
These pillars apply to any robust IaaS/PaaS environment, whether from a major hyperscaler, a European provider, or a sophisticated self-hosted private cloud (like Hop3).
Discussion¶
Here are additional point for each pillar.
1. The Operational Excellence Pillar¶
This pillar is about running your workloads effectively, gaining insight into their operations, and continuously improving supporting processes and procedures to deliver business value. It’s the “how” of managing your environment day-to-day and evolving it.
- Preparation and Planning: Before deploying any workload, operational readiness must be assessed. This involves defining clear operational requirements, understanding dependencies, and ensuring that runbooks and playbooks are in place. This includes planning for capacity, monitoring, and logging from the outset. Standardized architectures and deployment patterns are crucial here, as they make operations more predictable and repeatable.
- Automation of Operations: Manual operations are prone to error, slow, and don’t scale well. This pillar heavily emphasizes automating everything possible: infrastructure provisioning (Infrastructure as Code - IaC), configuration management, deployments (CI/CD pipelines), patching, backups, and even responses to common operational events. Automation ensures consistency, reduces toil, and allows teams to focus on higher-value activities.
- Monitoring and Observability: You can’t manage what you can’t measure. Comprehensive monitoring is essential, collecting metrics, logs, and traces from all layers of the stack (infrastructure, platform, application). Observability goes a step further, enabling you to ask arbitrary questions about your system’s state without pre-defining all metrics. This allows for proactive identification of issues, faster troubleshooting, and understanding system behavior under various conditions. Effective alerting mechanisms are critical to notify the right people at the right time.
- Event and Incident Management: When incidents occur (and they will), having well-defined processes to detect, respond to, and recover from them is vital. This includes clear escalation paths, communication strategies, and roles/responsibilities. Post-incident, conducting blameless retrospectives (post-mortems) is key to identifying root causes and implementing preventative measures, fostering a culture of continuous learning and improvement.
- Continuous Improvement (Kaizen): Operational excellence is not a one-time achievement but an ongoing process. Regularly reviewing operational procedures, analyzing operational data (e.g., incident frequency, resolution times, deployment success rates), and seeking feedback helps identify areas for refinement. This iterative approach ensures that operations become more efficient, reliable, and aligned with business goals over time. This also includes regularly updating documentation, runbooks, and automation scripts.
2. The Security Pillar¶
Security is foundational and must be integrated into every aspect of your system design and operation. It’s about protecting data, systems, and assets by leveraging a defense-in-depth strategy.
- Identity and Access Management (IAM): Controlling who (identities – users, services, applications) can do what (permissions) on which resources is paramount. This involves implementing strong authentication mechanisms (e.g., multi-factor authentication), practicing the principle of least privilege (granting only necessary permissions), and regularly reviewing access. Role-Based Access Control (RBAC) simplifies permission management. Centralized identity providers help manage identities consistently across various services.
- Detective Controls: It’s crucial to be able to detect potential security threats or incidents. This involves implementing logging and monitoring for security-relevant events, using intrusion detection/prevention systems (IDS/IPS), security information and event management (SIEM) systems, and anomaly detection. Regular security assessments, vulnerability scanning, and penetration testing help proactively identify weaknesses.
- Infrastructure Protection: This involves protecting your compute, network, and storage resources. Key practices include network segmentation (e.g., using virtual private networks, subnets, security groups/firewalls), protecting against Distributed Denial of Service (DDoS) attacks, implementing web application firewalls (WAFs), and ensuring secure configurations for all infrastructure components. This also extends to physical security if managing on-premises components of a hybrid system.
- Data Protection: Data must be protected both at rest (while stored) and in transit (while moving across networks). This is achieved through encryption, using strong cryptographic algorithms and robust key management practices. Data classification is also important to identify sensitive data and apply appropriate levels of protection. Implementing data loss prevention (DLP) strategies can help prevent accidental or malicious data exfiltration.
- Incident Response: Despite best efforts, security incidents may occur. Having a well-defined and practiced incident response plan is critical. This includes steps for containment, eradication, recovery, and post-incident analysis. Clear communication channels and roles are essential during an incident. Learning from incidents helps strengthen security posture.
3. The Reliability Pillar¶
This pillar ensures that a workload performs its intended function correctly and consistently when expected. It’s about designing systems that can withstand failures and recover quickly, minimizing disruption to users and business operations.
- Foundations and Change Management: Reliability starts with a solid foundation. This includes appropriate resource planning (network bandwidth, compute capacity) and careful management of quotas or service limits imposed by the platform provider. Change management is critical; most outages are caused by changes. Implementing controlled, automated, and reversible change processes reduces risk.
- Failure Detection and Recovery: Systems should be designed to automatically detect failures at all levels (hardware, software, network, dependencies). Upon detection, automated recovery mechanisms should initiate, such as failing over to redundant components, restarting services, or rerouting traffic. Defining clear Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) guides the design of these mechanisms.
- Resiliency and Redundancy: Design for failure. This means building redundancy into every critical component. This can involve deploying applications across multiple isolated locations (e.g., availability zones or regions), using load balancers to distribute traffic, and implementing data replication. Architectures should be fault-tolerant, meaning they can continue operating even if some components fail.
- Testing and Simulation: Regularly test failure and recovery procedures. This includes backup restoration drills, failover tests, and even chaos engineering (intentionally injecting failures into a live system) to verify that the system behaves as expected under adverse conditions. This builds confidence in the system’s ability to withstand real-world failures.
- Capacity and Scalability: Workloads must be able to adapt to changes in demand. This involves designing for scalability, both up/down (vertical scaling – increasing resources of existing instances) and in/out (horizontal scaling – adding or removing instances). Automated scaling based on demand ensures performance while optimizing costs. Capacity planning helps anticipate future needs.
4. The Performance Efficiency Pillar¶
This pillar focuses on using computing resources efficiently to meet system requirements and maintaining that efficiency as demand changes and technologies evolve. It’s about getting the most “bang for your buck” in terms of performance.
- Resource Selection: Choosing the right type, size, and configuration of resources (compute instances, storage types, database services, network configurations) is fundamental. This requires understanding the workload’s specific performance characteristics (CPU-bound, memory-bound, I/O-bound, network-bound). Over-provisioning wastes resources, while under-provisioning leads to poor performance.
- Performance Monitoring and Analysis: Continuously monitor key performance indicators (KPIs) such as response time, throughput, error rates, and resource utilization. Analyzing this data helps identify performance bottlenecks and areas for optimization. This data is also crucial for making informed decisions about scaling and resource selection.
- Optimization Strategies: There are many ways to optimize performance:
- Compute: Right-sizing instances, using optimized instance families for specific tasks (e.g., compute-optimized, memory-optimized).
- Storage: Selecting appropriate storage tiers (e.g., high-performance SSDs for databases, lower-cost object storage for archives), optimizing I/O patterns.
- Database: Query optimization, indexing, choosing appropriate database engines, read replicas.
- Network: Optimizing data transfer, using content delivery networks (CDNs), choosing regions close to users.
- Application Code: Efficient algorithms, caching, connection pooling, asynchronous processing.
- Review and Evolve: Technology evolves rapidly. Regularly review and adopt new, more efficient technologies and architectural patterns (e.g., serverless, containers, newer instance types) that can improve performance and efficiency. What was optimal yesterday might not be today.
- Trade-offs: Performance often involves trade-offs with other pillars, particularly cost and sometimes reliability. For instance, achieving extremely low latency might require more expensive resources. Understanding these trade-offs and making conscious decisions based on business requirements is key.
5. The Cost Optimization Pillar¶
This pillar is about achieving business outcomes and delivering value at the lowest possible price point. It’s not just about being cheap, but about being smart with spending and avoiding unnecessary costs.
- Expenditure Awareness and Tracking: You can’t optimize what you can’t see. Implementing robust cost tracking, monitoring, and reporting mechanisms is crucial. This includes using tagging to allocate costs to specific projects, departments, or applications. Setting budgets and alerts helps prevent unexpected overruns.
- Cost-Effective Resource Selection: Similar to performance efficiency, choosing the right resources is key, but with a cost focus. This means selecting the most cost-effective instance types, storage options, and services that still meet performance and reliability requirements. Evaluating different pricing models (e.g., on-demand, reserved capacity, spot/interruptible instances) can lead to significant savings.
- Matching Supply with Demand (Elasticity): Avoid over-provisioning by designing systems that can scale dynamically based on actual demand. Automate the scaling process to add resources when needed and, importantly, remove them when they are no longer required. Shutting down non-production environments (dev, test, staging) during non-business hours is a common and effective tactic.
- Optimizing Over Time: Cost optimization is an ongoing process. Regularly review spending, analyze cost trends, and identify opportunities for further savings. As new services and pricing models become available, re-evaluate existing architectures. Decommission unused resources promptly.
- Managed Services and Outsourcing: Leveraging managed services (e.g., managed databases, serverless functions, container orchestration platforms) can often reduce operational overhead, which translates to lower total cost of ownership (TCO), even if the direct service cost seems higher. The platform provider takes on tasks like patching, backups, and scaling the underlying infrastructure.
6. The Sustainability Pillar¶
This pillar focuses on the long-term environmental, economic, and societal impacts of your workloads. It encourages designing and operating systems that minimize resource consumption and environmental footprint.
- Understanding Impact: The first step is to understand the environmental impact of your workloads. This can be challenging, but some platform providers offer tools or data to estimate the carbon footprint associated with resource consumption. Consider factors like energy consumption, hardware manufacturing, and e-waste.
- Maximizing Utilization: Running services on fewer, more highly utilized resources is generally more sustainable. This means avoiding idle resources and consolidating workloads where possible. Virtualization and containerization technologies are key enablers here, as is adopting serverless architectures that only consume resources when actively processing requests.
- Efficient Resource Provisioning and Configuration: Choose the most energy-efficient hardware (if self-hosting) or select regions and services from your IaaS/PaaS provider that are known for their energy efficiency or use of renewable energy sources. Right-sizing instances to match workload demand prevents over-provisioning, which wastes energy.
- Data Management and Transfer: Storing and transferring large amounts of data consumes energy. Implement data lifecycle management policies to delete or archive unnecessary data. Optimize data transfer patterns to reduce the amount of data moved across networks, particularly over long distances. Use compression and efficient data formats.
- Software and Architecture Design: Design applications to be efficient. This includes writing lean code, optimizing algorithms, and choosing architectures that minimize resource consumption per transaction or user. For example, asynchronous processing or event-driven architectures can often be more resource-efficient than traditional request-response models for certain tasks.
- Managed Services: Similar to cost optimization, managed services can contribute to sustainability. Platform providers operating at scale can achieve higher levels of utilization and energy efficiency in their data centers than most individual organizations could on their own.
- Organizational Culture and Goals: Integrate sustainability considerations into your organization’s cloud strategy and governance. Set sustainability goals and track progress. Educate development and operations teams on sustainable design principles. This aligns technological choices with broader corporate social responsibility objectives.
Page last modified: 2025-05-28 08:41:39