AWS re:Invent Home
Register

Join us

Discover new ideas, build connections, and interact with new technology.
Why attend
Agenda
Campus
Innovation Talks
Keynotes
Session catalog
Sponsors
Learn
Attendee guides
AWS Builder Labs
Bootcamps
Content hubs
Expo
Generative AI
PeerTalk
Community
All Builders Welcome
AWS Certified
re:Play
Rec Center
FAQs
Justify your trip
Background image

Attendee guide: Reliability engineering

If you’re on a mission to find out how you can improve the resilience, reliability, and availability of your application, this guide aims to lead you to some of the sessions that can help you level up on these topics.

Sathyajith Bhat
Container Hero

Sathyajith Bhat

Staff Software Engineer, Site Reliability - The Trade Desk Inc

 

Reliability is a strong focus for AWS, and it is one of the six pillars of the AWS Well-Architected Framework. As a service owner and maintainer, ensuring the reliability of your service can be a challenging endeavor.

 

This guide for reliability engineering focuses on talks, workshops, and labs that should interest attendees looking to build, operate, and maintain reliable systems. Whether you are an experienced reliability engineer looking to enhance your knowledge or a keen engineer looking to embark on your reliability engineering journey, this guide should set you on the right path.

 

Reliability engineering sessions often make use of advanced features of many AWS services and are therefore categorized at the 300 or 400 level. You will find that many of my session suggestions align with these levels. It might be advantageous to explore the 200-level sessions on those services or do some light reading before attending the sessions.

General guidance

Tips for your first AWS re:Invent

If you’re a first-time attendee at re:Invent, you might be very tempted to fill your calendar with numerous sessions. I made the same mistake and realized I was burned out by day three. AWS re:Invent is not just another tech conference; it’s better to pace yourself and limit your schedule to two or three sessions a day for the first day or two. You can gradually increase the number of sessions on subsequent days. Trust me; you’ll never run out of sessions to attend!

 

Regarding session types, I would recommend giving priority to chalk talks, workshops, or labs over breakout sessions. You can watch most breakout sessions on demand later, but the experience of collaborating and discussing with fellow attendees in the other sessions is invaluable.

 

I also suggest watching the keynotes on topics that interest you. The lines for keynote viewing can be quite long, so it’s best to catch them in the comfort of your hotel room or in the content hub rooms if you have a session scheduled immediately after the keynotes. Additionally, keep in mind that AWS may launch new services during re:Invent week, and some of those services might address long-standing problems. So, make sure to leave room in your schedule for a session or two about the newly launched services.

Breakout sessions
Builders’ sessions
Chalk talks
Workshops
Closing comments
Recommendations

Breakout sessions

ARC305 | Resilient architectures at scale: Real-world use cases from Amazon.com

A perennial favorite session by Seth Eliot. Seth has worked with multiple Amazon teams, helping them incorporate Well-Architected best practices for building resilient workloads, and Seth will share the teams’ technology and processes for running services in the cloud. Join this session to learn from Amazon’s experience scaling and building resilient systems.

Learn more

ARC309 | Build applications that recover from an Availability Zone impairment

Your service is now running in multiple Availability Zones—but is it prepared to handle Availability Zone failures? Join this session to find out how you can implement mechanisms to route requests away from occasional, temporary failures—both hard and gray—in an Availability Zone using capabilities such as the Amazon Route 53 Application Recovery Controller zonal shift.

Learn more

ARC308 | Best practices for creating multi-Region architectures on AWS

Just deploying an application to multiple Regions doesn’t make for a resilient system. Building multi-Region architectures often comes with a new set of changes around dependencies, infrastructure, data replication, observability, and more. Whether you’re needing to expand to multiple Regions to improve resilience, adhere to governmental data regulations, or improve end-user latency, this session highlights best practices, design principles, and sample architectures to help you meet your requirements.

Learn more

CON401 | Deep dive into Amazon ECS resilience and availability

Often, the best learnings about system reliability come from people who have already built and operated such systems. In this session, learn how Amazon ECS can help you address your requirements for running resilient and reliable applications. Dive deep into how Amazon ECS service architecture, design, and operational practices provide a secure and resilient foundation for your applications.

Learn more

COP319 | Best practices for modern application observability

Reliable systems often need a pane of glass to look through and see how they are performing under the hood. Observability is the ability to measure a system’s current state based on generated output such as logs, metrics, and traces. Join this session to delve into the best practices for modern applications and container observability. Discover how to effectively monitor, analyze, and troubleshoot complex distributed systems, microservices architectures, and cloud-native environments with AWS observability.

Learn more

DAT333 | Building highly resilient applications with Amazon DynamoDB

If you’re looking at building resilient applications and can make use of nonrelational datasets, this session will suit you. Join this session to explore Amazon DynamoDB capabilities like redundant storage, automatic throughput scaling, and multi-active, multi-Region data replication to achieve the reliability levels that your application demands.

Learn more
Recommendations

Builders’ sessions

DAT306 | Improve resilience of database workloads by using chaos engineering

Chaos engineering is the process of testing a system by deliberately introducing unexpected failures to examine how an application behaves. While chaos testing, databases are often left out with the expectation that the databases will hum along perfectly with no issues. This builders’ session helps you build a standard architectural pattern for injecting failures and experiments for chaos engineering on Amazon Aurora and Amazon RDS database instances to test for high availability.

Learn more

NET303 | How to manage your network using infrastructure as code

Reliability engineers often spend time improving the network infrastructure, especially for cases of expanding to multiple Regions. In this builders’ session, learn about recommended approaches to using infrastructure as code (IaC) to manage multi-Region networks.

Learn more
Recommendations

Chalk talks

API305 | Building resilience in decoupled applications with dead-letter queues

Managed systems offload a lot of operational overhead but that doesn’t mean they are fully resilient. Often in the case of decoupled applications, if an event could not be processed, it is assumed that it is lost. In this chalk talk, you will learn about dead-letter queues (DLQs) and how they can be used for managing unconsumed messages in decoupled applications. DLQs provide a mechanism for handling messages that cannot be processed successfully.

Learn more

BOA204 | Hands-on with multi-Region architectures

As a reliability engineer, you’ve been tasked with the job of expanding your application to multiple Regions. In this 200-level chalk talk, ask your questions and learn how to implement a multi-Region strategy, using many of the tools and features that AWS offers.

Learn more

DOP319 | Zero-downtime deployment strategies

If you’re looking for tools, tips, and strategies to improve your deployments and adopt zero-downtime deployments, this chalk talk is for you. Hear about multiple options for deploying changes to Amazon EC2, Amazon ECS, and AWS Lambda Compute platforms using AWS CodeDeploy, AWS CloudFormation, AWS Cloud Development Kit (AWS CDK), and Amazon CodeCatalyst.

Learn more

NET316 | Networking ask-me-anything talk

Have questions about improving your network resiliency? Not sure if you should migrate from AWS Transit Gateway or how Amazon VPC Lattice will fit your architecture? This chalk talk covers the breadth of AWS networking services—from load balancing to AWS Verified Access to AWS PrivateLink.

Learn more

DAT319 | Making your Amazon Aurora cluster more resilient

If you’re looking at improving the reliability of your database, this chalk talk is for you. Learn about how you make use of Amazon Aurora’s Global Database and Amazon RDS Proxy, and find out about best practices to achieve high availability and disaster recovery.

Learn more

COP316 | Automating incident response with Incident Manager

Incident response provides a system for responding to and managing an incident and is another part of reliability engineering. If you’re looking at setting up incident management, join this chalk talk to learn about how to prepare for incidents and how to automatically act when a critical issue is detected and flagged by an Amazon CloudWatch alarm or Amazon EventBridge event and perform post-incident analysis.

Learn more
Recommendations

Workshops

ARC201 | Monitoring resilient architectures with AWS Resilience Hub

AWS Resilience Hub provides a central place to define, validate, and track the resilience of your applications on AWS. In this workshop, explore how to proactively manage your application’s resilience posture by enabling a notification system that will alert you when your application no longer meets your recovery objectives, which happens when real-world environments change in unexpected, unplanned, or even planned ways.

Learn more

ARC301 | Advanced Multi-AZ resilience patterns: Mitigating gray failures

To run workloads reliably, AWS recommends running them in multiple Availability Zones. While detecting and draining workloads during an entire Availability Zone failure is relatively straightforward if done well, detecting gray failures—failures where workloads observe failure in different manners, is much harder. This workshop demonstrates how you can use services like Amazon CloudWatch to detect gray failures and implement two mitigation patterns to respond to them.

Learn more

NET302 | Become a network support expert: We break it, you fix it

Networks are one of the major reasons for outages. As a reliability engineer, having a networking background can go a long way in improving the mitigation time. What better way to improve your networking skills than learning from actual failures? In this workshop, you assume the role of a network support engineer at a fictitious company. Your task is to solve various issues in your AWS network environment and help your colleagues with network tasks. At the end of the workshop, you are likely to improve your troubleshooting skills, use AWS services and features in new ways, and learn more about operating a network on AWS.

Learn more

FSI304 | Make applications highly resilient with AWS Fault Injection Simulator

AWS Fault Injection Simulator (FIS) is a fully managed service for running fault injection experiments to improve an application’s performance, observability, and resiliency. One of the tenets of reliability engineering is to perform frequent drills of controlled failures to observe and improve application reliability. Although this workshop uses specific technologies (Java, containers, PostgreSQL, and Apache Kafka), I would highly recommend attending it even if you’re not aware of the technologies to learn how you can apply similar tests to your application stack.

Learn more

DOP308 | Enforcing development standards with Amazon CodeCatalyst

This workshop highlights Amazon CodeCatalyst functionalities to empower central teams to enforce policies on development teams. It sounds pretty interesting to see that CodeCatalyst will be able to do that.

Learn more

OPN301 | Accelerate your serverless journey with Powertools for AWS Lambda

Powertools for AWS Lambda has grown to be a truly helpful set of tools that enables developers to build better serverless applications. This workshop gives you an introduction and tells you how to use it.

Learn more

FWM303 | Create a cross-platform Flutter application with AWS Amplify

Flutter is a good and trending technology. This workshop gives you an introduction on how to use it with AWS Amplify.

Learn more

DOP302 | Build software faster with Amazon CodeCatalyst

This workshop allows you to get hands-on experiences with Amazon CodeCatalyst.

Learn more
Conclusion

Closing comments

Whether this is your first re:Invent or you’re a multi-year veteran, I hope you find this guide helpful in navigating through the session catalog toward sessions on reliability engineering. Remember to make the most out of networking options available—whether via PeerTalk, the “hallway track,” the community booths, or the AWS Heroes Lounge—don’t hesitate to meet and talk to people.
Home of AWS re:Invent

Stay up to date with re:Invent

Thank you for subscribing to re:Invent updates.

Follow AWS on social media

Accessibility and Inclusion Become a sponsor Bookable meeting space Code of conduct Sponsors Terms and Conditions

Privacy • Site Terms • Cookie Preferences • © 2023, Amazon Web Services, Inc. or its affiliates. All rights reserved.