Nisän Haramati

Distributed Systems Engineer at NVIDIA

Talks

« Back to all talks

Title

Resilience Engineering: Identifying Reliability Dependencies and Common Mitigation Strategies

Abstract

Learn how to identify reliability dependencies and apply mitigation strategies like failure isolation, failure injection, end-to-end testing, and system level fuzzing. These skills help you reduce the occurrence of unexpected failures and improve the reliability of your systems.

Description

Resilience engineering is one of those things we all want to do, but never quite find the time to practice. There are always fires to put out and new infrastructure to deploy.

This talk will cover the fundamentals required for applying resilience engineering principles in your day to day work. It will cover how to identify the system dependencies that affect component and system-level service quality, and then go over some of the mitigation strategies that you can employ to prevent and alleviate failures in your own environment, starting from the simplest and cheapest, and then increasing in terms time, cost, and knowledge requirements.

It will cover:
  1. Dependency mapping: identifying the pathways through which service level degradation in one component can affect that of another (e.g. backpressure, partial outages, network connections, etc.)
  2. Failure isolation: isolating components and eliminating failure cascades where possible.
  3. Destructive testing and failure injection in order to learn about the failure mechanisms and behaviours of your system. This can be done in a separate “lab” environment or directly in a production system. The latter is often called Chaos Engineering.
  4. End-to-end testing: conformance testing and verification for overall system behaviour and output under varying conditions and sequences of events.
PDF File download »

Additional Resources

Testing

  1. Resources on Testing Distributed Systems
  2. Testing a Distributed System (End to End testing)
  3. Types of Tests
  4. Chaos Engineering

Resilience

  1. Resilience Engineering
  2. Resilience Engineering Resources