Key Concepts

This guide covers the essential terminology and concepts for Harness Resilience Testing.

Chaos Engineering

Chaos Engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent and unexpected conditions in production.

Core Principles

Steady State Hypothesis

The steady state represents your system's normal operating condition. Before running chaos experiments, you define:

Measurable system outputs that indicate normal behavior
Baseline metrics using Service Level Objectives (SLOs)
Acceptable thresholds for system performance

Example: "Our API should maintain 99.9% availability with response times under 200ms during normal operations."

Blast Radius

The scope of impact a chaos experiment can have on your system. Best practices:

Start small with non-critical systems or components
Gradually increase experiment scope as confidence grows
Use infrastructure controls to limit impact
Implement automatic rollback mechanisms

Hypothesis-Driven Testing

Each chaos experiment follows a scientific approach:

Identify the steady state and specify SLOs
Hypothesize what will happen when a fault is injected
Inject the failure in a controlled manner with minimal blast radius
Validate whether the system maintains steady state and meets SLOs

Key Components

Chaos Experiments

A chaos experiment is a set of operations coupled together to inject faults into a target resource and validate the system's resilience. Each experiment:

Targets specific infrastructure or application components
Injects one or more chaos faults in a defined sequence
Uses probes to validate system behavior
Can include custom actions for notifications or integrations
Generates a resilience score based on results

Chaos Faults

Chaos faults are pre-built failure scenarios that simulate real-world issues:

Fault Categories

Kubernetes Faults: Pod deletions, container kills, resource stress
Cloud Platform Faults: AWS, GCP, Azure service disruptions
Infrastructure Faults: CPU stress, memory exhaustion, network latency, disk pressure
Application Faults: Service failures, error injection, timeout simulation

Harness provides 200+ ready-to-use chaos faults in the Enterprise ChaosHub.

Resilience Probes

Resilience probes are health validation mechanisms that run during chaos experiments to verify your system maintains its steady state:

Probe Modes

Continuous Mode: Monitor throughout experiment duration
Edge Mode: Check at specific experiment phases (before, during, after)
OnChaos Mode: Validate only during fault injection

Probe Types

HTTP Probe: Validate API endpoints and services
Command Probe: Execute custom commands and validate output
Prometheus Probe: Query Prometheus metrics
Datadog Probe: Query Datadog metrics
Dynatrace Probe: Query Dynatrace metrics

Actions

Actions are custom tasks that execute within experiments:

Send notifications to Slack, PagerDuty, or email
Trigger webhooks for external integrations
Execute custom scripts or commands
Add delays between experiment steps
Integrate with monitoring and observability tools

ChaosHub

A ChaosHub is a centralized repository for reusable chaos engineering resources:

Enterprise ChaosHub

Provided by Harness with 200+ chaos faults
Regularly updated with new faults and improvements
Available across all projects and organizations

Custom ChaosHub

Create your own hub for organization-specific resources
Store custom experiment templates
Share probe templates across teams
Version control using Git integration

Chaos Infrastructure

Chaos infrastructure represents the target environment where chaos experiments execute:

Deployment Models

Agent-Based: Deploy chaos agents on Linux or Windows hosts
Agentless: Use Harness Delegate for Kubernetes and cloud resources

Supported Targets

Kubernetes clusters (EKS, GKE, AKS, OpenShift)
Linux and Windows hosts
AWS, GCP, Azure cloud resources
VMware infrastructure

Load Testing

Load Testing validates that your system can handle expected and peak traffic while maintaining performance and reliability.

Virtual Users

A virtual user (VU) simulates a real user executing your defined scenario: sending HTTP requests, waiting for responses, and looping continuously for the duration of the test. The Number of Users setting controls peak concurrency.

Load Profile

The load profile defines how virtual users are introduced over time:

Ramp-Up Phase: Users increase linearly from 0 to the target count over the configured Ramp-Up Duration. This models realistic traffic growth and avoids an artificial cold-start spike.
Steady-State Phase: After ramp-up, the full user count runs for the remaining test duration (Test Duration - Ramp-Up Duration).

Scenario

A scenario is the sequence of HTTP requests each virtual user executes. A scenario can be defined visually in the UI editor or through a custom Locust Python script.

Assertions

Assertions define per-request success criteria. Two types are supported:

Text: Validates that the response body contains a specific string
Response Time: Validates that the response arrives within a specified latency threshold

Requests that fail assertions are counted as errors in test results.

Load Test Infrastructure

A Linux host with the Harness chaos agent installed with the --load flag. This is the same Linux Chaos Infrastructure used for chaos testing, with load testing support enabled. The agent receives execution commands, runs the Locust process locally, and streams metrics back to Harness. See Load Test Infrastructure.

Disaster Recovery Testing

Disaster Recovery Testing validates that backup systems, failover mechanisms, and recovery procedures work during catastrophic scenarios. Each DR test is a Harness pipeline stage, giving you full orchestration of failover, validation, and notification steps.

RTO and RPO

Recovery Time Objective (RTO): The maximum acceptable time for a system to be restored to full operation after a failure. DR tests validate that your recovery procedures complete within this window.
Recovery Point Objective (RPO): The maximum acceptable amount of data loss measured in time. DR tests validate backup recency and data consistency after a simulated recovery.

Pipeline-Based DR Tests

DR tests are built using Harness Pipeline Studio. Each DR test is a stage with four tabs:

Overview: Stage name, objective, timeout, and stage variables
Environment: Target Harness environment, Chaos Infrastructure, and stage-level failure strategy
Execution: Step canvas where you compose the recovery workflow using DR-specific steps (Chaos Probe, Chaos Fault, Chaos Action) and standard Harness steps
Advanced: Delegate selector, conditional execution, looping strategy, and step-level failure strategy

Failure Strategy

Defines what happens when a step or stage encounters an error. You can handle specific failure types (Authentication Errors, Connectivity Errors, Timeout Errors, etc.) with actions like Rollback Pipeline, Retry Step, Abort, Manual Intervention, or Mark As Failure.

Conditional Execution

Controls whether a stage runs based on pipeline state: on success (default), on failure, always, or via a custom JEXL expression. Useful for running rollback stages only when a failover stage fails.

See DR Testing Concepts for a full breakdown of all concepts.

Harness Resilience Testing Concepts

These concepts apply across all resilience testing activities in the Harness platform.

Environments

Logical groupings of your infrastructure where tests are executed:

Organize resources by purpose (dev, staging, production)
Control access and permissions per environment
Isolate test execution to specific infrastructure scopes

Governance

Controls and policies to ensure safe, controlled testing:

RBAC (Role-Based Access Control)

Fine-grained permissions for who can:

Create and modify tests
Execute tests on specific infrastructure
View results and analytics
Manage governance policies

ChaosGuard

ChaosGuard provides advanced governance specifically for chaos experiments:

Define when experiments can run (time windows)
Specify where experiments can execute (infrastructure scope)
Control what faults can be injected (fault restrictions)
Set approval requirements for high-risk tests

Risks

Automated identification and tracking of system weaknesses:

Resilience Risks: Potential failure points in your system
Performance Risks: Bottlenecks and capacity issues
Compliance Risks: Gaps in meeting RTO/RPO targets

Risks are discovered automatically and can be validated through targeted tests.

Next Steps

Now that you understand the core concepts:

Architecture: Learn about the control plane and execution plane architecture
Get Started with Chaos Testing: Run your first chaos experiment
Explore Chaos Faults: Browse 200+ ready-to-use fault scenarios
Set Up Governance: Configure RBAC and ChaosGuard for safe testing

Chaos Engineering​

Core Principles​

Steady State Hypothesis​

Blast Radius​

Hypothesis-Driven Testing​

Key Components​

Chaos Experiments​

Chaos Faults​

Fault Categories​

Resilience Probes​

Probe Modes​

Probe Types​

Actions​

ChaosHub​

Enterprise ChaosHub​

Custom ChaosHub​

Chaos Infrastructure​

Deployment Models​

Supported Targets​

Load Testing​

Virtual Users​

Load Profile​

Scenario​

Assertions​

Load Test Infrastructure​

Disaster Recovery Testing​

RTO and RPO​

Pipeline-Based DR Tests​

Failure Strategy​

Conditional Execution​

Harness Resilience Testing Concepts​

Environments​

Governance​

RBAC (Role-Based Access Control)​

ChaosGuard​

Risks​

Next Steps​