Skip to content

RCA Example & QA Perspective

Scenario: Timeout Error in Payment Service

Problem Description: 15% of users in the production environment are unable to complete payment transactions. Users receive "Transaction timed out" error. The issue is observed more frequently during peak hours.

Impact:

  • Daily transaction loss

  • Customer complaints

  • Increased ticket volume to support team

5 Whys Technique Application

Step Question Answer
1 Why are users unable to make payments? Payment API returns timeout error
2 Why does the API timeout? Database query responds late
3 Why does the query respond late? Full table scan is performed on the Orders table
4 Why is a full table scan performed? Index is not defined on the relevant column
5 Why is the index not defined? Table growth was not anticipated, performance testing was not conducted

Root Cause: user_id column lacks an index, and performance testing under load was not performed.

Fishbone Analysis (Ishikawa Diagram)

The following table categorizes the factors contributing to the payment timeout error:

Category Factors
PEOPLE • Performance check was skipped during code review
• DBA approval was not obtained
METHOD • Load testing was not performed
• DB migration checklist was incomplete
SYSTEM • Index is missing
• Connection pool is insufficient
ENVIRONMENT • Peak hour traffic increase
• Concurrent user count increased 3x
MEASUREMENT • DB response time was not monitored
• Alert threshold was not defined

Main Problem: PAYMENT TIMEOUT ERROR

Actions Taken

Short-Term (Immediate Fix)

  • Index added to orders.user_id column

  • Connection pool size increased

  • API timeout duration optimized

Medium-Term (Preventive Actions)

  • Index review step added to DB migration checklist

  • Load testing process integrated into CI/CD pipeline

  • Monitoring and alerts defined for database response time

Long-Term (Process Improvement)

  • Performance testing strategy documented

  • Performance criteria added to code review checklist

  • Capacity planning process established

QA's Role in the RCA Process

QA engineers should take an active role in the RCA process and ask the following questions:

Test Coverage Analysis

  • Was this scenario within test coverage?

  • Was load testing performed? If so, under what conditions?

  • Were edge cases evaluated?

Test Strategy Assessment

  • Why did the current testing approach fail to catch this bug?

  • Which type of testing was missing? (Performance, Load, Stress)

  • How accurately did the test environment reflect production?

Regression Strategy

  • Which areas carry regression risk after this fix?

  • What should be added to automation coverage?

  • Should the smoke test suite be updated?

Defect Pattern Analysis

  • Have similar bugs occurred before?

  • Is there a common pattern?

  • Which module or component is at risk?

RCA Report Template

An RCA report should include the following sections:

Summary

  • Incident ID: INC-2024-0892

  • Date: 2024-01-15

  • Severity: P1

  • Affected System: Payment Service

  • Impact Duration: 3 hours 20 minutes

Problem Description

[Description of the problem and how it was detected]

Timeline

  • 09:15 - First alert triggered

  • 09:22 - Incident opened

  • 09:45 - Root cause identified

  • 10:30 - Hotfix deployed

  • 12:35 - Service stabilized

Root Cause

[Root cause identified through 5 Whys or other technique used]

Impact Analysis

  • Number of affected users

  • Number of failed transactions

  • Financial impact (if any)

Actions Taken Table

Action Owner Due Date Status
Add index DB Team 2024-01-16 Completed
Add load testing QA Team 2024-01-22 In Progress
Set up monitoring DevOps Team 2024-01-25 Planned
Update checklist QA Team 2024-01-20 Completed

Lessons Learned

[Conclusions drawn regarding process, system, and team]

Approval

  • Prepared by: [Name]

  • Approved by: [Name]

  • Date: [Date]

Lessons Learned

Questions to be evaluated as a team at the end of each RCA process:

From a Process Perspective:

  • Which control points were missing?

  • Which existing processes should be updated?

  • Is there an automation opportunity?

From a Technical Perspective:

  • Is an architectural or design change required?

  • Is monitoring and alerting sufficient?

  • Is documentation up to date?

From a Team Perspective:

  • Was there a knowledge gap?

  • Was a training need identified?

  • Were communication processes adequate?

Lessons learned outputs should be shared with the entire team and stored as a reference to prevent similar errors.