QA & SDET Handbook: Kiểm Thử Core Banking Phân Tán

Chuỗi bài (Phần 8 của 8): Bài tổng kết series tập hợp toàn bộ chiến lược kiểm thử dành riêng cho từng layer của Core Banking Architecture đã được đề cập ở các phần trước — từ ledger consistency đến distributed SQL, Saga, ISO 20022, API Security, và Streaming Fraud Detection.

Tại Sao Core Banking Cần SDET Riêng?

Kiểm thử hệ thống tài chính phân tán đòi hỏi kỹ năng chuyên biệt ngoài unit tests thông thường. Các lỗi nghiêm trọng nhất thường chỉ xuất hiện khi có concurrency cao, network failures, clock drift, hoặc partial system failures — những điều kiện không thể reproduce bằng integration tests đơn giản.

6 category chiến lược kiểm thử trong bài này tương ứng trực tiếp với từng phần của series:

Test Category	Tương ứng với Phần	Risk nếu bỏ qua
Double-Entry Invariant	Phần 1 (Ledger)	Double-spend, unbalanced GL
Distributed SQL & Clock	Phần 2 (Distributed SQL)	Split-brain, stale reads
Event Replay & Outbox	Phần 3 (Event Sourcing)	Data inconsistency, lost events
Saga Compensation	Phần 4 (Saga)	Orphaned holds, money stuck
Idempotency & API	Phần 5, 6 (ISO 20022, Security)	Double-charge, token theft
Flink State & SLA	Phần 7 (Fraud Detection)	Undetected fraud, false positives

Category 1: Double-Entry Invariant Auditing

Test 1.1: Concurrent Double-Spend Prevention

Mục tiêu: 100 goroutines đồng thời rút tiền từ một tài khoản — chỉ số yêu cầu đủ balance được phép thành công.

func TestConcurrentDoubleSpend(t *testing.T) {
    const (
        numWorkers     = 100
        withdrawAmount = 10_000   // 10,000 VND
        initialBalance = 100_000  // 100,000 VND
    )
    
    // Setup: tạo account với balance cố định
    accountID := createTestAccount(initialBalance)
    
    var (
        successCount atomic.Int64
        wg           sync.WaitGroup
    )
    
    // Chạy 100 concurrent withdrawals
    for i := 0; i < numWorkers; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            err := withdrawFromAccount(accountID, withdrawAmount)
            if err == nil {
                successCount.Add(1)
            }
        }()
    }
    wg.Wait()
    
    // Chỉ 10 requests được phép thành công
    assert.Equal(t, int64(10), successCount.Load(),
        "Chính xác 10 withdrawals được phép với balance 100,000 VND")
    
    // Balance cuối phải bằng 0 — không âm, không double-counted
    finalBalance := getAccountBalance(accountID)
    assert.Equal(t, int64(0), finalBalance,
        "Balance sau khi rút hết phải = 0, không âm")
    
    // Ledger invariant: SUM(DEBIT) = SUM(CREDIT)
    imbalance := checkLedgerBalance(accountID)
    assert.Equal(t, int64(0), imbalance,
        "Sổ cái phải cân bằng: SUM(DEBIT) = SUM(CREDIT)")
}

Test 1.2: Continuous Reconciliation Job

// Chạy mỗi 5 phút trong production monitoring
func RunLedgerReconciliation(ctx context.Context, db *sql.DB) ([]DiscrepancyReport, error) {
    query := `
        SELECT
            transaction_id,
            SUM(CASE WHEN direction = 'DEBIT' THEN amount ELSE -amount END) AS discrepancy
        FROM entries
        GROUP BY transaction_id
        HAVING SUM(CASE WHEN direction = 'DEBIT' THEN amount ELSE -amount END) <> 0
        LIMIT 100
    `
    
    rows, err := db.QueryContext(ctx, query)
    if err != nil {
        return nil, err
    }
    defer rows.Close()
    
    var reports []DiscrepancyReport
    for rows.Next() {
        var report DiscrepancyReport
        rows.Scan(&report.TransactionID, &report.Discrepancy)
        reports = append(reports, report)
    }
    
    if len(reports) > 0 {
        // CRITICAL: Fire P1 alert — ledger balance violated
        fireP1Alert(ctx, "LEDGER_IMBALANCE", reports)
    }
    
    return reports, nil
}

Test 1.3: Deadlock Prevention Verification

func TestDeadlockFreeTransfers(t *testing.T) {
    // Tạo 2 accounts
    accountA := createTestAccount(1_000_000)
    accountB := createTestAccount(1_000_000)
    
    var wg sync.WaitGroup
    errs := make(chan error, 100)
    
    // 50 goroutines: A → B
    for i := 0; i < 50; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            err := transferBetween(accountA, accountB, 1000)
            if err != nil {
                errs <- err
            }
        }()
    }
    
    // 50 goroutines: B → A (tạo ra điều kiện deadlock nếu lock order sai)
    for i := 0; i < 50; i++ {
        wg.Add(1)
        go func() {
            defer wg.Done()
            err := transferBetween(accountB, accountA, 1000)
            if err != nil {
                errs <- err
            }
        }()
    }
    
    // Chạy với timeout để detect deadlock
    done := make(chan struct{})
    go func() { wg.Wait(); close(done) }()
    
    select {
    case <-done:
        // No deadlock — tất cả goroutines hoàn thành
    case <-time.After(10 * time.Second):
        t.Fatal("DEADLOCK DETECTED: transfers không hoàn thành trong 10s")
    }
    
    // Tổng balance phải không đổi
    totalBalance := getAccountBalance(accountA) + getAccountBalance(accountB)
    assert.Equal(t, int64(2_000_000), totalBalance,
        "Tổng balance phải không thay đổi sau tất cả transfers")
}

Category 2: Distributed SQL & Clock Resilience Testing

Test 2.1: Network Partition (Split-Brain) Simulation

#!/bin/bash
# Simulation: 5-node CockroachDB cluster bị partition thành 3 + 2

MINORITY_NODES=("node4" "node5")
MAJORITY_NODES=("node1" "node2" "node3")

echo "=== Starting network partition simulation ==="

# Drop packets giữa minority và majority nodes
for node in "${MINORITY_NODES[@]}"; do
    # SSH vào node và drop packets đến majority
    ssh "$node" "sudo tc qdisc add dev eth0 root netem loss 100%"
    echo "Partitioned: $node disconnected from cluster"
done

sleep 5  # Đợi partition có hiệu lực

echo "=== Testing write behavior during partition ==="

# Test: Write trên majority side phải thành công
echo "Testing majority write..."
cockroach sql --host=node1:26257 --insecure \
    --execute="INSERT INTO test_transactions VALUES (gen_random_uuid(), 1000, 'VND', NOW())"
echo "Majority write: EXPECTED SUCCESS"

# Test: Write trên minority side phải fail
echo "Testing minority write..."
cockroach sql --host=node4:26257 --insecure --timeout=5s \
    --execute="INSERT INTO test_transactions VALUES (gen_random_uuid(), 1000, 'VND', NOW())" \
    && echo "FAIL: Minority write succeeded (should have failed!)" \
    || echo "PASS: Minority write correctly rejected"

echo "=== Healing partition ==="
for node in "${MINORITY_NODES[@]}"; do
    ssh "$node" "sudo tc qdisc del dev eth0 root"
done

sleep 10  # Đợi cluster đồng bộ

# Verify: Minority nodes phải catch up về consistent state
cockroach node status --host=node1:26257 --insecure
echo "All nodes should show consistent RANGES count"

Test 2.2: Clock Skew Injection (libfaketime)

#!/bin/bash
# Inject clock drift vượt max_clock_offset của CockroachDB (500ms)

# Install libfaketime
apt-get install -y libfaketime

# Test với clock drift = 600ms (vượt 500ms threshold)
echo "=== Testing with 600ms clock drift ==="
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/faketime/libfaketime.so.1 \
FAKETIME="+0.6s" \
go test ./distributed/... -run TestClockSkewResilience -v 2>&1

# Kỳ vọng: Database phải detect và reject hoặc retry
# KHÔNG được: return stale/out-of-order data

# Test với TiDB: inject drift > TSO timestamp
echo "=== Testing TiDB clock skew ==="
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/faketime/libfaketime.so.1 \
FAKETIME="+2s" \
go test ./tidb/... -run TestTSOClockDrift -v 2>&1

// Go test: Verify database reaction to clock skew
func TestClockSkewResilience(t *testing.T) {
    // Hàm này chạy DƯỚI faketime injection (600ms drift)
    
    // Attempt write trên node với drifted clock
    err := performDatabaseWrite("test-account-001", 50000)
    
    // CockroachDB phải:
    // - Detect clock skew > 500ms offset
    // - Either: reject với clear error, or
    // - Retry với timestamp push
    
    if err != nil {
        // Verify error là clock-related, không phải network/business error
        assert.Contains(t, err.Error(), "clock skew",
            "Error phải mention clock skew")
    } else {
        // Nếu succeed: verify read của chúng ta không bị stale
        result := readLastWrite("test-account-001")
        assert.NotNil(t, result, "Nếu write succeed, phải đọc được ngay")
    }
}

Category 3: Event Sourcing & Outbox Testing

Test 3.1: Event Store Replay Consistency

func TestEventReplayConsistency(t *testing.T) {
    ctx := context.Background()
    
    // 1. Verify "live" balance từ CQRS read model
    liveBalance := getReadModelBalance(ctx, "account-001")
    t.Logf("Live balance: %d", liveBalance)
    
    // 2. Drop và rebuild read model từ event store
    dropAccountBalancesTable(ctx)
    replayAllEventsFromEventStore(ctx)
    
    // 3. So sánh replayed balance với live balance
    replayedBalance := getReadModelBalance(ctx, "account-001")
    
    assert.Equal(t, liveBalance, replayedBalance,
        "Replayed balance PHẢI khớp chính xác với live balance")
}

Test 3.2: Outbox Atomicity — DB Success, Kafka Fail

func TestOutboxAtomicityKafkaFailure(t *testing.T) {
    ctx := context.Background()
    
    // Mock Kafka để fail publish
    mockKafka := NewFailingKafkaPublisher()
    
    initialBalanceA := getAccountBalance("account-A")
    initialOutboxCount := countPendingOutboxEvents()
    
    // Execute transfer — Kafka bị mocked để fail
    err := transferService.WithKafka(mockKafka).Transfer(ctx, TransferRequest{
        From: "account-A", To: "account-B", Amount: 500_000,
    })
    
    // Transfer database THÀNH CÔNG (vì trong cùng DB transaction)
    assert.NoError(t, err, "Transfer phải succeed ở database level")
    
    // Balance đã thay đổi
    newBalanceA := getAccountBalance("account-A")
    assert.Equal(t, initialBalanceA-500_000, newBalanceA)
    
    // Outbox event vẫn ở PENDING (Kafka fail)
    pendingCount := countPendingOutboxEvents()
    assert.Greater(t, pendingCount, initialOutboxCount,
        "Phải có PENDING outbox events do Kafka fail")
    
    // Sau khi Kafka recover → outbox worker sẽ retry publish
    mockKafka.Recover()
    waitForOutboxWorkerProcessing(5 * time.Second)
    
    // Verify: Kafka đã nhận event
    assert.True(t, mockKafka.HasPublished("MoneyTransferred"),
        "Outbox worker phải retry và publish event lên Kafka")
    
    // Final balance check
    assert.Equal(t, initialBalanceA-500_000, getAccountBalance("account-A"))
    assert.Equal(t, int64(500_000), getAccountBalance("account-B"))
}

Category 4: Saga Compensation Testing

Test 4.1: Step 2 Failure → Compensation Verification

func TestSagaStep2FailCompensation(t *testing.T) {
    ctx := context.Background()
    
    initialBalanceA := getAccountBalance("account-A")
    initialBalanceB := getAccountBalance("account-B")
    
    // Mock Payment Gateway để fail ở step 2
    mockGateway := NewFailingPaymentGateway(
        FailAt: PaymentSubmission,
        Error:  "NAPAS_TIMEOUT",
    )
    
    // Execute saga
    err := sagaOrchestrator.WithGateway(mockGateway).ExecuteTransfer(ctx, TransferRequest{
        From: "account-A", To: "account-B", Amount: 1_000_000,
    })
    
    // Saga phải return error
    assert.Error(t, err)
    assert.Contains(t, err.Error(), "NAPAS_TIMEOUT")
    
    // Chờ compensation hoàn thành
    waitForCompensation(5 * time.Second)
    
    // CRITICAL: Balance A phải được hoàn về nguyên trạng
    finalBalanceA := getAccountBalance("account-A")
    assert.Equal(t, initialBalanceA, finalBalanceA,
        "Compensation PHẢI hoàn tiền về account-A")
    
    // Balance B phải không thay đổi
    finalBalanceB := getAccountBalance("account-B")
    assert.Equal(t, initialBalanceB, finalBalanceB,
        "Account-B không được nhận tiền khi saga failed")
}

Test 4.2: Compensation Chain Failure → DLQ Routing

func TestCompensationFailureRoutesToDLQ(t *testing.T) {
    ctx := context.Background()
    
    // Mock: Step 2 fail VÀ RefundActivity cũng fail
    mockGateway := NewFailingPaymentGateway(FailAt: PaymentSubmission)
    mockRefund := NewFailingRefundService() // Compensation cũng fail
    
    dlqCapture := NewDLQCapture()
    alertCapture := NewAlertCapture()
    
    sagaOrchestrator.
        WithGateway(mockGateway).
        WithRefund(mockRefund).
        WithDLQ(dlqCapture).
        WithAlerts(alertCapture).
        ExecuteTransfer(ctx, validTransferRequest)
    
    // Chờ retry exhaust (3 attempts × exponential backoff)
    waitForRetryExhaustion(30 * time.Second)
    
    // Phải có event trong DLQ
    assert.Greater(t, dlqCapture.Count(), 0,
        "Failed compensation PHẢI route vào DLQ")
    
    // DLQ event phải chứa đủ context để ops team xử lý
    dlqEvent := dlqCapture.Last()
    assert.NotEmpty(t, dlqEvent.TransferID)
    assert.NotEmpty(t, dlqEvent.Reason)
    
    // Phải có P1 alert
    assert.True(t, alertCapture.HasFiredP1(),
        "P1 alert PHẢI được fired cho failed compensation")
}

Category 5: Idempotency & API Security Testing

Test 5.1: Concurrent Double-Submit

func TestConcurrentDoubleSubmit(t *testing.T) {
    const idempotencyKey = "transfer-idempotency-key-xyz-001"
    
    type result struct {
        statusCode int
        body       string
    }
    
    results := make(chan result, 2)
    
    // Fire 2 requests đồng thời với cùng idempotency key
    for i := 0; i < 2; i++ {
        go func() {
            resp, _ := httpClient.Post("/v1/transfers",
                "application/json",
                buildTransferBody(idempotencyKey, 1_000_000),
            )
            body, _ := io.ReadAll(resp.Body)
            results <- result{resp.StatusCode, string(body)}
        }()
    }
    
    r1 := <-results
    r2 := <-results
    
    codes := []int{r1.statusCode, r2.statusCode}
    
    // Đúng 1 request nhận 201 Created
    createdCount := 0
    for _, code := range codes {
        if code == 201 {
            createdCount++
        }
    }
    assert.Equal(t, 1, createdCount, "Chỉ một request được phép 201 Created")
    
    // Request còn lại phải nhận 200 (cached response) hoặc 409 (conflict)
    for _, code := range codes {
        if code != 201 {
            assert.True(t, code == 200 || code == 409,
                "Duplicate request phải nhận 200 (cached) hoặc 409 (conflict)")
        }
    }
    
    // Tài khoản chỉ bị debit một lần
    totalDebit := getTotalDebitForIdempotencyKey(idempotencyKey)
    assert.Equal(t, int64(1_000_000), totalDebit,
        "Chỉ 1,000,000 VND được trừ, không phải 2,000,000")
}

Test 5.2: DPoP Token Replay Attack

func TestDPoPTokenReplayAttack(t *testing.T) {
    // Lấy valid access token và DPoP proof
    accessToken, dpopProof := getValidDPoPTokenAndProof("POST", "/v1/transfers")
    
    // Request 1: Hợp lệ
    resp1 := makeAuthenticatedRequest(accessToken, dpopProof, "POST", "/v1/transfers", validBody)
    assert.Equal(t, 200, resp1.StatusCode, "Request 1 phải thành công")
    
    // Request 2: Replay CÙNG dpopProof (attacker intercepted)
    resp2 := makeAuthenticatedRequest(accessToken, dpopProof, "POST", "/v1/transfers", validBody)
    
    // Phải bị từ chối: jti đã được dùng
    assert.Equal(t, 401, resp2.StatusCode,
        "Replay attack PHẢI bị từ chối với 401")
    
    var errorResp map[string]string
    json.Unmarshal(resp2.Body, &errorResp)
    assert.Equal(t, "dpop_replay_detected", errorResp["error"])
}

Test 5.3: Idempotency Key Payload Mismatch

func TestIdempotencyPayloadMismatch(t *testing.T) {
    const key = "idem-key-mismatch-001"
    
    // Request 1: Amount = 1,000,000 VND
    resp1 := sendTransferWithKey(key, 1_000_000)
    assert.Equal(t, 201, resp1.StatusCode)
    
    // Request 2: Cùng key, amount KHÁC = 2,000,000 VND
    resp2 := sendTransferWithKey(key, 2_000_000)
    
    // Phải từ chối với 422 Unprocessable Entity
    assert.Equal(t, 422, resp2.StatusCode,
        "Payload mismatch với existing key PHẢI trả 422")
    
    var errorResp map[string]interface{}
    json.Unmarshal(resp2.Body, &errorResp)
    assert.Equal(t, "idempotency_key_payload_mismatch", errorResp["error"])
}

Category 6: Flink State & Fraud Detection Testing

Test 6.1: Flink Operator Unit Test (TestHarness)

@Test
public void testFraudDetectorWithTestHarness() throws Exception {
    // TestHarness: không cần Flink cluster thật
    KeyedOneInputStreamOperatorTestHarness<String, Event, AlertEvent> harness =
        new KeyedOneInputStreamOperatorTestHarness<>(
            new FraudDetectorOperator(),
            Event::getUserId,
            Types.STRING
        );
    harness.open();

    long now = System.currentTimeMillis();

    // Scenario A: 3 failed logins → high-value tx trong 5 phút → SHOULD ALERT
    harness.processElement(new Event("user-001", "login_failed", 0.0), now);
    harness.processElement(new Event("user-001", "login_failed", 0.0), now + 30_000);
    harness.processElement(new Event("user-001", "login_failed", 0.0), now + 60_000);
    harness.processElement(new Event("user-001", "transaction", 5_000_000.0), now + 120_000);

    // Advance watermark để trigger timer
    harness.processWatermark(now + 300_001);

    List<StreamRecord<AlertEvent>> output = harness.extractOutputStreamRecords();
    assertEquals(1, output.size(), "PHẢI có 1 fraud alert cho Scenario A");
    assertEquals("user-001", output.get(0).getValue().getUserId());

    // Scenario B: 3 failed logins → small tx (< 1M) → SHOULD NOT ALERT
    harness.processElement(new Event("user-002", "login_failed", 0.0), now);
    harness.processElement(new Event("user-002", "login_failed", 0.0), now + 30_000);
    harness.processElement(new Event("user-002", "login_failed", 0.0), now + 60_000);
    harness.processElement(new Event("user-002", "transaction", 50_000.0), now + 120_000); // 50K VND
    harness.processWatermark(now + 300_002);

    // Scenario B không tạo thêm alert
    assertEquals(1, harness.extractOutputStreamRecords().size(),
        "KHÔNG nên có alert cho Scenario B (small transaction)");

    harness.close();
}

Test 6.2: RocksDB State Recovery After Failure

@Test
public void testRocksDBStateRecoveryAfterTaskFailure() throws Exception {
    MiniClusterWithClientResource cluster = setupFlinkMiniCluster();

    // Run pipeline và manually trigger checkpoint
    String checkpointDir = "/tmp/flink-test-checkpoint";
    env.setStateBackend(new EmbeddedRocksDBStateBackend(true));
    env.setDefaultSavepointDirectory(checkpointDir);
    env.enableCheckpointing(1000);

    // Inject test events
    List<Event> testEvents = generateHighVolumeEvents(10_000);
    DataStream<Event> input = env.fromCollection(testEvents);
    DataStream<AlertEvent> output = buildFraudDetectionPipeline(input);

    // Start job, wait for first checkpoint
    JobClient job = env.executeAsync("test-fraud-pipeline");
    waitForCheckpoint(job, checkpointDir);

    // Simulate task failure (kill one TaskManager)
    cluster.getMiniCluster().getTaskManagers().get(0).shutDown();

    // Restart — should recover from checkpoint
    cluster.getMiniCluster().startNewTaskManager();

    // Verify: pipeline continues, no duplicate alerts
    List<AlertEvent> alerts = collectAlerts(output, 30_000);
    long duplicateAlerts = alerts.stream()
        .collect(Collectors.groupingBy(AlertEvent::getTransferId, Collectors.counting()))
        .values().stream()
        .filter(count -> count > 1)
        .count();

    assertEquals(0, duplicateAlerts,
        "Sau failure recovery, KHÔNG được có duplicate fraud alerts (Exactly-Once)");
}

Chaos Engineering Checklist

## Pre-Production Chaos Engineering Gates

### Category 1: Ledger Integrity
- [ ] Concurrent withdraw test: 100 goroutines, 1 account → chỉ đủ requests succeed
- [ ] Reconciliation job: 0 rows trả về từ imbalance query
- [ ] Deadlock-free transfers: 10,000 cross-transfers hoàn thành < 30s

### Category 2: Distributed SQL
- [ ] 3+2 partition test: majority write success, minority write rejected
- [ ] Clock drift 600ms: database detect và handle correctly
- [ ] Leader failover: new leader tiếp nhận trong <30s, data intact

### Category 3: Event Sourcing
- [ ] Event replay: replayed balance = live balance (zero diff)
- [ ] Outbox atomicity: Kafka fail không mất event, retry thành công
- [ ] Snapshot consistency: load từ snapshot + partial replay = full replay result

### Category 4: Saga
- [ ] Step 2 fail: compensation executed, source account restored
- [ ] Compensation fail: DLQ có event, P1 alert fired
- [ ] Timeout: pending transfer auto-void sau TTL

### Category 5: API Security
- [ ] DPoP replay: 401 on second use of same JTI
- [ ] Stolen token + wrong key: 401 thumbprint mismatch
- [ ] Double-submit: 201 once, 409 or 200 cached second time
- [ ] Payload mismatch: 422 when idempotency key reused with different payload

### Category 6: Fraud Detection
- [ ] CEP alert: ≥3 failed logins + high-value tx → alert generated
- [ ] SLA: P50 <50ms, P99 <100ms fraud scoring
- [ ] State recovery: zero duplicate alerts after TaskManager restart
- [ ] False positive rate: <30% in staging with real transaction patterns

Category 7: Load Testing & Performance Validation

Load testing là bước bắt buộc cuối cùng trước khi ship Core Banking system ra production — không có benchmark tự động thì SLA chỉ là con số trên giấy.

Các tool phù hợp cho Core Banking load testing:

Tool	Dùng cho	Ưu điểm
k6	API-level load test (REST/gRPC)	Scripting bằng JavaScript, threshold assertions, cloud execution
Gatling	JVM-based high-concurrency simulation	Scala DSL, built-in HTML reports, phù hợp với Java/Kotlin services
Artillery	Lightweight API load test	YAML config, phù hợp quick spike tests
Locust	Python-based distributed load test	Flexible custom scenarios, phù hợp ML-heavy workflows

Test 1: Ledger Throughput Benchmark (k6)

Target: ≥10,000 transfers/giây với P99 latency ≤50ms trên single-node PostgreSQL.

// k6/ledger-throughput.js
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Counter, Rate, Trend } from 'k6/metrics';

// Custom metrics
const transferErrors    = new Counter('transfer_errors');
const transferSuccesses = new Counter('transfer_successes');
const transferLatency   = new Trend('transfer_p99_latency', true);
const errorRate         = new Rate('error_rate');

export const options = {
  scenarios: {
    // Ramp up: 0 → 10,000 VUs trong 2 phút, giữ 5 phút, ramp down
    ledger_stress: {
      executor: 'ramping-vus',
      startVUs: 0,
      stages: [
        { duration: '2m', target: 500  },   // Warm-up
        { duration: '3m', target: 2000 },   // Moderate load
        { duration: '5m', target: 5000 },   // Target load
        { duration: '2m', target: 0    },   // Ramp down
      ],
    },
  },
  // SLA thresholds — test FAILS if violated
  thresholds: {
    'http_req_duration{scenario:ledger_stress}': [
      'p(95)<30',   // P95 < 30ms
      'p(99)<50',   // P99 < 50ms ← SLA gate
    ],
    'error_rate': ['rate<0.001'], // < 0.1% error rate
    'http_req_failed': ['rate<0.005'],
  },
};

// Seed data: pre-generated account pairs (avoid hot accounts)
const ACCOUNT_PAIRS = JSON.parse(open('./fixtures/account-pairs.json'));

export default function () {
  // Pick random account pair to distribute load evenly
  const pair = ACCOUNT_PAIRS[Math.floor(Math.random() * ACCOUNT_PAIRS.length)];

  const payload = JSON.stringify({
    idempotency_key: `load-test-${__VU}-${__ITER}`,  // Unique per VU+iteration
    source_account:  pair.source,
    target_account:  pair.target,
    amount:          Math.floor(Math.random() * 100000) + 1000, // 1,000–100,000 VND
    currency:        'VND',
  });

  const res = http.post(
    `${__ENV.BASE_URL}/api/v1/transfers`,
    payload,
    {
      headers: {
        'Content-Type': 'application/json',
        'Authorization': `Bearer ${__ENV.API_TOKEN}`,
      },
      timeout: '5s',
    }
  );

  const success = check(res, {
    'status 201': (r) => r.status === 201,
    'has transaction_id': (r) => JSON.parse(r.body).transaction_id !== undefined,
    'response < 50ms': (r) => r.timings.duration < 50,
  });

  transferLatency.add(res.timings.duration);

  if (success) {
    transferSuccesses.add(1);
    errorRate.add(0);
  } else {
    transferErrors.add(1);
    errorRate.add(1);
    console.error(`Transfer failed: ${res.status} — ${res.body.substring(0, 200)}`);
  }

  sleep(0.01); // 10ms think time giữa các requests
}

// Teardown: log summary
export function handleSummary(data) {
  return {
    'load-test-results/ledger-throughput.json': JSON.stringify(data),
  };
}

Chạy test:

# Tạo fixture data trước
node scripts/generate-account-pairs.js --count 10000 > fixtures/account-pairs.json

# Run k6 với 500 VUs ban đầu
k6 run \
  --env BASE_URL=https://staging.core-banking.internal \
  --env API_TOKEN=$(vault read -field=token secret/staging/api) \
  --out json=results/ledger-throughput-$(date +%Y%m%d).json \
  k6/ledger-throughput.js

# View results
k6 inspect results/ledger-throughput-$(date +%Y%m%d).json

Test 2: Idempotency Stress Test — Duplicate Request Bombardment

Mục tiêu: Đảm bảo cùng idempotency_key không bao giờ tạo ra 2 transactions khác nhau, dù bị gọi đồng thời 100 lần.

// k6/idempotency-stress.js
import http from 'k6/http';
import { check } from 'k6';
import { SharedArray } from 'k6/data';

export const options = {
  // 100 VUs cùng gọi 1 idempotency_key — không có success nào là duplicate
  scenarios: {
    duplicate_storm: {
      executor: 'shared-iterations',
      vus: 100,
      iterations: 1000,
      maxDuration: '30s',
    },
  },
  thresholds: {
    'checks': ['rate==1.0'], // 100% checks phải pass
  },
};

// Một idempotency_key duy nhất cho tất cả VUs
const IDEMPOTENCY_KEY = `idem-stress-test-${Date.now()}`;
const results = new SharedArray('results', () => []);

export default function () {
  const res = http.post(
    `${__ENV.BASE_URL}/api/v1/transfers`,
    JSON.stringify({
      idempotency_key: IDEMPOTENCY_KEY,  // Cùng key!
      source_account:  'ACC-001',
      target_account:  'ACC-002',
      amount:          50000,
      currency:        'VND',
    }),
    { headers: { 'Content-Type': 'application/json' } }
  );

  const body = JSON.parse(res.body);

  check(res, {
    // Chỉ chấp nhận 201 (first create) hoặc 200 (idempotent return)
    'valid status (201 or 200)': (r) => r.status === 201 || r.status === 200,
    // transaction_id phải nhất quán — không thể có 2 giá trị khác nhau
    'same transaction_id always': () => {
      const txId = body.transaction_id;
      if (results.length === 0) {
        results.push(txId);
        return true;
      }
      return txId === results[0]; // Mọi response phải cùng transaction_id
    },
  });
}

Test 3: Payment Gateway Latency Profile (k6 + thresholds per endpoint)

// k6/gateway-latency-profile.js
import http from 'k6/http';
import { check, group } from 'k6';

export const options = {
  vus: 200,
  duration: '5m',
  thresholds: {
    // Per-endpoint SLA từ ISO 20022 integration requirements
    'http_req_duration{endpoint:pacs008_parse}':  ['p(99)<100'], // XML parse < 100ms
    'http_req_duration{endpoint:transfer_submit}': ['p(99)<200'], // Submit < 200ms
    'http_req_duration{endpoint:status_query}':    ['p(95)<20'],  // Status check < 20ms (hot path)
    'http_req_failed': ['rate<0.001'],
  },
};

export default function () {
  group('pacs.008 Parse', () => {
    const res = http.post(
      `${__ENV.BASE_URL}/api/v1/payments/parse`,
      open('./fixtures/pacs008-sample.xml'),
      {
        headers: { 'Content-Type': 'application/xml' },
        tags: { endpoint: 'pacs008_parse' },
      }
    );
    check(res, { 'parse ok': (r) => r.status === 200 });
  });

  group('Transfer Submit', () => {
    const res = http.post(
      `${__ENV.BASE_URL}/api/v1/transfers`,
      JSON.stringify({ /* transfer payload */ }),
      {
        headers: { 'Content-Type': 'application/json' },
        tags: { endpoint: 'transfer_submit' },
      }
    );
    check(res, { 'submit ok': (r) => r.status === 201 });
  });

  group('Status Query', () => {
    const txId = `test-${__VU}-${__ITER - 1}`;
    const res = http.get(
      `${__ENV.BASE_URL}/api/v1/transfers/${txId}/status`,
      { tags: { endpoint: 'status_query' } }
    );
    check(res, { 'status ok': (r) => r.status === 200 || r.status === 404 });
  });
}

Pre-Production Load Testing Gates

#!/bin/bash
# scripts/pre-prod-load-gate.sh
# Chạy trong CI/CD pipeline trước deploy production

set -e

echo "=== Core Banking Load Testing Gate ==="

# 1. Ledger throughput SLA
k6 run --quiet \
  --env BASE_URL=$STAGING_URL \
  --env API_TOKEN=$STAGING_TOKEN \
  k6/ledger-throughput.js
echo "✅ Ledger throughput: P99 < 50ms"

# 2. Idempotency stress
k6 run --quiet \
  --env BASE_URL=$STAGING_URL \
  k6/idempotency-stress.js
echo "✅ Idempotency: no duplicate transactions under 100-VU storm"

# 3. Gateway latency profile
k6 run --quiet \
  --env BASE_URL=$STAGING_URL \
  k6/gateway-latency-profile.js
echo "✅ Gateway latency: pacs.008 parse < 100ms, transfer < 200ms, query < 20ms"

echo ""
echo "All load testing gates PASSED — safe to deploy to production"

KPI của Load Testing phase:

Metric	Pass Threshold	Fail → Action
Transfer P99	≤50ms	Investigate DB locking, connection pool
Error Rate	≤0.1%	Check idempotency logic, retry policy
Idempotency	100% same tx_id	Bug in unique constraint / cache logic
pacs.008 parse P99	≤100ms	Profile XML streaming parser
Status query P95	≤20ms	Check Redis cache hit rate

Phụ Lục: Testing Tools & Libraries

Tool	Dùng cho	Language
libfaketime	Clock drift injection	C/Linux
tc (traffic control)	Network partition simulation	Linux
toxiproxy	Programmable network conditions	Multi-language
Flink TestHarness	Operator unit testing	Java
Flink MiniCluster	Integration testing	Java
Go testing/iotest	I/O failure injection	Go
testcontainers-go	DB containers for integration tests	Go
k6	Load testing HTTP APIs	JavaScript
chaos-mesh	Kubernetes chaos engineering	YAML/Go

FAQ

Bao nhiêu coverage là đủ cho một Core Banking system?

Không có con số tuyệt đối, nhưng nguyên tắc 3-layer:

Unit tests: ≥90% coverage cho business logic (balance calculations, state machines)
Integration tests: Toàn bộ happy path + top 5 failure scenarios cho mỗi API
Chaos engineering: Ít nhất 1 lần mỗi sprint với network partition và clock skew

Quan trọng hơn coverage % là coverage của failure modes — đặc biệt là concurrent scenarios không thể test bằng sequential unit tests.

Flink TestHarness có thể test toàn bộ pipeline không?

TestHarness tốt cho operator-level unit tests (kiểm tra một operator riêng lẻ với mock inputs). Nhưng để test toàn bộ pipeline (Kafka source → CEP → ML inference → Kafka sink), cần dùng MiniCluster hoặc environment staging với real Kafka/Flink cluster.

Nên mock hay integration test cho database trong ledger tests?

Không mock database cho ledger invariant tests. Sử dụng testcontainers-go để spin up real PostgreSQL/TiDB instance trong Docker — kiểm tra thực sự các race conditions, deadlocks, và ACID properties mà mock không thể reproduce. Mock chỉ phù hợp cho external services (Kafka, NAPAS gateway, notification service).

Làm thế nào detect silent data corruption trong production?

Chạy continuous reconciliation — một background job đọc từ event store và recompute balance, so sánh với CQRS read model. Bất kỳ difference nào → P1 alert. Interval phụ thuộc vào transaction volume: 5 phút cho hệ thống lớn, 1 giờ cho hệ thống nhỏ. Đây là “immune system” của Core Banking.

Tổng Kết Series: Core Banking Architecture

Qua 8 phần của series, chúng ta đã đi qua toàn bộ stack của một hệ thống Core Banking production-grade:

Phần	Kiến thức Cốt lõi	Benchmark Quan Trọng
1	Double-Entry Ledger Schema, TigerBeetle Zig	1M TPS single-threaded
2	Distributed SQL, TrueTime, HLC, Percolator	1-3ms TSO overhead
3	Event Sourcing, CQRS, Outbox Pattern	<1ms vs 200ms balance lookup
4	Saga Orchestration, Temporal, DLQ	10-50ms per orchestration hop
5	ISO 20022, XML streaming parser	10-30x JSON faster than XML
6	FAPI 2.0, DPoP, mTLS	<0.1ms pooled mTLS overhead
7	Flink CEP, RocksDB, ML inference	50-100ms fraud scoring SLA
8	SDET handbook, chaos engineering	0 double-spends, 0 imbalances

Nội dung liên quan để đọc thêm:

Composable Banking Architecture — Từ monolith đến modular core
PayPay Architecture — Scale 70M users với TiDB và Kafka idempotency
High Concurrency Systems — Distributed locking và idempotency APIs

Tại Sao Core Banking Cần SDET Riêng?#

Category 1: Double-Entry Invariant Auditing#

Test 1.1: Concurrent Double-Spend Prevention#

Test 1.2: Continuous Reconciliation Job#

Test 1.3: Deadlock Prevention Verification#

Category 2: Distributed SQL & Clock Resilience Testing#

Test 2.1: Network Partition (Split-Brain) Simulation#

Test 2.2: Clock Skew Injection (libfaketime)#

Category 3: Event Sourcing & Outbox Testing#

Test 3.1: Event Store Replay Consistency#

Test 3.2: Outbox Atomicity — DB Success, Kafka Fail#

Category 4: Saga Compensation Testing#

Test 4.1: Step 2 Failure → Compensation Verification#

Test 4.2: Compensation Chain Failure → DLQ Routing#

Category 5: Idempotency & API Security Testing#

Test 5.1: Concurrent Double-Submit#

Test 5.2: DPoP Token Replay Attack#

Test 5.3: Idempotency Key Payload Mismatch#

Category 6: Flink State & Fraud Detection Testing#

Test 6.1: Flink Operator Unit Test (TestHarness)#

Test 6.2: RocksDB State Recovery After Failure#

Chaos Engineering Checklist#

Category 7: Load Testing & Performance Validation#

Test 1: Ledger Throughput Benchmark (k6)#

Test 2: Idempotency Stress Test — Duplicate Request Bombardment#

Test 3: Payment Gateway Latency Profile (k6 + thresholds per endpoint)#

Pre-Production Load Testing Gates#

Phụ Lục: Testing Tools & Libraries#

FAQ#

Bao nhiêu coverage là đủ cho một Core Banking system?#

Flink TestHarness có thể test toàn bộ pipeline không?#

Nên mock hay integration test cho database trong ledger tests?#

Làm thế nào detect silent data corruption trong production?#

Tổng Kết Series: Core Banking Architecture#