Abstract
Position sizing represents one of the most consequential yet underexplored decisions in systematic trading. While substantial research has focused on alpha signal generation through machine learning, the optimal translation of these signals into position sizes remains an open challenge. The Kelly Criterion provides a theoretically optimal solution for bet sizing under uncertainty, but its assumptions of known probability distributions and independent trials rarely hold in financial markets. Reinforcement Learning (RL) offers a promising alternative by learning adaptive policies directly from market interaction, but naive RL implementations often converge to suboptimal or unstable position sizing strategies.
This research presents a Kelly-Adjusted Soft Actor-Critic (KA-SAC) framework that combines the theoretical optimality of the Kelly Criterion with the adaptive capabilities of deep reinforcement learning. The core innovation lies in the action space design: rather than learning raw position sizes, the RL agent learns a bounded adjustment to a Kelly-optimal baseline. This architecture anchors the policy to theoretical optimality while allowing regime-dependent deviations. A Kelly-convergent reward function based on log-wealth maximization provably converges to the Kelly fraction under ergodic conditions.
Production deployment incorporates a robust 4-tier fallback cascade that gracefully degrades from full RL control to conservative Kelly baseline when model quality deteriorates. Curriculum learning reduces training time from 120 minutes to 45 minutes while improving final policy quality. The framework has been validated through extensive backtesting and is currently deployed in Trade-Matrix production for cryptocurrency perpetual futures trading.
Implemented in Trade-Matrix
This section documents RL position sizing capabilities deployed in production (November 2025).
Core Algorithm: Kelly-Adjusted SAC
Soft Actor-Critic (SAC) selected over PPO for:
- Superior sample efficiency in weak signal environments (IC 0.02-0.10)
- Entropy-regularized exploration preventing premature convergence
- Off-policy learning enabling experience replay from historical data
Kelly-Adjusted Action Space:
- RL outputs adjustment to Kelly baseline (not raw position)
- Ensures convergence toward Kelly-optimal sizing
- Log-wealth reward function maximizes long-term growth
Production Safety: 4-Tier Fallback Cascade
| Tier | Name | Condition | Position Control |
|---|---|---|---|
| 1 | FULL_RL | Confidence ≥ 0.50 AND IC ≥ 0.05 | 100% RL |
| 2 | BLENDED | Medium confidence/IC | 50% RL + 50% Kelly |
| 3 | PURE_KELLY | Low confidence OR IC < 0.03 | 100% Kelly |
| 4 | EMERGENCY_FLAT | Circuit breaker OPEN | 0% (flat) |
Tier Distribution (typical week):
- Tier 1 (Full RL): 60-70% of decisions
- Tier 2 (Blended): 15-25% during transitions
- Tier 3 (Pure Kelly): 8-15% low-confidence periods
- Tier 4 (Emergency): <2% crisis events
Circuit Breaker Protection (3-State FSM)
Triggers (any one activates OPEN state):
- 5 consecutive losing trades
- 5% daily drawdown
- 3-sigma model output anomaly
- IC decay below 0.03
Recovery Protocol:
- 1-hour cooldown in OPEN state
- HALF_OPEN: 25% position scale for testing
- 3 successful test trades required
- Then return to CLOSED (normal operation)
Regime-Adaptive Kelly Multipliers
| Regime | Kelly Fraction | Risk Aversion (γ) |
|---|---|---|
| Bear | 25% | 4.0 |
| Neutral | 50% | 2.0 |
| Bull | 67% | 1.5 |
| Crisis | 17% | 6.0 |
Regime detection uses 4-state Hidden Markov Model with MS-GARCH methodology.
Training: 4-Phase Curriculum Learning
- Basic (50K steps): Simple market without noise, learn directional trading
- Risk (100K steps): Add Sharpe ratio rewards and drawdown penalties
- Kelly (150K steps): Full Kelly-convergent rewards with deviation regularization
- Stress (50K steps): High-volatility scenarios and regime shifts
Result: 45 min training (vs 120 min without curriculum)
Convergence Metrics:
- Phase 1: Sharpe 2.1 → Phase 4: Sharpe 4.0+
- Monotonic reward increase across all phases
- Stable convergence without degenerate policies
Production Performance (ranges for IP protection)
Risk-Adjusted Returns:
- Sharpe Ratio: 4.0+ (vs 3.61 static Kelly baseline)
- Annual Return: 55-65%
- Max Drawdown: <12%
- Calmar Ratio: 3.5+
Weak Signal Performance (IC 0.02-0.09):
- 2x improvement over static Kelly in IC 0.02-0.05 range
- Learns to optimally exploit marginal edges
What Is NOT Deployed
The following remain research topics:
- ❌ Multi-asset portfolio optimization (Matrix Kelly formulation)
- ❌ Bayesian Neural Networks for uncertainty quantification
- ❌ Meta-learning (MAML) for rapid regime adaptation
- ❌ Hierarchical RL for multi-timescale decisions
- ❌ PPO algorithm (researched, SAC selected instead)
- ❌ Interpretability/explainability features
Research & Future Enhancements
This section covers theoretical extensions documented but not implemented.
Multi-Asset Kelly Portfolio Optimization
Extension from single-asset Kelly to correlated portfolio:
where Sigma is the covariance matrix and mu is the expected return vector. This matrix formulation accounts for correlations between assets, enabling optimal diversification.
Research Challenges:
- Covariance matrix estimation in non-stationary markets
- Computational complexity for large portfolios (100+ assets)
- Interaction between RL agent and portfolio rebalancing
Model Uncertainty via Bayesian Neural Networks
Incorporate epistemic uncertainty into position sizing by using Bayesian NNs for Q-networks:
- Aleatoric Uncertainty: Inherent market randomness (captured by current model)
- Epistemic Uncertainty: Model confidence in own predictions (NOT captured)
Bayesian approach provides uncertainty estimates that can dynamically scale positions based on model confidence.
Meta-Learning for Regime Adaptation
Model-Agnostic Meta-Learning (MAML) enables rapid adaptation to new regimes:
- Train on diverse historical regimes
- Learn "meta-policy" that can quickly fine-tune
- Deploy with 10-100 samples for new regime adaptation
Potential Benefit: Faster recovery from regime shifts (hours vs days).
Hierarchical RL for Multi-Timescale Decisions
Current implementation operates on single timescale (4H bars). Hierarchical RL would enable:
- High-Level: Strategic allocation (days-weeks)
- Mid-Level: Tactical positioning (hours-days)
- Low-Level: Execution optimization (minutes-hours)
Interpretability and Explainability
Methods for understanding RL decisions:
- SHAP values: Feature importance for specific decisions
- Attention mechanisms: Which state features drive actions
- Counterfactual analysis: "What if IC was 0.10 instead of 0.05?"
Improves trust, debugging, and regulatory compliance.
1. Introduction
1.1 The Position Sizing Challenge
Position sizing determines how much capital to allocate to each trading opportunity. While most quantitative research focuses on generating alpha signals ("what to trade" and "when to trade"), position sizing addresses the equally critical question of "how much to trade." Poor position sizing can transform a profitable strategy into a losing one, while optimal sizing maximizes long-run wealth growth.
The challenge is particularly acute in cryptocurrency markets, which exhibit:
- High Volatility: Daily volatility of 2-5% is common, compared to 0.5-1% for traditional equities
- Regime Switching: Rapid transitions between trending and mean-reverting behavior
- Weak Predictive Signals: Information Coefficients (IC) typically range from 0.02 to 0.10, significantly weaker than traditional markets
- Fat-Tailed Distributions: Excess kurtosis often exceeds 5, invalidating Gaussian assumptions
These characteristics make static position sizing approaches suboptimal. A fixed 55% Kelly allocation that works well during normal volatility may be disastrous during market crises, while overly conservative sizing sacrifices returns during favorable conditions.
1.2 Problem Statement
The Trade-Matrix system employs transfer learning-based ML models for signal generation, achieving Information Coefficients in the 0.20-0.27 range for major cryptocurrencies (BTC, ETH, SOL). However, the reinforcement learning component responsible for position sizing has historically underperformed relative to static Kelly Criterion allocation:
| Approach | Sharpe Ratio | Max Drawdown | Implementation |
|---|---|---|---|
| Naive RL | ~3.2 | 15% | Raw position output |
| Static Kelly | 3.61 | 12% | 55% fractional Kelly |
| KA-SAC (Target) | 4.0+ | <12% | Kelly-adjusted RL |
This performance gap indicates a fundamental mismatch between naive RL training objectives and optimal position sizing criteria.
1.3 Root Cause Analysis
Analysis of prior implementations revealed three primary failure modes:
Reward-Objective Mismatch: Previous reward functions combined Sharpe ratio and returns in weighted sums (e.g., r_t = 10 * Sharpe_t + 5 * R_t). This formulation does not converge to the Kelly-optimal position under any parameterization, as it optimizes a different objective than log-wealth maximization.
Algorithm Limitations: Proximal Policy Optimization (PPO), while stable, is an on-policy algorithm that struggles with exploration in weak signal environments. With IC values around 0.05, the signal-to-noise ratio is insufficient for PPO to reliably identify optimal positions without excessive training samples.
Missing Theoretical Anchor: Learning raw position sizes without reference to theoretical optimality leads to either overly conservative positions (missing opportunities) or overly aggressive positions (excessive risk) in weak signal regimes.
1.4 Solution Overview
The Kelly-Adjusted Soft Actor-Critic framework addresses these failures through three innovations:
- Kelly-Adjusted Action Space: The RL agent outputs an adjustment to a Kelly-optimal baseline rather than raw positions
- Soft Actor-Critic Algorithm: Entropy-regularized RL provides superior exploration through maximum entropy optimization
- Kelly-Convergent Reward: Log-wealth reward function that provably converges to the Kelly optimum
2. Kelly Criterion Foundations
2.1 Historical Background
The Kelly Criterion, introduced by John L. Kelly Jr. at Bell Labs in 1956, was originally developed for information theory applications involving a gambler with access to a noisy private wire transmitting horse race results. Kelly showed that maximizing the expected logarithm of wealth leads to optimal long-run growth.
The criterion gained prominence in finance through the work of Ed Thorp, who applied it to blackjack and later to the stock market. Thorp's approach of using Kelly-optimal sizing with fractional adjustments for estimation uncertainty became standard practice in quantitative trading.
2.2 Mathematical Derivation
Consider an asset with expected excess return mu and variance sigma-squared. The investor seeks to maximize expected log-wealth growth:
where f is the fraction of wealth to invest and R is the random return.
Expanding via Taylor series for small returns:
Taking the derivative with respect to f and setting to zero:
Solving yields the optimal Kelly fraction:
This elegant result states that the optimal bet size equals the expected edge divided by the variance of outcomes.
2.3 Kelly with Information Coefficient
In ML-based trading, we do not observe mu directly but instead have predictive signals with measurable quality. The Kelly fraction can be expressed in terms of the Information Coefficient:
where:
- IC: Information Coefficient (Spearman correlation between predictions and realized returns)
- Signal: ML model prediction in [-1, 1]
- Confidence: Model confidence in [0, 1]
- gamma: Risk aversion parameter (gamma = 2 for "half-Kelly")
- sigma^2: Rolling variance of returns
This formulation connects theoretical Kelly sizing to practical ML signal metrics, enabling dynamic position sizing based on signal quality.
2.4 Fractional Kelly and Risk Aversion
In practice, full Kelly sizing is rarely employed due to:
- Estimation Error: The true edge and variance are unknown and must be estimated
- Fat Tails: Return distributions exhibit excess kurtosis, invalidating Gaussian assumptions
- Drawdown Volatility: Full Kelly can produce drawdowns exceeding 50% during adverse runs
The fractional Kelly approach scales the optimal fraction by a risk aversion parameter:
Common values include:
- gamma = 2 (Half-Kelly): Recommended by Thorp for most applications
- gamma = 4 (Quarter-Kelly): Conservative approach for high uncertainty
- gamma = 1.5 (Two-Thirds Kelly): Aggressive for high-confidence situations
2.5 Regime-Adaptive Kelly
Market conditions vary dramatically, requiring regime-specific Kelly parameters:
| Regime | Risk Aversion (gamma) | IC Threshold | Max Position | Drawdown Limit |
|---|---|---|---|---|
| Bear (High Vol) | 4.0 | 0.08 | 50% | 8% |
| Neutral | 2.0 | 0.05 | 100% | 10% |
| Bull (Low Vol) | 1.5 | 0.03 | 150% | 12% |
| Crisis (Extreme) | 6.0 | 0.10 | 25% | 5% |
The mapping from gamma to position sizing follows:
- Bear: 25% of standard sizing (1/4)
- Neutral: 50% of standard sizing (1/2)
- Bull: 67% of standard sizing (2/3)
- Crisis: 17% of standard sizing (1/6)
Regime detection uses a 4-state Hidden Markov Model with MS-GARCH (Markov-Switching GARCH) methodology for volatility clustering identification.
3. Soft Actor-Critic Algorithm
3.1 Maximum Entropy RL Framework
Soft Actor-Critic (SAC) is an off-policy actor-critic algorithm based on the maximum entropy reinforcement learning framework. Unlike standard RL that maximizes cumulative reward, SAC maximizes a modified objective that includes policy entropy:
where H(pi(.|s)) is the entropy of the policy and alpha is the temperature parameter controlling the exploration-exploitation trade-off.
The entropy term encourages the policy to explore diverse actions while still maximizing reward. This is particularly valuable in financial environments where:
- The optimal action may be subtle (small position adjustments)
- Exploration must continue throughout training to avoid local optima
- Robust policies require exposure to diverse market conditions
3.2 Why SAC Over PPO and DDPG
SAC offers several advantages for position sizing in weak signal regimes:
Sample Efficiency: As an off-policy algorithm, SAC can reuse experience from a replay buffer. This is critical when market data is limited, as each trading day provides only a finite number of decision points.
Exploration via Entropy: The entropy maximization objective naturally balances exploration and exploitation. Unlike epsilon-greedy exploration, which becomes arbitrary, entropy-regularized exploration maintains coherent policies.
Automatic Temperature Tuning: SAC automatically adjusts the temperature parameter alpha to maintain appropriate exploration throughout training. This eliminates the need to manually tune exploration schedules.
Stability: The squashing function (typically tanh) ensures bounded actions, preventing numerical instability and extreme positions.
| Algorithm | Type | Exploration | Sample Efficiency | Stability |
|---|---|---|---|---|
| PPO | On-Policy | Limited | Low | High |
| DDPG | Off-Policy | OU Noise | Medium | Low |
| TD3 | Off-Policy | Gaussian | Medium | Medium |
| SAC | Off-Policy | Entropy | High | High |
3.3 Twin Critics Architecture
SAC employs twin Q-networks (Q1, Q2) to address overestimation bias common in value-based methods:
By taking the minimum of two independent Q-value estimates, SAC produces more conservative value predictions, leading to safer policies—a desirable property for position sizing.
3.4 Automatic Temperature Tuning
The temperature parameter alpha is automatically adjusted to match a target entropy:
where H_target is typically set to the negative of action dimension (-dim(A)). This ensures consistent exploration regardless of reward scale or training progress.
3.5 SAC Configuration for Position Sizing
The SAC hyperparameters are tuned for weak signal financial environments:
| Parameter | Value | Rationale |
|---|---|---|
| Learning Rate | 3e-4 | Standard for financial RL |
| Buffer Size | 100,000 | Sufficient replay diversity |
| Batch Size | 256 | Balance stability and speed |
| tau (soft update) | 0.005 | Slow target network updates |
| gamma (discount) | 0.99 | Long-horizon optimization |
| ent_coef | auto | Automatic temperature tuning |
| Network Architecture | [256, 256, 128] | Moderate capacity |
Note: These hyperparameters represent Stable-Baselines3 framework defaults that have proven effective for financial RL applications. Production tuning may yield marginal improvements but these values provide a robust starting point.
4. Kelly-Convergent Reward Design
4.1 Log-Wealth Maximization
The Kelly Criterion is derived from maximizing expected log-wealth. We design a reward function that directly optimizes this objective:
Theorem (Kelly Convergence): Let W_t denote wealth at time t, f_t the position fraction, and R_t the period return. The reward function:
when maximized in expectation over an ergodic return process, yields the optimal policy f* = mu / sigma^2 (the Kelly fraction).
Proof Sketch: The expected long-run growth rate is:
Taking the derivative and setting to zero:
Thus, any algorithm maximizing this objective will converge to Kelly-optimal sizing.
4.2 Complete Reward Function
The production reward function incorporates additional terms for robustness:
where:
DD_penalty = max(0, DD_t - tau)^2penalizes drawdowns exceeding thresholdtau(f_t - f*_t)^2regularizes toward the Kelly baselineSharpe_bonusrewards consistent positive Sharpe ratio
Default coefficients (empirically tuned):
lambda_1 = 50(strong drawdown penalty)lambda_2 = 5(moderate Kelly deviation penalty)lambda_3 = 2(light Sharpe bonus)
Implementation Status Note (January 2026)
The Kelly-convergent reward function described above represents the theoretical target design. The current production implementation uses a simpler weighted reward formulation:
reward = 0.4 * Sharpe + 0.3 * Return + 0.3 * WinRateThis provides stable training but does not provably converge to the Kelly fraction. The full Kelly-convergent reward with log-wealth terms remains a planned enhancement. See "Current Limitations" section below.
4.3 Kelly-Adjusted Action Space
The key architectural innovation is reformulating the action space. Instead of learning raw positions f in [0, f_max], the agent learns an adjustment to the Kelly baseline:
where delta_RL in [-0.5, 0.5] is the RL output, allowing positions from 50% to 150% of Kelly-optimal.
This design offers several advantages:
- Bounded Deviation: The agent cannot deviate arbitrarily far from optimality
- Faster Convergence: Learning a small adjustment is easier than learning absolute positions
- Graceful Fallback: If
delta -> 0, the system recovers Kelly behavior - Regime Adaptation: The agent learns regime-specific deviations from static Kelly
4.4 Transaction Cost Modeling
Real trading incurs costs that must be incorporated into the reward:
For cryptocurrency perpetual futures on Bybit:
- Maker fee: 0.01% (or rebate)
- Taker fee: 0.06%
- Spread: 0.01-0.03% depending on liquidity
Transaction costs naturally discourage excessive position turnover, a form of implicit regularization.
5. State Space Design
5.1 State Space Overview
The RL agent observes a 25-dimensional state space designed to capture all information relevant to position sizing:
| Category | Features | Dimensions | Description |
|---|---|---|---|
| ML Features | signal, confidence | 2 | Prediction and quality |
| Market Features | returns, volatility, momentum, RSI, volume, BB position, vol ratio, order flow | 8 | Technical indicators |
| Portfolio Features | position_ratio, pnl_ratio, win_rate, drawdown | 4 | Current state |
| Performance History | last 5 returns | 5 | Recent performance |
| Kelly Features | kelly_fraction, kelly_optimal, ic_rolling, ic_zscore, regime_id, regime_confidence | 6 | Theoretical baseline |
Total: 25 dimensions
5.2 ML Features
The ML signal features capture the predictive model's output:
- ml_signal: Normalized prediction in [-1, 1] where positive indicates bullish and negative indicates bearish
- ml_confidence: Model confidence in [0, 1] indicating prediction reliability
These features directly inform position direction and sizing, with higher confidence warranting larger positions.
5.3 Market Features
Market features provide context about current conditions:
- returns: Period return for recent bars
- volatility: Rolling 20-period volatility (annualized)
- momentum: 10-period price momentum
- rsi: Relative Strength Index (14-period)
- bb_position: Bollinger Band position [-1, 1]
- volume_ratio: Current volume relative to 20-period average
- vol_ratio: Short-term to long-term volatility ratio
- order_flow: Order flow imbalance estimate
5.4 Portfolio Features
Portfolio state informs risk-aware sizing:
- position_ratio: Current position / maximum position (capacity utilization)
- pnl_ratio: Unrealized + realized P&L / initial balance
- win_rate: Winning trades / total trades (rolling window)
- drawdown: Current peak-to-trough drawdown
5.5 Kelly Features
Kelly-derived features anchor the agent to theoretical optimality:
- kelly_fraction: Raw Kelly f* = mu / sigma^2
- kelly_optimal: Fractional Kelly f* / gamma (risk-adjusted)
- ic_rolling: Rolling IC estimate from recent predictions
- ic_zscore: IC z-score vs baseline (decay detection)
- regime_id: HMM regime (0=Bear, 1=Neutral, 2=Bull, 3=Crisis)
- regime_confidence: Regime classification confidence
5.6 Rolling IC Estimation
Real-time estimation of the Information Coefficient is critical for dynamic sizing:
where w = 60 is the rolling window. The IC estimate is used to:
- Calculate the Kelly baseline dynamically
- Gate position sizes when IC degrades (model decay detection)
- Trigger fallback to conservative sizing when IC falls below threshold
6. Curriculum Learning
6.1 Motivation
Training RL agents directly on the full Kelly-convergent reward with complex state space often leads to unstable learning. Curriculum learning addresses this by progressively increasing task difficulty, allowing the agent to master fundamentals before facing full complexity.
Without curriculum learning, agents frequently:
- Converge to degenerate policies (always flat or always max position)
- Take excessive time to learn basic profitable behavior
- Fail to generalize to extreme market conditions
6.2 Four-Phase Curriculum
The training curriculum progresses through four phases:
| Phase | Training Steps | Max Position | Reward | Objective |
|---|---|---|---|---|
| 1. Basic | 50,000 | 0.5 | PnL only | Learn profitable trading |
| 2. Risk | 100,000 | 1.0 | PnL + Sharpe | Add risk adjustment |
| 3. Kelly | 150,000 | 1.5 | Kelly-convergent | Full reward function |
| 4. Stress | 50,000 | 1.5 | Kelly-convergent | High-volatility only |
Total: 350,000 steps (~45 minutes)
6.3 Phase Details
Phase 1 (Basic): The agent learns that taking positions in the direction of the ML signal is profitable. Position sizes are constrained to 50% of maximum, and reward is simply realized P&L. This phase establishes the fundamental mapping from signals to directional positions.
Phase 2 (Risk): Risk-adjusted rewards are introduced via Sharpe ratio components. The agent learns that consistent returns are preferable to volatile ones. Maximum position increases to 100%, allowing normal-sized trades.
Phase 3 (Kelly): The full Kelly-convergent reward function is activated, including log-wealth terms, drawdown penalties, and Kelly deviation regularization. The agent refines its policy toward theoretically optimal sizing. Position limits increase to 150% for exceptional opportunities.
Phase 4 (Stress): Training data is filtered to include only high-volatility periods (top 20% by rolling volatility). This ensures the agent has sufficient experience with extreme conditions that may be underrepresented in normal market data.
6.4 Training Time Reduction
Curriculum learning significantly accelerates convergence:
| Approach | Training Time | Final Sharpe | Convergence Stability |
|---|---|---|---|
| Direct Training | 120 min | 3.8 | Unstable |
| Curriculum | 45 min | 4.0+ | Stable |
The 2.7x speedup results from:
- Faster early learning in simplified phases
- Better weight initialization for later phases
- Reduced policy oscillation
7. Four-Tier Fallback Cascade
7.1 Design Philosophy
Production systems require graceful degradation when model quality deteriorates. The fallback cascade ensures the system remains operational and profitable even when the RL component fails:
"Fail gracefully to theoretically optimal"
When RL fails, positions do not go to random or zero but to the Kelly Criterion baseline, which provides mathematically optimal sizing under uncertainty.
7.2 Tier Definitions
Tier 1: Full RL (FULL_RL)
- Conditions: Confidence > 0.50 AND IC > 0.05
- Position:
f_Kelly * (1 + delta_RL)(100% RL control) - Use Case: Normal operation with high-quality signals
Tier 2: Blended (BLENDED)
- Conditions: 0.30 < Confidence < 0.50 OR 0.03 < IC < 0.05
- Position:
0.5 * f_RL + 0.5 * f_Kelly(50/50 blend) - Use Case: Moderate confidence, hedge RL with Kelly
Tier 3: Pure Kelly (PURE_KELLY)
- Conditions: Confidence < 0.30 OR IC < 0.03 OR model error
- Position:
f_Kelly(100% Kelly baseline) - Use Case: Low confidence or RL failure
Tier 4: Emergency Flat (EMERGENCY_FLAT)
- Conditions: Circuit breaker OPEN
- Position:
0(flat position) - Use Case: Catastrophic conditions requiring trading halt
7.3 Tier Selection Logic
def determine_tier(ml_confidence, rolling_ic, circuit_state, model_error):
# Tier 4: Emergency
if circuit_state == CircuitState.OPEN:
return EMERGENCY_FLAT, "Circuit breaker OPEN"
if model_error:
return EMERGENCY_FLAT, "RL model inference error"
# Tier 1: Full RL
if ml_confidence >= 0.50 and rolling_ic >= 0.05:
return FULL_RL, "High confidence and IC"
# Tier 3: Pure Kelly
if ml_confidence < 0.30 or rolling_ic < 0.03:
return PURE_KELLY, "Low confidence or IC"
# Tier 2: Blended
return BLENDED, "Medium confidence/IC"
7.4 Tier Distribution in Production
Historical distribution across tiers (typical week):
| Tier | Percentage | Typical Conditions |
|---|---|---|
| Tier 1 (Full RL) | 60-70% | Normal market conditions |
| Tier 2 (Blended) | 15-25% | Transitional periods |
| Tier 3 (Pure Kelly) | 8-15% | Low-confidence periods |
| Tier 4 (Emergency) | <2% | Rare crisis events |
The majority of decisions use full RL control, validating the model's reliability. Fallback tiers activate primarily during regime transitions and volatility spikes.
7.5 Adaptive Blending (Advanced)
An optional adaptive blending system adjusts Tier 2 weights based on recent accuracy:
class AdaptiveBlendingSystem:
def update(self, rl_position, kelly_position, actual_return):
# Track which approach was correct
rl_accurate = (rl_position * actual_return) > 0
kelly_accurate = (kelly_position * actual_return) > 0
# Update weights toward more accurate approach
# Bounded to [0.25, 0.75] to prevent extreme allocations
self._update_weights(rl_accurate, kelly_accurate)
This follows the "expert aggregation" approach from online learning theory, dynamically adjusting trust in each method based on realized performance.
8. Circuit Breaker Protection
8.1 Three-State Finite State Machine
The circuit breaker implements a 3-state pattern adapted from microservices architecture:
CLOSED (Normal) --[trigger]--> OPEN (Halted)
|
[cooldown]
v
HALF_OPEN (Testing)
|
[success] | [failure]
v | v
CLOSED <----------------+---------------> OPEN
CLOSED: Normal operation, full position sizing allowed.
OPEN: Trading halted, all positions forced to zero. No new trades permitted.
HALF_OPEN: Testing recovery with reduced position sizes (25% of normal). Requires successful test trades to return to CLOSED.
8.2 Trigger Conditions
The circuit breaker opens when any of these conditions are met:
Consecutive Losses: Too many losing trades in sequence indicates systematic failure.
- Threshold: 5 consecutive losses
- Rationale: Probability of 5 random losses < 3% for 50% win rate
Daily Drawdown: Intraday losses exceed acceptable limit.
- Threshold: 5% daily drawdown
- Rationale: Preserves capital for recovery
Model Anomaly: RL output significantly different from expected.
- Threshold: 3 sigma from historical mean
- Rationale: Detects model corruption or extreme market conditions
IC Decay: Signal quality degrades below minimum.
- Threshold: IC < 0.03
- Rationale: Model predictions are essentially noise
8.3 Recovery Protocol
When the circuit opens:
- Cooldown Period: Wait 1 hour before attempting recovery
- Enter HALF_OPEN: Allow trading at 25% normal position size
- Test Trades: Execute 3 test trades
- Evaluate: If 3+ successful test trades, close circuit; otherwise reopen
The cooldown period allows:
- Market conditions to normalize
- Model inference issues to resolve
- Manual investigation if needed
8.4 Position Scaling
def get_position_scale(state):
if state == CircuitState.CLOSED:
return 1.0 # Full sizing
elif state == CircuitState.OPEN:
return 0.0 # Flat
else: # HALF_OPEN
return 0.25 # Reduced sizing
Even in HALF_OPEN state, the circuit breaker scales positions applied by all tiers, providing an additional layer of protection.
8.5 IC Decay Detection
A dedicated IC decay detector monitors signal quality:
Detection Methods:
- Absolute Threshold: IC < 50% of baseline
- Relative Decay: IC dropped >50% from recent average
- Z-Score: IC is 2+ standard deviations below expected
- Trend: Consistent decline over 20 periods
def check_ic_decay(current_ic, baseline_ic=0.05, baseline_std=0.02):
# Absolute check
if current_ic < baseline_ic * 0.5:
return True, "absolute", "retrain"
# Z-score check
z_score = (current_ic - baseline_ic) / baseline_std
if z_score < -2.0:
return True, "zscore", "investigate"
return False, None, None
9. Results and Production Validation
9.1 Training Convergence
The curriculum learning approach achieves stable convergence in 45 minutes:
| Phase | Duration | Reward (Final) | Sharpe (Validation) |
|---|---|---|---|
| Phase 1 (Basic) | 10 min | +0.8 | 2.1 |
| Phase 2 (Risk) | 15 min | +1.2 | 3.2 |
| Phase 3 (Kelly) | 15 min | +1.5 | 3.9 |
| Phase 4 (Stress) | 5 min | +1.4 | 4.0+ |
The reward and Sharpe ratio increase monotonically through phases, indicating successful skill transfer between curriculum stages.
9.2 Backtest Performance
Walk-forward validation over 2+ years of cryptocurrency data:
| Metric | Static Kelly | KA-SAC | Improvement |
|---|---|---|---|
| Annual Return | 45% | 55-65% | +22-44% |
| Sharpe Ratio | 3.61 | 4.0+ | +10%+ |
| Max Drawdown | 12% | <12% | Same or better |
| Calmar Ratio | 2.8 | 3.5+ | +25% |
| Monthly Sharpe Var | 0.6 | <0.5 | -17% |
The KA-SAC framework achieves meaningful improvements across all risk-adjusted metrics while maintaining or reducing maximum drawdown.
9.3 RL vs Kelly-Only Comparison (November 2025 Backtest)
A controlled comparison validates RL superiority over pure Kelly baseline:
| Metric | RL-ON (SAC) | RL-OFF (Kelly) | RL Improvement |
|---|---|---|---|
| Total PnL | $3,545,822 | $2,406,984 | +47.3% |
| Sharpe Ratio | 4.4255 | 4.4095 | +0.4% |
| BTC PnL | $1,812,582 | $1,175,716 | +54.2% |
| ETH PnL | $1,180,648 | $890,492 | +32.6% |
| SOL PnL | $552,592 | $340,776 | +62.2% |
Key Observations:
- RL provides +$1.14M additional alpha over pure Kelly sizing
- Largest improvement on SOL (+62%), the highest-volatility asset
- Sharpe improvement marginal, but absolute PnL improvement substantial
- RL learns regime-specific position adjustments that static Kelly cannot capture
Source: Walk-forward backtest, Nov 24, 2025, 2+ years historical data
⚠️ Methodological Note: Backtesting Limitations
The performance metrics above (Sharpe ratios, PnL figures) are derived from backtests where:
- Training methodology: Models trained using Cross-Validation (CV) and Walk-Forward Validation (WFV) to mitigate overfitting
- Test period overlap: The backtested period partially overlaps with data used during model training
- Comparative validity: Results demonstrate relative superiority between approaches (RL vs Kelly vs MS-GARCH) under controlled conditions
- No future guarantee: Past performance does NOT guarantee future results—market regimes, volatility characteristics, and signal quality may differ
Interpretation: These results validate that RL-based position sizing learns more effective regime-specific adjustments than static Kelly. However, actual production performance will depend on out-of-sample market conditions and ongoing model maintenance.
9.4 Fallback Tier Analysis
Production distribution validates the 4-tier design:
| Tier | Observations | Avg Position | Avg Return |
|---|---|---|---|
| Full RL | 65% | 0.72 | +0.15% |
| Blended | 22% | 0.55 | +0.08% |
| Pure Kelly | 11% | 0.48 | +0.05% |
| Emergency | 2% | 0.00 | 0.00% |
Key observations:
- Full RL positions are larger and more profitable (expected)
- Blended mode provides smooth transition
- Pure Kelly maintains profitability during uncertainty
- Emergency triggers are rare but protective
9.5 Weak Signal Regime Performance
The framework specifically addresses weak signal regimes (IC in 0.02-0.09 range):
| IC Range | Static Kelly Return | KA-SAC Return | Improvement |
|---|---|---|---|
| 0.02-0.05 | +2% | +4% | +100% |
| 0.05-0.07 | +5% | +8% | +60% |
| 0.07-0.09 | +8% | +11% | +37% |
The RL agent learns to optimally exploit even marginal edges, with larger relative improvements in weaker signal environments.
10. Trade-Matrix Integration
10.1 System Architecture
The RL position sizing agent integrates with Trade-Matrix's event-driven architecture:
MLSignalEvent --> RL Position Sizing Agent --> RLSignalEvent
|
4-Tier Fallback
|
Circuit Breaker
|
Final Position Size --> Execution
10.2 Event Flow
- MLSignalEventV2: Contains prediction, confidence, and features
- Feature Extraction: Extract 25 state space features
- Tier Determination: Select fallback tier based on conditions
- Position Calculation: Apply tier-specific logic
- RLSignalEvent: Publish final position multiplier
10.3 Production Configuration
rl_position_sizing:
model_name: "rl_position_sizer"
model_version: "Production"
mlflow_uri: "http://mlflow:5000"
# Tier thresholds
confidence_high: 0.50
confidence_med: 0.30
ic_high: 0.05
ic_med: 0.03
# Circuit breaker
consecutive_loss_threshold: 5
daily_drawdown_limit: 0.05
recovery_wait_seconds: 3600
10.4 Monitoring Metrics
Key Prometheus metrics for production monitoring:
| Metric | Type | Description |
|---|---|---|
trade_matrix_rl_tier_active |
Gauge | Current fallback tier (1-4) |
trade_matrix_rl_rolling_ic |
Gauge | Rolling IC by instrument |
trade_matrix_rl_position_size |
Gauge | Current position multiplier |
trade_matrix_circuit_breaker_state |
Gauge | Circuit state (0=closed, 1=open, 2=half) |
trade_matrix_rl_inference_latency_ms |
Histogram | Inference latency |
10.5 Integration with Regime Detection
The RL agent is designed to receive regime information from the 4-state HMM (MS-GARCH):
regime_multipliers = {
0: 0.25, # Bear: 25% of normal sizing
1: 0.50, # Neutral: 50% of normal sizing
2: 0.67, # Bull: 67% of normal sizing
3: 0.17, # Crisis: 17% of normal sizing
}
final_position = rl_position * regime_multipliers[regime_id]
Integration Status (January 2026)
While MS-GARCH regime detection runs in production, the regime information is not currently passed to the RL agent's state space in live trading. The RL agent operates on Kelly features without real-time regime context. This represents a key integration gap--see "Current Limitations" section for the full roadmap.
The regime multipliers above are applied as a post-processing step on the fallback cascade output, providing risk reduction independent of the RL agent's learned policy.
10.6 Current Limitations and Integration Gaps
This section documents known limitations as of January 2026, providing transparency about what is and is not fully integrated.
Reward Function Gap
| Aspect | Research Design | Production Implementation |
|---|---|---|
| Reward | Kelly-convergent log-wealth | Weighted (0.4Sharpe + 0.3Return + 0.3*WinRate) |
| Theoretical Guarantee | Provable Kelly convergence | No convergence guarantee |
| Status | Planned enhancement | Current production |
Impact: Production RL achieves strong empirical performance (+47% vs Kelly) but lacks theoretical optimality guarantee.
MS-GARCH Integration Gaps
Five integration gaps between MS-GARCH regime detection and RL position sizing:
| Gap | Severity | Description | Estimated Fix |
|---|---|---|---|
| #1 | CRITICAL | RL agent not receiving regime_confidence in state space |
1.5h |
| #2 | MEDIUM | Circuit breaker test position not regime-aware | 30m |
| #3 | MEDIUM | RL confidence not calibrated to regime uncertainty | 45m |
| #4 | CRITICAL | MS-GARCH regime NOT passed to RL agent in live trading | 2h |
| #5 | MEDIUM | Threshold adaptation uses different confidence model | 1h |
Current State: MS-GARCH runs successfully and informs the fallback cascade's Kelly multipliers, but the RL agent itself does not observe regime information during inference. This limits the agent's ability to learn regime-specific position sizing policies.
Research Features Not Implemented
The following remain research topics (documented in "Research & Future Enhancements" section):
- Multi-asset portfolio optimization (Matrix Kelly formulation)
- Bayesian Neural Networks for uncertainty quantification
- Meta-learning (MAML) for rapid regime adaptation
- Hierarchical RL for multi-timescale decisions
- Full interpretability/explainability features
Planned Enhancements (Prioritized)
- Enable MS-GARCH regime in RL state (Gap #4) - Expected +10-15% improvement
- Add regime_confidence to fallback system (Gap #1) - Better uncertainty handling
- Implement Kelly-convergent reward - Theoretical optimality (optional, lower priority given strong empirical results)
11. Conclusion
11.1 Summary
This research presents a comprehensive framework for RL-based position sizing that addresses the fundamental challenges of applying reinforcement learning to financial decision-making. The key contributions include:
- Kelly-Adjusted Action Space: Anchoring RL to theoretical optimality improves sample efficiency and convergence
- Kelly-Convergent Reward: Log-wealth maximization provably converges to the Kelly fraction
- Curriculum Learning: 4-phase progressive training reduces time from 120 to 45 minutes
- 4-Tier Fallback Cascade: Graceful degradation ensures robust production operation
- Circuit Breaker Protection: Multi-trigger safety mechanism prevents catastrophic losses
The framework has been validated through extensive backtesting and is deployed in Trade-Matrix production for cryptocurrency perpetual futures trading.
11.2 Key Insights
- Theoretical Anchoring Matters: Learning adjustments to a theoretically optimal baseline outperforms learning raw positions
- Entropy Regularization is Critical: SAC's exploration mechanism is essential for weak signal environments
- Fallback Design is Engineering, Not Afterthought: The 4-tier cascade required as much design effort as the RL algorithm itself
- Production Safety Requires Multiple Layers: Circuit breakers, regime adjustment, and fallback tiers work together
11.3 Future Directions
See the "Research & Future Enhancements" section at the beginning of this document for detailed coverage of:
- Multi-asset portfolio optimization (Matrix Kelly)
- Model uncertainty via Bayesian Neural Networks
- Meta-learning for regime adaptation (MAML)
- Hierarchical RL for multi-timescale decisions
- Interpretability and explainability methods
References
-
Kelly, J.L. (1956). A New Interpretation of Information Rate. Bell System Technical Journal, 35(4), 917-926.
-
Thorp, E.O. (2006). The Kelly Criterion in Blackjack Sports Betting and the Stock Market. Handbook of Asset and Liability Management.
-
Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor. ICML.
-
Moody, J., & Saffell, M. (2001). Learning to Trade via Direct Reinforcement. IEEE Transactions on Neural Networks, 12(4), 875-889.
-
Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009). Curriculum Learning. ICML.
-
Grinold, R.C., & Kahn, R.N. (2000). Active Portfolio Management. McGraw-Hill.
-
MacLean, L.C., Thorp, E.O., & Ziemba, W.T. (2011). The Kelly Capital Growth Investment Criterion. World Scientific.
-
Jiang, Z., Xu, D., & Liang, J. (2017). A Deep Reinforcement Learning Framework for the Financial Portfolio Management Problem. arXiv.10059.
-
Fowler, M. (2014). CircuitBreaker. martinfowler.com.
-
Taleb, N.N. (2007). The Black Swan: The Impact of the Highly Improbable. Random House.
This research was conducted by the Trade-Matrix Quantitative Research Team. The framework is production-deployed and continuously validated against live market data.
