ML and Cloud deployment

Kindwind

Any one with ML experience can you look over this? I want to make sure I get this right! # COLOSSUS Hyperparameter Tuning Guide

Current Defaults

Parameter	Day Mode	Night Mode	Notes
Workers	8	8	1 per physical core (Ryzen 7 7800X3D)
MCTS Simulations	100	200	Searches per move
Batch Size	256	512	Training batch size
Learning Rate	0.001	0.001	Adam optimizer
Temperature (early)	1.0	1.0	Exploration in first 30 moves
Temperature (late)	0.1	0.1	Exploitation after 30 moves
Max Moves	300	300	Game length cap
Buffer Capacity	50K	50K	Training sample buffer

Mercy Rule (TUV + IPC Victory Condition)

The mercy rule ends games early when one side has a decisive advantage, providing clear win/loss training signals instead of relying solely on IPC tiebreakers.

Formula

Score = Total_Unit_Value + (IPC_Income × income_weight)

Parameters

Parameter	Default	Description
min_moves	50	Minimum moves before mercy can trigger
score_ratio_threshold	1.05	Required improvement from baseline (5%)
income_weight	3.0	Multiplier for IPC income in score

Baseline Adjustment

The starting position favors Allies economically:

Axis starting score: 824 (TUV 629 + IPC 65×3)
Allies starting score: 977 (TUV 689 + IPC 96×3)
Baseline ratio: 0.84 (Axis/Allies)

The mercy rule compares against this baseline, so:

Axis wins if: current_ratio / 0.84 >= 1.05 → ratio ≥ 0.882
Allies wins if: 0.84 / current_ratio >= 1.05 → ratio ≤ 0.800

Why 1.05x Threshold?

Analysis of trained network self-play showed games swing ratios by ±4% from baseline:

Trained games stay within 0.80-0.87 ratio range
1.2x threshold (old) required ratio ≥ 1.008 for Axis win → never triggered
1.05x threshold allows decisive advantages to end games early

Results with Random Play

Threshold	Axis Wins	Allies Wins	Draws	Notes
1.2x	26%	16%	58%	Too conservative for trained play
1.1x	28%	30%	2%	Good balance
1.05x (default)	95%	5%	0%	Random play favors Axis

Note: Random play heavily favors Axis. With trained networks playing balanced games, expect closer to 50/50 Axis/Allies split.

Tuning Guidelines

If you see...	Try adjusting...
Still 100% draws	Lower threshold to 1.03x
Too many early mercy wins	Increase min_moves to 100
Unbalanced win rates	Check baseline calculation
AI exploiting mercy rule	Raise threshold to 1.1x or 1.2x

Curriculum Learning (Future)

As training progresses, tighten the mercy rule:

Phase 1 (early): threshold=1.2x (easy wins, frequent signal)
Phase 2 (mid): threshold=1.5x (harder to trigger)
Phase 3 (late): threshold=2.0x or disable (require actual VC capture)

When to Tune What

Phase 1: Getting It Working (You Are Here)

Don't tune yet. Use defaults until you confirm:

Model beats random (>60% win rate)
Loss is decreasing
No degenerate strategies

Phase 2: Initial Optimization

Once basics work, try:

If you see...	Try adjusting...
Loss stuck high (>5.0)	Lower learning rate (0.0003)
Loss drops then spikes	Lower learning rate, add warmup
Axis wins 90%+	Check game balance, maybe remove German bid
Draws 50%+	Increase max_moves, check mercy rule
Very slow training	Reduce simulations (50), keep workers at 8

Phase 3: Serious Training

After 10K+ games, consider:

Parameter	When to increase	When to decrease
Simulations	Model plateaued, need deeper search	Training too slow
Batch size	Stable training, want faster	Loss unstable
Learning rate	Training too slow	Loss unstable/spiking
Temperature	Too deterministic, missing good moves	Too random, not converging

Specific Recommendations

Learning Rate

0.001  - Default, good starting point
0.0003 - If loss is unstable
0.0001 - Fine-tuning after initial training

MCTS Simulations

50   - Fast iteration, early experiments
100  - Default day mode (balanced)
200  - Night mode (better quality)
400+ - Only if you have time and see benefit

Batch Size

128 - If running out of GPU memory
256 - Default (good for 17GB VRAM)
512 - Night mode, faster training

Workers (Self-Play)

4  - Light usage (gaming, browsing)
6  - Medium usage (some background tasks)
8  - Optimal for Ryzen 7 7800X3D (1 per physical core)

Important: 8 workers = 1 per physical core. More workers cause CPU contention and slower training due to context switching overhead. Testing showed 8 workers outperforms 10-14 workers on 8-core CPUs.

Memory Management

Buffer Capacity

Setting	Memory Usage	Notes
500K (old default)	~210 GB	WILL CRASH - impossible
100K	~44 GB	Too large for 32GB RAM
50K (current default)	~22 GB	Safe for 32GB RAM
25K	~11 GB	Conservative, for smaller systems

Each training sample is ~440KB due to the 104,729 action space.

Formula: memory_gb ≈ buffer_capacity × 0.00044

Memory Leak Signs

Memory usage climbing steadily over hours
System slowdown after 2-3 hours
Windows "low virtual memory" warnings
BSOD with CRITICAL_PROCESS_DIED

Solutions

Set --buffer-capacity 50000 (or lower)
Monitor with Task Manager during training
Restart training if memory exceeds 28GB

Warning Signs

Model Not Learning

Loss stays >10 after 1000 steps → Check data pipeline
Loss oscillates wildly → Reduce learning rate
0% or 100% win rate → Game mechanics bug

Degenerate Strategies

50% passes → Model learned "doing nothing is safe"
All games draw → Peace treaty equilibrium (check mercy rule is enabled)
Games <50 moves → Suicide attacks or mercy threshold too low
100% Axis or Allies wins → Mercy rule baseline may be miscalibrated
Alternating Axis/Allies wins → Peace Treaty 2.0 (see below)

Peace Treaty 2.0 (Mercy Rule Gaming)

The mercy rule solves the original "peace treaty" where neither side attacks and all games draw. But there's a subtler exploit:

The Problem: In self-play, the same network plays both sides. It could learn to trade mercy wins - "I'll tank my TUV to let you win this game, you do the same next game." The network doesn't distinguish between Axis and Allies identities across games.

Detection Signs:

Pattern	What It Means
Win rate oscillates: 80% Axis → 80% Allies → 80% Axis	Network cycling between strategies
Mercy triggers at exactly min_moves (50)	Intentionally fast losses
One side's TUV drops to near-zero quickly	Deliberate unit sacrifice
Suspiciously balanced 50/50 Axis/Allies wins	Too perfect to be real learning
Low game variance (all games look similar)	Memorized "trade" pattern

How to Detect Programmatically:

# In analyze_progress.py or training loop
# Check for alternating win streaks
recent_winners = [game.winner for game in last_100_games]
axis_streaks = count_streaks(recent_winners, 0)  # Count consecutive Axis wins
allies_streaks = count_streaks(recent_winners, 1)

# Suspicious if we see many short alternating streaks
if avg_streak_length < 3 and win_rate_variance < 0.1:
    print("WARNING: Possible Peace Treaty 2.0 detected")

# Check mercy trigger timing
mercy_moves = [game.end_move for game in mercy_games]
if np.std(mercy_moves) < 10:  # All mercy at same move count
    print("WARNING: Suspiciously consistent mercy timing")

Prevention Strategies:

Strategy	Implementation	Tradeoff
Asymmetric rewards	Axis win = +1.0, Allies win = +0.9	Breaks symmetry, may bias learning
Minimum game length	Raise min_moves from 50 → 100	Slower training, but harder to game
TUV floor check	No mercy if loser TUV > 50% of start	Prevents deliberate tanking
Streak detection	Pause training if alternating pattern detected	Reactive, not preventive
Diverse opponents	Play against older checkpoints (not just self)	Best solution, more complex

Recommended Fix (TUV Floor):

Add to mercy rule: Don't trigger mercy win if the "losing" side still has significant forces.

# In rewards.py MercyRule.check()
loser_tuv_ratio = loser_tuv / loser_starting_tuv
if loser_tuv_ratio > 0.5:  # Loser still has >50% of starting army
    # This is a legitimate beatdown, allow mercy
    return (winner, reason)
else:
    # Loser's army evaporated suspiciously fast
    # Could be intentional tanking - don't trigger mercy
    return None

When to Worry:

Early training (first 1000 games): Don't worry, randomness dominates
Mid training (1000-10000 games): Watch for patterns emerging
Late training (10000+ games): If 50/50 split persists with low variance, investigate

Overfitting

Loss decreases but eval win rate drops → Reduce training, add regularization
Training loss << validation loss → More diverse self-play

The "Suicide Loop"

The AI learns that losing quickly is less painful than dragging out a loss.

Detection: Games ending in <30 moves, one side's units disappear immediately.

Fix: Add small per-move survival bonus to reward function, or raise min_moves for mercy.

The "Hyperparameter Mismatch"

Learning rate too high → AI forgets everything every few minutes.

Detection: Loss oscillates wildly, win rate swings between 0% and 100%.

Fix: Lower learning rate to 0.0003 or 0.0001.

Command Examples

# Conservative (stable but slow)
python scripts/train.py --lr 0.0003 --simulations 50 --workers 8

# Aggressive (faster but riskier)
python scripts/train.py --lr 0.003 --batch-size 512 --workers 8

# Debug mode (fast iteration)
python scripts/train.py --duration 600 --simulations 25 --workers 4

# Full night run
python scripts/train.py --night --simulations 200 --workers 8

# Memory-safe long run
python scripts/train.py --night --buffer-capacity 50000 --workers 8

Monitoring Checklist

Every training run, check:

Loss trending down (not stuck or spiking)
Win rates not extreme (20-80% range is healthy)
Games completing (not all hitting max moves)
Mercy rule triggering (expect 30-50% with trained network)
Mercy timing varies (not all at exactly min_moves)
No alternating Axis/Allies win streaks (Peace Treaty 2.0)
Passes reasonable (<30% of moves)
Game lengths have healthy variance (not all identical)
Memory usage stable (<28GB for 32GB system)

After training:

python scripts/analyze_progress.py
python scripts/watch_game.py --checkpoint checkpoints/latest.pt --speed fast
python scripts/evaluate.py --checkpoint checkpoints/latest.pt --games 20

Neural Network Encoding Limitations

Current Cargo Encoding (Channels 48-53)

The network sees aggregate cargo counts per territory, normalized to max 4:

Channels 48-49: Our transport cargo (infantry / other)
Channels 50-51: Enemy transport cargo (infantry / other)
Channels 52-53: Carrier fighters (ours / enemy)

Known Limitations

Limitation	Strategic Impact	Priority
No per-transport cargo visibility	Can't plan which transport to unload first	Medium
No transport capacity remaining	Can't see if transport is full (2 units) or has space	High
No carrier capacity remaining	Can't see if carrier has 0/1/2 fighters	Medium
Aggregate counts only	Multiple transports in same sea zone appear as one blob	Medium
"Other" cargo lumps unit types	Artillery, armor, AA guns indistinguishable in cargo	Low
Normalization cap of 4	Large fleets (5+ transports) lose precision	Low

Future Encoding Improvements

If training shows transport/carrier coordination issues:

Per-unit cargo channels: Separate channel for each transport's cargo state
Capacity remaining channels: Binary flags for "has space" vs "full"
Distinct unit encoding: Track which specific units are loaded where
Separate "other" cargo types: Artillery vs armor vs AA gun distinction

AA Gun Transport Rules

Summary (verified and tested 2026-01):

Action	Combat Move	Non-Combat Move
Load AA gun on transport	Not allowed	Allowed
Unload AA gun from transport	Not allowed	Allowed
Amphibious assault with AA gun	Allowed (as cargo)	N/A

Details:

AA guns CAN load on transports during NCM
AA guns CANNOT load on transports during Combat Move
AA guns CAN unload during NCM
AA guns CAN participate in amphibious assault (unloaded as non-combatant)
AA guns CANNOT move independently during Combat Move (no combat movement for AA)

This is enforced by the game engine in gen_transport_loads_combat_dedup which excludes AA guns, while gen_transport_loads_dedup (NCM) includes them.

Test coverage: See tests/integration_tests.rs and unit tests in src/moves/generation_fixed.rs

Titans Memory Integration (FUTURE)

Status: PLANNING - Implement after 5,000+ games baseline established

Paper: "Titans: Learning to Memorize at Test Time" (Behrouz, Zhong, Mirrokni - Google Research, December 2024)
Implementation: lucidrains/titans-pytorch (MIT License, 1.5k+ stars)

Overview

Titans is a "surprise-based neural memory" architecture that enables test-time learning. This is a "brain transplant" rather than a full rebuild: we keep the MCTS chassis but replace the static neural network with one that adapts during gameplay.

In vanilla AlphaZero, the neural network is frozen during gameplay. It learned patterns from millions of self-play games but cannot adapt to:

Opponent-specific tendencies - Is this player aggressive? Defensive? Risk-tolerant?
Strategic surprises - Unusual openings, unconventional purchases
Game-specific adaptations - Adjusting mid-game when something unexpected happens

Why Titans for COLOSSUS?

Feature	Standard AlphaZero	Titans-Enhanced
Network weights during game	Static	Dynamic (memory module updates)
Opponent modeling	None	Implicit (learns from surprises)
Adaptation speed	Zero	Real-time (after each opponent move)
Memory of game history	CNN sees last N states	Neural long-term memory
"Surprise" awareness	None	Quantified (gradient of prediction error)

A&A-Specific Benefits:

Purchase Phase Adaptation: Opponent buys 6 bombers → High surprise → Memory updates → Value network shifts toward anti-air strategies
Risk Tolerance Modeling: Opponent attacks with 30% win probability → Memory encodes "opponent is risk-seeking" → MCTS values "bait" moves higher
Strategic Flexibility: Russia stacks Ukraine instead of expected Caucasus defense → AI adjusts strategic evaluation for remainder of game
Breaking Peace Treaty Pattern: Surprising aggressive moves become memorable, encouraging counter-play

Architecture

Current COLOSSUS Architecture

┌─────────────────────────────────────────────────────────┐
│                    MCTS Engine                          │
│   ┌─────────────────────────────────────────────────┐   │
│   │  For each simulation:                           │   │
│   │    1. Select (UCB)                              │   │
│   │    2. Expand                                    │   │
│   │    3. Evaluate → Query Neural Network (STATIC)  │   │
│   │    4. Backpropagate                             │   │
│   └─────────────────────────────────────────────────┘   │
│                         ↓                               │
│              ┌─────────────────┐                        │
│              │  Policy Head    │ → Move probabilities   │
│              │  Value Head     │ → Win probability      │
│              │  (ResNet/CNN)   │                        │
│              │  [FROZEN]       │                        │
│              └─────────────────┘                        │
└─────────────────────────────────────────────────────────┘

Titans-Enhanced Architecture

┌─────────────────────────────────────────────────────────┐
│                    MCTS Engine                          │
│   ┌─────────────────────────────────────────────────┐   │
│   │  For each simulation:                           │   │
│   │    1. Select (UCB)                              │   │
│   │    2. Expand                                    │   │
│   │    3. Evaluate → Query Titans Network           │   │
│   │    4. Backpropagate                             │   │
│   │  [Memory LOCKED during thinking]                │   │
│   └─────────────────────────────────────────────────┘   │
│                         ↓                               │
│   ┌─────────────────────────────────────────────────┐   │
│   │           TITANS ARCHITECTURE                   │   │
│   │  ┌─────────────┐  ┌─────────────┐  ┌─────────┐  │   │
│   │  │Short-Term   │  │Long-Term    │  │Persist  │  │   │
│   │  │Memory       │  │Memory       │  │Memory   │  │   │
│   │  │(Attention)  │  │(Neural MLP) │  │(Fixed)  │  │   │
│   │  │[Window=128] │  │[UPDATES!]   │  │[Task]   │  │   │
│   │  └─────────────┘  └─────────────┘  └─────────┘  │   │
│   │           ↓              ↓              ↓        │   │
│   │         ┌────────────────────────────────┐      │   │
│   │         │        Policy + Value Heads    │      │   │
│   │         │        + Surprise Metric       │      │   │
│   │         └────────────────────────────────┘      │   │
│   └─────────────────────────────────────────────────┘   │
│                         ↓                               │
│   ┌─────────────────────────────────────────────────┐   │
│   │  AFTER OPPONENT MOVES:                          │   │
│   │    1. Calculate prediction vs actual            │   │
│   │    2. Compute surprise (gradient loss)          │   │
│   │    3. Backprop to Long-Term Memory ONLY         │   │
│   │    4. Memory weights updated for next turn      │   │
│   └─────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────┘

Hyperparameters

Memory Module Configuration

Parameter	Default	Range	Notes
enabled	false	-	Enable after base training complete
type	MemoryMLP	MemoryMLP, FactorizedMemoryMLP	2-layer MLP recommended
dim	384	256-512	Match board embedding size
num_layers	2	1-4	More = more expressive, slower
chunk_size	64	32-128	History window per update

Surprise Mechanism

Parameter	Default	Range	Notes
learning_rate	0.01	0.001-0.1	Memory update step size
min_threshold	0.1	0.0-0.5	Skip tiny surprises
max_gradient	1.0	0.5-2.0	Clip extreme surprises

Integration Flags

Parameter	Default	Notes
lock_during_mcts	true	Don't learn from simulations
update_on_opponent_move	true	Core mechanism
reset_between_games	true	Avoid opponent overfitting

Full YAML Configuration

titans:
  enabled: false                    # Enable after base training complete

  memory_module:
    type: "MemoryMLP"               # MemoryMLP, MemoryAttention, etc.
    dim: 384                        # Match board embedding dimension
    num_layers: 2                   # MLP depth (2 recommended by lucidrains)
    chunk_size: 64                  # History window for processing

  surprise:
    learning_rate: 0.01             # Memory update step size
    min_threshold: 0.1              # Don't update if surprise below this
    max_gradient: 1.0               # Clip large surprises

  integration:
    lock_during_mcts: true          # Don't learn from simulations
    update_on_opponent_move: true   # Core surprise mechanism
    update_on_own_move: false       # Usually not needed

  history:
    max_length: 300                 # Max game states to track
    include_purchases: true         # Track purchase decisions
    include_combat_results: true    # Track battle outcomes

Implementation

Installation

pip install titans-pytorch

Core Imports

from titans_pytorch import NeuralMemory, MemoryAsContextTransformer

# Memory models available:
from titans_pytorch import (
    MemoryMLP,              # Simple 1-4 layer MLP (paper default)
    MemoryAttention,        # Attention-based memory
    FactorizedMemoryMLP,    # Efficient factorized version
    MemorySwiGluMLP,        # SwiGLU activation variant
    GatedResidualMemoryMLP  # With residual connections
)

Board State Encoding (Keep Current)

# Current encoding (unchanged)
board_tensor = encode_board(game_state)  # Shape: [1, 54, 12, 12]

# Flatten for Titans
board_flat = board_tensor.view(1, -1)  # Shape: [1, 7776]

# Project to Titans dimension
embedding = self.projection(board_flat)  # Shape: [1, 384]

History Sequence

class GameHistory:
    def __init__(self, embedding_dim=384, max_length=300):
        self.states = []
        self.dim = embedding_dim

    def add_state(self, board_embedding):
        self.states.append(board_embedding)

    def get_sequence(self):
        if not self.states:
            return torch.zeros(1, 1, self.dim)
        return torch.stack(self.states, dim=1)  # [1, T, dim]

The Surprise Calculation (Key Innovation)

def calculate_surprise(network, board_before_opponent, actual_opponent_move):
    """
    Core Titans mechanism: How surprised was the AI by opponent's move?

    High surprise → Large gradient → Memory updates significantly
    Low surprise → Small gradient → Memory mostly unchanged
    """
    with torch.enable_grad():
        # Get prediction BEFORE opponent moved
        policy_pred, value_pred, _ = network(board_before_opponent)

        # What probability did we assign to their actual move?
        move_idx = encode_move(actual_opponent_move)
        predicted_prob = policy_pred[0, move_idx]

        # Surprise = negative log probability (cross-entropy style)
        # If we predicted 90% → low surprise
        # If we predicted 0.1% → high surprise
        surprise_loss = -torch.log(predicted_prob + 1e-8)

        return surprise_loss

def update_memory(network, surprise_loss, learning_rate=0.01):
    """Update ONLY the memory module weights, not the full network."""
    network.memory_module.zero_grad()
    surprise_loss.backward()

    with torch.no_grad():
        for param in network.memory_module.parameters():
            if param.grad is not None:
                param.data -= learning_rate * param.grad

Game Loop Integration

class TitansEnhancedMCTS:
    def __init__(self, network):
        self.network = network
        self.history = GameHistory()

    def play_turn(self, game_state):
        # === THINK PHASE ===
        # Lock memory during MCTS (don't learn from imagination)
        self.network.memory_module.eval()

        # Standard MCTS search
        best_move = self.mcts_search(game_state, simulations=200)

        # Store state BEFORE our move
        self.board_before_move = encode_board(game_state)

        return best_move

    def observe_opponent_move(self, opponent_move, new_state):
        # === SURPRISE PHASE ===
        # Calculate how unexpected opponent's move was
        surprise = calculate_surprise(
            self.network,
            self.board_before_move,
            opponent_move
        )

        # Update memory based on surprise
        self.network.memory_module.train()
        update_memory(self.network, surprise)

        # Add to history for context
        self.history.add_state(encode_board(new_state))

        # Log for analysis
        print(f"Opponent move surprise: {surprise.item():.4f}")

Three Titans Variants

The paper presents three ways to incorporate memory. For COLOSSUS, we recommend MAC (Memory as Context):

1. Memory as Context (MAC) - RECOMMENDED

History → [Neural Memory] → context
Current → [Attention] → query
(context, query) → [Combine] → Policy/Value

Why for A&A: Game history matters. What territories changed hands, what was purchased - this context informs current decisions.

2. Memory as Layer (MAL)

Input → [Memory Layer] → [Attention Layer] → ... → Output

Better for: Very long sequences (2M+ tokens). Overkill for A&A games.

3. Memory as Gate (MAG)

Input → [Memory Branch] ─┐
      → [Attention]    ──┼→ [Gated Combine] → Output

Better for: When you need fine-grained control over memory influence.

Pre-Training Requirements

CRITICAL: Titans surprise-based learning only works if the AI already knows what's "normal". You must:

Train base model first (current COLOSSUS training)
- 10,000+ self-play games minimum
- Network learns rules, basic strategy
- This is your "persistent memory" foundation
Then enable surprise updates
- Network can now detect deviations from learned patterns
- Memory module adapts to specific opponents
- Value shifts reflect game-specific surprises

Implementation Phases

Phase 1: Validation (Current Priority)

Continue current training to 5,000+ games
Validate learning is happening (loss decreasing)
Resolve peace treaty pattern
Establish baseline performance metrics

Phase 2: Titans Infrastructure (After Baseline)

Install titans-pytorch:
pip install titans-pytorch
Create TitansNetwork wrapper class
Implement GameHistory tracking
Add surprise calculation utilities
Unit tests for memory updates

Phase 3: Integration (Careful)

Modify MCTS to use Titans network
Implement memory locking during search
Add post-opponent-move surprise calculation
Test on single games first
Monitor memory weight changes

Phase 4: Training (New Paradigm)

Train base model (frozen memory) - 10,000 games
Enable memory updates during inference only
Test against vanilla COLOSSUS
Measure adaptation effectiveness

Phase 5: Optimization

Tune memory learning rate (0.001 - 0.1 range)
Experiment with memory architectures (MLP layers)
Adjust chunk_size for A&A game length
Profile memory/compute overhead

Expected Outcomes

Metric	Without Titans	With Titans (Expected)
Adaptation to unusual openings	None	Within 3-5 turns
Opponent tendency modeling	None	Implicit after 10+ moves
Response to strategic surprises	Fixed policy	Dynamic adjustment
"Stuck in local minima" games	Common	Reduced (surprise breaks patterns)

A&A-Specific Scenarios

Germany buys navy instead of land units
- Current: AI follows trained policy regardless
- Titans: High surprise → Memory updates → UK/US naval strategy shifts
Japan ignores India, attacks Australia
- Current: AI continues India-focused defense
- Titans: Surprise registered → Pacific defense prioritized
Russia trades Ukraine aggressively
- Current: Standard Eastern Front evaluation
- Titans: Risk-seeking behavior encoded → AI sets traps

Known Challenges

PyTorch Functional Transforms: The titans-pytorch library uses torch.func.grad which has compatibility issues with some setups. May need:
```
torch._C._jit_set_profiling_mode(False)
torch._C._jit_set_profiling_executor(False)
```
Memory Overhead: Neural memory adds parameters. Monitor GPU memory during MCTS (many forward passes). Memory state size grows with game length.
Overfitting to Opponent: Risk that AI adapts TOO much to one opponent's style, becomes exploitable. Mitigation: decay memory updates over time, reset memory between games, train on diverse self-play opponents.

Decision Point

Condition	Action
Current training < 5,000 games	Wait. Build baseline first.
Peace treaty pattern persists after 10k games	Consider Titans to break equilibrium
Want opponent-adaptive AI for human play	Titans is the answer
Cloud training budget available	Train base → add Titans layer

Resources

Papers:

Titans: Learning to Memorize at Test Time - Core paper
MIRAS - Theoretical framework
Test-Time Training Done Right - Related approach

Code:

lucidrains/titans-pytorch - MIT licensed implementation

COLOSSUS Cloud Deployment Plan

Budget: $100
Goal: 20,000-50,000 games
Timeline: Week 3 (after PC validation)

Pre-Cloud Checklist

Complete these BEFORE spending money:

Task	Status	Notes
5,000 games on PC		~2 weeks at current rate
Watch games with watch_game.py		Verify AI is learning, not broken
Win rate not 100% draws		Some Axis/Allied wins appearing
Test in WSL		Catch Linux bugs free
Checkpoint upload working		Don't lose work if instance dies
Git repo ready		Push code to GitHub/GitLab

DO NOT proceed to cloud until all boxes checked!

Phase 1: WSL Testing (Free)

Test on Linux before paying for cloud:

# 1. Enable WSL (Windows Terminal as admin)
wsl --install

# 2. Open Ubuntu
wsl

# 3. Install dependencies
sudo apt update
sudo apt install -y build-essential python3-pip curl

# 4. Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
source $HOME/.cargo/env

# 5. Copy project (from Windows path)
cp -r /mnt/c/colossus ~/colossus
cd ~/colossus

# 6. Install Python deps
pip3 install torch numpy maturin psutil

# 7. Build Rust extension
maturin develop --release

# 8. Run quick test
python3 scripts/train.py --workers 4 --hours 0.5

# 9. Run full test suite
cargo test --all

If this works, cloud will work.

Phase 2: Checkpoint Cloud Sync

Add automatic checkpoint upload so you don't lose progress:

Option A: Google Drive (Recommended - You already use it)

Install rclone and configure:

# Install rclone
curl https://rclone.org/install.sh | sudo bash

# Configure Google Drive
rclone config
# Follow prompts to add "gdrive" remote

# Test upload
rclone copy checkpoints/latest.pt gdrive:colossus/checkpoints/

Add to training script (auto-upload every checkpoint):

# In async_pipeline.py after saving checkpoint:
import subprocess
subprocess.run([
    "rclone", "copy", 
    "checkpoints/", 
    "gdrive:colossus/checkpoints/",
    "--quiet"
])

Option B: Simple SCP (manual but reliable)

After training stops:

# From your Windows machine:
scp -r user@cloud-ip:~/colossus/checkpoints ./cloud_checkpoints/

Phase 3: Cloud Provider Setup

Recommended: Vast.ai

Best price for your budget.

Create account: https://vast.ai
Add $100 credits
Find instance:
- GPU: RTX 4090 or A100
- CPU: 32+ cores
- RAM: 64GB+
- Storage: 50GB+
- Price: $0.30-0.80/hr

Instance Selection

GPU	$/hr	Cores	For $100	Best For
RTX 4090	$0.30-0.50	32	200-300 hrs	Best value
A100 40GB	$0.80-1.20	64	80-125 hrs	Max speed
RTX 3090	$0.20-0.35	16-32	280-500 hrs	Budget

Recommendation: RTX 4090 with 32+ CPU cores at ~$0.40/hr = 250 hours = ~10 days

Phase 4: Cloud Training

One-Time Setup

# SSH into instance
ssh -i your_key root@instance_ip

# Run setup script
curl -sSL https://raw.githubusercontent.com/YOUR_USERNAME/colossus/main/scripts/cloud_setup.sh | bash

# OR manual:
git clone https://github.com/YOUR_USERNAME/colossus.git
cd colossus
pip install torch numpy maturin
maturin develop --release

Upload Your Checkpoint (Continue Training)

# From your Windows machine, upload current checkpoint:
scp C:\colossus\checkpoints\latest.pt root@instance_ip:~/colossus/checkpoints/

Start Training

# Use screen (stays running after disconnect)
screen -S training

# Start with more workers (cloud has more CPU cores)
cd ~/colossus
python scripts/train.py \
    --workers 24 \
    --simulations 100 \
    --hours 240 \
    --resume checkpoints/latest.pt

# Detach: Ctrl+A then D
# Reconnect: screen -r training

Monitor

# New SSH session
screen -r training  # Watch live

# Or check logs
tail -f checkpoints/training.log

Phase 5: Download Results

When done or budget running low:

# From Windows, download checkpoint:
scp root@instance_ip:~/colossus/checkpoints/latest.pt C:\colossus\checkpoints\cloud_latest.pt

# Download all checkpoints:
scp -r root@instance_ip:~/colossus/checkpoints/ C:\colossus\cloud_checkpoints/

Budget Tracking

Item	Hours	Cost
Budget	-	$100
Instance ($0.40/hr)	250	-$100
Remaining	0	$0

Expected Results for $100

Instance Type	Hours	Workers	Games/hr	Total Games
RTX 4090 (32 core)	250	24	~150	~37,500
A100 (64 core)	100	48	~250	~25,000

Cloud Training Config

Update scripts/train.py for cloud:

# Cloud-optimized settings
CLOUD_CONFIG = {
    'workers': 24,           # 32-core machine
    'simulations': 100,
    'batch_size': 512,       # Bigger GPU
    'hours': 240,            # 10 days max
    'checkpoint_interval': 600,  # Every 10 min
}

Or create train_cloud.sh:

#!/bin/bash
python scripts/train.py \
    --workers 24 \
    --simulations 100 \
    --batch-size 512 \
    --hours 240 \
    --resume checkpoints/latest.pt \
    2>&1 | tee training.log

Exit Criteria

Stop training when:

Condition	Action
Budget exhausted	Download checkpoint, stop instance
50,000 games reached	You have enough for evaluation
AI beats random 80%+	Success! Time to evaluate
Loss stops decreasing	May need hyperparameter tuning
Still 100% draws at 20K games	Something's wrong, stop and debug

Troubleshooting

Instance Dies / Gets Preempted

Checkpoints auto-save every 10 min
Use rclone to sync to Google Drive
Restart on new instance, resume from latest.pt

Out of GPU Memory

# Reduce batch size
python scripts/train.py --batch-size 256 ...

Training Too Slow

# More workers (up to CPU cores - 2)
python scripts/train.py --workers 48 ...

# Fewer simulations (faster but lower quality)
python scripts/train.py --simulations 50 ...

Summary Checklist

Before Cloud:

5,000 games on PC
Watched games, AI is learning
WSL test passed
Git repo pushed
Checkpoint sync tested

On Cloud:

Instance launched
Setup script ran
Uploaded local checkpoint
Training started in screen
rclone syncing checkpoints

After Cloud:

Downloaded final checkpoint
Stopped instance (stop billing!)
Tested checkpoint locally
Watch AI play

Last updated: 2026-01

RogerCooper

@kindwind Rather than a mercy rule, I would suggest a fixed turn limit, where the game is adjudicated based upon VC's. For the boardgame ports, a 10 turn limit would be sufficient.

Keep in mind is possible for a game to reach a stalemate.

Kindwind

@rogercooper right now I am trying to pin down a good signal from the tree search. I finally stopped get draws. Axis were winning like 25% but the game engine was wrong. I think i have the game engine pinned down. will some training tonight to see if I can get a signal. If I have to I will add turns. have to see how it plays out.

RogerCooper

@kindwind It would seem that training against random games would be every inefficient compared to training against the AI that TripleA comes with.TripleA is not Go, random play should be very bad.