ML and Cloud deployment
-
Any one with ML experience can you look over this? I want to make sure I get this right! # COLOSSUS Hyperparameter Tuning Guide
Current Defaults
Parameter Day Mode Night Mode Notes Workers 8 8 1 per physical core (Ryzen 7 7800X3D) MCTS Simulations 100 200 Searches per move Batch Size 256 512 Training batch size Learning Rate 0.001 0.001 Adam optimizer Temperature (early) 1.0 1.0 Exploration in first 30 moves Temperature (late) 0.1 0.1 Exploitation after 30 moves Max Moves 300 300 Game length cap Buffer Capacity 50K 50K Training sample buffer
Mercy Rule (TUV + IPC Victory Condition)
The mercy rule ends games early when one side has a decisive advantage, providing clear win/loss training signals instead of relying solely on IPC tiebreakers.
Formula
Score = Total_Unit_Value + (IPC_Income × income_weight)Parameters
Parameter Default Description min_moves 50 Minimum moves before mercy can trigger score_ratio_threshold 1.05 Required improvement from baseline (5%) income_weight 3.0 Multiplier for IPC income in score Baseline Adjustment
The starting position favors Allies economically:
- Axis starting score: 824 (TUV 629 + IPC 65×3)
- Allies starting score: 977 (TUV 689 + IPC 96×3)
- Baseline ratio: 0.84 (Axis/Allies)
The mercy rule compares against this baseline, so:
- Axis wins if:
current_ratio / 0.84 >= 1.05→ ratio ≥ 0.882 - Allies wins if:
0.84 / current_ratio >= 1.05→ ratio ≤ 0.800
Why 1.05x Threshold?
Analysis of trained network self-play showed games swing ratios by ±4% from baseline:
- Trained games stay within 0.80-0.87 ratio range
- 1.2x threshold (old) required ratio ≥ 1.008 for Axis win → never triggered
- 1.05x threshold allows decisive advantages to end games early
Results with Random Play
Threshold Axis Wins Allies Wins Draws Notes 1.2x 26% 16% 58% Too conservative for trained play 1.1x 28% 30% 2% Good balance 1.05x (default) 95% 5% 0% Random play favors Axis Note: Random play heavily favors Axis. With trained networks playing balanced games, expect closer to 50/50 Axis/Allies split.
Tuning Guidelines
If you see... Try adjusting... Still 100% draws Lower threshold to 1.03x Too many early mercy wins Increase min_moves to 100 Unbalanced win rates Check baseline calculation AI exploiting mercy rule Raise threshold to 1.1x or 1.2x Curriculum Learning (Future)
As training progresses, tighten the mercy rule:
- Phase 1 (early): threshold=1.2x (easy wins, frequent signal)
- Phase 2 (mid): threshold=1.5x (harder to trigger)
- Phase 3 (late): threshold=2.0x or disable (require actual VC capture)
When to Tune What
Phase 1: Getting It Working (You Are Here)
Don't tune yet. Use defaults until you confirm:
- Model beats random (>60% win rate)
- Loss is decreasing
- No degenerate strategies
Phase 2: Initial Optimization
Once basics work, try:
If you see... Try adjusting... Loss stuck high (>5.0) Lower learning rate (0.0003) Loss drops then spikes Lower learning rate, add warmup Axis wins 90%+ Check game balance, maybe remove German bid Draws 50%+ Increase max_moves, check mercy rule Very slow training Reduce simulations (50), keep workers at 8 Phase 3: Serious Training
After 10K+ games, consider:
Parameter When to increase When to decrease Simulations Model plateaued, need deeper search Training too slow Batch size Stable training, want faster Loss unstable Learning rate Training too slow Loss unstable/spiking Temperature Too deterministic, missing good moves Too random, not converging
Specific Recommendations
Learning Rate
0.001 - Default, good starting point 0.0003 - If loss is unstable 0.0001 - Fine-tuning after initial trainingMCTS Simulations
50 - Fast iteration, early experiments 100 - Default day mode (balanced) 200 - Night mode (better quality) 400+ - Only if you have time and see benefitBatch Size
128 - If running out of GPU memory 256 - Default (good for 17GB VRAM) 512 - Night mode, faster trainingWorkers (Self-Play)
4 - Light usage (gaming, browsing) 6 - Medium usage (some background tasks) 8 - Optimal for Ryzen 7 7800X3D (1 per physical core)Important: 8 workers = 1 per physical core. More workers cause CPU contention and slower training due to context switching overhead. Testing showed 8 workers outperforms 10-14 workers on 8-core CPUs.
Memory Management
Buffer Capacity
Setting Memory Usage Notes 500K (old default) ~210 GB WILL CRASH - impossible 100K ~44 GB Too large for 32GB RAM 50K (current default) ~22 GB Safe for 32GB RAM 25K ~11 GB Conservative, for smaller systems Each training sample is ~440KB due to the 104,729 action space.
Formula:
memory_gb ≈ buffer_capacity × 0.00044Memory Leak Signs
- Memory usage climbing steadily over hours
- System slowdown after 2-3 hours
- Windows "low virtual memory" warnings
- BSOD with CRITICAL_PROCESS_DIED
Solutions
- Set
--buffer-capacity 50000(or lower) - Monitor with Task Manager during training
- Restart training if memory exceeds 28GB
Warning Signs
Model Not Learning
- Loss stays >10 after 1000 steps → Check data pipeline
- Loss oscillates wildly → Reduce learning rate
- 0% or 100% win rate → Game mechanics bug
Degenerate Strategies
-
50% passes → Model learned "doing nothing is safe"
- All games draw → Peace treaty equilibrium (check mercy rule is enabled)
- Games <50 moves → Suicide attacks or mercy threshold too low
- 100% Axis or Allies wins → Mercy rule baseline may be miscalibrated
- Alternating Axis/Allies wins → Peace Treaty 2.0 (see below)
Peace Treaty 2.0 (Mercy Rule Gaming)
The mercy rule solves the original "peace treaty" where neither side attacks and all games draw. But there's a subtler exploit:
The Problem: In self-play, the same network plays both sides. It could learn to trade mercy wins - "I'll tank my TUV to let you win this game, you do the same next game." The network doesn't distinguish between Axis and Allies identities across games.
Detection Signs:
Pattern What It Means Win rate oscillates: 80% Axis → 80% Allies → 80% Axis Network cycling between strategies Mercy triggers at exactly min_moves (50) Intentionally fast losses One side's TUV drops to near-zero quickly Deliberate unit sacrifice Suspiciously balanced 50/50 Axis/Allies wins Too perfect to be real learning Low game variance (all games look similar) Memorized "trade" pattern How to Detect Programmatically:
# In analyze_progress.py or training loop # Check for alternating win streaks recent_winners = [game.winner for game in last_100_games] axis_streaks = count_streaks(recent_winners, 0) # Count consecutive Axis wins allies_streaks = count_streaks(recent_winners, 1) # Suspicious if we see many short alternating streaks if avg_streak_length < 3 and win_rate_variance < 0.1: print("WARNING: Possible Peace Treaty 2.0 detected") # Check mercy trigger timing mercy_moves = [game.end_move for game in mercy_games] if np.std(mercy_moves) < 10: # All mercy at same move count print("WARNING: Suspiciously consistent mercy timing")Prevention Strategies:
Strategy Implementation Tradeoff Asymmetric rewards Axis win = +1.0, Allies win = +0.9 Breaks symmetry, may bias learning Minimum game length Raise min_moves from 50 → 100 Slower training, but harder to game TUV floor check No mercy if loser TUV > 50% of start Prevents deliberate tanking Streak detection Pause training if alternating pattern detected Reactive, not preventive Diverse opponents Play against older checkpoints (not just self) Best solution, more complex Recommended Fix (TUV Floor):
Add to mercy rule: Don't trigger mercy win if the "losing" side still has significant forces.
# In rewards.py MercyRule.check() loser_tuv_ratio = loser_tuv / loser_starting_tuv if loser_tuv_ratio > 0.5: # Loser still has >50% of starting army # This is a legitimate beatdown, allow mercy return (winner, reason) else: # Loser's army evaporated suspiciously fast # Could be intentional tanking - don't trigger mercy return NoneWhen to Worry:
- Early training (first 1000 games): Don't worry, randomness dominates
- Mid training (1000-10000 games): Watch for patterns emerging
- Late training (10000+ games): If 50/50 split persists with low variance, investigate
Overfitting
- Loss decreases but eval win rate drops → Reduce training, add regularization
- Training loss << validation loss → More diverse self-play
The "Suicide Loop"
The AI learns that losing quickly is less painful than dragging out a loss.
Detection: Games ending in <30 moves, one side's units disappear immediately.
Fix: Add small per-move survival bonus to reward function, or raise min_moves for mercy.
The "Hyperparameter Mismatch"
Learning rate too high → AI forgets everything every few minutes.
Detection: Loss oscillates wildly, win rate swings between 0% and 100%.
Fix: Lower learning rate to 0.0003 or 0.0001.
Command Examples
# Conservative (stable but slow) python scripts/train.py --lr 0.0003 --simulations 50 --workers 8 # Aggressive (faster but riskier) python scripts/train.py --lr 0.003 --batch-size 512 --workers 8 # Debug mode (fast iteration) python scripts/train.py --duration 600 --simulations 25 --workers 4 # Full night run python scripts/train.py --night --simulations 200 --workers 8 # Memory-safe long run python scripts/train.py --night --buffer-capacity 50000 --workers 8
Monitoring Checklist
Every training run, check:
- Loss trending down (not stuck or spiking)
- Win rates not extreme (20-80% range is healthy)
- Games completing (not all hitting max moves)
- Mercy rule triggering (expect 30-50% with trained network)
- Mercy timing varies (not all at exactly min_moves)
- No alternating Axis/Allies win streaks (Peace Treaty 2.0)
- Passes reasonable (<30% of moves)
- Game lengths have healthy variance (not all identical)
- Memory usage stable (<28GB for 32GB system)
After training:
python scripts/analyze_progress.py python scripts/watch_game.py --checkpoint checkpoints/latest.pt --speed fast python scripts/evaluate.py --checkpoint checkpoints/latest.pt --games 20
Neural Network Encoding Limitations
Current Cargo Encoding (Channels 48-53)
The network sees aggregate cargo counts per territory, normalized to max 4:
- Channels 48-49: Our transport cargo (infantry / other)
- Channels 50-51: Enemy transport cargo (infantry / other)
- Channels 52-53: Carrier fighters (ours / enemy)
Known Limitations
Limitation Strategic Impact Priority No per-transport cargo visibility Can't plan which transport to unload first Medium No transport capacity remaining Can't see if transport is full (2 units) or has space High No carrier capacity remaining Can't see if carrier has 0/1/2 fighters Medium Aggregate counts only Multiple transports in same sea zone appear as one blob Medium "Other" cargo lumps unit types Artillery, armor, AA guns indistinguishable in cargo Low Normalization cap of 4 Large fleets (5+ transports) lose precision Low Future Encoding Improvements
If training shows transport/carrier coordination issues:
- Per-unit cargo channels: Separate channel for each transport's cargo state
- Capacity remaining channels: Binary flags for "has space" vs "full"
- Distinct unit encoding: Track which specific units are loaded where
- Separate "other" cargo types: Artillery vs armor vs AA gun distinction
AA Gun Transport Rules
Summary (verified and tested 2026-01):
Action Combat Move Non-Combat Move Load AA gun on transport
Not allowed
AllowedUnload AA gun from transport
Not allowed
AllowedAmphibious assault with AA gun
Allowed (as cargo)N/A Details:
- AA guns CAN load on transports during NCM

- AA guns CANNOT load on transports during Combat Move

- AA guns CAN unload during NCM

- AA guns CAN participate in amphibious assault (unloaded as non-combatant)

- AA guns CANNOT move independently during Combat Move (no combat movement for AA)
This is enforced by the game engine in
gen_transport_loads_combat_dedupwhich excludes AA guns, whilegen_transport_loads_dedup(NCM) includes them.Test coverage: See
tests/integration_tests.rsand unit tests insrc/moves/generation_fixed.rs
Titans Memory Integration (FUTURE)
Status: PLANNING - Implement after 5,000+ games baseline established
Paper: "Titans: Learning to Memorize at Test Time" (Behrouz, Zhong, Mirrokni - Google Research, December 2024)
Implementation:lucidrains/titans-pytorch(MIT License, 1.5k+ stars)Overview
Titans is a "surprise-based neural memory" architecture that enables test-time learning. This is a "brain transplant" rather than a full rebuild: we keep the MCTS chassis but replace the static neural network with one that adapts during gameplay.
In vanilla AlphaZero, the neural network is frozen during gameplay. It learned patterns from millions of self-play games but cannot adapt to:
- Opponent-specific tendencies - Is this player aggressive? Defensive? Risk-tolerant?
- Strategic surprises - Unusual openings, unconventional purchases
- Game-specific adaptations - Adjusting mid-game when something unexpected happens
Why Titans for COLOSSUS?
Feature Standard AlphaZero Titans-Enhanced Network weights during game Static Dynamic (memory module updates) Opponent modeling None Implicit (learns from surprises) Adaptation speed Zero Real-time (after each opponent move) Memory of game history CNN sees last N states Neural long-term memory "Surprise" awareness None Quantified (gradient of prediction error) A&A-Specific Benefits:
- Purchase Phase Adaptation: Opponent buys 6 bombers → High surprise → Memory updates → Value network shifts toward anti-air strategies
- Risk Tolerance Modeling: Opponent attacks with 30% win probability → Memory encodes "opponent is risk-seeking" → MCTS values "bait" moves higher
- Strategic Flexibility: Russia stacks Ukraine instead of expected Caucasus defense → AI adjusts strategic evaluation for remainder of game
- Breaking Peace Treaty Pattern: Surprising aggressive moves become memorable, encouraging counter-play
Architecture
Current COLOSSUS Architecture
┌─────────────────────────────────────────────────────────┐ │ MCTS Engine │ │ ┌─────────────────────────────────────────────────┐ │ │ │ For each simulation: │ │ │ │ 1. Select (UCB) │ │ │ │ 2. Expand │ │ │ │ 3. Evaluate → Query Neural Network (STATIC) │ │ │ │ 4. Backpropagate │ │ │ └─────────────────────────────────────────────────┘ │ │ ↓ │ │ ┌─────────────────┐ │ │ │ Policy Head │ → Move probabilities │ │ │ Value Head │ → Win probability │ │ │ (ResNet/CNN) │ │ │ │ [FROZEN] │ │ │ └─────────────────┘ │ └─────────────────────────────────────────────────────────┘Titans-Enhanced Architecture
┌─────────────────────────────────────────────────────────┐ │ MCTS Engine │ │ ┌─────────────────────────────────────────────────┐ │ │ │ For each simulation: │ │ │ │ 1. Select (UCB) │ │ │ │ 2. Expand │ │ │ │ 3. Evaluate → Query Titans Network │ │ │ │ 4. Backpropagate │ │ │ │ [Memory LOCKED during thinking] │ │ │ └─────────────────────────────────────────────────┘ │ │ ↓ │ │ ┌─────────────────────────────────────────────────┐ │ │ │ TITANS ARCHITECTURE │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────┐ │ │ │ │ │Short-Term │ │Long-Term │ │Persist │ │ │ │ │ │Memory │ │Memory │ │Memory │ │ │ │ │ │(Attention) │ │(Neural MLP) │ │(Fixed) │ │ │ │ │ │[Window=128] │ │[UPDATES!] │ │[Task] │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────┘ │ │ │ │ ↓ ↓ ↓ │ │ │ │ ┌────────────────────────────────┐ │ │ │ │ │ Policy + Value Heads │ │ │ │ │ │ + Surprise Metric │ │ │ │ │ └────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────┘ │ │ ↓ │ │ ┌─────────────────────────────────────────────────┐ │ │ │ AFTER OPPONENT MOVES: │ │ │ │ 1. Calculate prediction vs actual │ │ │ │ 2. Compute surprise (gradient loss) │ │ │ │ 3. Backprop to Long-Term Memory ONLY │ │ │ │ 4. Memory weights updated for next turn │ │ │ └─────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────┘
Hyperparameters
Memory Module Configuration
Parameter Default Range Notes enabled false - Enable after base training complete type MemoryMLP MemoryMLP, FactorizedMemoryMLP 2-layer MLP recommended dim 384 256-512 Match board embedding size num_layers 2 1-4 More = more expressive, slower chunk_size 64 32-128 History window per update Surprise Mechanism
Parameter Default Range Notes learning_rate 0.01 0.001-0.1 Memory update step size min_threshold 0.1 0.0-0.5 Skip tiny surprises max_gradient 1.0 0.5-2.0 Clip extreme surprises Integration Flags
Parameter Default Notes lock_during_mcts true Don't learn from simulations update_on_opponent_move true Core mechanism reset_between_games true Avoid opponent overfitting Full YAML Configuration
titans: enabled: false # Enable after base training complete memory_module: type: "MemoryMLP" # MemoryMLP, MemoryAttention, etc. dim: 384 # Match board embedding dimension num_layers: 2 # MLP depth (2 recommended by lucidrains) chunk_size: 64 # History window for processing surprise: learning_rate: 0.01 # Memory update step size min_threshold: 0.1 # Don't update if surprise below this max_gradient: 1.0 # Clip large surprises integration: lock_during_mcts: true # Don't learn from simulations update_on_opponent_move: true # Core surprise mechanism update_on_own_move: false # Usually not needed history: max_length: 300 # Max game states to track include_purchases: true # Track purchase decisions include_combat_results: true # Track battle outcomes
Implementation
Installation
pip install titans-pytorchCore Imports
from titans_pytorch import NeuralMemory, MemoryAsContextTransformer # Memory models available: from titans_pytorch import ( MemoryMLP, # Simple 1-4 layer MLP (paper default) MemoryAttention, # Attention-based memory FactorizedMemoryMLP, # Efficient factorized version MemorySwiGluMLP, # SwiGLU activation variant GatedResidualMemoryMLP # With residual connections )Board State Encoding (Keep Current)
# Current encoding (unchanged) board_tensor = encode_board(game_state) # Shape: [1, 54, 12, 12] # Flatten for Titans board_flat = board_tensor.view(1, -1) # Shape: [1, 7776] # Project to Titans dimension embedding = self.projection(board_flat) # Shape: [1, 384]History Sequence
class GameHistory: def __init__(self, embedding_dim=384, max_length=300): self.states = [] self.dim = embedding_dim def add_state(self, board_embedding): self.states.append(board_embedding) def get_sequence(self): if not self.states: return torch.zeros(1, 1, self.dim) return torch.stack(self.states, dim=1) # [1, T, dim]The Surprise Calculation (Key Innovation)
def calculate_surprise(network, board_before_opponent, actual_opponent_move): """ Core Titans mechanism: How surprised was the AI by opponent's move? High surprise → Large gradient → Memory updates significantly Low surprise → Small gradient → Memory mostly unchanged """ with torch.enable_grad(): # Get prediction BEFORE opponent moved policy_pred, value_pred, _ = network(board_before_opponent) # What probability did we assign to their actual move? move_idx = encode_move(actual_opponent_move) predicted_prob = policy_pred[0, move_idx] # Surprise = negative log probability (cross-entropy style) # If we predicted 90% → low surprise # If we predicted 0.1% → high surprise surprise_loss = -torch.log(predicted_prob + 1e-8) return surprise_loss def update_memory(network, surprise_loss, learning_rate=0.01): """Update ONLY the memory module weights, not the full network.""" network.memory_module.zero_grad() surprise_loss.backward() with torch.no_grad(): for param in network.memory_module.parameters(): if param.grad is not None: param.data -= learning_rate * param.gradGame Loop Integration
class TitansEnhancedMCTS: def __init__(self, network): self.network = network self.history = GameHistory() def play_turn(self, game_state): # === THINK PHASE === # Lock memory during MCTS (don't learn from imagination) self.network.memory_module.eval() # Standard MCTS search best_move = self.mcts_search(game_state, simulations=200) # Store state BEFORE our move self.board_before_move = encode_board(game_state) return best_move def observe_opponent_move(self, opponent_move, new_state): # === SURPRISE PHASE === # Calculate how unexpected opponent's move was surprise = calculate_surprise( self.network, self.board_before_move, opponent_move ) # Update memory based on surprise self.network.memory_module.train() update_memory(self.network, surprise) # Add to history for context self.history.add_state(encode_board(new_state)) # Log for analysis print(f"Opponent move surprise: {surprise.item():.4f}")
Three Titans Variants
The paper presents three ways to incorporate memory. For COLOSSUS, we recommend MAC (Memory as Context):
1. Memory as Context (MAC) - RECOMMENDED
History → [Neural Memory] → context Current → [Attention] → query (context, query) → [Combine] → Policy/ValueWhy for A&A: Game history matters. What territories changed hands, what was purchased - this context informs current decisions.
2. Memory as Layer (MAL)
Input → [Memory Layer] → [Attention Layer] → ... → OutputBetter for: Very long sequences (2M+ tokens). Overkill for A&A games.
3. Memory as Gate (MAG)
Input → [Memory Branch] ─┐ → [Attention] ──┼→ [Gated Combine] → OutputBetter for: When you need fine-grained control over memory influence.
Pre-Training Requirements
CRITICAL: Titans surprise-based learning only works if the AI already knows what's "normal". You must:
-
Train base model first (current COLOSSUS training)
- 10,000+ self-play games minimum
- Network learns rules, basic strategy
- This is your "persistent memory" foundation
-
Then enable surprise updates
- Network can now detect deviations from learned patterns
- Memory module adapts to specific opponents
- Value shifts reflect game-specific surprises
Implementation Phases
Phase 1: Validation (Current Priority)
- Continue current training to 5,000+ games
- Validate learning is happening (loss decreasing)
- Resolve peace treaty pattern
- Establish baseline performance metrics
Phase 2: Titans Infrastructure (After Baseline)
- Install titans-pytorch:
pip install titans-pytorch - Create TitansNetwork wrapper class
- Implement GameHistory tracking
- Add surprise calculation utilities
- Unit tests for memory updates
Phase 3: Integration (Careful)
- Modify MCTS to use Titans network
- Implement memory locking during search
- Add post-opponent-move surprise calculation
- Test on single games first
- Monitor memory weight changes
Phase 4: Training (New Paradigm)
- Train base model (frozen memory) - 10,000 games
- Enable memory updates during inference only
- Test against vanilla COLOSSUS
- Measure adaptation effectiveness
Phase 5: Optimization
- Tune memory learning rate (0.001 - 0.1 range)
- Experiment with memory architectures (MLP layers)
- Adjust chunk_size for A&A game length
- Profile memory/compute overhead
Expected Outcomes
Metric Without Titans With Titans (Expected) Adaptation to unusual openings None Within 3-5 turns Opponent tendency modeling None Implicit after 10+ moves Response to strategic surprises Fixed policy Dynamic adjustment "Stuck in local minima" games Common Reduced (surprise breaks patterns) A&A-Specific Scenarios
-
Germany buys navy instead of land units
- Current: AI follows trained policy regardless
- Titans: High surprise → Memory updates → UK/US naval strategy shifts
-
Japan ignores India, attacks Australia
- Current: AI continues India-focused defense
- Titans: Surprise registered → Pacific defense prioritized
-
Russia trades Ukraine aggressively
- Current: Standard Eastern Front evaluation
- Titans: Risk-seeking behavior encoded → AI sets traps
Known Challenges
-
PyTorch Functional Transforms: The
titans-pytorchlibrary usestorch.func.gradwhich has compatibility issues with some setups. May need:torch._C._jit_set_profiling_mode(False) torch._C._jit_set_profiling_executor(False) -
Memory Overhead: Neural memory adds parameters. Monitor GPU memory during MCTS (many forward passes). Memory state size grows with game length.
-
Overfitting to Opponent: Risk that AI adapts TOO much to one opponent's style, becomes exploitable. Mitigation: decay memory updates over time, reset memory between games, train on diverse self-play opponents.
Decision Point
Condition Action Current training < 5,000 games Wait. Build baseline first. Peace treaty pattern persists after 10k games Consider Titans to break equilibrium Want opponent-adaptive AI for human play Titans is the answer Cloud training budget available Train base → add Titans layer
Resources
Papers:
- Titans: Learning to Memorize at Test Time - Core paper
- MIRAS - Theoretical framework
- Test-Time Training Done Right - Related approach
Code:
- lucidrains/titans-pytorch - MIT licensed implementation
COLOSSUS Cloud Deployment Plan
Budget: $100
Goal: 20,000-50,000 games
Timeline: Week 3 (after PC validation)
Pre-Cloud Checklist
Complete these BEFORE spending money:
Task Status Notes 5,000 games on PC 
~2 weeks at current rate Watch games with watch_game.py 
Verify AI is learning, not broken Win rate not 100% draws 
Some Axis/Allied wins appearing Test in WSL 
Catch Linux bugs free Checkpoint upload working 
Don't lose work if instance dies Git repo ready 
Push code to GitHub/GitLab DO NOT proceed to cloud until all boxes checked!
Phase 1: WSL Testing (Free)
Test on Linux before paying for cloud:
# 1. Enable WSL (Windows Terminal as admin) wsl --install # 2. Open Ubuntu wsl # 3. Install dependencies sudo apt update sudo apt install -y build-essential python3-pip curl # 4. Install Rust curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y source $HOME/.cargo/env # 5. Copy project (from Windows path) cp -r /mnt/c/colossus ~/colossus cd ~/colossus # 6. Install Python deps pip3 install torch numpy maturin psutil # 7. Build Rust extension maturin develop --release # 8. Run quick test python3 scripts/train.py --workers 4 --hours 0.5 # 9. Run full test suite cargo test --allIf this works, cloud will work.
Phase 2: Checkpoint Cloud Sync
Add automatic checkpoint upload so you don't lose progress:
Option A: Google Drive (Recommended - You already use it)
Install rclone and configure:
# Install rclone curl https://rclone.org/install.sh | sudo bash # Configure Google Drive rclone config # Follow prompts to add "gdrive" remote # Test upload rclone copy checkpoints/latest.pt gdrive:colossus/checkpoints/Add to training script (auto-upload every checkpoint):
# In async_pipeline.py after saving checkpoint: import subprocess subprocess.run([ "rclone", "copy", "checkpoints/", "gdrive:colossus/checkpoints/", "--quiet" ])Option B: Simple SCP (manual but reliable)
After training stops:
# From your Windows machine: scp -r user@cloud-ip:~/colossus/checkpoints ./cloud_checkpoints/
Phase 3: Cloud Provider Setup
Recommended: Vast.ai
Best price for your budget.
- Create account: https://vast.ai
- Add $100 credits
- Find instance:
- GPU: RTX 4090 or A100
- CPU: 32+ cores
- RAM: 64GB+
- Storage: 50GB+
- Price: $0.30-0.80/hr
Instance Selection
GPU $/hr Cores For $100 Best For RTX 4090 $0.30-0.50 32 200-300 hrs Best value A100 40GB $0.80-1.20 64 80-125 hrs Max speed RTX 3090 $0.20-0.35 16-32 280-500 hrs Budget Recommendation: RTX 4090 with 32+ CPU cores at ~$0.40/hr = 250 hours = ~10 days
Phase 4: Cloud Training
One-Time Setup
# SSH into instance ssh -i your_key root@instance_ip # Run setup script curl -sSL https://raw.githubusercontent.com/YOUR_USERNAME/colossus/main/scripts/cloud_setup.sh | bash # OR manual: git clone https://github.com/YOUR_USERNAME/colossus.git cd colossus pip install torch numpy maturin maturin develop --releaseUpload Your Checkpoint (Continue Training)
# From your Windows machine, upload current checkpoint: scp C:\colossus\checkpoints\latest.pt root@instance_ip:~/colossus/checkpoints/Start Training
# Use screen (stays running after disconnect) screen -S training # Start with more workers (cloud has more CPU cores) cd ~/colossus python scripts/train.py \ --workers 24 \ --simulations 100 \ --hours 240 \ --resume checkpoints/latest.pt # Detach: Ctrl+A then D # Reconnect: screen -r trainingMonitor
# New SSH session screen -r training # Watch live # Or check logs tail -f checkpoints/training.log
Phase 5: Download Results
When done or budget running low:
# From Windows, download checkpoint: scp root@instance_ip:~/colossus/checkpoints/latest.pt C:\colossus\checkpoints\cloud_latest.pt # Download all checkpoints: scp -r root@instance_ip:~/colossus/checkpoints/ C:\colossus\cloud_checkpoints/
Budget Tracking
Item Hours Cost Budget - $100 Instance ($0.40/hr) 250 -$100 Remaining 0 $0 Expected Results for $100
Instance Type Hours Workers Games/hr Total Games RTX 4090 (32 core) 250 24 ~150 ~37,500 A100 (64 core) 100 48 ~250 ~25,000
Cloud Training Config
Update
scripts/train.pyfor cloud:# Cloud-optimized settings CLOUD_CONFIG = { 'workers': 24, # 32-core machine 'simulations': 100, 'batch_size': 512, # Bigger GPU 'hours': 240, # 10 days max 'checkpoint_interval': 600, # Every 10 min }Or create
train_cloud.sh:#!/bin/bash python scripts/train.py \ --workers 24 \ --simulations 100 \ --batch-size 512 \ --hours 240 \ --resume checkpoints/latest.pt \ 2>&1 | tee training.log
Exit Criteria
Stop training when:
Condition Action Budget exhausted Download checkpoint, stop instance 50,000 games reached You have enough for evaluation AI beats random 80%+ Success! Time to evaluate Loss stops decreasing May need hyperparameter tuning Still 100% draws at 20K games Something's wrong, stop and debug
Troubleshooting
Instance Dies / Gets Preempted
- Checkpoints auto-save every 10 min
- Use rclone to sync to Google Drive
- Restart on new instance, resume from latest.pt
Out of GPU Memory
# Reduce batch size python scripts/train.py --batch-size 256 ...Training Too Slow
# More workers (up to CPU cores - 2) python scripts/train.py --workers 48 ... # Fewer simulations (faster but lower quality) python scripts/train.py --simulations 50 ...
Summary Checklist
Before Cloud:
- 5,000 games on PC
- Watched games, AI is learning
- WSL test passed
- Git repo pushed
- Checkpoint sync tested
On Cloud:
- Instance launched
- Setup script ran
- Uploaded local checkpoint
- Training started in screen
- rclone syncing checkpoints
After Cloud:
- Downloaded final checkpoint
- Stopped instance (stop billing!)
- Tested checkpoint locally
- Watch AI play
Last updated: 2026-01
-
@kindwind Rather than a mercy rule, I would suggest a fixed turn limit, where the game is adjudicated based upon VC's. For the boardgame ports, a 10 turn limit would be sufficient.
Keep in mind is possible for a game to reach a stalemate.
-
@rogercooper right now I am trying to pin down a good signal from the tree search. I finally stopped get draws. Axis were winning like 25% but the game engine was wrong. I think i have the game engine pinned down. will some training tonight to see if I can get a signal. If I have to I will add turns. have to see how it plays out.
-
@kindwind It would seem that training against random games would be every inefficient compared to training against the AI that TripleA comes with.TripleA is not Go, random play should be very bad.