ML Meta-Model Optimisation
Systematic hyperparameter and feature search using OOB (out-of-bag) predictions as the honest out-of-sample objective
120 HP combos, 18 feature subsets, 20-step forward selection. 1,178,824 training rows.

OOB Top-20 Alpha
9.27%
Honest out-of-sample
OOB Top-5 Alpha
11.51%
Out-of-bag predictions
OOB Hit Rate
92.0%
Top-20 picks above national
Overfitting Gap
0.31pp
In-sample 9.58% vs OOB 9.27%
Why OOB matters. Random Forest's out-of-bag predictions give each sample a prediction from only trees that did not see it during training. This is a genuine out-of-sample metric, unlike the previous in-sample results (where RF Deep showed 10.81% alpha but was heavily overfitted). The optimised model achieves 9.27% OOB alpha-20 with only a 0.31 percentage point gap to in-sample, confirming minimal overfitting.
Improvement over baselines. Original forecast: 5.99% OOB top-20 alpha. Simple sum model (10 fields): 7.17%. Optimised RF: 9.27%. The ML model adds 3.28pp of genuine out-of-sample alpha over the original forecast and 2.10pp over the simple sum model.
Overfitting Gap
The gap between in-sample alpha and OOB alpha measures how much the model memorises training data rather than learning genuine patterns. A gap of 0.31pp (9.58% vs 9.27%) is very small, confirming the model generalises well. By contrast, the previous RF Deep model (depth 10) had an in-sample alpha of 10.81% but its OOB R squared of 0.20 suggests its true alpha would be much lower.

Hyperparameter Grid Search
120 combinations of max_depth (3-8), min_samples_leaf (20-500), and max_features (0.3-1.0) were tested. Depth 8 dominates the top positions. Using 70% of features per split (max_features=0.7) achieves the best OOB alpha.

Top 10 Configurations by OOB Alpha-20
| Depth | Min Leaf | Max Feat | OOB R2 | Top-5 | Top-20 | Hit Rate | Corr |
|---|---|---|---|---|---|---|---|
| 8 | 20 | 0.7 | 0.1779 | 10.58% | 8.64% | 90.7% | 0.4266 |
| 8 | 50 | 0.7 | 0.1769 | 10.23% | 8.55% | 90.5% | 0.4252 |
| 8 | 20 | 0.5 | 0.1794 | 10.11% | 8.50% | 90.1% | 0.4316 |
| 8 | 50 | 1.0 | 0.1699 | 10.25% | 8.44% | 90.5% | 0.4136 |
| 8 | 20 | 1.0 | 0.1712 | 10.67% | 8.43% | 90.3% | 0.4152 |
| 8 | 100 | 0.7 | 0.1756 | 9.93% | 8.42% | 90.2% | 0.4236 |
| 7 | 20 | 0.7 | 0.1638 | 8.42% | 7.87% | 88.2% | 0.4107 |
| 7 | 50 | 0.7 | 0.1636 | 8.30% | 7.80% | 87.8% | 0.4104 |
| 6 | 20 | 0.7 | 0.1498 | 7.34% | 7.28% | 87.4% | 0.3946 |
| 6 | 20 | 0.5 | 0.1498 | 7.17% | 7.10% | 86.4% | 0.3944 |

Feature Subset Search
Tested 18 feature subsets: top-N by importance, domain-grouped (price, growth, signals), and exclusion sets. The top-10 features by importance achieve the best OOB alpha at 8.72%, slightly above all 20 features at 8.64%. Removing synthetic features drops alpha by 0.47pp, confirming they provide genuine orthogonal signal.

| Subset | Features | OOB R2 | OOB Alpha-20 | Hit Rate |
|---|---|---|---|---|
| top-10 (by importance) | 10 | 0.1714 | 8.72% | 90.9% |
| top-20 (all) | 20 | 0.1774 | 8.64% | 90.6% |
| no_low_coverage | 20 | 0.1779 | 8.64% | 90.7% |
| top-12 | 12 | 0.1751 | 8.60% | 90.4% |
| top-15 | 15 | 0.1775 | 8.56% | 90.3% |
| no_census | 18 | 0.1715 | 8.43% | 90.4% |
| top-7 | 7 | 0.1616 | 8.19% | 89.5% |
| no_synth / univariate_only | 15 | 0.1735 | 8.17% | 88.5% |
| top-5 | 5 | 0.1541 | 7.98% | 88.6% |
| top-4 | 4 | 0.1419 | 7.11% | 85.1% |
| signals+forecast | 6 | 0.0922 | 7.00% | 85.2% |
| price+forecast | 7 | 0.1521 | 6.76% | 83.6% |
| growth+forecast | 5 | 0.1444 | 6.54% | 83.1% |
| forecast_only | 1 | 0.0512 | 5.82% | 78.3% |
| top-3 | 3 | 0.1289 | 5.73% | 79.5% |
Greedy Forward Selection (OOB Alpha)
Starting from forecast_pred alone, features were added one at a time, always picking the feature that maximises OOB top-20 alpha. Peak performance at 13 features (9.24%). The forward selection chose very different features from the importance ranking: synthetic composites (transport, urban heat, cultural integration) were selected early despite low standalone importance, because they provide orthogonal signal that the forecast does not capture.

| Step | Feature Added | Total | OOB R2 | OOB Alpha-20 | Hit Rate |
|---|---|---|---|---|---|
| 0 | (forecast_pred only) | 1 | 0.0512 | 5.82% | 78.3% |
| 1 | synth_transport_ecosystem | 2 | 0.0860 | 6.90% | 86.8% |
| 2 | synth_environment_urban_heat | 3 | 0.0974 | 7.82% | 89.7% |
| 3 | buy_3yr_growth | 4 | 0.1027 | 8.31% | 89.9% |
| 4 | buy_price | 5 | 0.1098 | 8.73% | 91.1% |
| 5 | census_public_housing | 6 | 0.1155 | 8.84% | 90.9% |
| 6 | house_vacancy_rate | 7 | 0.1201 | 9.02% | 91.1% |
| 7 | synth_cultural_integration | 8 | 0.1206 | 9.16% | 91.9% |
| 8 | stock_on_market | 9 | 0.1209 | 9.19% | 92.4% |
| 9 | rent_price | 10 | 0.1227 | 9.20% | 92.3% |
| 10 | census_overseas_born | 11 | 0.1256 | 9.21% | 91.8% |
| 11 | synth_business_innovation | 12 | 0.1255 | 9.19% | 91.8% |
| 12 | months_of_supply | 13 | 0.1253 | 9.24% | 92.0% |
| 13 | mib_perc_renters | 14 | 0.1289 | 9.06% | 91.3% |
| 14 | owner_occupied | 15 | 0.1292 | 8.99% | 90.9% |
| 15 | rent_3yr_growth | 16 | 0.1411 | 8.79% | 91.0% |
| 16 | pct_sold_at_loss | 17 | 0.1561 | 8.78% | 90.5% |
| 17 | buy_10yr_growth | 18 | 0.1785 | 8.89% | 90.9% |
| 18 | buy_1yr_growth_75 | 19 | 0.1778 | 8.76% | 90.9% |
| 19 | synth_dev_infrastructure | 20 | 0.1774 | 8.63% | 90.5% |
Note on OOB R squared vs OOB alpha divergence. After step 12 (13 features), OOB R squared keeps rising (from 0.13 to 0.18 at all 20 features) while OOB alpha declines (from 9.24% to 8.63%). This happens because the high-importance growth features (10yr growth, rent growth, distress) improve overall R squared across all suburbs but reduce the model's ability to identify the very best suburbs. They add noise to the top-N ranking even though they improve average prediction accuracy.
Optimised Model: Feature Importance
In the final 13-feature model, the forecast prediction accounts for 44% of importance. Synthetic transport ecosystem (20%) and urban heat (11%) are the second and third most important. These spatial composites capture suburb-level characteristics that the pure price/growth features miss.

Tree Count Sensitivity
Diminishing returns beyond 200 trees. OOB R squared plateaus at 0.178 and alpha-20 stabilises around 8.7%. 300 trees is a good balance of accuracy and training speed.
| Trees | OOB R2 | OOB Top-5 | OOB Top-20 | Hit Rate |
|---|---|---|---|---|
| 50 | 0.1761 | 10.57% | 8.44% | 90.1% |
| 100 | 0.1776 | 10.65% | 8.64% | 90.5% |
| 150 | 0.1778 | 10.67% | 8.63% | 90.6% |
| 200 | 0.1779 | 10.58% | 8.64% | 90.7% |
| 300 | 0.1781 | 10.70% | 8.74% | 91.0% |
| 400 | 0.1782 | 10.68% | 8.74% | 91.0% |
| 500 | 0.1782 | 10.69% | 8.76% | 91.0% |
Decile Performance (OOB)
Suburbs ranked by OOB predictions (predictions from trees that did not train on each suburb). A clear monotonic gradient from bottom to top decile confirms the model's out-of-sample ranking ability.

Rolling Alpha (OOB vs Original)
12-month rolling average of OOB top-20 alpha vs the original forecast. The optimised model consistently outperforms the original across all time periods.

Methodology
Optimisation Objective
- Primary metric: OOB top-20 alpha (out-of-bag predictions ranked per date)
- OOB: each sample predicted only by trees that did not see it in training
- Genuinely out-of-sample, no separate validation set needed
- Also tracked: OOB R squared, OOB hit rate, OOB correlation
Search Strategy
- Search 1: 120-combo HP grid (depth x leaf x max_features)
- Search 2: 18 feature subsets (importance, domain, exclusion)
- Search 3: n_estimators sensitivity (50 to 500)
- Search 4: Greedy forward feature selection (20 steps)
Optimised Model
- Random Forest: depth=8, leaf=20, max_features=0.7, 500 trees
- 13 features (forward selection optimum)
- OOB R squared: 0.1251 (on 13-feature subset)
- Overfitting gap: 0.31pp (in-sample 9.58% vs OOB 9.27%)
Data
- 1,178,824 training rows (house, 2-year, SAL markets with actuals)
- 6,492 suburbs across 192 forecast dates
- Target: log growth relative to national average
- Missing values filled with column median
Next Step: Walk-Forward Backtesting
OOB provides a good estimate of generalisation. Walk-forward backtesting with true temporal holdout will give the definitive out-of-sample result.