Faculty · AI is mathsMethods, models and assumptions
On this page
OverviewThe easy part: forests already contain parallel workBootstrap is not just a way to keep processors busyOut-of-bag data are the forest's built-in diagnosticWhen diagnostics become a method: VSURFThe parallel trap: chunks are rarely random populationsWhen data never stop, the experiment changes againSeveral routes preserve different parts of the methodWhy these connected works still make a sharp teaching articleThe research behind this article
DSTI TechBlog  /  AI is maths
FacultyAI is maths

Scaling a random forest without losing the statistics

Random forests seem made for big data: grow trees independently, distribute the work, combine the votes. Christine Tuleau-Malot and colleagues showed why the central challenge is preserving the sampling logic, diagnostics and meaning of the forest when data are split — or never stop arriving. The same out-of-bag and variable-importance machinery also became VSURF, connecting statistical research to a reusable R package.

random-forestsbootstrapout-of-bagvariable-importanceVSURFR-packagedistributed-learningdata-streams

Not all artificial intelligence is a neural network. Random forests are one of the clearest demonstrations that machine learning is a mathematical construction: resampling changes the data seen by each model, randomisation changes the decisions available to each tree, and aggregation turns many unstable predictors into one robust predictor. When the data become too large for one machine — or arrive as a stream — each part of that construction has to survive the engineering change.

The useful question is not simply, “Can the algorithm run at scale?” It is, “After scaling it, are we still estimating the same thing — and can we still tell when it is wrong?”

01The easy part: forests already contain parallel work

The 2015 paper by Robin Genuer, Jean-Michel Poggi, Christine Tuleau-Malot and Nathalie Villa-Vialaneix begins from an obvious attraction. A random forest is an ensemble of many decision trees. The trees are deliberately made different from one another, then their predictions are aggregated — a majority vote for classification, or an average for regression. Because one tree does not need to wait for the preceding tree to finish, the forest looks “embarrassingly parallel”.

That observation is correct, but incomplete. It describes how to distribute the trees. Big data often forces us to distribute the data themselves, and those are not the same operation.

01Resample

Draw a bootstrap sample from the learning data, with replacement.

02Randomise splits

At each node, consider only a random subset of predictor variables.

03Grow fully

Build the randomised decision tree without the usual pruning step.

04Aggregate

Combine many trees into one classification vote or regression average.

This distinction between tree-level parallelism and data-level partitioning is the article's first lasting lesson. A distributed system can make the code faster while quietly changing the statistical experiment underneath it.

02Bootstrap is not just a way to keep processors busy

For a learning set of n observations, a classical bootstrap sample contains n draws made with replacement. Some observations appear several times; others are not selected at all. That pattern is not an implementation detail. It is one of the mathematical mechanisms that creates diversity between trees.

How often one observation is selected
Ki ∼ Binomial(n, 1/n) → Poisson(1)

As the dataset grows, the number of copies of a particular observation in a bootstrap sample is well approximated by a Poisson distribution with mean 1.

Probability it is left out
P(Ki = 0) = (1 − 1/n)n → e−1 ≈ 0.368

Roughly 36.8% of the observations are absent from the bootstrap sample of a given tree. Those are its out-of-bag observations.

Bootstrap laboratory

Draw 12 times from 12 observations, with replacement. Repeated labels are used to train the tree; observations that never appear become that tree's out-of-bag test set.

Training population
Bootstrap counts — gold means out-of-bag
out-of-bag observations
share in this draw
36.8%theoretical large-n limit

The online variants reviewed in the paper use this convergence in reverse: when one new observation arrives, each tree is updated k times, with k drawn from a Poisson(1) distribution. That is a compact way to imitate the multiplicities that a batch bootstrap would have produced — without storing and resampling the full historical dataset.

03Out-of-bag data are the forest's built-in diagnostic

An observation excluded from a tree's bootstrap sample can test that tree because the tree did not train on it. Across the forest, every observation is out-of-bag for a subset of trees. Their predictions can be combined into an out-of-bag error, giving the forest an internal estimate of predictive performance without setting aside a separate validation sample.

The same mechanism supports permutation variable importance. Take the out-of-bag sample for one tree, shuffle one predictor, and measure how much the error increases. If destroying the relationship carried by that predictor damages prediction, the variable mattered to the tree.

Permutation importance, conceptually
VI ( Xj ) = 1Q t ( errTree ~ t j errTree t )

Average, over the trees, the extra out-of-bag error caused by permuting predictor Xj.

This is where the paper becomes more than a survey of faster implementations. Its authors treat error estimation and variable importance as part of the method, not as optional reporting. A scaled system that still returns predictions but loses its trustworthy diagnostics is not obviously the same random forest.

04When diagnostics become a method: VSURF

Out-of-bag error and permutation importance also became the engine of a connected line of work. Robin Genuer, Jean-Michel Poggi and Christine Tuleau-Malot — the core trio shared by both research projects — developed a variable-selection procedure and made it available as VSURF, an R package distributed through CRAN.

The connection is direct. The big-data work asks what happens when a scalable implementation can no longer reproduce classical OOB error or variable importance. VSURF shows how much those quantities can do when they are preserved: they rank predictors, identify a data-driven noise floor, compare nested forests and decide whether an additional variable improves prediction enough to be retained.

One method, two scientific objectives

Variable selection depends on what the analysis is for

VSURF · Variable Selection Using Random Forests
Interpretation set

Retain the variables strongly related to the response, including useful redundancy. In imaging or functional data, correlated predictors may describe a whole region or scientific structure worth understanding.

Prediction set

Build a smaller, lower-redundancy subset that remains sufficient for accurate prediction. The goal is a compact operational model, not a complete map of every associated variable.

01Threshold

Average permutation importance over repeated forests, estimate the variability associated with uninformative predictors and remove variables below the data-driven threshold.

02Interpret

Compare nested forests built from the ranked variables and retain a compact model whose OOB error remains within the uncertainty of the best observed result.

03Predict

Introduce ranked variables sequentially and keep a new variable only when its reduction in OOB error exceeds a threshold estimated from the noisy tail.

library(VSURF)
selection <- VSURF(x = predictors, y = response)
summary(selection)

The package operationalises the research for regression and supervised classification, including high-dimensional settings. Its computations can also be parallelised while preserving reproducible random-number generation.

When a scalable implementation loses OOB error or variable importance, it loses more than a diagnostic chart. It may lose the mathematical quantities required to select variables reproducibly.

05The parallel trap: chunks are rarely random populations

A common MapReduce adaptation splits a very large dataset into smaller chunks, builds a forest independently on each chunk, and merges all the trees. Computationally, this is appealing. Statistically, it can be dangerous.

Real data on disk are often ordered by time, geography, acquisition system, customer, class, or some other form of locality. Send contiguous chunks to separate workers and each forest may learn a different population.

Contiguous partition

Each worker receives a locally homogeneous slice.

Worker forests see A-only, B-only and C-only worlds.

Random or stratified partition

Every worker receives a more representative mixture.

Each local forest sees a miniature of the global population.
1 — Locality bias

Neighbouring records on disk may share attributes, so naive chunks are not random samples.

2 — Heterogeneous forests

Local forests may be so different that averaging all their trees has no clear statistical meaning.

3 — The bootstrap size problem

The behaviour of an m-out-of-n bootstrap depends strongly on m, which is difficult to tune inside a simple distributed scheme.

4 — Diagnostics disappear

Workers lose the global training indexes needed to reconstruct classical out-of-bag error and variable importance.

Averaging many local forests does not automatically recreate one forest trained on the global population.

06When data never stop, the experiment changes again

In an online setting, the learner sees the current observation but may not retain all previous observations. The forest must update as data arrive. The online random forests reviewed by the authors combine Poisson online bagging, Extremely Randomized Trees and incremental node statistics.

new observation
(xt, yt)
draw k ∼ Poisson(1)
for each tree
update tree k times
or test it when k = 0

If k = 0, the current observation is out-of-bag for that tree and can update its error estimate. But the paper points out the approximation: after the tree changes on later data, that old prediction cannot be recomputed unless the observation was stored. The online OOB estimate is therefore not identical to the classical batch quantity.

Variable importance is harder still. Permutation importance asks us to shuffle one variable in an out-of-bag sample. A stream that is discarded after processing leaves nothing to permute. The computational constraint removes the object required by the statistical definition.

That is the heart of “AI is maths”

The algorithm is not only the code that produces a prediction. It is also the sampling experiment, the error estimate and the definition of importance. Change the data lifecycle, and those mathematical objects may have to be redefined.

07Several routes preserve different parts of the method

The authors map several directions for preserving more of the method's meaning under big-data constraints. Different forms of scale call for different compromises, and the choice should follow the statistical property that matters most.

01Partition deliberately

Randomise or stratify the data before distributing them, especially on the response, rather than trusting physical storage order.

02Use the Bag of Little Bootstraps

Construct nominally size-n bootstrap samples from only m ≪ n distinct observations, keeping the resampling logic while reducing computational burden.

03Make individual trees cheaper

Use more strongly randomised tree families such as Extremely Randomized Trees, Perfect Random Tree Ensembles or Purely Random Forests.

04Weight forests, not only trees

Treat the result as an ensemble of local forests and adapt the vote to account for sampling bias instead of merging every tree indiscriminately.

05Update rather than rebuild

Use online forests to address both volume and velocity, processing only enough of the data stream to reach adequate accuracy.

06Keep diagnostics in the design

Judge a scalable variant by what it preserves about OOB error and variable importance, not only by throughput.

08Why these connected works still make a sharp teaching article

MapReduce is no longer the fashionable headline it was in 2015. The underlying problem has not aged: distributed learning still partitions observations, streaming systems still forget history, and production constraints still tempt engineers to treat a mathematically defined estimator as interchangeable with any implementation that produces similar-looking predictions.

The paper's value is its refusal to confuse scalability with correctness. It asks four questions that remain useful whenever a machine-learning method moves from a notebook to an infrastructure:

Those are mathematical questions expressed through systems architecture. The machines, storage layout and update strategy are part of the statistical model whether we acknowledge them or not.

VSURF adds the complementary lesson. When OOB error and variable importance are preserved carefully, they can drive an end-to-end selection workflow, distinguish interpretation from prediction, and become software that other researchers and engineers can apply to their own data.

The hard question is not whether the code can run on many machines. It is whether the result is still the same statistical object — and whether its mathematics remains usable.

09The research behind this article

Original conference paper

Random forests and big data

Robin Genuer, Jean-Michel Poggi, Christine Tuleau-Malot and Nathalie Villa-Vialaneix. Presented at the 47th Journées de Statistique de la Société Française de Statistique, Lille, .

Connected method and software

VSURF: An R Package for Variable Selection Using Random Forests

Robin Genuer, Jean-Michel Poggi and Christine Tuleau-Malot. Published in The R Journal, volume 7, issue 2, pages 19–33, . DOI: 10.32614/RJ-2015-018.

The peer-reviewed article explains the selection strategy and its implementation; the package makes the method directly usable in R.

Faculty connection and research lineage

Dr Christine Malot

Published in the research literature under her full name, Christine Tuleau-Malot. At DSTI, she is co-President, with Pr Fabien Gandon, of the DSTI Scientific and Advisory Board.

Her doctorate on variable selection for high-dimensional discrimination and functional-data classification was supervised by Pr Jean-Michel Poggi. Their later work with Robin Genuer continues this research relationship through a statistical method, a peer-reviewed software article and the VSURF package.

She teaches Foundations of Statistical Analysis - Part 2 and Advanced Statistical Analysis in the MSc in Data Science & AI, and Mathematics Harmonisation in the BSc Computer Science & Engineering.

The big-data work continued

Random Forests for Big Data

The four authors developed the conference contribution into a longer article published in Big Data Research in . It expands the review and the discussion of scalable random-forest variants.

Across the conference paper, the journal extension and VSURF, the shared theme is consistent: mathematical definitions, diagnostics and implementation choices belong to one engineering object.

Editorial note. This DSTI TechBlog article is an educational interpretation of the cited research, written for the “AI is maths” series. It is not presented as a new paper, and it does not attribute the editorial wording to the researchers. Mathematical notation has been simplified where that improves readability; the original paper remains the authoritative source.