There is a problem that every data scientist eventually meets, and it is the opposite of the one they are trained for. Not too much data — none. The dataset that would answer the question does not exist, cannot be collected in time, or sits behind walls that will not come down. This article is about what a serious modeller does next, the course DSTI built around that question, and the researcher who teaches it.

Why this course exists: the data that is not there

Modern data science is taught as though data is the easy part and method is the hard part. In practice the order is frequently reversed. Useful data is fragmented across systems that were never designed to talk to each other, locked by legitimate security and privacy policy, expensive to assemble, or simply never recorded.

The evidence is consistent across the industry. IBM notes that most enterprise data environments remain too fragmented to support AI at scale, reporting 2025 figures in which the large majority of organisations intend to deploy advanced AI within a year yet most concede they lack a well-defined data foundation (IBM, What is data fragmentation?). IDC's analyses put the bottleneck plainly: fewer than half of AI pilot projects reach production, and the binding constraint is the accessibility and operationalisation of data across heterogeneous environments rather than compute or model architecture. Forrester has estimated that knowledge workers lose on the order of a working day each week simply locating data across disconnected systems; survey after survey finds data scientists spending something close to half their time finding, cleaning and preparing data before any modelling begins; and DATAVERSITY's 2024 management survey found data silos cited as the top concern by roughly two-thirds of organisations.

It is worth being precise about why the data is unavailable, because two different mechanisms are often conflated. Some barriers are accidental — silos, incompatible formats, lost lineage. Others are deliberate and entirely legitimate: role-based access control, corporate IT policy, medical confidentiality, and data-protection law exist precisely to restrict who may see what. A model that needs individual-level behaviour to answer a public question frequently runs into the second kind of wall, and no amount of engineering removes it. As Bobashev puts it to his students, social-network data on how people actually influence one another — the very thing you would need to model, say, how drug use begins — is almost never collectible at all.

The industry's response has a name. Gartner has forecast that the majority of data used in AI projects would be synthetically generated within a few years, and that by 2030 synthetic data will overshadow real data in a wide range of AI models — a striking claim from a firm that does not usually traffic in hyperbole (Gartner, Top Data & Analytics Predictions; MIT Sloan, What is synthetic data?). The same Gartner work warns, in the same breath, that most organisations will mismanage it. Both halves of that sentence matter, and the second half is most of what this course is about.

Synthetic data is not one thing. At one end sits a generative model fitted to a real dataset, producing statistically similar records that contain none of the originals — the approach behind MIT's Synthetic Data Vault, where freelance data scientists building predictive models on synthetic versions of five public datasets showed no significant difference from those built on the real data (MIT Sloan). At the other end — the end this course teaches — sits something older and more demanding: you build a mechanism, a population of interacting actors following rules drawn from what is actually known, and let it generate the data the world would have produced if you could have watched it.

A confession at the origin

DSTI's interest in this is not abstract. When the school launched in 2015 with a single programme — what is now the MSc in Data Science & AI — its co-founder, Sébastien Corniglion, wanted students exposed early to multi-agent modelling and to the broader craft of simulating synthetic populations. The motivation was personal. His own doctoral work, with Nadine Tournois, had run straight into the wall above: Towards a Numerical, Agent-Based, Behaviour Analysis: The Case of Tourism (Corniglion & Tournois, 2012).

The problem there was structural. No single party holds a global view of how tourists actually spend across a destination — the data is scattered among independent shops, hotels, public bodies and joint ventures, and assembling it would have required partnerships and a thicket of privacy and legal work. So, rather than wait for a dataset that was never going to arrive, the work generated artificial sales data with an agent-based simulation in NetLogo, combining cellular-automaton rules with stochastic processes, and calibrated not from a master database but from observable regularities and expert insight — plausible expenditure by visitor profile, realistic ratios of hotels, bars and restaurants observed across the region. The contribution was deliberately modest and exploratory, and its most interesting finding cut against received practice: nationality, the variable the tourism industry segments on by reflex, turned out to be a poor discriminator of behaviour, while spending patterns revealed coherent groups and a recurring "group-leader" effect concentrated in the first three to five days of a stay.

Corniglion is candid that he never felt authoritative enough to teach the subject. What changed was a Scientific Advisory Board meeting. Dr Gregory Piatetsky-Shapiro — founder of KDnuggets, a pioneer of knowledge discovery and data mining, and an honorary member of DSTI's board — was aware of the intention, and introduced the school to a researcher who had spent a career doing exactly this, rigorously, where the stakes were human. That researcher, Dr Georgiy Bobashev, has taught Agent-Based Modelling at DSTI ever since.

01 What an agent-based model is, and the question it answers

An agent-based model (ABM) is a bottom-up description of a system. Instead of writing equations for the population as a whole, you specify the individuals — agents — give each one a small set of attributes and rules, place them in an environment and perhaps a network, and let them interact. Structure that no one programmed in directly — clusters, waves, tipping points, segregation, contagion — emerges from the local interactions. The intellectual lineage runs through Epstein and Axtell's Growing Artificial Societies, which made the case that whole classes of social phenomena are best understood by growing them from the bottom up rather than assuming them at the top.

The classic demonstrations are deliberately small. Schelling's segregation model, in which only a mild individual preference for not being a local minority produces sharply divided neighbourhoods; a wealth-distribution model whose almost trivial trading rules settle into a Pareto curve; the El Farol bar problem, wolf–sheep predation, a flock of birds — each one a case of macro-structure that no single agent intended or could see. Bobashev draws the contrast memorably: a system-dynamics model is a classical orchestra, every player following one global score; an agent-based model is a jazz band, where the music is whatever emerges from musicians reacting to one another, locally and in the moment.

The course does not start with agents, though. It starts with a more disciplined question: why model at all, and which kind of model? Bobashev frames modelling through systems science, and insists that the choice of tool follow the objective. There are, in the course's framing, four reasons to build a model — to predict a number, to make a decision, to understand a relationship, or to estimate a risk — and a spectrum of model families to choose from, ordered by how much structure they admit: statistical models, Markov models, system-dynamics models, microsimulations, and — at the far end, where the agents stop being passive and start interacting — agent-based models. An ABM is the right instrument only for some objectives, and a substantial part of the teaching is learning to tell which. The standard text for this is Railsback and Grimm's Agent-Based and Individual-Based Modeling: A Practical Introduction, and the laboratory tool is NetLogo.

Course connection. Bobashev teaches → Agent-Based Modelling (MSc in Data Science & AI, and — since 2025 — the MSc in Data Analytics with AI), which builds on → Foundations of Statistical Analysis, Parts 1 & 2 — the school's "FSML" foundation, taught by Dr Christophe Bécavin and Dr Christine Malot. Modelling rests on statistical reasoning; the prerequisite is not decoration.

02 Building a world from evidence, not from nothing

The crucial discipline — and the answer to anyone who suspects synthetic data is just "making things up" — is that you do not invent data arbitrarily. You encode what is genuinely known into the mechanism, and you let the mechanism, not your wishes, produce the output.

Corniglion's tourists never existed, but the rules they followed were not fiction: pedestrian movement, a bounded probability of entering a shop, expenditure drawn from distributions anchored to expert estimates, structural constraints on the mix of businesses taken from direct observation. The artificial data was a consequence of those evidence-based rules, which is exactly why its conclusions were interesting rather than circular — the surprising result about nationality was not assumed, it fell out of the simulation.

Bobashev's epidemiological work makes the same move at a far higher level of rigour. To model how an infection moves through a city you need a population that does not come in a single file: people grouped into households, schools, workplaces and social groups, mixing at different rates. That structured population is synthetic — and at RTI it is a concrete artefact, not a metaphor: a dataset of anonymous synthetic persons and households, placed geographically and matched to United States census and American Community Survey distributions down to the block level, complete with group quarters (dormitories, nursing homes, prisons, military bases) and with school and workplace assignments that encode the contact network itself (RTI Synthetic Population viewer). No real person is in it; the structure that drives the disease is. The disease dynamics are then simply a consequence of who plausibly meets whom — the myriad of reference surveys that no single database unifies becoming, in aggregate, enough to constrain a credible model. This is the same instinct that drives privacy-preserving synthetic data in regulated settings: reproduce the population, not the individuals, so that no real person is exposed while the structure that matters is preserved.

The honest framing is this: a synthetic dataset is only as good as the evidence and the mechanism behind it. Built carelessly, it launders assumptions into conclusions. Built well, it is a way of reasoning rigorously about a system you cannot fully observe.

03 Which model, and at what scale: the hybrid insight

One of Bobashev's most cited methodological contributions shows what maturity in this field looks like. With Joshua Epstein and colleagues, he addressed a genuine tension in epidemic modelling: agent-based models capture the local interaction and individual variation that matter enormously at the start of an outbreak, when a handful of cases either fizzles or ignites — but they are computationally heavy. Equation-based (compartmental) models are tractable and even analytically transparent, but they assume well-mixed averages that misrepresent exactly that early, structured phase (A Hybrid Epidemic Model: Combining the Advantages of Agent-Based and Equation-Based Approaches, Bobashev, Goedecke, Yu & Epstein, Proceedings of the 2007 Winter Simulation Conference, pp. 1532–1537).

Their answer was not to pick a side but to switch: run the model agent-by-agent while the number of infected is small and individual variation dominates, then, once the count is large enough for the law of large numbers to apply, hand over to a much cheaper equation-based description — and switch back if numbers fall again. The hybrid saves computation and, more fundamentally, lets the emergent structure produced by the agents be analysed mathematically. They treat the full ABM as the "gold standard" with the most micro-detail, and ask precisely when a coarser description is safe.

There is a precise mathematical reason this matters, and Bobashev teaches it directly. When a system's response is non-linear, the average of the outcomes is not the outcome of the average — a fact known as Jensen's inequality. A statistical or system-dynamics model implicitly averages first and then applies the rule; an agent-based model applies the rule to each individual and averages afterwards. For a curved (convex or concave) response these give systematically different answers, and the gap is widest exactly where individual variation is largest and the rule bends most sharply — which is to say, at the early, structured phase of an outbreak. That is the bias the agents preserve and the aggregate erases. Seen this way, the hybrid model is a disciplined statement of when that distinction has stopped mattering and a cheaper average has become safe.

That is the transferable lesson, and it generalises well beyond epidemics: rigour is not loyalty to a favourite method. It is matching the formalism to the question and to the scale, and knowing when aggregation is justified and when it would erase the very thing you are trying to see.

04 Why should one trust a model?

This is, verbatim, one of the questions on the syllabus, and it is where the course earns its seriousness. Bobashev opens it with the modeller's oldest proverb — all models are wrong, but some are useful (George Box) — and then spends real time on what "useful" has to be made to mean. A simulation that runs and produces plausible-looking pictures is the most dangerous artefact in computational science, because plausibility is not validity.

The honest difficulties are well known and are taught as such:

Validation is layered, and most of the layers are not the obvious one. The course separates them carefully: verification (is the code actually doing what you wrote, with no bugs?); internal validation, which is what calibration buys you (are inputs and outputs consistent with the data you built the model from?); external validation (does it match data it was not fitted to?); cross-validation against other models; predictive validity; and plain face validity. Tuning a model until it reproduces known data clears only the second of these — and is routinely mistaken for the third.
Equifinality. Many different parameter settings — and even different mechanisms — can produce the same output. A good fit does not single out a true explanation, and treating it as if it does is a standard error.
Discovery or artefact. When the rules come from expert judgement, it is easy to bake the desired conclusion into the assumptions and then "discover" it. But the opposite failure is just as real: if a theoretical model only ever confirms common sense it has taught you nothing, and when it produces something surprising the first question is always whether that is a genuine finding or an artefact of the model. The defence is the same in both directions — derive results that were not assumed, and test sensitivity to every uncertain choice.
The map and the territory. A model is an argument about a system, not the system itself. Its value is in disciplined, falsifiable exploration, not in the authority of a confident-looking output.

The course is correspondingly precise about the three things people tend to blur together: sensitivity (how much results move when parameters or initial conditions are nudged), uncertainty (how parameter uncertainty propagates into the reliability of the output), and robustness (whether the conclusion survives a change to the model's structure, not just its numbers). The field's main instrument for making all of this inspectable is the ODD protocol (Overview, Design concepts, Details), a standard structure for describing an agent-based model in full so that another researcher can scrutinise and reproduce it (Grimm et al., JASSS, 2020 update). DSTI's course teaches model building through ODD, alongside uncertainty analysis, interpretation, documentation and presentation — the unglamorous parts that separate a result from a screenshot. Bobashev's own published work models this restraint: the hybrid paper is careful to state where validation remains for future work rather than overclaiming.

DSTI's position. A model is not a substitute for evidence; it is a way of reasoning when evidence is incomplete. The skill we teach is not "running simulations" — it is knowing what a simulation can and cannot be trusted to tell you, and being able to defend the answer.

05 Making it run: agents, compute and reproducibility

Built honestly, ABMs are also demanding to run. Exploring a model means sweeping its parameters and repeating stochastic runs many times over, which quickly outgrows a laptop. This is engineering as well as science, and it is recognisably DSTI's territory.

Bobashev's recent work with Michael Duprey is a practical guide to running large-scale NetLogo models on cloud infrastructure (Enhancing Computational Efficiency in NetLogo: Best Practices for Running Large-Scale Agent-Based Models on AWS and Cloud Infrastructures, 2026). It is exactly the kind of operational detail students need: memory and JVM tuning, BehaviorSpace parameter sweeps, and matching the AWS instance family to whether a model is compute-bound or memory-bound — a comparison that, on a standard benchmark, found a compute-optimised instance roughly a third cheaper than a memory-optimised one for the same work. Two of its themes deserve emphasis beyond the cost saving. The first is reproducibility: seeding each run deterministically so that results can be regenerated exactly — a scientific virtue, not merely an engineering convenience. The second is computational sustainability: more efficient simulation is less energy, less cost and less waste, which is the same principle DSTI teaches across its engineering curriculum.

06 Where the stakes are human

It would be possible to teach all of this on toy problems. Bobashev does not, and the choice of problems is itself part of what students absorb. His research sits, by deliberate design, where data is scarcest, most sensitive, and most consequential: public health and substance use.

His group has used agent-based and statistical modelling to study the combined effect of medication for opioid use disorder and naloxone on overdose deaths across counties in New York (Cerdá et al., Epidemiology, 2024); HIV transmission among people who inject drugs, including through the COVID-19 period (Des Jarlais, Bobashev et al., Drug and Alcohol Dependence, 2022); and buprenorphine diversion examined through a harm-reduction lens rather than a purely punitive one (Adams et al., Harm Reduction Journal, 2023). His statistical methods work is in the same service: the mobForest R package for random-forest model-based recursive partitioning was demonstrated on alcohol-dependence treatment data (Garge, Bobashev & Eggleston, BMC Bioinformatics, 2013).

These are precisely the settings where you cannot simply collect the dataset — for reasons of privacy, ethics, stigma and law — and where getting the modelling wrong has a human cost. The subtitle of this article — decision-grade worlds — is not rhetorical in his work: the same family of models was pressed into service during the COVID-19 pandemic to forecast regional hospital and intensive-care bed demand on a rolling basis, the kind of output a public-health authority actually plans against. It is here that the course's framing of an agent-based model as, in effect, an artificial-intelligence system for a whole society — a population of decision-making agents whose collective behaviour you can interrogate — stops being a slogan. They are a quiet argument that modelling-from-evidence is not a workaround for missing data but, handled responsibly, a way to reason about interventions that matter. The care shows in the choice of questions.

MSc in Data Science & AI MSc in Data Analytics with AI

07 The course at DSTI

Agent-Based Modelling is taught within the MSc in Data Science & AI and the MSc in Data Analytics with AI, by Dr Georgiy Bobashev. The Data Analytics with AI route was added in 2025, on the recommendation of DSTI's Scientific Advisory Board — the same body that first brought Bobashev to the school — which judged that the discipline of modelling-from-evidence matters as much for data analysts as for AI specialists. The course assumes the school's statistical foundation — the "FSML" prerequisite, meaning Foundations of Statistical Analysis, Part 1 (Dr Christophe Bécavin) and Part 2 (Dr Christine Malot); at DSTI these are expected knowledge to revise, not optional extras. The core reference is Railsback and Grimm; the working environment is NetLogo, which students install and use from the first laboratory.

It runs as an intensive sequence of paired lecture-and-laboratory days. The arc goes from why model? through the systems-science families and the matching of method to objective, into the ODD protocol and the components of an ABM — agents, rules, environments, networks — and then into building, running and analysing models in the laboratory, individually and in teams, before closing on calibration, validation and the relationship between ABM and AI. Assessment is a project, and the brief itself teaches the two halves of the craft: each student either builds a working model in NetLogo or documents a complex one in full through the ODD protocol. The standard Bobashev sets is the same one this article has tried to honour — produce something someone has a reason to trust.

Closing: an honest kind of data

Data science spends most of its attention on abundance. This course is a deliberate counterweight: a serious treatment of what to do when the data is absent, fragmented or rightly out of reach — which, for a great many real questions, is the normal condition rather than the exception. The answer is not to invent data and hope. It is to build a mechanism from genuine evidence, to be ruthless about validation, and to remain clear that a model is an argument, not an oracle.

It is fitting that the course exists because of an admission rather than a credential — a founder who knew the limits of his own authority on the subject, a board member who knew the right person, and a researcher who has spent a career doing this where it counts. That is the version of expertise DSTI tries to teach: not the confidence to simulate, but the judgement to know what a simulation is worth.

References and sources

DSTI faculty and originating work

Corniglion, S. & Tournois, N. (2012). Towards a Numerical, Agent-Based, Behaviour Analysis: The Case of Tourism. In Agents and Data Mining Interaction (ADMI 2011), LNAI 7103, pp. 58–85. Springer.
Bobashev, G. V., Goedecke, D. M., Yu, F. & Epstein, J. M. (2007). A Hybrid Epidemic Model: Combining the Advantages of Agent-Based and Equation-Based Approaches. Proceedings of the 2007 Winter Simulation Conference, pp. 1532–1537.
Duprey, M. A. & Bobashev, G. V. (2026). Enhancing Computational Efficiency in NetLogo: Best Practices for Running Large-Scale Agent-Based Models on AWS and Cloud Infrastructures. arXiv preprint.
Garge, N. R., Bobashev, G. & Eggleston, B. (2013). Random forest methodology for model-based recursive partitioning: the mobForest package for R. BMC Bioinformatics 14:125.
Cerdá, M., Bobashev, G., Epstein, J. M. et al. (2024). Simulating the simultaneous impact of medication for opioid use disorder and naloxone on opioid overdose death in eight New York counties. Epidemiology 35(3):418–429.
Des Jarlais, D., Bobashev, G., Feelemyer, J. & McKnight, C. (2022). Modeling HIV transmission among persons who inject drugs (PWID)… Drug and Alcohol Dependence 238:109573.
Adams, J. W., Duprey, M., Bobashev, G. et al. (2023). Examining buprenorphine diversion through a harm reduction lens: an agent-based modeling study. Harm Reduction Journal 20:150.

Method and foundations

Epstein, J. M. & Axtell, R. (1996). Growing Artificial Societies: Social Science from the Bottom Up. MIT Press.
Railsback, S. F. & Grimm, V. Agent-Based and Individual-Based Modeling: A Practical Introduction. Princeton University Press.
Grimm, V. et al. (2020). The ODD Protocol for Describing Agent-Based and Other Simulation Models: A Second Update. JASSS 23(2):7.
Wilensky, U. (1999). NetLogo. Center for Connected Learning, Northwestern University.
RTI International. Synthetic Population viewer.

On data scarcity, fragmentation and synthetic data

IBM. What is data fragmentation?
MIT Sloan. What is synthetic data — and how can it help you competitively?
Gartner. Top Data & Analytics Predictions for 2025 and beyond.
Figures attributed in-text to IDC, Forrester and DATAVERSITY's 2024 Trends in Data Management survey (data scientists' preparation time and "data silos as top concern").

People

Dr Gregory Piatetsky-Shapiro — KDnuggets profile · Wikipedia
Dr Georgiy Bobashev — RTI profile