Propensity models are not dead. They got promoted.

Generative AI does not replace propensity models. Most practitioners have already worked this out: generative AI generates things, text, images, synthetic data; propensity models produce calibrated probabilities from structured signals. Different tools, different jobs.

The more useful question in 2026 is how GenAI has changed each step of the workflow that produces and deploys a propensity model, and what that means for the practitioner doing the work.

Walk into any bank, insurer, or major retailer building a propensity model today and the workflow looks familiar: four layers, each with a defined purpose and a measurable output, from problem framing through to measurement and iteration. The model in the middle is still a gradient-boosted tree or a calibrated logistic regression. What is happening around that model, at each step, has shifted materially.

The workflow

Four layers. Each has a distinct purpose and produces a defined output.

Layer 1: Problem Identification. Define the right problem and the business value opportunity. The output is a well-defined, actionable use case with agreed success metrics. Nothing downstream can be trusted if this layer is rushed. Most model failures trace back here.

Layer 2: Model Development. Build a predictive model that accurately estimates propensity. The output is a validated model ready for operational use. This layer is the technical core of the workflow, but it is not where most of the value is created.

Layer 3: Translating Scoring to Strategy. Convert scores into decisions, actions, and customer strategies. The output is an actionable strategy that turns scores into optimized actions. This is where data science becomes analytics leadership.

Layer 4: Operationalize, Measure & Learn. Execute interventions, measure impact, and continuously improve. The output is measured business impact and a continuously improving system. Value is only realized at this layer. A model that is never deployed, or deployed without a feedback loop, is a cost centre.

Machine learning workflow — four layers map the entire lifecycle

The sorting logic

Two variables predict where GenAI integrates at each step: how codifiable the output is, and how costly an error is. High codifiability and low error cost means substitution is underway. High codifiability and high error cost means automation with governance. Low codifiability and low error cost is co-creation territory. Low codifiability and high error cost is augmentation only.

The full framework is mapped in GenAI does not disrupt uniformly. Here is a way to think about why. What follows is the applied version for the propensity model workflow.

Machine learning workflow enhanced with genai

Layer 1: Problem Identification

Problem identification is the highest-stakes layer in the workflow: a misframed problem produces a model that is technically valid and strategically useless. All five tasks are weakly codifiable with high error costs, so the entire layer sits in Augment; LLMs contribute at the edges but do not own the decisions.

Define business objectives: LLMs draft structured problem statements from vague briefs; human owns the decision about what to predict and why.
Assess intervention feasibility: LLMs model scenarios and surface operational constraints; human judges viability given budget, capacity, and organizational appetite.
Identify success metrics: LLMs benchmark KPIs and draft metric definitions; human sets targets and accountability.
Define scope & assumptions: LLMs generate assumption registers and flag data gaps; human approves scope boundaries.
Stakeholder alignment & prioritization: LLMs draft alignment documents and presentations; human owns the influence and negotiation.

GenAI compresses documentation and preparation time. The decisions remain human.

Layer 2: Model Development

Model development is the most codifiable layer and the most transformed. Data preparation and documentation are running at Substitute; the structural change in feature engineering is that LLMs bring unstructured signals into the model more comprehensively; validation sits in Govern-and-Automate because a misleading regulatory explanation carries direct legal consequence.

Data preparation: Substitute. Agentic tools handle cleaning, temporal splits, and base table construction. CTGAN and TVAE address class imbalance and cold-start by generating synthetic minority-class examples without exposing customer data.
Feature engineering: Substitute (generation) / Augment (selection). LLMs-as-encoders convert transcripts, tickets, and reviews into structured features that extend reach beyond what structured pipelines can touch. Feature selection and leakage prevention remain human judgment.
Model training & selection: Mixed. Hyperparameter tuning is Substitute. Algorithm choice and business-relevant model acceptance sit in Augment. TabPFN-class models are a credible default for datasets below 10,000 rows; XGBoost still dominates at enterprise scale.
Validation & explainability checks: Govern-and-Automate. LLMs translate SHAP values into plain prose meeting adverse-action requirements under Basel III or the EU AI Act. Human review before deployment is structurally required.
Model documentation: Substitute. LLMs generate model cards, data lineage summaries, and assumption registers from validation outputs. What previously took a day now takes an hour of review.

Most codifiable work is Substituted. Judgment-heavy steps remain Augmented. Validation sits in Govern-and-Automate where error costs are highest.

Layer 3: Translating Scoring to Strategy

This is where the model becomes a business decision, and Co-create dominates. Running and refreshing scores is codifiable and moving toward Substitute; the strategic work around it requires contextual judgment that cannot be automated.

Define segmentation & scoring approach: Substitute (score runs and refresh) / Co-create (segment strategy). Automated scoring is table stakes; deciding what constitutes a persuadable tier versus one where intervention backfires requires operational context.
Generate business insights: Co-create / Augment. LLMs generate business-ready narratives from segment profiles in seconds; whether an observation reflects a real behavioral pattern or a data artifact requires a practitioner to evaluate.
Define decision rules: Co-create. GenAI surfaces optimal thresholds and models the business case under different scenarios; human sets eligibility criteria, exclusions, and suppression logic.
Define outreach strategy: Co-create. GenAI generates treatment variants and models expected lift; human owns strategic logic, channel decisions, and brand constraints.
Test strategy: Co-create / Augment. GenAI drafts test designs and flags common methodological errors; human signs off on holdout construction and success metrics, since a flawed test compounds into bad decisions downstream.

AI generates options and evidence across this layer. Human judgment determines what to do with them.

Layer 4: Operationalize, Measure & Learn

The most automated layer in the workflow, with the clearest governance lines. Deployment, drift monitoring, and test tracking run with minimal human involvement once the system is in production; the feedback loop that closes the system is the least automated step and requires the most contextual judgment.

Operational deployment: Substitute. Containerization, API endpoints, CI/CD, and load testing are within range of agentic development tools; human approves the go-live runbook.
Audience activation: Co-create. Agents generate personalized message variants per propensity tier and execute outreach at consumer scale. Treatment logic, suppression rules, and channel constraints remain human decisions given brand and regulatory consequences.
A/B testing & measurement: Govern-and-Automate / Augment. Tracking, lift computation, and significance flagging run autonomously; distinguishing a genuine effect from a measurement artifact requires human review.
Measure model health: Govern-and-Automate. Score distribution tracking, drift tests (PSI, KS), and alert triggering need no human involvement to function. Root cause diagnosis and the decision to act on a drift alert require a practitioner who can distinguish a pipeline issue from a genuine behavioral shift.
Capture outcomes & iterate continuously: Augment. GenAI surfaces patterns in outcome data and drafts the iteration case; decisions about what to change, and whether the original problem framing has become stale, belong to the practitioner.

Autonomous execution with governance at the boundary. Humans own interpretation and iteration decisions.

What changes for the practitioner

The framework above shows exactly where autonomous execution is taking over. The harder question is what practitioners do with the capacity that gets freed up.

Routine, codifiable work (data cleaning, hyperparameter search, score refresh, drift monitoring) is moving toward autonomous execution. That is time returned to the work that matters more.

That work concentrates in three places. The first is judgment: problem framing, model selection trade-offs, causal interpretation, and decisions about when to retrain. A function that hands off the codifiable work without investing the recovered capacity in judgment-intensive work becomes a faster pipeline that still misses the important calls.

The second is organizational. LLMs can generate analysis faster than any analyst ever could. What they cannot do is walk into a room of skeptical stakeholders with competing priorities and make the case that this particular problem is worth solving. That requires business understanding, organizational credibility, and the ability to influence a decision without formal authority. In a world where the technical execution gap between analytics teams is narrowing, these capabilities become the primary differentiator between functions that shape strategy and functions that serve it.

The third is architectural thinking. The model builder and the systems architect are increasingly the same person. Which components need to exist? Which can be bought? How does the pipeline span model selection, feature engineering, retrieval design, agent orchestration, API contracts, and feedback loops? Where does human judgment stay structurally in the decision path, as a deliberate design choice rather than a governance checkbox? These questions now belong to the practitioner's brief.

The propensity model did not get replaced. It got promoted. And so did the practitioner who builds and deploys it, into a job that is technically broader, strategically more visible, and more consequential than the one they had before.

If this was useful, Analytics in the AI Era goes deeper on questions like this every week. Analysis grounded in what is actually happening, applied to the work of leading data and analytics teams.

Subscribe here; free, and you can leave at any time.

If any of this connects to something you are working through, I would love to compare notes. Reach out.