AC.
Back to blog
·13 min readGenAI TransformationFrameworks & Models

AI is Ready for Simple Analytics. Then it Hits a Cliff

A practitioner's benchmark of 60 SQL tasks finds the exact point where AI analytics reliability ends.

Every vendor demo answers the same question: can the AI do it? You type a question in plain English, a clean SQL query appears, the numbers look right, the room nods.

That is not the question analytics practitioners need answered.

The question that matters when you are deciding whether to put AI into an analytics workflow is narrower and far less flattering: how often will it produce a correct, trustworthy result without a human checking its work, and at what level of complexity does that stop being true?

I could not find a clear answer in terms I could actually use. The published research tends to answer a different question: can the model map natural language to valid SQL? That answers something real. The question analytics leaders need to act on is different.

So I ran my own structured test: sixty SQL tasks across six models, three prompt strategies, scored three independent ways. This article is what I found.

If you lead an analytics team, here is the whole thing in six lines:

  1. AI is genuinely production-ready for simple SQL, and then falls off a cliff.
  2. How you prompt it matters more than which model you buy, at least at the easy end.
  3. The real risk is confident, wrong SQL that runs fine and returns the wrong number.
  4. Open-weight models are shockingly competitive, with frontier-level quality at a fraction of the cost and real caveats.
  5. Do not trust an AI to grade its own homework, especially on correctness.
  6. Your "AI is good at X" baseline expires in about one model generation.

How I structured the test

Sixty SQL tasks across five complexity tiers, from single-table aggregations to multi-CTE pipelines, tested against six models using three prompt strategies: naive (question only), structured (question plus full schema), and optimised (structured plus chain-of-thought guidance). Every response was scored three ways: AI judge, execution against a reference result set, and blind human calibration.

Each combination lands in one of three operational modes:

  • Automate: production-ready with spot-checks. (Correctness ≥ 4.0, execution-confirmed.)
  • Augment: reliable first draft; human reviews before it ships. (Correctness ≥ 3.0.)
  • Human-led: AI accelerates a person but cannot be the primary author. (Correctness < 3.0.)

Those three modes are the main output. Everything that follows is evidence for where each type of work lands.


Insight 1: AI is ready to automate simple SQL. Then it hits a cliff.

With proper schema context, all three frontier models reach Automate at Tiers 1 and 2, covering single-table aggregations and standard multi-table joins. A large share of day-to-day analytics work lives here: most dashboards, most routine reporting questions.

Then, at Tier 3, every model falls off a cliff.

Correctness score chart showing a sharp drop at Tier 3, crossing below the 4.0 Automate threshold and not recovering through Tier 5

The drop is abrupt. Correctness falls from 4.22 at Tier 2 to 3.69 at Tier 3 in a single step, crossing below the 4.0 Automate threshold and not recovering at any tier beyond it. This pattern holds for every model in the study, which confirms a structural capability boundary rather than a quirk of one vendor.

The reason is very interesting. Tiers 1 and 2 require the model to recall syntax and apply schema references in a relatively linear way. Tier 3 requires holding multiple schema relationships in context simultaneously while applying window functions or constructing nested subqueries. The limit seems to be structural.

What lives at Tier 3 and beyond? Window functions, ranking-within-groups, multi-step aggregation, correlated subqueries, multi-CTE pipelines. In business terms: cohort retention, month-over-month attribution, lifetime-value models. The analyses that are worth the most are precisely the ones AI is least able to produce unsupervised.

The cliff is your deployment boundary. For most teams, a significant share of work falls into Tiers 1 and 2: routine reporting, dashboard maintenance, segment breakdowns. That is your immediate automation opportunity, and it is also the work consuming the most analyst calendar hours. Automate it, then redeploy those hours toward the Tier 4–5 work where analysts are genuinely irreplaceable.


Insight 2: Prompt strategy beats model selection

The single highest-leverage move was free: putting the database schema into the prompt. Moving from naive to structured prompting delivered a correctness lift of 1.16 to 2.09 points (on a five-point scale) across every model tested at Tier 1, at zero additional model cost.

Bar chart: structured prompts score 4.42–4.67 at Tier 1 versus naive prompts at 2.33–3.42, demonstrating that schema context outweighs model choice

Why such a large jump? Without schema context, the dominant failure is the model inventing tables that do not exist (FROM sales instead of store_sales). It writes perfectly valid SQL for a schema it imagined. Across all naive runs, 45% of failures were classified as wrong_schema. Hand it the real schema and that entire failure class largely disappears.

GPT-5.4 has the worst naive baseline and the biggest structured lift (+2.09 points), which means "model A beats model B" is often really "model A was prompted better." Be skeptical of head-to-head comparisons that do not hold prompting constant.

At Tiers 1 and 2, structured and optimised are essentially tied on correctness. The picture changes at Tier 3:

Judge correctness (1–5), structured vs. optimised, averaged across frontier models:

TierStructuredOptimisedOptimised lift
14.564.53–0.03
24.224.220.00
33.693.94+0.25
43.173.25+0.08
53.443.31–0.13

At Tier 3, optimised delivers a meaningful +0.25 correctness lift, though the benefit is model-dependent: Claude and Gemini respond well to chain-of-thought guidance; GPT-5.4 does not. Optimised also recovers the ambiguity handling and explanation quality that structured's "SQL only" constraint suppresses, and that recovery is worth the two to three times input cost increase where auditability matters. Use structured for Tiers 1–2 and move to optimised for Tier 3 Augment work.

The largest performance lever available today is schema injection into your AI tooling. It costs engineering time, paid once. If your team is evaluating models without schema-aware prompting, those results are measuring the wrong thing.


Insight 3: The real danger is confident, wrong SQL.

Once you give models the schema, their failures do not go away. They go underground. Errors stop being obvious and become dangerous: queries that execute perfectly and return a wrong answer.

Naive prompts fail mostly on hallucinated schema / Structured prompts fail on semantics:

Donut charts showing schema context shifts failures from hallucinated tables (45% of naive failures) to wrong results (48% of structured failures) — from obvious to invisible

A schema-reference error tells on itself the instant you run the query. A wrong_result error does not. To catch it, you have to already know what the right answer is. Prompt strategy reshapes the failure mode; it does not make failures disappear. Schema context eliminates the obvious ones and surfaces the subtle ones. That is a better failure mode for a supervised workflow, but only if your supervision checks outcomes. Runtime errors announce themselves; wrong results hide.

Syntax errors were 8% of structured-prompt failures. Modern models are excellent typists. The hard part is being right.


Insight 4: Open-weight models are shockingly cost-competitive

I expected the open-weight models to be clearly worse. On the easy-to-medium tiers, they were not.

Judge correctness (1–5), structured prompt, with operational readiness and cost per task:

ModelTier 1Tier 2Tier 3Cost/taskType
Claude Sonnet 4.64.67 (Automate)4.22 (Automate)4.00 (Augment)~$0.023Frontier
GPT-5.44.42 (Automate)4.00 (Automate)3.83 (Augment)~$0.028Frontier
Gemini 3.1-Pro4.58 (Automate)4.22 (Automate)3.25 (Augment)~$0.010Frontier
DeepSeek V3.24.50 (Automate)4.00 (Automate)3.50 (Augment)~$0.001Open-weight
Gemma4-OR4.08 (Automate)4.25 (Automate)3.67 (Augment)~$0.001Open-weight
Llama 4 Scout3.75 (Augment)3.08 (Augment)2.25 (Human-led)~$0.001Open-weight

DeepSeek V3.2 reaches 4.50 at Tier 1, within 0.17 points of Claude Sonnet 4.6, at roughly 1/25th the cost per task. Gemma4 is the stronger open-weight option for mid-complexity work, edging out DeepSeek at Tiers 2 (4.25 vs 4.00) and 3 (3.67 vs 3.50). Llama 4 Scout does not rise above Augment and drops to Human-led at Tier 3.

For regulated industries, there is a privacy angle worth noting. Gemma4 runs locally via Ollama with approximately 16 GB of VRAM, which means capable SQL generation fully on your own infrastructure is a real option today.

The catch: open-weight models depend entirely on schema context. Without it, correctness scores collapse toward zero; they are far less forgiving of sloppy prompting than frontier models. The cost savings are real only if your schema-injection plumbing is reliable.


Insight 5: Do not trust an AI to grade its own homework

I used an AI judge to score thousands of responses, then blind-scored a calibration sample to check whether the judge could be trusted. Mostly yes, and there is a bias pattern you need to know about.

Overall, the AI judge agreed with my human scoring 93.9% of the time on frontier models and 95.2% on open-weight ones (agreement = within one point on a five-point scale).

The agreement varies by dimension:

Bar chart showing AI judge agreement rates: correctness at 87.5% is the weakest dimension and the one that matters most, trailing schema adherence and explanation quality at 96.9%

Correctness is the weakest dimension, and the one you care about most. The judge systematically over-scores plausible-looking queries that actually return wrong data: when a generated query has sophisticated structure and executes successfully, the judge rates it highly even when the result set is wrong. At Tier 4, correctness agreement between human and judge drops to 79.2%.

The second issue is subtler. Some analytics questions have more than one defensible answer. Ask for transaction counts by warehouse, and one analyst groups by warehouse ID alone; another groups by warehouse ID and warehouse name. Both are semantically valid; which one the question intended depends on context the AI judge cannot access. A rigid scoring rubric penalizes one interpretation as wrong. That both depresses agreement scores and obscures real capability. Correctness in analytics is often contextual, and evaluation frameworks that treat it as a strict binary will systematically undercount what the model is actually getting right.

Let the AI judge style, structure, and clarity. Verify correctness with something objective: execution against a reference result set. Where a question has legitimate alternative interpretations, flag those before scoring them as failures. A model's confidence is not evidence that the answer is correct. A 12.5% error rate on the most important dimension, in a domain where wrong answers inform real decisions, is a serious governance gap that a human calibration layer needs to close.


Sidebar: Your AI baseline has a one-generation shelf life

Two of the three prior-generation models I targeted had been deprecated and inaccessible within roughly 18 months of release. Where I could compare (GPT-4o to GPT-5.4), improvement was real but tier-bounded: new models clear the easy bar faster; the hard ceiling barely moves. Providers retire models on 12 to 18-month cycles, shorter than most enterprise audit or model-risk-management cycles. A benchmark result, a vendor evaluation, a "we validated this model" sign-off: all of it has a shelf life of about one model generation. Build model re-evaluation into your analytics governance calendar as a recurring item.


A framework you can actually use: Automate / Augment / Human-led

Strip away the specific models and the durable output of this exercise is a decision lens. For any analytics task you are considering handing to AI, sort it into one of three modes and resource it accordingly.

Task complexityTypical analytics workFrontier verdict
Tier 1Single-table aggregations, simple filters, most dashboardsAutomate (with schema context)
Tier 2Standard joins, GROUP BY / HAVING, segment breakdownsAutomate (with schema context)
Tier 3Window functions, ranking, cohort and trend analysisAugment: human reviews
Tier 4–5Multi-CTE pipelines, attribution, correlated logicHuman-led: AI assists only

Read that bottom-up and it is a roadmap. The work that is safe to automate is also the work that is lowest-value and most commoditised. The work that matters most is where AI still needs a human in the loop. AI clears the Tier 1–2 queue off your senior analysts' desks. That frees them for the Tier 4–5 problems only they can solve.


Here is where I would start

  1. Build a schema-injection layer before you upgrade models. Highest-ROI move in this study. Engineering time, paid once.
  2. Stop treating "it ran" as a quality gate. Add reference-result validation, or keep a domain expert in the loop to catch confident-but-wrong queries.
  3. Match your prompt strategy to the tier. Structured for Tiers 1–2; optimised for Tier 3 Augment work where the correctness lift earns its cost.
  4. Pilot an open-weight model for high-volume, batch, or privacy-sensitive workloads, but only once your schema plumbing is solid.
  5. Keep humans firmly in the lead on Tier 3+ analysis. That is where the data says the models break.
  6. If you use AI to evaluate AI, verify correctness objectively. A model's confidence is not evidence that the answer is correct.
  7. Budget for re-evaluation every model generation. Your validated baseline expires faster than your procurement cycle.

This is a practitioner's structured test. The human calibration was done by a single rater, and the AI judge is a Claude model scoring a benchmark that includes Claude, so treat the results accordingly. Phase 1 covers SQL generation only; Phase 2 moves up the stack to data interpretation and analysis. If you want to dig into the full methodology, I am glad to share it. For analytics SQL today, the reliability line sits right at the edge of Tier 2. Plan around it.


If this was useful, Analytics in the AI Era goes deeper on questions like this every week. Analysis grounded in what is actually happening, applied to the work of leading data and analytics teams.

Subscribe here; free, and you can leave at any time.

If any of this connects to something you are working through, I would love to compare notes. Reach out.

I write about analytics leadership and AI transformation on LinkedIn.

Connect on LinkedIn

More in this topic