Skip to content

At AI Con 2026, everyone agreed: QA is harder now

2026-06-22 · 6 min · Oleg Neskoromnyi

I went to Seattle expecting to hear about what AI can do. I came back thinking about what it can't.

AI Con 2026 ran June 7 to 12 — six days, 20 sessions, from a pre-conference "AI for Leaders" training through hands-on workshops and a closing Leadership Summit for engineering managers. The crowd was practitioners: engineers, QA leads, CTOs, people who build software for a living. Nobody was selling anything.

The main theme, repeated across keynotes and breakout rooms and hallway conversations: writing code is cheap now. Knowing what to build, and proving it works — that's still the hard part.

If you came expecting AI to reduce the need for quality work, you left disappointed.

Why AI Making Code Cheaper Makes QA More Important

The idea pushed hardest across AI Con 2026: the bottleneck in software development has shifted. It's no longer writing the code.

One session laid out the math plainly: if you write 100% of your code with AI assistance, you might come out 15% more productive — because the bottleneck isn't code generation, it's everything around it. Reviewing it. Understanding whether it actually solves the right problem. Verifying it works the way you think it does.

The 2025 DORA State of AI-Assisted Software Development report lands in the same place: AI doesn't automatically improve software delivery performance. It amplifies what's already there. High-performing teams get better. Teams with fragmented processes get exposed.

More code generated means more code to review. More velocity means more risk of shipping something that technically runs but functionally fails. The humans don't exit the loop — they shift from producing to evaluating, from writing to validating.

That's a QA problem. It's always been a QA problem. The volume just went up.

The Technical Idea I Keep Coming Back To: LLM-as-a-Judge

The most interesting technical concept I encountered at the conference came from a session on context engineering for agentic workflows — specifically, what to do when AI pipelines run without a human watching at every step. The answer: LLM-as-a-Judge.

LLM-as-a-Judge is a pattern where a second AI model evaluates the output of a first AI model at each step in an automated pipeline. The evaluator scores quality, flags problems, and escalates to a human when something falls outside expected bounds.

Think of it as automated code review, but for AI outputs instead of human outputs. The first model does the work. The second model checks whether the work was done correctly. When it finds something wrong, it stops the pipeline and routes to a human reviewer.

The reason this landed harder than other evaluation concepts I've encountered: it mirrors what QA engineers do in a manual review process, applied at the pipeline level. When nobody is watching the pipeline, you need something that is. A second model watching the first is one concrete answer.

I've been thinking about how this fits into the spec-based agentic workflows I already use. The spec tells the AI what to build. A judge model tells you whether it built what the spec required. That closes a loop I've been patching manually.

A 2025 paper, When AIs Judge AIs: The Rise of Agent-as-a-Judge Evaluation, describes how this pattern is moving beyond offline testing into production pipelines — continuous monitoring of AI agents in deployment, feeding results back to developers in real time.

LLM-as-a-Judge is a pattern where a second language model evaluates the outputs of a first model at each step in an automated workflow. It scores quality, flags issues, and escalates to human review when outputs fall below a threshold — bringing automated validation to pipelines that run without direct human oversight.

AI-DLC: Thinking About the Full Lifecycle

One workshop introduced a framing I found immediately useful: AI-DLC, or using AI across the full software development lifecycle — not just at code generation.

The idea: writing code is cheap now. What remains expensive is knowing what problem you're solving and verifying the solution is correct. AI handles the middle. Humans own the edges.

The workshop had us build our own AI-powered workflow, which made it concrete fast. The point wasn't which tools to use — it was being deliberate about where in the lifecycle AI fits naturally and where it still needs a human hand.

I already work this way in my personal projects: AI for generation, me for problem definition and validation. What the workshop added was a more structured way to think about that split — something I can explain to a team instead of just doing it informally.

If you're working with AI in your development process, map where AI currently sits in your lifecycle versus where you still do work manually. The gaps are usually revealing — and often not where you expect them.

Fast Code and Right Code Are Not the Same Thing

Several sessions touched on the same trap: AI lets you move from idea to working product very quickly — ideation, building, testing, pitching — and that speed creates a false sense of confidence.

Just because AI built it fast doesn't mean it's right. Speed of generation and correctness of solution are different axes. You can ship faster and still ship the wrong thing.

The 2025 Stack Overflow Developer Survey backs this up: 66% of developers say AI outputs are "almost right but not quite," and 45% report losing significant time debugging AI-generated code. Adoption keeps growing. Trust keeps dropping.

The conversations I had outside the sessions confirmed the same thing: AI is moving fast enough that last month's practices might not be current. Most teams still aren't using it well. And the human judgment layer — evaluating whether what AI produced is actually correct — that's not automating away.

Generating code faster than your team can review it isn't a productivity gain. Velocity without validation just means you find the problems later, when they cost more to fix.

What the Conversations Outside the Sessions Confirmed

The best insights at a conference rarely come from the scheduled content. The hallway conversations, the lunch tables, the end-of-day discussions — that's where people say what they actually think.

The consistent thread across all of them: AI handles execution, but judgment stays human. The AI doesn't know what you actually need — it only knows what you told it. That gap between instruction and intent is where the quality work lives.

I talked with people from different industries and different countries — healthcare, finance, infrastructure, startups. The problems were similar across all of them. Everyone was working through the same thing: how to stay in control of output you didn't write yourself.

That breadth of context was probably the most valuable part of the week. What a QA team at a bank is figuring out about AI-generated code isn't that different from what a small product team is figuring out. The scale differs. The core problem doesn't.

Three Things I'm Taking Back

Six days in Seattle left me with a clearer picture of where the actual work is.

  • LLM-as-a-Judge for automated pipelines. I want to add a validation step to my agentic workflows where a second model reviews outputs before anything reaches a human reviewer. I have a project where this fits immediately — I want to try it in practice before writing up what I find.

  • The AI-DLC framing for team conversations. The question "should we use AI for this?" is less useful than "where in the lifecycle does AI fit, and what validation practices do we need at each handoff?" That framing is something I can bring back to my team.

  • The 15% stat as a calibration tool. It came from a conference session, not a published study, so I won't cite it as data. But it captures something the DORA research confirms: AI doesn't solve productivity by generating code faster. The gains come from everything around the code. Teams that understand that distinction will get more out of AI than teams that just increase generation speed.


If you were at AI Con this year, I'd be curious what stuck with you. And if you're already using LLM-as-a-Judge patterns in your own pipelines, I'd like to hear how you've implemented it — I'm about to find out the hard way.

Stay Connected

Subscribe and get instant access to 50 free AI prompts for software testers — plus new articles on AI-powered testing, automation strategies, and quality leadership. No spam, unsubscribe anytime.