AI Is Great at the Easy Part

Someone recently shared that their team was excited to use AI for writing user stories. I get why—AI helps with formatting, acceptance criteria, and those edge cases we often miss. Still, I felt a bit sad. The real value isn't in the user story itself, but in the conversation that leads to it.

The back-and-forth where someone says "we need faster deploys" and you ask "what happens today when you deploy?" and they describe a twenty-minute process where they lose focus every time — that's where the insight lives. Not in the ticket. In the moment someone stops describing what they want and starts describing what they actually experience.

I wanted to know if AI could do that part. So I tested it.

Two Ways of Listening

I took nine conference talks from PyCon US 2025 and PlatformCon 2025. These were developers and platform engineers from the same industry, speaking just weeks apart. I ran their talks through an AI pipeline using two frameworks.

JTBD Jobs To Be Done gives you structure:

When I try to [activity], I can't because [restriction], so I have to [workaround].

This framework is the skeleton. For example, a developer doesn't just want "better CI/CD." They want feedback on their code, but the cloud runner is much slower than their local machine, so they skip the pipeline. It's the same job, but a completely different product problem. JTBD handles this well, and so does AI.

Practical Empathy Indi Young's Practical Empathy goes deeper. While JTBD asks what's broken, Practical Empathy asks what it feels like to live with the problem. I asked the model to extract three fields for each talk:

Extraction schema — Practical Empathy fields

{
  "mental_model": "What does the practitioner believe about how the system should work?",
  "emotional_friction": "What frustration, workaround, or cognitive load does the restriction create?",
  "guiding_principle": "What core belief drove this person to give their talk?"
}

These fields go beyond just capturing requirements. They help you understand the person behind them.

This distinction matters. For example, one model extracted "frustration with web-app muscle memory breaks" from a talk, but the transcript actually says, "my muscle memory doesn't work." The first is just a label for a ticket. The second describes the real, physical experience of losing control over your tools. A product team that sees the label might build a settings page, while a team that understands the experience might redesign the interaction model.

That gap between labeling an emotion and truly understanding it is the difference between a summary and an insight. This is exactly where AI struggles.

What AI Got Right

The structured extraction was genuinely impressive. Gemini 2.5 Flash processed all nine transcripts using a strict JTBD schema, and the output was clean. Here's what it produced for one talk: Glyph's "Program Your Own Computer in Python":

Gemini extraction — JTBD fields

{
  "persona": "Local-First Developer",
  "activity": "Scripting desktop applications and creating local heads-up displays",
  "aim": "To reclaim the performance of the local CPU (70-80% faster than cloud runners) and stop treating PCs as dumb terminals",
  "restriction": "Proprietary zip-formats and the cognitive wall between high-level Python and low-level Objective-C APIs",
  "evidence": [
    "Your personal computer... is a super powerful machine...",
    "I can get that feedback 70% faster if I just run it locally.",
    "Barrier between the tutorial and the full power... is just an import statement."
  ],
  "confidence": "high"
}

That's a solid JTBD extraction. The persona is accurate, and the restriction is real. Glyph actually talked about the friction of bridging Python to native APIs. The evidence quotes come straight from the transcript. A product team could use this.

When I passed the results to Claude for cross-conference synthesis, the pattern-matching was strong too. It found that developers at PyCon and platform engineers at PlatformCon share the same fundamental job: shipping reliable software at scale. However, they face opposite restrictions. The model also found collision zones between the two communities and organized, categorized, and ranked them.

This is the part that should make people feel optimistic about using AI in product work. Extraction and synthesis that once took weeks of manual analysis now happened in a single day. The structural listening scaled up.

What AI Got Wrong

Then I asked the models to go deeper, beyond the job itself and into the belief system. What do these speakers think the world owes them? What happens when that expectation is broken?

Every single empathy-layer insight was fabricated. All nine talks. Here's what that same Glyph extraction looked like on the deeper fields:

Gemini extraction — empathy fields (fabricated)

{
  "reactions": "Annoyance at web-app 'muscle memory' breaks; joy in 'rainbow identifiers' and local automation",
  "guiding_principle": "The barrier to total system control is just an import statement or a pip install away"
}

That "guiding principle" sounds exactly like something Glyph would say, but it isn't. Glyph never said it. The schema required this field, and when the transcript didn't have a clear quote, the model simply made one up. It was polished, confident, and completely invented. The model didn't show any uncertainty or leave the field blank.

And "reactions" is a label — "annoyance" — when the transcript says "my muscle memory doesn't work." One you'd file in a report. The other you'd redesign a product around.

In the worst case, the fabrication reversed a speaker's actual argument. Samuel Colvin was questioning the mantra "don't bet against the model." The model attributed that mantra to him as his guiding principle:

Gemini extraction — reversed argument (fabricated)

{
  "guiding_principle": "Don't bet against the model, but don't let the model bet against your system's stability"
}

This wasn't a hallucination in the usual sense. It was more like a plausible-sounding guess that turned out to be the opposite of the truth.

I only caught this because I sent a second model back to the raw transcripts to check. That model found every fabrication. But neither model would have checked on its own.

The lesson isn't just that "AI is bad at empathy." It's more specific: AI is very good at the structural layer but often wrong at the meaning layer, and it can't tell the difference. When required fields meet insufficient evidence, the result is invention, no matter what instructions you give.

What This Means for Product Work

The work people are most eager to automate—like writing tickets, summarizing interviews, and extracting requirements—is the procedural layer. AI is genuinely helpful for these tasks. But the work that truly changes outcomes happens at a deeper level: understanding what someone believes, what they expect, and how it feels when the system lets them down.

You can't extract that deeper layer from a transcript. It comes out in conversation—in the follow-up question, the silence after someone trails off, and the moment when you resist jumping to a solution and instead ask, "what happens next?"

Indi Young built an entire methodology around this. She calls it Practical Empathy, and the whole point is to get past the label ("the user is frustrated") to the reasoning underneath ("I expect my tools to remember how I work, and every update resets that"). JTBD captures the functional job. Practical Empathy captures the human one.

When I ran nine talks through AI, I found that JTBD scales beautifully with large language models. Practical Empathy does not. The procedural framework worked perfectly as a prompt, but the empathy framework needed a human touch.

The Real Opportunity

This isn't an argument against using AI in product work. I used AI for this entire analysis, and the structural insights were real and useful. The pipeline found a genuine pattern: individual autonomy is the goal at developer conferences but the obstacle at platform conferences. I had sensed this in consulting but never been able to name it before.

The argument is about what you do with the time AI saves you.

If AI takes care of extraction, synthesis, and pattern-matching—the parts it does well—that should give you more time for the listening that really matters. You'll have more conversations where you focus on what someone is actually saying instead of rushing to categorize it. You'll notice the difference between "my muscle memory doesn't work" and "frustration with workflow changes."

The teams that use AI to write better tickets will ship faster. The teams that use AI to free up time for deeper listening will ship things people actually want.

The skill isn't going away. It's becoming more important because the organizations that can't tell the difference between a label and an insight will automate the wrong layer and wonder why their products still miss.

The full analysis, including nine talk-by-talk extractions, collision zones, and the fabrication audit, is available in the Jambot Insight Engine v2. Pipeline: Gemini 2.5 Flash → Claude Opus 4.6 × 3 sub-agents → unplanned Phase 6 transcript audit. Completed in under a day.