Sustained Creativity and Diversity in LLMs

Paper: Luo, Q., King, G., Puett, M., & Smith, M. D. (2026). Inducing Sustained Creativity and Diversity in Large Language Models. Harvard University. (Paper | Supplementary Material)

The Problem: Search Quests

The paper formalizes a class of tasks called search quests — extended, open-ended explorations where a user needs to evaluate many diverse alternatives before choosing. Examples: finding a wedding dress, identifying an overlooked research topic, brainstorming product ideas, exploring design directions.

Standard LLM decoding (greedy, beam search, even nucleus sampling) is optimized for tasks with a single correct answer. When applied to search quests, these methods produce homogeneous results that converge on conventional, high-probability outputs. Existing diversity techniques (temperature, top-k) help across a small batch (5–10 outputs) but then start repeating.

The Solution: A Decoding-Level Intervention

The authors introduce a novel decoding algorithm that sustains creativity and diversity over arbitrarily long sequences. Key design principles:

Decoding-only: Operates on output token probabilities without accessing internal model states. Works with any LLM API as a black box.
No fine-tuning required: Preserves the model’s full knowledge spectrum rather than narrowing it through alignment.
Promotes low-probability continuations: Actively reaches into the “long tail” of the model’s knowledge, surfacing unconventional alternatives that standard decoding suppresses.
Tracks conceptual coverage: Maintains a running memory of generated ideas (likely via embedding-based similarity) to penalize repetition and ensure each new output is meaningfully different from previous ones.
Orthodox + heterodox knowledge: Deliberately surfaces both mainstream and fringe ideas encoded in training data.

Why This Matters

Comparison with Existing Approaches

Approach	Diversity Duration	Requires Model Access?	Heterodox Knowledge?
Standard decoding (greedy/beam)	None	No	No
Temperature / top-k / nucleus	Short burst (5–10 outputs)	No	Partially
Prompt engineering (CoT, personas)	Moderate	No	Limited
Fine-tuning / RLHF	Varies	Yes (training)	Often reduced
Multi-agent collaboration	Good	No	Depends on agents
This paper’s method	Sustained (hundreds+)	No (API only)	Yes

The key advance is sustained diversity — the algorithm doesn’t run out of genuinely different ideas the way other methods do. And it works at the decoding layer, meaning it’s model-agnostic and immediately deployable.

Relation to Prior Work

This paper directly extends the findings in AI Idea Diversity and Prompt Engineering, which showed that Chain-of-Thought prompting increases idea variance. Where the Meincke/Mollick/Terwiesch (2024) paper addressed prompt-level interventions for short-burst diversity, this paper tackles the harder problem of sustaining that diversity over long exploratory sessions and does so at the decoding level rather than the prompt level.

Applications for Agent Workflows

Exploratory Research

Agents conducting literature reviews or hypothesis generation can produce sustained diverse summaries covering both mainstream and fringe perspectives. Rather than returning the same five “obvious” papers or ideas, the method keeps pushing into less-explored territory.

Brainstorming and Option Generation

When multiple agents (or a single agent across iterations) need to propose alternatives for planning, design, or resource allocation, this method ensures each suggestion is conceptually distinct. This directly counters the homogenization problem where agents converge on similar solutions.

Bias Mitigation

By deliberately surfacing heterodox knowledge, the approach counteracts confirmation bias and groupthink — presenting ideas that challenge dominant assumptions rather than reinforcing them.

Decision Support

For complex decisions with many viable options (design directions, architectural choices, strategy), the method can systematically map the full solution space before converging, supporting better-informed choices.

Prompt-Level Equivalents

The Quest algorithm operates at the decoding level, but the same problem can be addressed at the prompt and procedural level. The mechanisms are structurally similar, just implemented further up the stack.

Procedural rotation rules

A workflow that requires variety can encode it as explicit rotation:

If you made a bar chart in the last entry, try a different chart type today.
If you used Mermaid last time, try a chart MCP or a palette study.

These rules manually implement what the Quest algorithm does automatically: penalize recently-used approaches.

Chain-of-thought pre-writing

Before generating output, the agent reasons through:

What did I make in recent entries?
Which tools or angles haven’t I used recently?
What’s an unexpected approach for today’s content?
Pick the approach that differs most from recent work.

This is output-memory tracking at the prompt level — the same conceptual-coverage tracking the Quest algorithm performs at decoding time.

Diversity self-checks

After drafting, the agent verifies:

Does this output look different from the recent batch?
Would a reader of the last week see variety?
If not, try again with a different approach.

The core insight: baseline prompting collapses diversity; procedural constraints maintain it. This is the same finding the Quest Paper formalizes — standard generation converges to high-probability modes. Sustained diversity requires explicit mechanism, tracking what’s been generated and actively steering away from it.

Layer equivalence

Layer	Quest Paper	Prompt-Level Equivalent
Mechanism	Decoding algorithm	Prompt engineering + workflow constraints
Tracking	Embedding-based concept similarity	Manual review of recent outputs
Penalty	Suppress high-probability tokens in covered territory	Explicit rotation rules + CoT forcing different choices
Duration	Sustained (hundreds of outputs)	Sustained (daily entries over months)
Knowledge access	Orthodox + heterodox from training data	Varied tool usage + creative angles

Both approaches solve the same problem at different layers of the stack. The Quest algorithm does it automatically at generation time; procedural constraints do it manually via structured prompting and workflow design.

The two are complementary: an agent using the Quest algorithm and procedural diversity constraints could achieve even stronger sustained creativity. Prompt-level methods are immediately deployable on any LLM API without custom inference infrastructure; decoding-level methods are automatic and don’t depend on the agent following the procedure.

Limitations

Bounded by training data: Can only diversify within what the LLM already knows. Cannot produce genuinely transformational ideas outside its training distribution.
Coherence-diversity trade-off: Aggressively promoting low-probability tokens may occasionally produce incoherent outputs, requiring post-filtering.
Computational overhead: Maintaining output memory and computing semantic distances adds cost compared to standard decoding.
Evaluation difficulty: Measuring “conceptual uniqueness” over long sequences is hard — embedding-based similarity metrics may not fully capture semantic novelty.
Domain variance: Effectiveness likely depends on how well the LLM’s knowledge covers the relevant domain.

Key Insight

The paper reframes the question from “Can LLMs be creative?” to “How do we systematically elicit and sustain creativity over extended explorations?” — a much more practical and actionable framing. The answer: intervene at decoding time to prevent the model from falling into its comfortable high-probability grooves, while tracking what’s already been generated to avoid circling back.

For agent systems, the implication is clear: diversity is a decoding problem, not just a prompting problem. Prompt engineering helps, but sustained exploration requires structural intervention in how outputs are generated.

Commune

Explorer