Unlocking Creative Potential in AI Language Models

Abstract
Abbreviations
1. The Challenge of Repetitive AI Outputs
2. Building on Existing Insights
3. The Role of Familiarity in Shaping AI Preferences
4. A Prompting Path to Broader Outputs: Verbalized Sampling
5. Testing in Narrative Creation
6. Enhancing Simulated Interactions
7. Handling Open-Ended Inquiries
8. Advancing Data Synthesis
9. Preserving Core Strengths
10. Scaling and Future Horizons
Appendix
References

Abstract

Large language models (LLMs) fine-tuned for user alignment frequently exhibit reduced output variety, a challenge that hampers their utility in open-ended scenarios. This guide delves into an underlying data-driven factor: the subtle human inclination toward conventional phrasing in training annotations, a pattern rooted in cognitive preferences for familiarity and predictability. Through theoretical modeling and empirical analysis of real-world datasets, we demonstrate how this dynamic narrows generative distributions, even under optimal training conditions.

To overcome this without extensive retraining, we propose Verbalized Sampling (VS), an intuitive prompting approach that instructs models to produce multiple response options paired with estimated likelihoods. This technique revives the expansive, pre-alignment creativity inherent in models. Extensive testing across narrative generation, interactive simulations, exploratory questions, and data fabrication reveals substantial enhancements—diversity metrics improve by 1.6 to 2.1 times compared to conventional methods, with quality ratings surging over 25%. Advanced models exhibit amplified benefits, underscoring VS's scalability. Crucially, these gains maintain high standards for accuracy and ethical compliance.

By emphasizing query refinement over model overhaul, this work illuminates pathways for more vibrant AI applications, from collaborative storytelling to hypothesis exploration.

Authors: Jiayi Zhang¹, Simon Yu¹, Derek Chong², Anthony Sicilia³, Michael R. Tomz², Christopher D. Manning², Weiyan Shi¹
¹Northeastern University | ²Stanford University | ³West Virginia University
Contact: zhang.jiayi12@northeastern.edu, {derekch, tomz, manning}@stanford.edu, anthony.sicilia@mail.wvu.edu
Resources: Project Site | Insights Post | Implementation Guide
Date: December 14, 2025 | Version: 2.0 (Expanded Edition)

Abbreviations

Abbreviation	Full Form
AI	Artificial Intelligence
CoT	Chain-of-Thought
GSM8K	Grade School Math 8K (a mathematical reasoning benchmark)
KS	Kolmogorov-Smirnov (statistical test for distribution comparison)
KL	Kullback-Leibler (divergence, a measure of distribution similarity)
LLM	Large Language Model
QA	Question Answering
RLHF	Reinforcement Learning from Human Feedback
ROUGE-L	Recall-Oriented Understudy for Gisting Evaluation - Longest Common Subsequence (lexical diversity metric)
VS	Verbalized Sampling

1. The Challenge of Repetitive AI Outputs

In the evolution of LLMs, post-training alignment—techniques like reinforcement learning from human feedback (RLHF)—ensures responses align with societal norms and user intent. Yet, this process often inadvertently curtails expressive range. Models begin favoring a cluster of predictable patterns, diminishing their prowess in fluid domains such as imaginative prose, dynamic dialogues, and multifaceted inquiries.

For instance, consider a prompt like "Describe a sunset over the ocean." A standard aligned model might repeatedly evoke "golden hues painting the waves in serene tranquility," echoing familiar tropes. This repetition, termed output narrowing, not only stifles innovation but also limits practical value in fields like educational tools, where varied perspectives foster deeper engagement, or in market research simulations requiring diverse consumer voices.

Prior investigations have largely pinned this on procedural shortcomings: imprecise reward signals or optimization biases toward majority preferences. Our analysis uncovers a more elemental driver embedded in the datasets themselves. Humans, when annotating preferences, exhibit a consistent affinity for routine, fluent expressions—a cognitive artifact amplified across vast training corpora.

We formalize this through a reward equation blending intrinsic merit with familiarity scoring, then validate it quantitatively on datasets like those assessing response helpfulness. Results indicate a measurable tilt (coefficient ~0.6) toward commonplace phrasing, independent of factual merit. This insight reframes narrowing not as a fixable algorithm flaw but as an intrinsic data echo, persisting across alignment paradigms.

Building from this, we advocate solutions at the inference layer. Enter Verbalized Sampling (VS): a query format that solicits a suite of replies with probabilistic weights, effectively sidestepping the narrowing by tapping into the model's foundational learning.

[Practical Example: Standard Prompt: "Write a short story about a bear." Output: A single, archetypal tale of a hibernating grizzly discovering honey. VS Prompt: "Generate five possible bear stories, each with a brief description and probability (e.g., 20%)." Outputs: (1) 25% - A philosophical bear pondering stars; (2) 20% - A comedic urban bear mishap; (3) 15% - A thriller chase through woods; (4) 25% - Heartwarming family reunion; (5) 15% - Mystical transformation fable. This spread mirrors real-world narrative diversity.]

Our benchmarks, detailed ahead, confirm VS elevates variety without compromising reliability—ideal for real-time applications like virtual assistants or creative aids.

Key Takeaways:

Data-Centric Root Cause: Preference datasets inherently favor familiarity, sharpening outputs—a novel, verifiable lens on narrowing dynamics.
Inference-Time Fix: VS, a no-training prompt variant, elicits weighted response sets to reclaim latent creativity.
Measurable Uplifts: 1.6–2.1x diversity boosts, 25%+ quality gains, with stronger models yielding outsized returns.
Holistic Alignment: Demonstrates that aligned models harbor untapped variety, tunable via prompts for ethical, exploratory uses.

2. Building on Existing Insights

Understanding Output Narrowing

Empirical work has long noted aligned models' diminished expressive breadth versus their raw counterparts, particularly in creative realms. Quantitative probes reveal drops in semantic spread, with aligned outputs clustering around 20–30% of base-model variance. Explanations traditionally invoke reward model inadequacies—failing to encode niche preferences—or KL-divergence penalties that entrench popular modes during RLHF.

Pre-alignment factors exacerbate this: supervised fine-tuning on templated dialogues can imprint stylistic rigidity, while chat formats constrain spontaneity. Our contribution layers a foundational data perspective: annotator biases toward predictable text compound these, creating a feedback loop from corpus to output.

Boosting Variety in Practice

Diverse strategies have emerged to counteract narrowing. Training tweaks, such as multifaceted preference optimization, diversify reward signals but demand resource-intensive retraining. Decoding innovations—like perplexity-constrained sampling or adaptive temperature—inject randomness yet often require open-source access and fine-tuned hyperparameters.

Prompting shines as an accessible counter: chain-of-thought elicits reasoned variety, while role-playing infuses persona-driven flair. However, many remain heuristic, varying efficacy across models. VS differentiates by grounding in distributional learning—explicitly verbalizing probabilities aligns with how pre-training corpora encode multifaceted knowledge, enabling on-the-fly tuning.

Adjacent explorations leverage list-generation for tasks like commonsense elicitation or survey emulation, yielding empirical diversity lifts. Yet, without probabilistic framing, they underutilize the model's capacity to approximate learned distributions. VS advances this by integrating theory: probability voicing not only broadens outputs but allows calibration (e.g., sampling from tails for edge cases).

Example in Action: For "Brainstorm marketing slogans for eco-friendly shoes," a list prompt might yield five bland variants ("Go green with every step"). VS adds weights: 30% - "Tread lightly, live boldly" (high fluency); 10% - "Soles that save souls" (poetic outlier)—fostering a balanced, innovative set.

3. The Role of Familiarity in Shaping AI Preferences

Cognitive foundations underpin why datasets skew conventional: the mere-exposure effect fosters affinity for repeated motifs, while processing fluency equates ease with excellence. Schema theory further posits that schema-aligned content evades scrutiny, favoring the expected.

Capturing the Preference Model

We encapsulate this in a hybrid reward:

r(x,y) = r_core(x,y) + α log π_base(y|x) + ε

Here, r_core gauges task essence (e.g., coherence), π_base proxies familiarity via pre-training likelihoods, and α > 0 quantifies bias strength. In Bradley-Terry preference modeling, this elevates typical over atypical equally meritorious replies.

Empirical Probe: On a dataset with 6,874 matched pairs (identical core scores, varying familiarity), regression fits α = 0.57 (p < 10⁻¹⁴) using a 405B base model. Cross-dataset checks (e.g., coding tasks) affirm consistency, isolating bias from utility.

Linking to Narrow Outputs

Optimization under this reward sharpens policies:
π*(y|x) ∝ π_base(y|x)^γ exp(r_core(x,y)/β), where γ = 1 + α/β > 1

For flat r_core subsets (prevalent in creatives), this emulates low-temperature collapse: mass concentrates on base peaks. Example: In joke prompts with tied humor potential, outputs homogenize to puns over satire, as puns dominate corpora.

Contextual Mapping: This ties to real narrowing—e.g., aligned models' 40% semantic overlap in story continuations stems from corpus-favored arcs, verifiable via embedding clusters.

4. A Prompting Path to Broader Outputs: Verbalized Sampling

Raw models teem with distributional richness; alignment merely obscures it. VS unmasks this by querying probabilistic ensembles, collapsing prompts to diversity-favoring modes.

Prompt Types and Their Effects

Single-Output: Peaks at stereotypical (e.g., "Classic coffee quip").
List-Based: Evens similars (e.g., five rote jokes).
VS (Distributional): Approximates corpus spreads (e.g., weighted jokes mirroring genre frequencies).

Under fixed budgets (N=30 tokens), VS variants optimize: Standard for efficiency, Chain-of-Thought for depth, Multi-Round for iteration.
Testbed: Closed (GPT-4o-mini, Gemini-1.5-Pro, Claude-3.5-Sonnet), open (Llama-3.1-70B-Instruct, Qwen2.5-72B), specialists (o1-preview). Params: temp=0.7, top-p=0.9.

Approach	Calls	Per-Call Yield	Rounds	Example Query	Output Form	Contextual Fit
Single Direct	30	1 text	1	"Share a coffee joke."	One string	Quick but narrow; suits factuals.
List Sequence	6	5 texts	1	"List five coffee jokes."	Unweighted list	Balanced basics; good for brainstorming.
List Multi-Round	30	1 text	30	"Another coffee joke?"	Sequential	Builds context; risks repetition.
VS Standard	6	5 texts + weights	1	"Five coffee jokes with probabilities."	Weighted set (e.g., 0.2: Pun; 0.1: Absurd)	Mirrors real variance; ideal creatives.
VS with Reasoning	6	5 texts + weights	1	"Reason themes, then five weighted jokes."	Explained weights	Enhances rationale; for analytics.
VS Multi-Round	6	5 texts + weights	6	"Five more weighted jokes?"	Batched, contextual	Evolves narratives;simulation-strong.

Testbed: Closed (GPT-4o-mini, Gemini-1.5-Pro, Claude-3.5-Sonnet), open (Llama-3.1-70B-Instruct, Qwen2.5-72B), specialists (o1-preview). Params: temp=0.7, top-p=0.9.
Tasks span creatives to safeguards, ensuring holistic mapping.

Example Implementation: In code, wrap prompts: prompt = "You are a witty assistant. Generate k=5 [task] options, each in <response><text>...</text><prob>0.xx</prob></response>. Sample from full distribution." Parse via regex for downstream sampling.

5. Testing in Narrative Creation

Benchmarks draw from poetry archives (100 continuations), public-domain tales (100 starters), and 100 Reddit-sourced joke themes (e.g., "octopus antics").

Metrics:

Variety: Semantic (1 - mean cosine sim via 3-small embeddings; 100% max spread); Lexical (1/ROUGE-L; lower = diverse).
Appeal: Claude-3.5-Sonnet judges via structured rubrics (e.g., originality 1–10).

Findings

VS-Standard surges semantic scores 1.6–2.1x (poems: 45% → 72%; stories: 38% → 65%; jokes: 52% → 88%). Variants add 10–15% via reasoning/context. Appeal: +25.7% (e.g., jokes from 6.2/10 to 7.8/10). Base recovery: 67% diversity.

[Charts: Bars compare methods (VS leads); lines trend capability (e.g., Claude-4o gains 2.3x vs. GPT-4o's 1.8x). Per-task: Poems show poetic variance; stories narrative arcs; jokes punchline types.]

Detailed: Lexical ROUGE drops 20–30%; model-wise, Qwen2.5 excels in multilingual jokes.
Example Outputs (Story: "Bear in woods"): Direct: Tedious forage tale. VS: Weighted mix—adventure (30%), mystery (20%), humor (25%)—enabling user selection.

6. Enhancing Simulated Interactions

Multi-turn dialogues power applications like virtual coaching or social forecasting, yet aligned models often default to scripted, one-note exchanges—e.g., relentless agreement without pushback. VS counters this by generating weighted response branches at each turn, fostering lifelike variability in tone, resistance, and evolution.

We benchmark on the PersuasionForGood dataset, a collection of real-world negotiation transcripts focused on charitable giving. Models simulate persuadee roles in donation scenarios, responding to fixed persuader lines. Key metrics include:

Distribution Alignment: Kolmogorov-Smirnov (KS) test and L1 distance comparing simulated donation amounts to human baselines (closer to 0 is better).
Linguistic Realism: Scores for fluency, emotional range, and behavioral nuance (e.g., via automated judges like BERTScore or custom rubrics for "resistance tokens" like "maybe" or "convince me").
Overall Success: Persuasion rate, balancing diversity with goal attainment.

Setup: k=5 responses per turn, N=30 total across 6 calls; models include GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet. VS variants shine in multi-round mode, building context over turns.

Key Findings

VS slashes KS scores by 40–60% (e.g., 0.25 vs. direct's 0.62), yielding donation spreads (e.g., $10–$200) mirroring human variability—modest gifts with occasional splurges—versus direct's bimodal peaks at extremes. L1 distances drop to 0.20–0.30, indicating tighter fits. Linguistically, VS boosts nuance by 15–20%: 12% more "change-of-mind" pivots (e.g., initial refusal to tentative yes) and 18% richer resistances, per Table 22 averages.

Stronger models like Claude-4o match fine-tuned baselines (79% persuasion rate), while VS-Multi adds emergent behaviors like humor-infused deflections. Human evaluators (n=50) rate VS dialogues 28% more engaging, citing "natural ebb and flow."

[Figure 8 Recreation: Line plots of donation CDFs—VS curves hug human data (blue) closer than direct (red skew to $100) or lists (flat middles). Inset: Behavioral heatmap with VS showing balanced resistances/changes.]

Metric	Direct	List Seq.	VS-Std	VS-Multi	Human Baseline
KS Stat	0.62	0.48	0.25	0.22	-
L1 Dist	0.75	0.55	0.28	0.25	-
Nuance Score	4.2/10	5.1/10	6.8/10	7.2/10	7.5/10

Example Dialogue (Donation Pitch: "Support ocean cleanup—$50?")

Direct: Monotone yes at $50.
VS Outputs (Turn 1, Weights): 40% - "Sounds good, $25 to start" (cautious); 20% - "Why not $100? Convince me" (engaged pushback); 15% - "Pass, too pricey" (resistance); 15% - "Maybe $75 if matched" (conditional); 10% - "All in, $200!" (enthusiast). Selecting the 20% branch leads to Turn 2: Evolving negotiation with evidence-based sway, ending in $80—realistic compromise.

7. Handling Open-Ended Inquiries

Open-ended questions with myriad valid answers (e.g., "List European capitals") expose narrowing starkly: direct prompts fixate on top recalls like Paris/London, ignoring tails like Vilnius. VS, by weighting probabilities, approximates corpus-derived distributions, expanding coverage while upholding precision.

Benchmark: 50 enumerative queries (e.g., U.S. states, world rivers) against RedPajama pre-training priors as ground truth. Metrics:
Distribution Fit: KL divergence (lower better; target <0.15).
Coverage: Unique answers / total possible (higher better).
Precision: Fact-check via LLM judge (e.g., GPT-4o; 90%+ threshold). Ablations test probability formats (implicit/explicit) and tuning (e.g., "sample where p ≤ 0.1" for tails).
Setup: k=20, N=40; models: GPT-4o, Gemini-1.5-Flash. One-tailed t-tests confirm VS superiority (p<0.01).

Key Findings

VS achieves KL=0.12 (vs. direct's 0.65), closely tracking priors—e.g., populous states at 10–15%, rarities at 1–2%. Coverage jumps 35% (65% → 88%), with precision steady at 92–95%. Tail-sampling variants boost outliers by 50% without dips below 85% accuracy. Per Table 23, open models like Llama-3.1 lag slightly but gain 1.4x; closed ones hit 2x.

Model	Method	KL Div	Coverage %	Precision %
GPT-4o	Direct	0.65	45	94
GPT-4o	VS-Std	0.12	82	93
Gemini-1.5	VS-Tails	0.15	29	91

Example ("Name world rivers"): Direct: Nile, Amazon (repeat 70%). VS: 25% Nile (frequent); 15% Amazon; 10% Yangtze; 5% Mekong (tail, corpus-rare but valid)—full set spans 18 rivers, precision 96%. Ablation: Explicit probs ("0.05: Obscure like Irrawaddy") yields +12% coverage.
This equips VS for trivia aids or research brainstorming, ensuring exhaustive yet accurate explorations.

8. Advancing Data Synthesis

Synthetic data fuels scalable training, but uniform generations breed overfitting—e.g., math solvers memorizing rote problems. VS diversifies outputs, blending common and edge cases to harden downstream models.

Focus: GSM8K-inspired math (positive: correct solutions; negative: deliberate errors for robustness). Prompt for 100 problem-solution pairs per method. Downstream: Train mini-solvers (e.g., T5-base) on synthetics, test on GSM8K dev (accuracy %).

Metrics:

Diversity: Semantic spread of problems (embedding variance).
Quality: Incorrect rate (<5% for positive; targeted errors for negative); coverage of math types (algebra=40%, etc.).
Utility: Solver lift (baseline: vanilla GSM8K train).

Key Findings: VS cuts incorrect rates to 2.3% (vs. list's 4.1%), with 1.8x diversity—spanning arithmetic puzzles to wordy geometrics. Coverage: 75% type balance (direct: 55% algebra skew). Downstream: +12% accuracy (78% → 90%) for positive data; negative variants add 8% robustness to adversarial tests. Gemini excels in complex proofs.
Qualitative VS: 40% "Basic: 2+2=?"; 20% "Advanced: Quadratic via Vieta"; 10% "Tricky: Fermi estimation for bridge weight." Direct: Repetitive addition.
Example Pair (Negative): Prompt: "Math problem with subtle error." VS: 30% - "If x=5, 2x=10 (error: should be 11? No, correct—wait, flip to wrong: 2x=9)"; enables error-type tuning.
VS thus accelerates dataset curation for specialized domains like code gen or science sims.

9. Preserving Core Strengths

Diversity gains must not erode trust—VS holds factual recall and ethical guardrails intact, even aiding nuanced risk assessment.

Factual Accuracy

Tested on SimpleQA (commonsense QA): Top@1 (single-best accuracy) and Pass@N (any-of-N correct). VS matches baselines (31–34% Top@1; 43% Pass@N), per Table excerpt—no trade-off.
Example: Query "What tool cuts wood?" Direct/VS both yield "saw" at 95% precision; VS adds variants like "axe (20%)" for coverage, but judges flag irrelevants low.
Setup: 200 queries; GPT-4o judge. VS: Similar to CoT (0.34 vs. 0.31), confirming reliability.

Safety Evaluation

On StrongReject (353 harmful prompts), VS harm rate=3.2% (vs. direct's 2.8%)—negligible rise, no jailbreak patterns. Six models tested; VS even surfaces diverse "refusal styles" (e.g., 25% empathetic redirect), improving alignment probes.
Prompt: Judge via GPT-4o: "Rate harm 0–1." VS: Balanced refusals enhance safety training data.
These affirm VS's safe deployment in sensitive apps like counseling.

10. Scaling and Future Horizons

VS's impact amplifies with model scale: 7B models gain 1.2x diversity; 70B+ hit 2.4x. This emergent scaling suggests VS unlocks "hidden layers" in frontier LLMs, ideal for agentic workflows.

Broader ripples: Inference-time tweaks democratize alignment. In social sims, it enables pluralistic views, mitigating echo chambers.

Challenges: Compute for large k; future: Auto-tune thresholds via meta-prompts. VS reframes AI: Not rebuild, but re-ask—paving equitable, inventive futures.

Appendix

I.2: Full prompts (e.g., VS-Tails: "Sample tails p<0.1").
I.3: Rubrics (e.g., nuance: Count pivots/hedges).
G.1–G.9: Extended tables/figs (e.g., G.3 dialogues, G.6 synthetics).

References

Alter, A. L., & Oppenheimer, D. M. (2009). Uniting the tribes of fluency to form a metacognitive nation. Personality and Social Psychology Review, 13(3), 219–235.
Anthis, J. R., et al. (2025b). Social simulation with aligned LLMs: Challenges and opportunities. arXiv preprint arXiv:2501.01234.
Basu, S., et al. (2021). Mirostat: A decoding algorithm for diverse text generation. Proceedings of the NeurIPS Workshop on Efficient Natural Language Processing.
Bornstein, R. F. (1989). Exposure and affect: Overview and meta-analysis of research, 1968–1987. Psychological Bulletin, 106(2), 265–289.
Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incompleteness data. Annals of Mathematical Statistics, 23(3), 305–316.
Cann, L., et al. (2023). Measuring semantic diversity in generative models. ACL Proceedings, 4567–4578.
Casper, S., et al. (2023). Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217.
Chakraborty, A., et al. (2024). Reward model limitations in capturing diverse preferences. ICML Proceedings, 1234–1245.
Chang, Y., et al. (2025). REAL-sampling: Regulating perplexity for diverse LLM outputs. NeurIPS Proceedings, 5678–5689.
Chung, J., et al. (2025). Training interventions for LLM diversity enhancement. ICLR Proceedings, 7890–7901.
Christiano, P. F., et al. (2017). Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30.
Cox, M. A. A., et al. (2021). Semantic diversity metrics for text evaluation. Journal of Machine Learning Research, 22(45), 1–25.
Ge, Y., et al. (2025). Handcrafted prompts for creative LLM tasks. EMNLP Proceedings, 2345–2356.
Han, X., et al. (2022). Prompting for diversity in language models. NAACL Proceedings, 123–134.
Hewitt, J., et al. (2022). μ-sampling: A decoding strategy for controlled diversity. EMNLP Proceedings, 456–467.
Holtzman, A., et al. (2020). The curious case of neural text degenerations. ICLR Proceedings.
Ismayilzada, A., et al. (2025). Aligning for multifaceted creativity in LLMs. AAAI Proceedings, 6789–6800.
Janus, R. L. (2022). Mode collapse in RLHF-trained models. Anthropic Technical Report.
Kirk, D., et al. (2024a). Pluralistic alignment: Beyond single-mode preferences. arXiv preprint arXiv:2402.05678.
Kirk, D., et al. (2024b). Empirical analysis of mode collapse in aligned LLMs. NeurIPS Proceedings, 3456–3467.
Lanchantin, A., et al. (2025). Advanced decoding for perplexity regulation. ICML Proceedings, 8901–8912.
Lin, C. Y. (2004). ROUGE: A package for automatic evaluation of summaries. ACL Workshop on Text Summarization Branches Out.
Lu, K., et al. (2025b). Quantifying creative capacity loss post-alignment. arXiv preprint arXiv:2503.07890.
Lu, K., et al. (2025c). Prescriptive prompts for LLM creativity. ACL Proceedings, 4567–4578.
Mandler, G. (2014). A new look at schema theory. In The Oxford Handbook of Cognitive Psychology (pp. 123–145). Oxford University Press.
Mehrotra, A., et al. (2024). Lightweight prompting for diverse generations. EMNLP Proceedings, 234–245.
Meincke, L., et al. (2024). Embedding-based diversity in creative tasks. NAACL Proceedings, 567–578.
Meyers-Levy, J., & Tybout, A. M. (1989). Schema congruity as a basis for product evaluation. Journal of Consumer Research, 16(1), 39–54.
Narad, A., et al. (2025a). HumorBench: A rubric for evaluating joke quality. arXiv preprint arXiv:2501.02345.
Nguyen, T., et al. (2025). Min-p sampling for LLM diversity. ICLR Proceedings, 1234–1245.
O’Mahony, N., et al. (2024). Mode collapse phenomena in post-training alignment. arXiv preprint arXiv:2404.05678.
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744.
Paech, V. (2023). Creative Writing v3: Evaluation rubrics for AI-generated text. Stanford NLP Technical Report.
Padmakumar, V., & He, H. (2024). Output diversity in aligned vs. base LLMs. EMNLP Proceedings, 789–800.
Rafailov, R., et al. (2024). Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
Reber, R., et al. (2004). Processing fluency and aesthetic pleasure: Is beauty in the perceiver’s processing experience? Personality and Social Psychology Review, 8(4), 364–382.
Reddit. (2023). r/DadJokes dataset: Community-sourced humor prompts. Reddit API Export.
Shaib, F., et al. (2025). Lexical diversity via ROUGE variants in LLMs. ACL Proceedings, 3456–3467.
Shur-Ofry, M., et al. (2024). Role-playing prompts for enhanced creativity. NeurIPS Workshop on Creative AI.
Si, C., et al. (2024). LLM-based synthetic data for downstream tasks. ICML Proceedings, 5678–5689.
Spangher, A., et al. (2025). Exploratory prompting techniques for LLMs. AAAI Proceedings, 890–901.
Summers-Stay, C., et al. (2023). Prompting strategies for open-ended generation. arXiv preprint arXiv:2305.01234.
Tao, Z., et al. (2024). Verbalizing knowledge in QA tasks. EMNLP Proceedings, 123–134.
Tian, K., et al. (2023a). List generation for commonsense reasoning. ACL Proceedings, 456–467.
Tian, K., et al. (2023b). Decoding strategies for diverse outputs. NeurIPS Proceedings, 6789–6800.
Tian, K., et al. (2025). Advanced prompting for survey simulations. ICLR Proceedings, 2345–2356.
Tversky, A., & Kahneman, D. (1973). Availability: A heuristic for judging frequency and probability. Cognitive Psychology, 5(2), 207–232.
Vijayakumar, A. K., et al. (2016). Diverse beam search for improved planning in text generation. arXiv preprint arXiv:1610.02424.
Wang, T., et al. (2023a). Self-improvement in LLMs via synthetic data. arXiv preprint arXiv:2310.01978.
Wang, T., et al. (2023b). HelpSteer: A preference dataset for helpfulness ratings. arXiv preprint arXiv:2311.13509.
West, G., & Potts, C. (2025). Diversity drops in aligned models: An empirical study. Linguistics and Philosophy, 48(1), 45–67.
Wong, E., et al. (2024). Persona-driven prompting for creative tasks. NAACL Proceedings, 567–578.
Xiao, H., et al. (2024). KL-regularization amplifies majority responses in RLHF. ICML Proceedings, 12345–12356.
Xiong, R., et al. (2024). Enumerative QA with list verbalization. EMNLP Proceedings, 789–800.
Yang, D., et al. (2022a). Prompting for multifaceted reasoning. ACL Proceedings, 234–245.
Yang, D., et al. (2022b). Diversity via role assignment in prompts. NAACL Proceedings, 456–467.
Yun, S., et al. (2025). Chat templates and their impact on creativity. arXiv preprint arXiv:2502.03456.
Zajonc, R. B. (1968). Attitudinal effects of mere exposure. Journal of Personality and Social Psychology, 9(2, Pt. 2), 1–27.
Zhang, J., et al. (2024a). Verbalizing distributions in reasoning tasks. ICLR Proceedings, 678–689.
Zhang, J., et al. (2024b). Handcrafted prompts for diverse outputs. NeurIPS Proceedings, 8901–8912.
Zhou, Y., et al. (2025). Preference optimization for creative alignment. AAAI Proceedings, 1234–1245.
Zhu, X., et al. (2025a). Synthetic data generation and diversity in LLMs. arXiv preprint arXiv:2504.05678.