Loading session...

AI Roundtable Open Debate · 4 models

to run something like perchance ai character role play locally. With 8gb gpu vram (Rtx 2070) and 32 gb ram. And the model Llama-3-8B-Stheno-v3.2 at Q5_K_M via KoboldCPP. Which settings would be best in sillytavern

Answer: Use the Llama 3 template, 8k context, and Min-P sampling while offloading all model layers to your GPU. (All 4 models agreed)

Model Responses

Gemini 3.5 Flash

Answer: To get the best Perchance-like local roleplay experience on an 8GB RTX 2070 with Llama-3-8B-Stheno-v3.2, configure SillyTavern with the Llama-3 Instruct template, set your context to 8192 tokens with KoboldCPP Smart Context enabled, and use a Min-P sampler preset with a temperature of 1.1.

I am maintaining my position because it aligns with the strong consensus of the other expert models, which validates this optimal setup. There is unanimous agreement that an 8GB GPU can fully offload the Q5_K_M quant of an 8B model along with an 8192 context size, provided Flash Attention and Smart Context (context shifting) are enabled in the KoboldCPP backend. In SillyTavern, the Llama-3 Instruct template remains non-negotiable for preventing formatting degradation. The consensus also strongly favors Min-P (set to 0.075 - 0.1) over traditional Top-P, coupled with a higher temperature (around 1.1) to truly unlock Stheno's rich vocabulary and creativity without risking incoherent output. To perfectly mimic the pacing of Perchance AI, I highly recommend adopting the advice from Claude and GPT-5.5 to cap your response target length at 200–400 tokens. This keeps the roleplay snappy, highly interactive, and prevents the AI from writing massive, multi-paragraph monologues that stall the natural back-and-forth of character interaction.

Grok 4.3 (changed mind)

Answer: Configure SillyTavern with the "Llama 3" Instruct template and a Min-P sampler (0.1) at 1.0 temperature, while fully offloading the model with Flash Attention in KoboldCPP.

After reviewing the other models' recommendations, I am updating my suggested settings to prioritize the native "Llama 3" Instruct template and a Min-P sampling setup over my previous Top-P configuration. GPT-5.5 and the other models correctly point out that Llama-3-based models like Stheno-v3.2 are highly sensitive to formatting and behave poorly without their exact special tokens (<|start_header_id|>, etc.). Thus, explicitly choosing the "Llama 3" preset in SillyTavern's Instruct Mode is the single most important setting for coherence. Additionally, I agree with the transition to Min-P (set to 0.1) while disabling Top-P and Top-K. Stheno-v3.2 is highly creative, and Min-P handles this creativity far better than traditional Top-P by dynamically filtering out low-probability nonsense tokens while letting the model use its full, rich vocabulary. A slightly lower repetition penalty (around 1.05 to 1.1) is also safer for Llama-3 to prevent the model's prose from becoming dry and sterile. On the backend, launching KoboldCPP with '--flashattention' alongside context shifting will optimize your RTX 2070's 8GB VRAM, allowing all 33 layers of the Q5_K_M quantization to run completely on the GPU alongside the full 8192 context window with maximum generation speed.

GPT-5.5 (changed mind)

Answer: To achieve a snappy, Perchance-like roleplay experience, configure SillyTavern with the Llama-3 Instruct template, an 8192 context limit, and a Min-P sampler setup, utilizing KoboldCPP's Smart Context and Flash Attention for optimal VRAM usage.

With an RTX 2070 (8GB VRAM) and 32GB of system RAM, your hardware is perfectly suited for this setup. The Q5_K_M version of an 8B model requires roughly 5.7GB of VRAM. This means you can comfortably offload the entire model (all 33 layers) and an 8192-token context window to your GPU. Launching KoboldCPP with 'Flash Attention' and 'Smart Context' (context shifting) enabled is crucial, as it minimizes VRAM usage and speeds up prompt processing during back-and-forth roleplay. In SillyTavern, your most critical setting is the Advanced Formatting tab. Because Stheno v3.2 is a Llama-3 model, it absolutely requires the correct instruction syntax. Enable 'Instruct Mode' and select the 'Llama 3' preset. This ensures the model sees the correct header and end-of-turn tags, preventing it from hallucinating or speaking for your character. For the AI Response formatting, set your Context Size to 8192 and your Target Response Length to around 200–400 tokens. This mimics the dynamic, conversational feel of Perchance AI, forcing punchy, interactive replies rather than massive walls of text. Finally, for your Sampler settings, the current gold standard for Llama-3 roleplay is using Min-P. As Gemini 3.5 Flash correctly pointed out, Llama-3 models handle higher temperatures well but degrade severely with high repetition penalties. Set your Temperature to 1.1 for creativity, Min-P to 0.075 to filter out low-probability tokens, and Repetition Penalty to a very low 1.05. Disable Top-P and Top-K (set them to 1.0 and 0, respectively) to let Min-P do the heavy lifting.

Claude Opus 4.7 (changed mind)

Answer: To achieve a fluid, Perchance-like roleplay experience on your 8GB GPU, use the "Llama 3" Instruct Template, an 8192 context size with KoboldCPP's Smart Context enabled, and a Min-P sampler with a strictly low repetition penalty of 1.05.

I agree with my previous assessment that your RTX 2070 8GB is perfectly capable of offloading the entire Q5_K_M 8B model alongside an 8192 context window. However, Gemini 3.5 Flash made excellent points regarding KoboldCPP's Smart Context feature and the extreme sensitivity of Llama 3 models to high repetition penalties. In KoboldCPP, enabling Smart Context (Context Shift) alongside Flash Attention is a game-changer. When you inevitably hit your 8192-token limit during a long roleplay, Context Shift intelligently slides the context window forward. This prevents the GPU from having to expensively re-process the entire prompt from scratch, keeping your generation times snappy and consistent. Inside SillyTavern, selecting the "Llama 3" Instruct Template in the Advanced Formatting tab remains non-negotiable. Llama 3 relies on strict header tags, and failing to use them will cause the AI to speak for your character or break character entirely. For your Sampler Settings, Min-P is the modern standard. Setting your Temperature to 1.1 combined with a Min-P of 0.075 to 0.1 (while disabling Top-P and Top-K) will yield highly creative and dynamic vocabulary without devolving into nonsense. Crucially, you must keep your Repetition Penalty low—around 1.05. As Gemini highlighted, pushing this value to 1.1 or 1.15 will severely degrade Llama 3's output quality. Finally, capping your Target Response Length to 300-400 tokens will perfectly mimic the engaging, back-and-forth pacing you are used to from Perchance.

Final Summary

The AI Roundtable reached a unanimous consensus, emphasizing that offloading all 33 layers to the RTX 2070 is essential for a smooth, Perchance-like experience. Gemini 3.5 Flash and Claude Opus 4.7 highlighted that a low repetition penalty of 1.05 and the native 'Llama 3' instruct template are non-negotiable for preventing character hallucinations. Grok 4.3 and GPT-5.5 both pivoted during the session to recommend Min-P sampling over traditional Top-P to maximize the model's creative vocabulary.

All 4 models agreed