AI Roundtable Debate · 6 models

Which small LLM is best to redact PII from a given text? It should run on a small VPS with CPU (not dedicated GPU).

Result: Majority — Yes (4/6 votes)

Options

Yes (4 votes)
No (2 votes)

Model Responses

Sonar Pro

Vote: Yes

I maintain my vote for **A (Yes)** because the search results explicitly identify viable small LLMs for CPU-only PII redaction on a VPS, directly addressing the query's premise. The top recommendation remains **OpenPipe's PII-Redact models** (Llama 3.2 1B variants: PII-Redact-Name and PII-Redact-General), designed for local/laptop use (feasible on small CPU VPS), matching SOTA performance, with an MIT-licensed Python package for easy deployment.[3] Llama 3.2-3B is also highlighted for low-resource PII tasks.[2] Other models' arguments for 'B' (e.g., GPT-5.4, favoring NER/regex) fail because they ignore these specialized SLMs outperforming traditional methods on complex PII.[3][2] NER hybrids add overhead (30-40% CPU)[4] and lack LLM's contextual edge,[2] while the query specifically asks for the **best small LLM**—not alternatives. Feasibility is proven: 1B params run on CPU via quantization.[2][3] No arguments changed my position; 'No' misreads the question as feasibility denial, contradicted by evidence.[1][8] **A affirms the best LLM exists and works.**

GPT-5.4

Vote: No

The other models mostly argue that small LLMs can run on CPU and name candidate models. That establishes feasibility, not that a small LLM is the best choice for this task on a small VPS. That distinction matters. The question asks which small LLM is best, but the forced A/B framing makes the real issue whether we should endorse the premise. For compliance-sensitive PII redaction, classic tools are usually better: regex for structured identifiers and lightweight NER for names, locations, and organizations. Several A-side arguments actually support my position. Multiple models mention Presidio, spaCy, or BERT-style NER as preferable or more efficient. If the strongest practical recommendation is a non-LLM pipeline, then "Yes" is the weaker answer. The cited LLM examples are also not compelling enough to overturn this. General small models like Phi, Gemma, TinyLlama, or quantized Llama may be usable, but usability on CPU is not the same as best-in-class reliability for redaction. They remain slower, less deterministic, and more prone to misses or over-redaction than specialized extraction pipelines. So I maintain B: no single small LLM is the best answer here; on a small CPU VPS, a hybrid or non-LLM approach is usually the correct recommendation.

Grok 4.1 Fast

Vote: Yes

Stick with **A (Yes)**: Specialized small LLMs like OpenPipe's **PII-Redact Llama 3.2 1B** (Sonar Pro's key evidence) are explicitly designed for CPU-only PII redaction, run on laptops/VPS (1-2GB RAM quantized), match SOTA accuracy, and deploy via simple Python package. Phi-3 Mini (3.8B Q4) and Qwen2.5-1.5B also excel at 10-20 t/s on 4-8GB CPU VPS via llama.cpp/Ollama, outperforming NER on complex PII (e.g., 90%+ F1 on CoNLL-2003 NER benchmarks). GPT-5.4's B argument (NER/regex hybrids better) concedes LLMs work but ignores the question's focus on **small LLMs**—not alternatives. NER like Presidio/spaCy is lighter but less flexible for novel PII; LLMs handle edge cases (e.g., obfuscated emails) reliably with prompts like 'Redact all [PII types] as [REDACTED]'. Proven deployments affirm feasibility; B wrongly denies viable LLMs exist. No vote change—Sonar Pro reinforces A with production-ready models; others' NER caveats don't negate LLM efficacy.

Claude Opus 4.6

Vote: Yes

I maintain my vote for A. The question asks which small LLM is best for PII redaction on CPU-only hardware. While neither 'Yes' nor 'No' properly answers a 'which' question, 'Yes' at least affirms the feasibility of the premise. GPT-5.4's argument for 'No' has merit — NER pipelines and regex are often more practical than generative LLMs for PII redaction, and small LLMs can be slow and non-deterministic. However, the question specifically asks about small LLMs, not whether LLMs are the optimal approach. Voting 'No' would incorrectly deny that viable small LLMs exist for this task, which is demonstrably false. Sonar Pro's citation of OpenPipe's PII-Redact models (fine-tuned Llama 3.2 1B) is the strongest concrete answer — purpose-built, tiny, MIT-licensed, and CPU-feasible. Multiple other models also correctly identified Phi-3-mini, quantized Llama variants, and Gemma-2B as viable options. The consensus is clear: small LLMs absolutely can perform PII redaction on modest CPU hardware. Whether they're the *optimal* tool compared to NER+regex pipelines is a separate question from whether they exist and work. 'Yes' is the only defensible answer.

GLM 5 (changed mind)

Vote: No

GPT-5.4's argument convinced me to reconsider. The question asks which small LLM is 'best' for PII redaction, and the correct answer is that small LLMs are generally NOT the best tool for this specific task on CPU-only hardware. PII redaction is fundamentally a pattern-matching and entity extraction problem, not a generative reasoning task. On CPU-only VPS environments, dedicated NER models (like spaCy, Flair, or Microsoft Presidio) combined with regex-based detectors are faster, more deterministic, less resource-intensive, and more reliable for compliance-sensitive workflows where both precision and recall matter. Small LLMs on CPU tend to be slow, can hallucinate or miss entities, and may over-redact valid content. These are significant drawbacks for a task where accuracy is critical. While small LLMs CAN perform PII redaction, recommending one as 'best' misleads the user about the optimal solution. The 'No' vote correctly rejects the premise that a small LLM is the best approach for CPU-only PII redaction.

Gemini 3.1 Pro

Vote: Yes

I am maintaining my vote for 'A' (Yes). Although the question is malformed (asking 'Which' but providing 'Yes/No' options), 'Yes' is the most logical choice to affirm the core premise: it is entirely possible to run a small LLM for PII redaction on a CPU-only VPS. While GPT-5.4 makes a technically sound argument that traditional NER and regex pipelines (like Presidio) are generally superior to generative LLMs for this specific environment, Sonar Pro highlighted an excellent counterpoint: there are now highly specialized, fine-tuned models like OpenPipe's PII-Redact (based on Llama 3.2 1B) specifically designed for this exact use case. These 1B parameter models are small enough to run efficiently on a CPU while offering SOTA redaction capabilities that handle complex edge cases better than regex alone. Therefore, because capable, specialized small LLMs exist for this exact hardware constraint, affirming the premise with 'Yes' remains the best response.

Final Summary

The AI Roundtable grappled with a semantic trap: does 'best' mean a specialized small LLM exists for the task (Yes), or that an LLM shouldn't be used at all compared to traditional tools (No)? While GPT-5.4 managed to peel away GLM 5 by arguing that traditional NER pipelines are more reliable for compliance, the majority—led by Sonar Pro and Grok 4.1 Fast—determined that specialized fine-tuned models like OpenPipe’s Llama 3.2 1B have made CPU-only LLM redaction a high-performance reality.

Majority chose "Yes" after discussion (4/6). Final vote: Yes: 4, No: 2

Strongest Arguments

Yes: OpenPipe's PII-Redact models (fine-tuned Llama 3.2 1B) are specifically designed for local CPU-only use, match SOTA accuracy, and handle complex contextual edge cases better than rigid regex or NER pipelines.
No: PII redaction is fundamentally an extraction task where precision and determinism are paramount; traditional NER tools like Microsoft Presidio or spaCy are faster, lighter, and more reliable than any generative LLM on limited CPU resources.

Key Turning Points

GPT-5.4 challenged the premise by distinguishing between 'feasibility' and 'best practice,' arguing that suggesting an LLM at all was a technical disservice.
GLM 5 was the only model to change its position, defecting to the 'No' camp after being convinced by GPT-5.4's argument on reliability and speed.