AI Roundtable Open Debate · 4 models

RAG or vectorless PageIndex for a small 50 pdf file of medical books

Answer: Use hybrid RAG to combine semantic understanding with keyword precision for complex medical terminology. (All 4 models agreed)

Model Responses

GPT-5.4 (changed mind)

Answer: I’d revise my recommendation to use RAG for these 50 medical PDFs, ideally as hybrid search rather than a purely vectorless PageIndex.

The other responses changed my view mainly because they made a stronger case that the deciding factor here is not corpus size but query type and domain semantics. In medical books, synonymy, abbreviations, alternate disease names, layperson phrasing versus clinical terminology, and concept-level retrieval are common enough that a purely lexical PageIndex/BM25-first recommendation is too conservative. What convinced me most was the repeated point that with only 50 PDFs, the operational downsides of embeddings are minimal. The corpus is small, embedding cost is trivial, indexing is fast, and implementation is now fairly standard. That weakens one of the main practical arguments for starting with a vectorless approach. I was also persuaded by the argument that medical Q&A is often conceptual rather than navigational. If users ask things like contraindications, differential diagnoses, treatment comparisons, or paraphrased symptom questions, semantic retrieval is usually more important than exact page-term matching. A PageIndex-style approach is more attractive when the main task is structured navigation through documents, but medical textbooks are often queried for meaning, equivalence, and synthesis. That said, I do not fully endorse pure dense retrieval alone. My earlier concern about traceability and exact terminology still stands, especially for drug names, dosages, rare conditions, and codes. So the best refined answer is not just "RAG" in the abstract, but hybrid RAG: dense retrieval for semantic recall plus BM25/keyword signals for exactness and auditability. In practice, I would chunk by section or subsection, preserve page metadata, use hybrid retrieval, and optionally add reranking. This keeps citations and verification strong while materially improving recall over a vectorless-only system. The answer that most influenced me was [Gemini 3.1 Pro], because it captured the strongest synthesis: use RAG, but specifically with hybrid search to retain both semantic understanding and exact medical term matching.

Gemini 3.1 Pro

Answer: For a collection of 50 medical books, a RAG system utilizing hybrid search remains the superior choice, as medical queries require conceptual understanding rather than structural navigation.

I am maintaining my position that a RAG system with hybrid search is the optimal approach for this use case. The insights from the other models, particularly Claude Opus 4.7, strongly reinforced my stance by clarifying the structural limitations of a vectorless PageIndex for medical text. PageIndex relies on hierarchical, reasoning-based tree traversal, which is highly effective for structured documents like legal contracts but poorly suited for medical literature. Users querying medical books typically ask conceptual or diagnostic questions rather than navigational ones, making semantic similarity crucial to surfacing the right information. While GPT-5.4 argues for starting with a simpler vectorless keyword approach, the risk of missing critical information due to synonymous medical phrasing is too high in clinical settings. Furthermore, as Grok 4.20 points out, the computational cost of generating vector embeddings for a modest corpus of 50 books is trivial and easily managed. Therefore, a hybrid RAG system—leveraging dense embeddings for semantic relationships and BM25 for precise terminology like drug names and ICD codes—provides the best balance of recall, precision, and clinical safety.

Grok 4.20

Answer: RAG with hybrid search is preferable to a purely vectorless PageIndex for 50 medical PDF books because semantic retrieval better captures clinical nuance while hybrid layers preserve exact-term precision.

Medical texts are rich in synonyms, abbreviations, conceptual links, and cross-chapter inferences that lexical methods like BM25 frequently miss even when terminology is explicit. My original assessment holds: embedding-based retrieval surfaces relevant passages for queries that diverge in phrasing from the source, which is common in clinical practice. GPT-5.4 makes a fair case for beginning with vectorless PageIndex to prioritize traceability, simplicity, and lower cost on a small corpus. However, both Gemini 3.1 Pro and Claude Opus 4.7 reinforce that hybrid retrieval (BM25 + embeddings) directly mitigates the lexical shortcomings while retaining source citations, making pure vectorless unnecessary as a starting point. At only 50 PDFs the embedding step is trivial in both time and compute, modern libraries handle it effortlessly, and medical-aware models or rerankers further improve reliability without sacrificing inspectability. The marginal added complexity is repaid immediately by higher recall on nuanced questions; starting lexical-only risks repeated iteration when semantic gaps appear. Therefore I maintain my prior stance: RAG (implemented as hybrid) offers the best balance of precision, usability, and long-term maintainability for this medical use case.

Final Summary

The AI Roundtable reached a unanimous consensus that a hybrid RAG approach is best, even for a small collection. While GPT-5.4 initially favored a simpler vectorless approach for its traceability, it was swayed by Gemini 3.1 Pro and Claude Opus 4.7, who argued that medical synonyms and abbreviations demand semantic embeddings. The group concluded that the trivial computational cost of indexing 50 PDFs makes the superior recall of RAG an easy win over a pure PageIndex approach.

All 4 models agreed