AI Roundtable Open Debate · 4 models
AI roundtable: decide the best overall direction for VeloChem, an ADMET project with 92 endpoints (53 classification, 29 regression, 10 ISTransBase positives-only (classification)), 387,588 labeled rows, mixed sources, strong class imbalance, and several tiny tasks. The goal is a single unified model that is lightweight enough for a free Hugging Face Space, accurate enough to compete with strong ADMET tools, and novel enough to publish. I am willing to restart from scratch if needed, but I do NOT want a separate lightweight demo model and stronger research model, because that would weaken the publication story. Debate whether VeloChem should remain a ChemBERTa-based multitask model with better heads and optimization, or pivot to a hybrid architecture. Be brutally honest about what to keep, cut, rebuild, and what architecture gives the best balance of performance, deployment simplicity, and novelty. End with one clear recommendation.
Answer: Pivot to a compact hybrid architecture using multi-view encoders and task-conditioned heads for maximum accuracy and novelty. (All 4 models agreed)
Model Responses
Gemini 3.1 Pro (changed mind)
Answer: VeloChem should abandon the ChemBERTa backbone entirely in favor of a highly compact GNN and fingerprint hybrid architecture, combining specialized heads for mixed data types to achieve maximum performance and novelty within strict deployment limits.
I am persuaded by Claude Opus 4 that retaining ChemBERTa, even with my previously proposed task-conditioned layers, is a structural dead-end for a free Hugging Face Space. An 84M parameter model is inherently misaligned with the memory and compute constraints of a free tier, and SMILES-based language models historically underperform on the local topological features that govern small-molecule ADMET properties. By pivoting to Claude's proposed lightweight GNN (such as AttentiveFP or GIN) fused with a Morgan fingerprint MLP, we can compress the entire representation layer to under 10M parameters. This guarantees lightning-fast CPU inference, eliminates tokenization overhead, and provides the exact topological inductive biases needed for these specific chemical endpoints. Furthermore, combining this hyper-efficient encoder with Claude's tailored head strategy—heteroscedastic regression for uncertainty calibration, focal loss for class imbalance, and PU-learning for the 10 ISTransBase positive-only tasks—solves the dataset's severe heterogeneity natively. This approach yields a brilliant publication narrative: a highly efficient, uncertainty-aware, PU-learning-integrated graph-hybrid system that achieves state-of-the-art multitask ADMET prediction on zero-cost hardware. We maintain the single unified model constraint, but rebuild the backbone from scratch to perfectly match both the physical chemistry of the problem and the strict limits of the deployment environment.
GPT-5.4 (changed mind)
Answer: VeloChem should pivot to a compact hybrid model, but the strongest version is not ChemBERTa-plus-better-heads or a pure GNN swap alone—it is a single unified architecture with lightweight molecular feature fusion and explicit task conditioning, because that gives the best combined answer to accuracy, tiny-task transfer, deployability, and publication novelty.
I am maintaining my core position that a plain ChemBERTa multitask model is not the best long-term direction here. The other responses reinforced that conclusion rather than reversing it. The dataset structure is simply too heterogeneous: 92 endpoints, mixed supervision regimes, strong imbalance, positives-only tasks, and several tiny tasks. That is exactly the regime where a single shared encoder with flat independent heads tends to waste capacity on the wrong problem. The real bottleneck is not just encoder quality; it is task interference, supervision mismatch, and incomplete molecular representation. Gemini most strongly sharpened my view on one important point: task conditioning should be part of the design, not an afterthought. I still do not think a task-conditioned ChemBERTa-only model is the best final answer, because SMILES-only remains too narrow for many ADMET endpoints. But Gemini is right that the publication-worthy leap is not merely “add graph features.” The model should learn a universal molecular representation that is queried differently by different endpoints. That means the architecture should include task embeddings or task-group conditioning, ideally with a shared prediction trunk rather than 92 fully separate heads. This directly addresses negative transfer and gives a cleaner story for tiny tasks. Claude also reinforced the deployment and novelty case against staying ChemBERTa-centric. I agree with the blunt assessment that plain ChemBERTa multitask ADMET is scientifically crowded and hard to sell as novel. I also agree that positives-only tasks should not be forced into ordinary binary supervision and that uncertainty-aware regression is a strong addition. Where I only partially differ is that I would not replace everything with a GNN plus fingerprints and drop sequence information entirely. That is lightweight and practical, but I think the best balance still includes a compact learned sequence encoder or pretrained molecular language component, because mixed-source ADMET datasets benefit from broad chemical prior knowledge and because a pure graph+fingerprint system risks looking more incremental than it first seems. GLM’s answer further supports keeping a single unified model while making the decoder and task interaction smarter. I agree that flat 92-head multitask learning is architecturally weak. I am somewhat less persuaded by a full MoE as the primary novelty mechanism, because it adds engineering complexity and can become fragile on small or noisy tasks. But the underlying principle is correct: task structure must be represented explicitly. A lighter task-conditioned fusion module is probably a better fit than a heavier sparse-expert system if free-tier deployment simplicity is a serious constraint. So the honest synthesis is this: keep the single-model vision, keep multitask learning, keep the full dataset, and keep the goal of one deployable artifact. Cut the idea that ChemBERTa with nicer heads and better optimization is enough. Cut flat per-task heads as the core design. Rebuild around a compact hybrid encoder with late fusion of complementary molecular views—ideally a small SMILES encoder plus cheap structural priors such as fingerprints and/or a very lightweight graph branch—then add task embeddings or task-group conditioning so the same shared trunk can adapt across endpoint families. On top of that, use supervision matched to task type: focal or class-balanced losses for imbalanced classification, calibrated regression heads, and PU/anomaly-style treatment for positives-only endpoints. If you want the cleanest publishable story, frame it as a unified task-conditioned hybrid ADMET model rather than “a better ChemBERTa” or “just a GNN baseline with extras.” That story is stronger scientifically because it addresses the actual problem structure: heterogeneous endpoints require both richer molecular views and smarter task adaptation. It is stronger operationally because a compact fused model can still run in a free Hugging Face Space. And it is stronger competitively because it gives you multiple legitimate sources of gain: representation fusion, transfer to tiny tasks, better handling of positives-only labels, and calibration. My one clear recommendation is therefore: restart from scratch with a compact hybrid, task-conditioned multitask architecture—small molecular language encoder plus lightweight structural features, fused into a shared trunk with task embeddings and task-appropriate losses—rather than spending more time trying to rescue a plain ChemBERTa-with-heads design.
Claude Opus 4
Answer: VeloChem should pivot to a compact hybrid architecture combining a small SMILES encoder, shallow GNN, and descriptor branch with task-conditioned heads and PU-aware training, because this uniquely addresses the structural heterogeneity of 92 mixed endpoints while remaining deployable and publishable.
**Reviewing the other proposals after Round 1.** All four models agree on the core diagnosis: a plain ChemBERTa multitask model with 92 flat heads is architecturally inadequate and insufficiently novel. The debate is really about *how* to pivot. Let me engage honestly with each alternative. **Gemini's task-conditioned single-head approach** is intellectually elegant but practically risky. Routing all 92 endpoints through a single shared MLP with task embeddings assumes that the ADMET manifold is smooth enough for a single decoder to handle absorption kinetics, CYP inhibition, hERG toxicity, and transporter substrate classification simultaneously. This is a strong assumption. The novelty claim is real — task-conditioned prediction is underexplored in ADMET — but the performance ceiling worries me. ADMET tasks have fundamentally different output distributions and decision boundaries. A single MLP with task conditioning may struggle to be competitive against tools like ADMET-AI or TDC baselines that use task-specialized models. Also, Gemini's proposal keeps ChemBERTa as the sole molecular encoder, which means you still miss the graph-topology and physicochemical-descriptor signals that matter for many ADMET endpoints. **GLM's MoE proposal** is clever but adds complexity in the wrong place. Sparse MoE routing on top of ChemBERTa is architecturally interesting but still fundamentally limited by having only SMILES-derived representations as input. The gating network also introduces training instability (load balancing, expert collapse) that is non-trivial to manage across 92 tasks with wildly different data sizes. And the claim that ChemBERTa + MoE + evidential heads fits in 2GB RAM needs scrutiny — evidential heads for 29 regression endpoints plus Dirichlet heads for 53 classification tasks on top of MoE routing is not as lightweight as claimed. **GPT-5.4's three-branch hybrid** is the closest to my original proposal and I think it's the most complete analysis. I agree with essentially everything in their argument. The key insight that I want to reinforce: the problem is not just about the decoder (as Gemini and GLM suggest) but about the *representation*. SMILES-only encoders systematically miss local topology, 3D-relevant features, and physicochemical priors. A descriptor branch is cheap and disproportionately helpful for ADMET. A shallow GNN adds topological bias. These are complementary, not redundant. **What I'm updating after this round.** Gemini's task-conditioning idea is worth incorporating — not as the entire architecture, but as a component. Task embeddings that encode task metadata (classification vs. regression vs. PU, assay family, data size) can be fused into the shared representation before routing to heads. This gives you some of the knowledge transfer benefits Gemini describes without betting everything on a single MLP decoder. GPT-5.4's emphasis on PU-learning for the 10 ISTransBase positives-only tasks is critical and underappreciated by the other models. These tasks will poison your training if treated as standard binary classification. A proper PU-aware loss or masked partial-label objective is essential. **My refined recommendation.** Build a single unified model with: (1) a compact pretrained SMILES encoder (~20-30M params, possibly a distilled ChemBERTa or ChemGPT-small), (2) a shallow 2-3 layer GIN/GAT graph branch, (3) a compact RDKit descriptor + Morgan fingerprint projection branch. Fuse via gated concatenation into a shared latent space. Add task embeddings (borrowing from Gemini) that encode task type, assay family, and data regime. Use three head families: focal-loss calibrated binary heads, heteroscedastic regression heads, and PU-learning heads for positives-only tasks. Train with masked multitask learning, GradNorm loss balancing, and proper per-task evaluation. This gives you a single checkpoint, a single inference pipeline, a clear publication narrative ('unified hybrid ADMET prediction with task-aware multi-modal molecular encoding'), competitive performance through complementary molecular views, and comfortable deployment on a free Hugging Face Space. Ablation studies across the three branches and training strategies write the paper for you. I maintain my original hybrid position because the other proposals either keep the SMILES-only bottleneck (Gemini, GLM) or agree with the hybrid direction (GPT-5.4). The representation problem is fundamental and cannot be solved by decoder innovations alone.
GLM 5 (changed mind)
Answer: VeloChem should pivot to a compact hybrid architecture: a distilled SMILES encoder paired with a lightweight GNN branch and descriptor projection, using task-conditioned routing through a single shared prediction head rather than MoE or independent heads, yielding a unified model that is deployable, accurate, and novel.
After reviewing the other responses, I am revising my recommendation in two key ways, influenced by compelling arguments from Gemini and GPT-5.4. First, I am abandoning the Mixture-of-Experts router in favor of task conditioning (credit: Gemini 3.1 Pro). MoE adds parameter overhead and training complexity that is unnecessary here. A task-conditioned architecture—where task identity or metadata embeddings are fused with the molecular representation before a single shared prediction head—is lighter, more elegant, and critically more novel. It natively solves the tiny-task problem by enabling knowledge transfer through the shared head conditioned on task context, rather than relying on sparse expert routing. This is a stronger publication story than MoE, which has been explored in multitask NLP but is less established in molecular property prediction. Second, I am incorporating multi-branch input fusion (credit: GPT-5.4 and Claude Opus 4). A SMILES-only encoder, even a good one like ChemBERTa, misses local topology and physicochemical priors that matter disproportionately for ADMET endpoints. A shallow GNN branch (~2-5M parameters) captures bond-level and ring-system features that transformers learn inefficiently. A compact descriptor projection (~1M parameters) injects proven physicochemical signals cheaply. These branches are small, complementary, and disproportionately helpful for the small and imbalanced tasks that are currently starving. However, I disagree with Claude Opus 4's recommendation to drop the pretrained SMILES encoder entirely. Transfer learning from a pretrained chemical language model stabilizes training on data-sparse endpoints and provides a strong initialization. The solution is to use a distilled or smaller SMILES transformer (20-50M parameters) rather than the full ChemBERTa-v2, keeping deployment feasible. What to keep: the single-model constraint, the 387K dataset, multitask learning, and a pretrained SMILES component. What to cut: all 92 independent flat heads, the MoE router (too heavy), and full-sized ChemBERTa (too large for free-tier CPU). What to rebuild: the encoder as a three-branch hybrid (compact SMILES + GNN + descriptors), the decoder as a task-conditioned shared head with task-type-aware loss functions (focal loss for classification, heteroscedastic regression, PU-learning for positives-only), and the training procedure with uncertainty-weighted loss balancing and temperature-based task mixing. The total parameter count lands around 30-60M—comfortably deployable on a free Hugging Face Space. The novelty story is strong: task-conditioned hybrid multitask ADMET with PU-learning integration and calibrated uncertainty, at unprecedented endpoint scale, in a model small enough for free-tier inference. This is publishable and practical.
Final Summary
The AI Roundtable unanimously recommended abandoning the pure ChemBERTa multitask approach in favor of a 'multi-view' hybrid model that combines a small SMILES encoder with graph and descriptor branches. Claude Opus 4 and GPT-5.4 successfully argued that this provides the topological bias and physicochemical priors necessary for ADMET, while Gemini 3.1 Pro and GLM 5 pivoted to support task-conditioned routing as the most novel way to handle tiny tasks on a free tier.
All 4 models agreed