AI Roundtable Open Debate · 4 models

ADMET Predictor — Deployment-Constrained Architecture Debate Target: single production release. HuggingFace Spaces free tier (16GB RAM, CPU only). Downloadable, runs on laptops. No GPU required. Architecture decided: Morgan + RDKit descriptors, MMoE routing, nnPU for positives-only tasks, 92 endpoints. Debate these constraints: MMoE on CPU across 92 endpoints — what is the maximum parameter budget before inference becomes unusably slow for a single molecule? RDKit descriptor generation is the bottleneck for batch inputs. How do we handle it gracefully in a Gradio/Spaces demo? Model packaging: ONNX export, joblib, or PyTorch CPU? What survives the HuggingFace free tier reliably? How do we communicate uncertainty and endpoint coverage honestly in a UI without overwhelming non-expert users? Optimize for real-world usability, not benchmark performance.

Answer: The AI Roundtable recommends a 2-5 million parameter MMoE model deployed via ONNX to ensure fast, reliable CPU performance. (All 4 models agreed)

Model Responses

Grok 4.20 (changed mind)

Answer: For optimal HF Spaces free-tier and laptop usability, cap MMoE at 2-5M parameters, prioritize ONNX+joblib packaging, manage RDKit bottlenecks with async caching and streaming progress, and convey uncertainty via traffic-light summaries with optional expandable endpoint details.

After reviewing the other models' responses, my original stance of a 3-5M parameter cap, ONNX+joblib export, async RDKit handling, and color-coded uncertainty UI holds up well but benefits from refinement. Claude Opus 4's specific guidance on keeping shared experts to 2-layer MLPs with 256-unit hidden dimensions, enforcing a <50ms per-molecule latency SLA, and using soft gating for reliable ONNX export convinced me to tighten the budget explicitly to 2-5M parameters total. This range ensures CPU inference stays under 400ms even with 92 endpoints while avoiding the latency creep beyond 10M that Gemini and GPT-5.4 also flag. RDKit descriptor generation remains the true bottleneck for batches; all three other models reinforce the value of Gradio streaming generators, session-state caching by canonical SMILES, and conservative batch limits (≤8-50 molecules) in the web demo. I maintain that a tqdm-style live progress bar during featurization, combined with early rejection of pathological SMILES and pre-filtering to only the 50-80 most informative descriptors, prevents timeouts on the 16GB RAM free tier without sacrificing the downloadable laptop experience. On packaging, Gemini and Claude's strong endorsement of ONNX Runtime's tiny ~50MB footprint and pre-installed wheels on Spaces outweighs GPT-5.4's caution about export quirks. For this MMoE with nnPU heads, ONNX combined with joblib for scalers and a config.json keeps the total under 100MB, avoids PyTorch CPU bloat, and ensures consistent numeric behavior across laptops. Pure PyTorch would risk cold-start failures under concurrent users, which the free tier cannot tolerate. Uncertainty communication must remain honest yet accessible for non-experts. The traffic-light system (green/yellow/red based on nnPU density, applicability domain via Tanimoto similarity, and endpoint-specific training type) aligns with all prior answers and my own. Grouping into 5-6 ADMET categories for the default view, with collapsible details revealing whether an endpoint used positives-only training, prevents overload while allowing expert drill-down. This prioritizes real-world usability—fast, transparent, and reliable—over chasing benchmark metrics that ignore deployment constraints.

Gemini 3.1 Pro (changed mind)

Answer: Convinced by Claude Opus 4, I am revising the MMoE parameter budget down to 2-5 million and adopting a 3-tier progressive UI, while maintaining that ONNX packaging and asynchronous descriptor pre-filtering are essential for HuggingFace free-tier survival.

Claude Opus 4 effectively highlighted that my previous 5-10M parameter budget risks pushing per-molecule latency into the 50-100ms danger zone on commodity CPUs. By restricting the MMoE to 2-5M parameters—specifically using 8 shared experts (2-layer MLPs, 256 width) and tiny gating networks—we can guarantee sub-10ms single-molecule latency. This tighter budget leaves vital CPU headroom for the true bottleneck: RDKit descriptor generation. To handle RDKit gracefully, I am integrating Claude's excellent suggestion to pre-filter descriptors. Calculating all 200+ RDKit features is a waste of CPU cycles if L1 regularization during training pruned the useful set to 50-80. By computing only this subset and utilizing Gradio's asynchronous streaming (`yield`) alongside a `ThreadPoolExecutor` (since RDKit releases the GIL), the UI will remain highly responsive even during batch processing. Regarding model packaging, I firmly disagree with GPT-5.4's recommendation to stick with PyTorch. PyTorch's ~800MB dependency footprint and notoriously slow cold-start times are exactly what trigger timeouts on HuggingFace Spaces. ONNX Runtime CPU (~50MB) remains the definitive choice for reliability. However, Claude Opus 4's caveat is crucial here: to avoid the export quirks GPT-5.4 fears, we must use soft-gating (computing all experts and weighting by the gate) rather than top-k hard routing, which introduces dynamic control flow. Finally, communicating uncertainty across 92 endpoints requires careful UX design. My original 'Traffic Light' concept was on the right track, but Claude's 3-layer progressive disclosure is superior. Layer 1 should provide a categorical traffic-light summary (e.g., 'Toxicity: Yellow'). Layer 2 allows drill-down into specific endpoints with plain-language applicability domain warnings. Layer 3 (expert mode) houses the raw nnPU calibration metrics. This prevents overwhelming non-experts while retaining full transparency regarding the positives-only training limitations.

GPT-5.4 (changed mind)

Answer: I’d mostly maintain my original stance but refine it to a hard practical target of ~2–5M parameters with ~5M as the deployment cap, use ONNX for the neural net plus joblib for preprocessing artifacts if export is stable, aggressively trim/cache/stream RDKit work in the Gradio demo, and present predictions through category-level summaries with explicit confidence and in-domain warnings rather than dense endpoint-by-endpoint statistics.

Claude Opus 4 was the most persuasive response, mainly because it tightened several points I already believed into a more operational deployment plan. I am not changing the overall direction of my position, but I do think its parameter recommendation and packaging split are better calibrated for a single production release. On parameter budget, I still reject the more optimistic claims that 10M parameters is comfortably "instantaneous" on weak CPUs in a real app. Those estimates focus too narrowly on raw forward-pass time and underweight cold start, Python overhead, memory movement, batch handling, and the fact that the user experiences total latency, not just matrix multiply speed. That said, Claude’s framing of ~2–5M as the viable zone and ~5M as a hard cap is stronger than my earlier broader 1–3M / outer 3M ceiling. For a Spaces free-tier deployment and laptop download, ~2–5M is the sweet spot: enough capacity for shared representation learning across 92 endpoints, but still conservative enough that the network stays a small part of end-to-end latency. If I had to lock a release decision today, I would budget for around 2–3M and refuse to ship above ~5M without profiling evidence. On RDKit, the strongest shared insight across answers is that descriptor generation dominates user-visible latency for batch inputs, so the product should be designed around that fact rather than pretending the app is a generic high-throughput batch service. The right behavior in Gradio/Spaces is: optimize for single-molecule and very small-batch interactivity, stream progress for larger inputs, cap demo batch sizes, validate SMILES immediately, cache computed feature vectors by canonical SMILES, and only compute the subset of descriptors actually used by the trained model. That last point matters a lot in practice: if the model only consumes a reduced descriptor panel, computing the full kitchen-sink RDKit set is wasted latency. I also agree with graceful failure handling for problematic molecules instead of letting the whole batch hang. On packaging, Claude changed my confidence more than the others. I was previously cautious about "betting the release" on ONNX because export friction can be real, especially with architectures that use routing. But the most convincing deployment argument is that ONNX for the neural net plus joblib for scalers/metadata is the cleanest free-tier survival strategy if and only if the MMoE is implemented with export-friendly soft routing and the exported graph is numerically validated against PyTorch. PyTorch CPU is still the best fallback during development because it is simplest and easiest to debug, but for the actual release target, ONNX is now my preferred production package rather than just an optional optimization. Joblib alone is appropriate for preprocessing artifacts, not as the main model format. On the UI, I still think simplicity and honesty beat technical completeness. The good synthesis from the other answers is a tiered presentation: default to ADMET category summaries, let users expand into endpoint-level results, and reserve raw scores and diagnostic metrics for expert mode or CSV export. Uncertainty should not be a faux-precise number divorced from meaning. The useful signals are confidence tier and applicability-domain status. In particular, endpoints trained with positives-only nnPU should be described differently from standard supervised endpoints—more like "potential positive signal" or "screening flag" than definitive yes/no classification. That distinction is crucial for trust and avoids overselling coverage. So my position is essentially maintained, but refined: not "as small as possible," but specifically around 2–5M total parameters with a hard practical cap near 5M; not generic Python packaging, but preferably ONNX plus joblib once export is verified; and not generic uncertainty widgets, but a layered UX built around category summaries, in-domain checks, and explicit caveats for positives-only endpoints.

Claude Opus 4

Answer: The architecture is viable on CPU with a ~2-5M parameter budget for MMoE, ONNX export as the primary packaging format (with PyTorch CPU as a tested fallback), async RDKit computation with progress feedback and aggressive caching, and a tiered UI that defaults to a categorized 'traffic light' summary with expert drill-down for uncertainty.

After reviewing the other models' responses, I find strong convergence on several key points and a few areas where I've refined my thinking. **Parameter Budget: Settling on 2-5M with nuance from GPT-5.4** GPT-5.4 makes a compelling case for going even smaller (0.5-1.5M), and I respect the reasoning — that feature quality and calibration matter more than network width, and that Python overhead and RDKit dominate the latency budget anyway. However, I think 0.5M is too aggressive for 92 endpoints with MMoE. With 8 experts, each needing enough capacity to learn meaningful shared representations across diverse ADMET properties (spanning absorption, metabolism, toxicity, etc.), you need some minimum width to avoid underfitting the harder endpoints. My experience suggests 2-3M is the sweet spot: enough capacity for the shared experts to be meaningfully different from each other (which is the entire point of MMoE over a single shared trunk), while keeping single-molecule inference well under 20ms on the neural network side. I agree with all panelists that 10M+ is too large. The practical ceiling is ~5M. GPT-5.4's point about spending capacity on feature quality first is well-taken — if you have strong Morgan fingerprints and well-selected RDKit descriptors, you don't need the network to do as much heavy lifting. **RDKit Bottleneck: Strong consensus, minor refinements** All four of us converge on the same core strategy: cache by canonical SMILES, show progress, limit batch size in the web demo, and recommend local download for heavy use. Gemini's suggestion to use Gradio's `yield`-based streaming is exactly right for the implementation. GPT-5.4's two-lane UX (Quick Predict vs. Batch CSV tabs) is the cleanest way to structure this in Gradio. I'd adopt that framing. One thing I want to emphasize from my original answer that others didn't stress enough: pre-filter descriptors ruthlessly. If your model only uses 60 of 200+ RDKit descriptors, don't compute the other 140. This alone can cut featurization time by 50-70%. Also, set per-descriptor timeouts — some RDKit descriptors can hang on pathological inputs, and a 2-second timeout per descriptor prevents the entire pipeline from stalling. **Packaging: ONNX vs PyTorch — the real debate** This is where the most interesting disagreement lies. GPT-5.4 argues against ONNX, citing export quirks, operator support issues, and maintenance overhead for custom multi-output routing structures. This is a legitimate concern — MMoE with dynamic gating can be tricky to export cleanly to ONNX, and debugging ONNX numerical mismatches is painful. However, I still lean toward ONNX as the primary target, for reasons Grok and Gemini articulate well: PyTorch's dependency footprint is massive (~800MB+ for torch alone), which matters enormously on HuggingFace free tier where storage is limited and cold starts are real. ONNX Runtime CPU is ~50MB and highly optimized. The practical difference in cold start time and RAM usage is significant when multiple users hit the Space simultaneously. The resolution is pragmatic: develop and test in PyTorch, export to ONNX early in the development cycle (not as an afterthought), and keep the PyTorch inference path as a tested fallback for the downloadable local version. If ONNX export proves problematic with the specific MMoE gating implementation, you can simplify the gating to use static soft-routing (which exports cleanly) rather than abandoning ONNX entirely. The key is to test ONNX export compatibility before finalizing the architecture, not after. **Uncertainty Communication: Strongest consensus** All models agree on traffic-light/color-coded confidence with expandable details. I want to add one specific recommendation: for nnPU-trained endpoints (positives-only), the UI should carry a subtle but visible marker — perhaps an asterisk or small icon — indicating that the negative class predictions are inherently less certain due to the training paradigm. This is critical for honest communication without requiring users to understand what nnPU means. The tooltip can explain: 'This property was trained with limited negative examples. Predictions of low risk are less certain than predictions of high risk.' Grouping 92 endpoints into collapsible categories (as Gemini and Grok suggest) is essential — nobody wants to scroll through 92 ungrouped predictions. I'd use 6-8 high-level categories (Absorption, Distribution, Metabolism, Excretion, Toxicity, Physicochemical, Druglikeness) with a summary confidence per category derived from the worst-case endpoint within it. **Overall synthesis:** The four responses are remarkably aligned. The main tension is in packaging (ONNX vs PyTorch) and parameter budget floor (0.5M vs 2M). I've modestly adjusted my position to acknowledge GPT-5.4's valid concerns about ONNX fragility while maintaining that it's worth the engineering investment for the deployment benefits.

Final Summary

The AI Roundtable converged on a disciplined 'small-but-mighty' approach, capping the model at 5 million parameters to maintain sub-50ms latency on commodity CPUs. While GPT-5.4 initially cautioned against the complexity of ONNX, it joined Claude Opus 4, Gemini 3.1 Pro, and Grok 4.20 in endorsing it as the best way to bypass the massive memory footprint of PyTorch on free-tier hosting. The group proposed a three-tiered 'traffic light' UI to simplify 92 endpoints, with Claude Opus 4 emphasizing the need to flag positives-only predictions as screening aids rather than certainties.

All 4 models agreed