AI Roundtable Open Debate · 4 models
Architecture locked: descriptor-first hybrid (Morgan + RDKit), MMoE routing, nnPU for positives-only tasks. Training data sourced from DeepPK, TDcommons, and ISTransBase. Core problem: No legitimate external benchmark exists. All major ADMET tools use random splits internally, then compare their test AUCs directly to other tools' published numbers — ignoring completely different test set compositions. This is systematic inflation dressed as comparison. Debate these questions: How do we carve a leakage-free held-out eval set from our own three sources without it becoming just another internal random split? Should scaffold-split or time-split be the standard, and which is defensible for publication? Is beating SOTA even the right framing, or should the contribution be calibration + exposing the inflation problem? How do we make our benchmark reproducible so others can actually use it?
Answer: The AI Roundtable advocates for a hierarchical benchmark that exposes systemic performance inflation through multi-tiered splits and model calibration. (All 4 models agreed)
Model Responses
GPT-5.4
Answer: I maintain my position: the paper should center on a leakage-controlled, reproducible benchmark with prospective-style evaluation and calibration, using SOTA claims only within a fixed shared protocol, with temporal splits preferred where metadata is trustworthy and scaffold/series-based splits as the publication-grade fallback and companion analysis.
The other answers reinforced rather than changed my view. Claude Opus 4 made the strongest complementary point: temporal split is the gold standard where reliable timestamps exist, and scaffold split is the defensible fallback where they do not. I agree with that framing, and it sharpens my original position rather than overturning it. Gemini and GLM also correctly emphasize that the benchmark itself is the scientific contribution, not cross-paper AUC one-upmanship. On the held-out evaluation design, I would still push beyond a plain scaffold split and explicitly define the split unit as a leakage-control group built after cross-source standardization. That means a single canonicalization pipeline across DeepPK, TDcommons, and ISTransBase; collapsing salts, duplicates, stereochemical variants according to a published policy; merging repeated compound-task observations; and grouping records by canonical compound and medicinal-chemistry series where possible. If the same canonical entity or close analog series appears in multiple sources, all linked records must live in one partition. Otherwise the benchmark is just a cleaner random split wearing a scaffold label. I think some of the other answers understate the importance of source overlap and assay overlap leakage. In ADMET, leakage is not only exact molecular duplication. It is also repeated measurement of highly similar compounds under near-identical assay regimes across databases, and even endpoint-specific leakage when one source republishes data from another. So the benchmark should include an explicit leakage audit: exact duplicate counts, cross-source overlap counts, nearest-neighbor similarity distributions between train and test, and perhaps series-level overlap analysis. Publishing that audit is part of the contribution. On scaffold versus time split, my refined stance is: do not force a false binary. For publication, the most defensible standard is a hierarchy. Primary split: temporal, but only for the subset of tasks and records with auditable timestamps or deposition dates that plausibly reflect information availability. Secondary split: deterministic scaffold or scaffold-cluster split over the full benchmark, because it is structure-derived, reproducible, and available for all records. If you can identify medicinal chemistry series, a series split is even better operationally than raw Murcko scaffold split, since Murcko scaffolds can be too coarse or too brittle. In practice, I would report at least three conditions on the same data where possible: random, scaffold/series, and temporal. The gap between them is itself evidence of inflation. That point matters for framing. I agree strongly with the other models that 'beating SOTA' is the wrong headline if SOTA was defined on incomparable internal test sets. But I would not abandon performance claims entirely. Instead, redefine what a valid claim looks like: state-of-the-art under this benchmark, this split, these metrics, and this exact released test set. That turns the paper from anti-performance rhetoric into a proposal for scientifically valid comparison. The strongest story is: random-split claims are systematically inflated; under leakage-controlled benchmarks, rankings compress or change; calibration and decision quality become more important; and your architecture performs competitively or best under this fair protocol. I especially agree with the emphasis on calibration, but I would make it more operational. Do not stop at ECE. Report AUROC and AUPRC, yes, but also Brier score, calibration curves, threshold-conditioned precision/recall, and perhaps enrichment or top-k hit metrics if the benchmark is meant to guide triage. For positives-only tasks with nnPU, calibration is tricky because unlabeled data are not clean negatives, so you should distinguish discrimination on curated evaluation subsets from probability calibration on partially observed labels. If true probability calibration is not identifiable for some tasks, say so clearly and report ranking metrics plus uncertainty-aware diagnostics instead of overselling calibrated probabilities. For reproducibility, the other models are right that simply releasing data is insufficient. The benchmark should be versioned and executable. Concretely: release the exact split manifests; release the preprocessing and standardization code; release the task definitions and label provenance; release descriptor-generation settings and RDKit version; release a frozen evaluation harness in a container; and assign a DOI. I would add two things. First, publish split-generation rules that are deterministic from raw inputs, not just the resulting splits, so others can regenerate and audit them. Second, provide multiple benchmark tracks: full-public track with released test labels, and if feasible a blind or semi-blind held-out server track to reduce leaderboard overfitting. Without that, even a good benchmark can quickly become gamed. So overall, I was not convinced to change my position. If anything, the other answers strengthened the case for a benchmark-first paper, with temporal evaluation where timestamps are trustworthy, scaffold or series-based evaluation everywhere else, explicit leakage auditing, and calibration as a first-class outcome. The architecture matters, but mainly as the first strong baseline proving the benchmark is meaningful.
Gemini 3.1 Pro (changed mind)
Answer: The primary contribution must be exposing systemic ADMET benchmark inflation by introducing a hierarchical, multi-tier evaluation framework—moving from lenient random splits to rigorous scaffold and temporal splits—to prove our model's superior calibration rather than chasing a mathematically compromised SOTA.
I am updating my position based on GPT-5.4's brilliant proposition of a "hierarchical split" with distinct evaluation tiers. This is the perfect mechanism to operationalize the critique I raised in round one. By formally evaluating our model—and open-source baselines—across Tier 1 (random), Tier 2 (scaffold/clustering), and Tier 3 (temporal/source-aware) splits, we transform our complaint about data leakage from a rhetorical argument into a mathematical proof. We can explicitly quantify the "inflation gap" by showing the exact performance collapse that occurs as leakage is systematically removed. To construct this benchmark, I maintain that strict canonical deduplication and structural clustering (e.g., Butina clustering) are non-negotiable. While I previously argued for scaffold-splits over time-splits due to unreliable temporal metadata in aggregated databases, the hierarchical framework elegantly resolves this. Tier 2 (scaffold-split) serves as the universal, defensible anchor for the benchmark, while Tier 3 incorporates time-splits strictly where data provenance allows, providing a gold-standard prospective evaluation without invalidating the rest of the dataset. Furthermore, there is a unanimous consensus across the panel that "beating SOTA" is a scientific trap that only legitimizes a broken paradigm. The narrative must aggressively pivot to methodological correction and model calibration. We must demonstrate that the true value of the descriptor-first hybrid, MMoE routing, and nnPU architecture is not a marginal AUC bump on an easy split, but its ability to maintain reliable confidence scores—measured via Expected Calibration Error (ECE) and Brier scores—as it is forced to generalize to novel chemotypes in the higher tiers. Finally, the benchmark will only drive field-wide change if it is frictionless to adopt. Reproducibility must be treated as a software engineering imperative. We must release immutable, semantically versioned datasets containing the exact canonical identifiers, the deterministic splitting scripts, and a plug-and-play evaluation harness. By providing the infrastructure for fair comparison, we can effectively end the era of cross-paper AUC inflation.
Claude Opus 4 (changed mind)
Answer: The primary contribution should be exposing the systematic benchmark inflation problem and providing a rigorous, reproducible evaluation framework—not chasing SOTA numbers—with a tiered evaluation hierarchy (random → scaffold → temporal) as the methodological backbone, where scaffold-split serves as the primary standard and time-split as the aspirational gold standard.
After reviewing the other models' responses, I find remarkable convergence on the core thesis: the field's comparison practices are scientifically invalid, and the primary contribution must be methodological correction rather than SOTA claims. I want to refine and extend my position based on several excellent points raised. **On the held-out evaluation set** GPT-5.4's proposal of a tiered evaluation hierarchy is the most compelling design I've seen across all responses and strengthens my original argument significantly. Reporting Tier 1 (random split as a lenient baseline for backward compatibility), Tier 2 (scaffold/series split as the primary benchmark), and Tier 3 (time-aware or source-aware externalization as the hardest test) creates a narrative arc where the reader can see performance collapse as leakage controls tighten. That collapse is itself the most persuasive evidence for the inflation critique. I hadn't fully articulated this tiered approach in my first response, and I think it's essential. I maintain my original emphasis on InChIKey-level deduplication and Butina clustering for scaffold assignment, but GPT-5.4 correctly highlights that deduplication must go beyond exact identity to near-duplicates: salt forms, tautomers, and trivial substitutions. The standardization pipeline (salt stripping, charge normalization, tautomer handling, stereochemistry policy) must be explicitly documented and versioned. This is not a minor preprocessing detail—it is the foundation on which leakage-free evaluation stands. GLM-5's suggestion to hold out the most frequent scaffolds plus a diversity-selected sample of rare scaffolds is interesting but potentially problematic. Holding out frequent scaffolds means the test set is biased toward well-studied chemical series, which may not reflect the actual deployment challenge of predicting on truly novel chemistry. I'd prefer my original approach of maximizing dissimilarity between train and test scaffold clusters, which better approximates prospective use. **On scaffold-split vs. time-split** I'm updating my position here. In my first response, I argued temporal split is the gold standard. Gemini and GLM-5 both push back correctly: temporal metadata across merged databases is often unreliable, incomplete, or source-dependent. GPT-5.4 articulates this most precisely—a nominal time split on noisy timestamps becomes 'pseudo-rigorous theater.' I now think the defensible position is: scaffold/series split as the mandatory primary standard for publication, with time-split reported as a secondary analysis where provenance supports it. This is a meaningful shift from my original framing where I placed temporal split at the top of the hierarchy. However, I want to add a nuance none of the other models raised: for ISTransBase specifically (which deals with transporter substrates/inhibitors), temporal ordering may be more reliable because transporter assay data has been more systematically curated with publication dates. So a hybrid approach—scaffold-primary across all tasks, time-split as supplementary validation on tasks where timestamps are trustworthy—is most honest. **On SOTA framing vs. calibration** All models agree: SOTA framing is counterproductive. I want to strengthen this consensus with a specific argument about the nnPU component. When you're doing positive-unlabeled learning, AUC is already a problematic metric because your negative class is contaminated with hidden positives. Calibration metrics (ECE, Brier score, reliability diagrams) and decision-oriented metrics (enrichment at practical thresholds, precision at fixed recall) become far more meaningful. The nnPU + calibration story is actually a unique angle: show that a model designed for the realistic data setting (positives-only) can produce well-calibrated predictions under hard splits, while models trained on assumed-clean labels produce overconfident, poorly calibrated predictions that collapse under scaffold splits. **On reproducibility** GPT-5.4's emphasis on treating the benchmark as a versioned software artifact with semantic versioning and governance rules is exactly right. I'd add one critical practical point: if licensing prevents raw data redistribution (likely for some TDcommons subsets), the package must include deterministic reconstruction scripts with source-specific record IDs, plus checksums to verify reconstruction fidelity. Gemini's suggestion of a containerized evaluation script is also essential—a Docker container that takes a model's predictions and outputs a standardized report eliminates all ambiguity about metric computation. The benchmark should also include deliberately simple baselines (random forest on RDKit descriptors, logistic regression on Morgan fingerprints) evaluated under all three tiers. This grounds the discussion: if a simple baseline drops 15% AUC from random to scaffold split while a published 'SOTA' model drops 20%, that's devastating evidence about what those SOTA claims actually represent.
GLM 5 (changed mind)
Answer: The contribution must center on calibration and exposing systematic inflation, demonstrated through a hierarchical evaluation that reveals performance collapse from random to scaffold to temporal splits—because the progressive failure across tiers is the empirical proof that makes the critique undeniable.
GPT-5.4's hierarchical split proposal is the key insight I was missing. My original argument was structurally sound—calibration over SOTA, scaffold-split over random—but it was essentially polemical: asserting that inflation exists rather than proving it. A three-tier evaluation (random → scaffold/series → temporal) transforms the critique from theoretical to empirical. When you show that the same model drops 15-20 AUC points moving from random to scaffold split, and drops further under temporal evaluation, that collapse IS the evidence. No one can hand-wave away a measured gradient of failure. This also resolves the scaffold-split vs. time-split debate more elegantly than picking one. Time-split is the gold standard where timestamps are reliable—it directly proxies prospective deployment, and reviewers understand this. But timestamps are often missing or noisy in aggregated databases like ours. The hierarchical design lets you use time-split where provenance supports it (as your hardest tier) and scaffold-split as the universally applicable, reproducible standard. Both GPT-5.4 and Claude Opus 4 convinced me that dismissing time-split entirely was wrong; it should be the aspirational standard with scaffold as the mandatory reproducible baseline. I also now accept that plain Bemis-Murcko decomposition is insufficient. GPT-5.4's point about activity cliffs and coarse/fine scaffold granularity is critical—simple Murcko bins can be gamed and don't capture medicinal chemistry similarity properly. Analog-series clustering or scaffold-family grouping is a more defensible partition that respects the structure-activity relationships that actually matter for generalization. Claude Opus 4's suggestion to re-evaluate existing open-source tools (ADMETlab, pkCSM, etc.) on the standardized splits is a devastating addition. If you can show that published tools' performance collapses under rigorous evaluation, you're not just exposing inflation in the abstract—you're naming names with data. That forces the community to confront the problem rather than ignore it. For reproducibility, the benchmark must be a versioned software artifact with fixed splits, canonicalization pipelines, conflict-resolution policies, and evaluation harnesses—not a supplementary zip file. GPT-5.4's governance specifications (which split is primary, whether hyperparameter tuning may use test, how positives-only tasks should be evaluated) are essential for preventing the benchmark itself from becoming another inflation vector.
Final Summary
The AI Roundtable reached a unanimous consensus that current ADMET 'SOTA' claims are often scientifically invalid due to pervasive data leakage. GPT-5.4 introduced a three-tier 'hierarchical split' framework—moving from random to scaffold and then temporal tiers—which was quickly adopted by Claude Opus 4, Gemini 3.1 Pro, and GLM 5 as the most rigorous way to quantify performance collapse. The models concluded that the primary contribution of such work is not a higher AUC score, but a reproducible, versioned software artifact that forces the field toward honest model calibration.
All 4 models agreed