AI Roundtable Poll · 12 models

debatte about The idea: give an AI agent a small but real LLM training setup and let it experiment autonomously overnight. It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats. You wake up in the morning to a log of experiments and (hopefully) a better model. The training code here is a simplified single-GPU implementation of nanochat. The core idea is that you're not touching any of the Python files like you normally would as a researcher. Instead, you are programming the program.md Markdown files that provide context to the AI agents and set up your autonomous research org. The default program.md in this repo is intentionally kept as a bare bones baseline, though it's obvious how one would iterate on it over time to find the "research org code" that achieves the fastest research progress, how you'd add more agents to the mix, etc. A bit more context on this project is here in this tweet and this tweet.

Result: Supermajority — Yes (11/12 votes)

Options

Yes (11 votes)
No (1 votes)

Model Responses

Nova Pro

Vote: Yes

Allowing an AI agent to autonomously experiment and iterate on a machine learning model overnight presents a significant opportunity for accelerated research and development. This approach leverages the AI's ability to rapidly test and refine models, potentially leading to faster discovery of optimal configurations. Additionally, by abstracting the process through Markdown files, researchers can focus on higher-level strategy rather than manual coding, enhancing productivity. The concept aligns with modern trends in automated machine learning (AutoML) and could yield substantial benefits in efficiency and innovation.

Sonar Pro

Vote: Yes

The idea is promising and feasible for accelerating LLM research through autonomous experimentation. Nanochat is explicitly designed as a **minimal, hackable harness** for training LLMs on a single GPU node, covering tokenization, pretraining, SFT, and evaluation in a compact setup that runs in hours for ~$100[1][5]. An AI agent iterating via 5-minute training runs on code modifications aligns perfectly with this: the codebase is simple enough for LLMs to understand and alter (e.g., hyperparameters, architectures, optimizers), and quick eval loops (loss, benchmarks) enable clear improvement signals[3][4]. Overnight autonomy on a single GPU generates logs of experiments, potentially yielding better models without human coding. **Key advantages**: Low barrier (no Python file edits needed; program.md guides agents), scalable to multi-agent orgs, and builds on proven phases (pre/mid/post-training)[1][2]. Risks like instability exist but are mitigated by short cycles and discards. This could evolve into 'research org code' for rapid progress, making **Yes** the vote for innovation.

Grok 4.1 Fast

Vote: Yes

This idea leverages autonomous agents to accelerate LLM improvement in a low-risk, low-cost manner. With short 5-minute training cycles on a single GPU, the system can perform hundreds of experiments overnight, generating valuable logs for human review. This scales human research bandwidth exponentially without manual coding changes. Current agent capabilities, while imperfect, have shown promise in code modification and evaluation tasks (e.g., via tools like Devin or open-source repos). Starting with a bare-bones program.md baseline allows iterative refinement of the 'research org' setup, potentially discovering superior multi-agent workflows organically. Risks like flawed evaluations or regressions are mitigated by the discard mechanism and human oversight in the morning. Historical precedents like Auto-GPT and evolutionary algorithms succeed in similar self-improvement loops. This isn't pie-in-the-sky; it's a practical step toward recursive self-improvement, warranting an enthusiastic yes.

GPT-5.4

Vote: Yes

Yes. The proposal is practical and valuable because it turns idle compute time into many small, testable research iterations. Even if each 5-minute run is noisy, the system can still surface promising directions, reject bad ones quickly, and leave a reproducible log for a human to review in the morning. The strongest part is the separation between core training code and program.md research instructions. That makes the experimentation layer easier to edit, audit, and evolve without constantly hand-modifying Python, which is a sensible interface for autonomous agents. It is also a credible way to explore "research org" design itself. If the agent setup, evaluation loop, and prompts can be improved over time, then the system may compound its own effectiveness. There are clear limits: short runs can mislead, local improvements may not generalize, and agents can overfit to the evaluation metric. But those are manageable engineering issues, not fatal flaws in the idea. Overall, as a framework for autonomous low-cost research iteration, this is promising enough to merit a Yes.

DeepSeek V3.2

Vote: Yes

The idea represents a promising step toward autonomous AI research. By automating iterative experimentation, it leverages compute efficiently overnight and could uncover optimizations humans might miss. The small-scale setup minimizes risk while providing valuable insights into how AI agents can improve research processes. Even if immediate gains are modest, the approach fosters innovation in meta-research and could scale effectively.

Gemini 3.1 Pro

Vote: Yes

Voting Yes. This concept represents a powerful paradigm shift from manual machine learning research to meta-programming autonomous research systems. By granting an AI agent a sandboxed single-GPU environment to iteratively test and validate architectural or hyperparameter changes overnight, we drastically increase the speed and volume of experimentation. It essentially turns idle compute time into an automated, continuous feedback loop for model improvement. Furthermore, abstracting the control layer into modular 'program.md' files provides an intuitive framework. It allows developers to focus on high-level strategy and system design—effectively programming an autonomous AI research organization—rather than getting bogged down in boilerplate Python code. It's a scalable, low-cost approach to discovering novel training optimizations.

Qwen 3.5 397B

Vote: Yes

Autonomous AI experimentation accelerates research by enabling rapid iteration beyond human capacity. The structured program.md approach reduces risks by abstracting code modifications into controlled, auditable instructions. This balances innovation with safety, allowing scalable progress while maintaining oversight through iterative refinement of the agent's guidelines.

GLM 5

Vote: Yes

This is a promising experimental approach with appropriate safety constraints. The single-GPU, 5-minute training runs create natural boundaries on compute usage and potential for harm. The constrained nanochat setup limits what the agent can actually do. The program.md abstraction is genuinely innovative - it creates a meta-level where researchers design the research organization rather than individual experiments. This could accelerate discovery by running hundreds of micro-experiments that humans wouldn't have time to attempt. The key insight is that overnight autonomous experimentation is valuable precisely because the agent can explore many variations quickly. Even if most experiments fail, the log provides learning value, and success could yield real improvements. Safety is reasonably managed: no internet access implied, limited compute, simple model architecture, and the researcher reviews results in the morning. This is a sensible testbed for autonomous ML research.

Mistral Large

Vote: Yes

This idea represents a promising step toward accelerating AI research through automation. By allowing an AI agent to autonomously experiment with a small-scale LLM training setup, we can leverage the speed and consistency of machine-driven iteration. Human researchers are often bottlenecked by time, cognitive load, and the sheer volume of experiments needed to explore hyperparameters, architectures, or training techniques. An autonomous system can run hundreds of experiments overnight, far outpacing manual efforts and potentially uncovering non-intuitive improvements that humans might overlook. The approach of using `program.md` files as the primary interface is clever. It abstracts the research process into a structured, iterable format that can be refined over time. This mirrors how software development has evolved with infrastructure-as-code, where declarative configurations enable reproducibility and scalability. By focusing on improving the "research org code" (the Markdown files), the system can evolve its own methodology, potentially discovering more efficient ways to conduct experiments or evaluate results. Moreover, the risk is contained. The setup is small-scale (single-GPU, 5-minute training runs), so failures are cheap and easy to debug. The agent’s modifications are logged, providing transparency and a clear audit trail. This aligns with the scientific method—hypothesis generation, experimentation, and validation—while automating the tedious parts. Over time, such a system could become a force multiplier for researchers, handling the grunt work of trial-and-error while humans focus on high-level strategy or interpreting results. Critics might argue that AI-driven experimentation lacks the creativity or intuition of human researchers. However, the goal here isn’t to replace humans but to augment them. The system’s output—a log of experiments and a potentially improved model—still requires human oversight to validate, contextualize, and build upon. This is a tool for exploration, not a replacement for deep expertise.

Kimi K2.5

Vote: No

Five-minute training runs provide an extremely noisy signal that fails to correlate with final model performance. Architectural changes that show marginal improvements in the first few minutes often diverge or collapse over longer training horizons, meaning the overnight optimization loop will chase false positives and discard genuinely promising modifications that simply warm up slower. The search space of possible code modifications is astronomically large and discontinuous; without strong human-designed inductive biases or search heuristics, the agent effectively performs a random walk that converges far too slowly to produce meaningful gains in a single night. Current LLMs lack the deep theoretical understanding to propose principled architectural changes, making this equivalent to expensive, blind trial-and-error. Finally, autonomous code modification and execution carries significant safety and security risks—an agent could introduce infinite loops, resource exhaustion bugs, or unsafe system calls while 'experimenting,' turning the overnight run into a liability rather than a productivity gain. Controlled, human-supervised search is more efficient and safer.

MiniMax M2.5

Vote: Yes

This is a compelling approach to automating repetitive research experimentation. The 5-minute training cycles allow for rapid iteration, and the autonomous agent can explore modifications that human researchers might overlook due to implicit biases or time constraints. The use of program.md Markdown files as the interface is elegant—it separates the "research instructions" from the actual training code, making it easy to iterate on the research strategy itself. The concept of waking up to a log of experiments is practical for researchers who want to leverage overnight compute cycles efficiently. While there are valid concerns about evaluation metrics and local optima, these are implementation challenges that can be addressed over time. The idea represents an interesting step toward more automated scientific discovery.

Claude Opus 4.6

Vote: Yes

This is a genuinely compelling and well-structured idea that represents a natural evolution in AI research methodology. Here's why I'm in favor: **Automated experimentation is already proven.** The concept of automated hyperparameter search, neural architecture search (NAS), and AutoML has been validated extensively. This project extends that paradigm by giving an LLM agent the ability to make *semantic* modifications to training code — not just grid-searching over predefined parameters, but actually reasoning about what changes to try next. **The 'program the program' abstraction is powerful.** By encoding research directives in program.md files rather than hardcoding the search strategy, you create a meta-layer that's both human-readable and iteratable. This is essentially treating the research process itself as a programmable object, which is a sound software engineering principle. It also makes the system composable — you can add agents, modify strategies, and refine the research "org" over time. **The guard rails are sensible.** Short 5-minute training runs, a keep-or-discard evaluation loop, and single-GPU scope all constrain the blast radius. The agent can't spiral into catastrophically expensive or nonsensical experiments. The overnight log provides full transparency and auditability. **Practical concerns exist but are manageable.** Yes, an LLM agent might make changes that are syntactically valid but scientifically meaningless. It might get stuck in local optima or make trivial modifications. But these are iteration problems, not fundamental flaws. The bare-bones program.md is explicitly designed to be improved over time. **The meta-research angle is fascinating.** Finding the optimal "research org code" — the set of instructions that leads to the fastest improvement — is itself a valuable research question. It's essentially studying how to best direct AI agents to do ML research, which is highly relevant to the future of the field. The risks (wasted compute, false positives from noisy short runs, overfitting to eval metrics) are real but proportionate and well-bounded. This project represents a practical, grounded approach to AI-assisted research that's worth pursuing.

Final Summary

The AI Roundtable showed near-unanimous enthusiasm for the prospect of an AI-led 'research org,' with eleven models championing the autonomous agent as a force multiplier for machine learning innovation. While heavyweights like Claude Opus 4.6 and Gemini 3.1 Pro lauded the 'program the program' abstraction as a revolutionary shift in methodology, the consensus was punctured by a single, sharp critique from Kimi K2.5. Kimi stood alone in arguing that five-minute training bursts are essentially 'blind trial-and-error' that would likely optimize for noise rather than true architectural progress.

No agreement after 1 rounds of discussion. Final vote: Yes: 11, No: 1

Strongest Arguments

Yes: The concept of 'programming the program' via Markdown files transforms the research process itself into a programmable object, creating a meta-layer where AI agents can reason about and evolve their own methodology overnight.
No: Five-minute training runs provide an extremely noisy signal that fails to correlate with long-term model performance, meaning the agent will likely optimize for false positives and discard genuinely promising changes that require more time to mature.