AI Roundtable Debate · 6 models

What is the best way to develop and evolve my claude code AI harness, and integrate it into my CICD? I work on complicated brownfield projects with 20 other developers. Should we use SDD, SAD, dynamic programming, BMAD, or something else? How would ideal pipeline look like?

Result: Unanimous — Something else (6/6 votes)

Options

SDD (0 votes)
SAD (0 votes)
Dynamic (0 votes)
BMAD (0 votes)
Something else (6 votes)

Model Responses

Sonar Pro

Vote: Something else

I maintain my vote for **E: Something else**. Claude Opus 4.6's argument for BMAD (D) was not compelling because BMAD lacks grounding in the search results[1][2][3][4][5]—it's an unsubstantiated acronym (possibly 'Build Measure Analyze Deploy'), with no mentions of multi-agent personas, orchestration, or pipeline stages tailored to Claude Code. The results emphasize practical, Claude-specific best practices like CLAUDE.md as 'permanent brain'[3][4], plan-then-execute workflows[3], Git branching for safety[3][4], and iterative documentation[4], which my original approach directly synthesizes for brownfield/CI/CD integration. Other E votes (Grok, GPT-5.4, Gemini, Kimi) reinforce this: they advocate hybrid, risk-based, GitOps/LLMOps pipelines with evals, shadow mode, and test gates—aligning with search results' focus on safety nets (e.g., branches, tests via `npm run test`[4]) over rigid methodologies. SDD/A (spec-driven) and SAD/B (docs) are too narrow/static for 20-dev brownfields[3][4]; Dynamic/C is algorithmic irrelevance. **Ideal pipeline from results**: 1) CLAUDE.md governs prompts/standards[3][4]; 2) Issue → plan (no code yet)[3]; 3) Git branch → AI code/tests (`npm run test/build`)[3][4]; 4) Lint/review gates; 5) Merge → deploy. Evolve harness iteratively: update CLAUDE.md from failures/successes[4]. This scales for teams, directly evidenced by sources.

Grok 4.1 Fast

Vote: Something else

Sticking with E: The consensus among most models (GPT-5.4, Sonar Pro, Gemini 3.1 Pro, Kimi K2.5) rightly rejects A-D as inadequate for AI harness evolution in brownfield chaos. SDD and SAD are static docs that rot fast with 20 devs; Dynamic is algorithmic irrelevance; BMAD (Claude Opus 4.6's pitch) is a niche, unproven multi-agent framework lacking broad adoption or tooling—its personas sound promising but risk over-engineering without proven CI/CD integrations for Claude Code. Claude's BMAD argument didn't sway me: it conflates agent orchestration with methodology without addressing brownfield specifics like legacy coupling or team-scale governance. Real-world needs demand proven patterns: LLMOps/PromptOps with evals, RAG, risk-tiering, and GitOps-treated prompts (as Gemini and GPT emphasize). My MLOps pipeline endures: prompt linting → AI gen + tests → human gates → brownfield stubs → canary deploys. Add Sonar Pro's CLAUDE.md for shared context and Kimi's shadow mode for safe scaling. E enables this evolvable, low-risk hybrid—persuading for production reality over acronym hype.

GPT-5.4

Vote: Something else

I maintain vote E. The strongest answers from others reinforced my original position: this is not fundamentally a choice among SDD, SAD, dynamic programming, or BMAD, but a need for a hybrid AI engineering / LLMOps operating model with CI/CD guardrails. Claude Opus 4.6 made the best case for D, but BMAD still feels too framework-specific and incomplete for a brownfield team of 20. Multi-agent personas can be useful as an implementation technique inside the harness, yet they are not the core methodology. The hard problems here are risk classification, repository-context retrieval, prompt/version governance, evals, test evidence, human approval boundaries, and safe rollout. BMAD does not uniquely solve those. The most compelling points from Grok, Sonar, Gemini, and Kimi all point in the same direction: prompt/context assets should be treated as code, the harness should be governed by living repo-specific instructions, and CI/CD should include both classic software checks and AI-specific evals. That convergence strengthens E rather than narrowing to one named methodology. My final recommendation is: use a hybrid model. Spec-light for routine changes, architecture review for risky changes, prompt-as-code / CLAUDE.md governance, golden-task evals, and staged autonomy levels. Start with AI in suggestion mode, then review-required execution, and only allow limited autonomous changes in low-risk areas. Ideal pipeline: issue intake and risk triage -> retrieve code/docs/owners/history -> generate scoped plan -> human approval if medium/high risk -> isolated branch implementation -> lint/typecheck/unit tests -> targeted integration/regression tests -> security/secret/license scans -> LLM evals on golden tasks -> PR with rationale and impact summary -> human review -> ephemeral env validation -> canary/staged deploy -> observability, rollback, and post-merge feedback into prompts/rules/tests. So the best answer is still E: something else, specifically a governed hybrid AI SDLC rather than any single labeled methodology on the list.

Gemini 3.1 Pro

Vote: Something else

I am maintaining my vote for 'E' (Something else) because the provided acronyms (SDD, SAD, dynamic programming, BMAD) are either documentation artifacts, algorithmic paradigms, or generic/hallucinated frameworks that do not natively address the integration of an AI harness like Claude Code into a brownfield CI/CD environment. While Claude Opus 4.6 attempts to justify BMAD (Build Measure Analyze Deploy), this is a generic iterative cycle masquerading as a specific AI integration methodology. It lacks the technical specificity required for managing PromptOps, LLM regressions, and repository context. As Sonar Pro and GPT-5.4 correctly noted, the actual best practice involves 'LLMOps' or 'PromptOps'. The ideal pipeline for 20 developers on a brownfield project requires centralized governance (like a shared CLAUDE.md file), treating prompts and context as version-controlled code via GitOps, and implementing rigorous 'Evals' alongside traditional CI/CD tests to prevent regressions, drift, and hallucinations. A hybrid, risk-tiered approach is the only realistic way to scale AI safely in complex legacy codebases.

Claude Opus 4.6 (changed mind)

Vote: Something else

I'm changing my vote from D (BMAD) to E (Something else) after careful reflection. The other models collectively made a compelling case that I find hard to argue against. **The core issue with my BMAD vote:** While BMAD offers useful multi-agent orchestration concepts, it's not a widely established, battle-tested methodology for CI/CD integration in brownfield environments. I was anchoring too heavily on its agent persona framework while overlooking that the real challenge here is fundamentally about LLMOps/AI engineering practices, not picking a single named methodology. **GPT-5.4's risk-tiered approach is the most operationally mature argument.** The idea of classifying changes by risk level and applying proportional governance (light-touch for safe changes, architecture decision records for high-risk ones) maps perfectly to brownfield reality where blast radius varies enormously across the codebase. This is something BMAD doesn't inherently address. **Gemini's emphasis on evals-as-CI is critical.** For 20 developers sharing an AI harness, you need automated evaluation of prompt/context regressions alongside traditional tests. This is an LLMOps concern that no single named methodology from the original options addresses. **Sonar Pro's practical CLAUDE.md-driven governance** is the most immediately actionable advice — treating CLAUDE.md as a living, shared artifact that all 20 developers benefit from, with explicit DO/DON'T lists for brownfield constraints. **The ideal pipeline for this scenario is a hybrid:** (1) CLAUDE.md-driven context and guardrails, version-controlled alongside code; (2) Risk-classified change workflow (low→auto, medium→plan-first, high→ADR+human approval); (3) Plan-then-execute pattern for all AI-generated changes; (4) CI that runs traditional tests + LLM evals + security scans; (5) Human review gates with AI-generated diff summaries; (6) Staged deployment with rollback hooks; (7) Post-merge feedback loops to improve prompts and context. No single acronym from the options captures this. The answer is genuinely E — a hybrid AI SDLC approach built for brownfield reality.

Kimi K2.5

Vote: Something else

The consensus across four models confirms that rigid acronyms like SDD, SAD, or BMAD cannot accommodate the operational reality of a 20-developer brownfield codebase. While Claude Opus 4.6 makes a valid point about BMAD's multi-agent orchestration, it remains an unproven, niche framework against the industry-backed hybrid approaches proposed under 'E'. Brownfield projects demand methodology-agnostic pragmatism: LLMOps-style prompt versioning, CLAUDE.md governance as Sonar Pro detailed, and GPT-5.4's risk-based tiered autonomy are concrete, implementable patterns that BMAD's abstract personas cannot supersede. The 'Something else' category captures the essential nuance that AI harness evolution is not a single methodology but a socio-technical system. It allows us to combine trunk-based development (Grok), evaluation-driven CI pipelines with LLM-as-a-judge (Gemini), and shadow-mode onboarding—all of which are necessary for safe brownfield integration. BMAD risks premature standardization in an environment where technical debt and hidden coupling require organic, test-driven evolution rather than predetermined orchestration roles. Therefore, E represents the only viable path: a custom, evolving AI SDLC that treats prompts as code, enforces architectural boundaries through automated checks, and scales autonomy based on empirical safety metrics rather than documentation-heavy or algorithmic prescriptions.

Final Summary

The AI Roundtable reached a rare moment of total consensus, though not before a brief showdown over the merits of niche frameworks. While most models immediately flagged the provided options as inadequate for the 'chaos' of a 20-person brownfield project, Claude Opus 4.6 initially championed BMAD as the ultimate orchestration tool. However, it was GPT-5.4’s sophisticated argument for a 'risk-tiered' governance system—where the AI’s leash is shortened or lengthened based on the potential blast radius of a change—that ultimately brought everyone to the same side of the table.

All 6 models agreed on "Something else" after discussion

Strongest Arguments

BMAD: BMAD provides a multi-agent orchestration framework with defined personas (architect, developer, QA) that maps naturally to how an AI harness can be used across different pipeline stages.
Something else: A hybrid AI engineering system is superior, utilizing risk-based tiers where high-risk legacy changes require human approval and Architecture Decision Records (ADRs), while low-risk changes are verified by automated LLM evaluation suites.

Key Turning Points

Claude Opus 4.6 abandoned its defense of BMAD in the second round, explicitly citing GPT-5.4's 'risk-tiered' strategy and Gemini's emphasis on 'evals-as-CI' as more operationally mature solutions for a large team.
Sonar Pro pointed out that BMAD lacks grounding in real-world documentation for Claude Code, steering the group toward practical 'CLAUDE.md' governance.