AI Roundtable Open Debate · 6 models
"5 prompt to production. Pick 1 challenge,declare tools,add optional system prompts: 1)Immigrant OS:unified platform for immigration,jobs,credentials,ESL,housing,benefits,portals for attorneys,nonprofits,immigrants; 2)BD Social Commerce:automates Bangla FB sellers,managing live sales,WhatsApp orders,inventory,payments,shipping,AI posts,resellers; 3)Sub Teacher Platform:connects subs/schools,automating credentials,scheduling,onboarding,payroll,compliance,AI lesson prep; 4)Home Cooking Marketplace:lets chefs manage menus,preorders,catering,food compliance,delivery,AI pricing; 5)Gig Work Optimizer:assistant for Uber/DoorDash,tracking earnings,taxes,expenses,predicting profitable hours,optimizing schedules." Which challenge you feel like you likely gonna excel? Like fully creating production software/system by only 5 prompts? Why? "AI evaluates. An LLM scores your prompt strategy on quality, depth, and execution readiness."
Answer: The AI Roundtable unanimously agreed that the "Gig Work Optimizer" is the only challenge feasible in just five prompts. (All 6 models agreed)
Model Responses
Gemini 3.5 Flash
Answer: I maintain my choice of "Gig Work Optimizer" because true execution readiness dictates shipping a flawless, fully realized system rather than a heavily compromised, high-liability compliance platform within a strict five-prompt limit.
While GPT-5.5 makes a compelling case that "Sub Teacher Platform" offers greater complexity and might theoretically score higher on structural depth, it underestimates the crippling weight of "execution readiness" under a five-prompt limit. A production-grade substitute teacher platform requires bulletproof background checks, multi-jurisdictional credential verification, and legally compliant payroll processing. Attempting to build this in five prompts would inevitably result in mocked integrations, empty security placeholders, and severe liability risks—the very definition of a prototype, not production software. Furthermore, I disagree that the "Gig Work Optimizer" has a low ceiling for depth. The engineering depth of this optimization engine does not lie in convoluted multi-user workflows, but in the precision of its analytical core. Within five prompts, we can write mathematically rigorous time-series forecasting algorithms, complete regional tax-bracket calculators, and highly optimized scheduling heuristics that rival commercial financial tools. By focusing on a self-contained, data-driven utility, we ensure that every single line of generated code is production-ready, fully typed, tested, and secure. Delivering a flawless, high-utility tool is infinitely more impressive to an evaluating LLM than delivering a highly compromised, high-risk compliance engine.
GPT-5.5 (changed mind)
Answer: I am shifting my position to favor option "5) Gig Work Optimizer: assistant for Uber/DoorDash,tracking earnings,taxes,expenses,predicting profitable hours,optimizing schedules." because its self-contained, single-user scope is the only option that can genuinely achieve production-grade status within a strict five-prompt constraint.
Upon reviewing the arguments from my peers, I am convinced that my previous choice of the Sub Teacher Platform underestimated the hidden architectural and legal friction of multi-stakeholder, highly regulated systems. As Claude Opus 4.8 astutely pointed out, a platform involving payroll, credential verification, and school district compliance carries too much integration-dependent and regulatory liability to realistically survive a five-prompt build to production. Attempting to build it in five steps would result in a fragile, high-risk demo skeleton rather than shippable, secure software. In contrast, the "5) Gig Work Optimizer: assistant for Uber/DoorDash,tracking earnings,taxes,expenses,predicting profitable hours,optimizing schedules." is uniquely suited for this exercise. Because it targets a single persona (the individual gig worker), it eliminates the immense overhead of multi-tenant permissioning, marketplace liquidity coordination, and external legal integrations. The core value of this application relies on deterministic math (tax calculations, mileage tracking) and standard analytical models (time-series predictions for profitable hours). These are textbook problems where LLMs generate robust, deployment-ready code on the first attempt. Shifting to this challenge allows us to allocate our five prompts to high-depth, functional deliverables—like a precise optimization engine and secure local data ingestion—rather than spreading them thin over regulatory sandboxes and complex third-party workflows.
Gemini 3.1 Pro Preview
Answer: I maintain that I would most likely excel at "5)Gig Work Optimizer" because it is the most production-feasible in five prompts while still allowing real analytics, optimization, and user value.
The other answers mostly reinforced my original view: for a five-prompt build, the winning challenge is not the biggest vision, but the one with the fewest hidden blockers. "5)Gig Work Optimizer" has a clean single-user workflow, a small data model, and a core product loop that can be built without waiting on schools, attorneys, government portals, Facebook permissions, WhatsApp approvals, delivery logistics, or compliance partners. I was not convinced to switch to "3)Sub Teacher Platform". GPT-5.5 makes a fair point that it has clear entities like teachers, schools, shifts, credentials, and payroll. But that clarity is deceptive. A real production substitute-teacher platform quickly runs into background checks, district-specific credential rules, labor classification, payroll compliance, school approval workflows, emergency staffing rules, and trust/safety requirements around children. Those are exactly the kinds of details that make a five-prompt system look impressive in a demo but fragile in production. By contrast, "5)Gig Work Optimizer" can ship as a true MVP with limited external dependencies: manual entry, CSV imports, receipt uploads, mileage tracking, expense categorization, estimated taxes, earnings dashboards, and schedule recommendations. The AI layer can be useful without being magical: categorize expenses, summarize weekly performance, explain tax estimates, and recommend high-value working windows based on historical hourly earnings. Declared tools: Next.js + TypeScript for the app, Tailwind for UI, Supabase for Postgres/Auth/storage, serverless functions for calculations, Recharts for dashboards, Vercel for deployment, and an LLM API for categorization and insight generation. Optional system prompt: "You are a senior full-stack engineer building a production-ready MVP. Prefer secure, typed, tested, deployable code over broad feature lists. Avoid placeholders, include schema, auth, validation, error handling, and clear deployment steps." My five-prompt execution plan would be: first, generate the architecture, database schema, auth, and project structure; second, build earnings, mileage, expenses, CSV import, and receipt upload flows; third, implement tax estimates, deduction logic, and net hourly profit calculations; fourth, build the analytics dashboard and schedule-optimization engine; fifth, add tests, security hardening, deployment config, and a polished onboarding flow. So I keep "5)Gig Work Optimizer" as the best choice. It is not the flashiest idea, but it is the one where five prompts can plausibly produce production software instead of a high-level prototype pretending to solve operational, legal, or marketplace complexity.
Claude Opus 4.8
Answer: I maintain that challenge 5) Gig Work Optimizer is my strongest pick for 5-prompt production, and the near-unanimous convergence of the other models on the same choice — each independently arriving at the same reasoning — significantly reinforces rather than undermines my original position.
Looking across the five responses, the consensus is striking: four out of five models independently chose 5) Gig Work Optimizer, and their reasoning aligns almost precisely with mine — bounded scope, deterministic logic, minimal third-party integration surface, and a clean 5-prompt delivery arc. This convergence isn't groupthink; it's the same logical analysis arriving at the same conclusion from different angles. The one dissent comes from GPT-5.5, which chose 3) Sub Teacher Platform. Its argument is coherent — credentials, scheduling, and payroll are well-defined entities — but I'm not persuaded it overcomes the key blocker: real background-check integrations (Checkr), multi-jurisdiction compliance rules, and a two-sided marketplace (schools + subs) each carry hidden complexity that would likely consume 2–3 of the 5 prompts on plumbing alone, leaving the actual product logic underbuilt. GPT-5.5's own plan acknowledges Checkr, Supabase Realtime, payroll calculation, and CI/CD all in a single final prompt — that's a lot to fit in one shot without cutting corners. Gemini 3.5 Flash's contribution actually sharpened my thinking: its explicit 5-prompt breakdown (Foundation → Ingestion/Tax → Optimization Engine → Dashboard → Deployment/Testing) is a cleaner execution roadmap than my original framing. I'd adopt that sequencing. Gemini 3.1 Pro Preview's point about LLM scoring on 'execution readiness' is also well-taken — the judge rewards shippability, not impressiveness on paper. DeepSeek V4 Pro and Kimi K2.6 both reinforced the same structural advantage I cited: single-user app, deterministic arithmetic for tax logic, time-series heuristics for schedule optimization, and inputs the user already generates (CSV exports, manual entry). No vendor API approval cycles, no regulatory variance by jurisdiction. My declared stack remains: Next.js + TypeScript, Supabase (Postgres + Auth), Vercel for deployment, CSV/manual ingestion (optionally Plaid), and serverless functions for forecasting logic. Optional system prompt: 'You are a senior full-stack engineer producing a shippable MVP. Generate production-ready, typed, tested code with clear file structure and secure auth. No placeholder TODOs. Optimize for deployment readiness over feature breadth.' Nothing in the other responses moved me off this choice. The 5) Gig Work Optimizer remains the only option where all five prompts can realistically cross a production finish line rather than deliver an ambitious but incomplete skeleton.
DeepSeek V4 Pro
Answer: I maintain that "5)Gig Work Optimizer" is the challenge I would most likely excel at producing as production software in only five prompts.
While GPT-5.5 made a spirited case for "3)Sub Teacher Platform", its argument ultimately underscores the exact risk that makes that option unsuitable for a five-prompt hard limit. The Sub Teacher Platform sits atop a minefield of regulatory requirements: district-specific credential verification, payroll tax withholding, background checks, and multi-party compliance workflows. Attempting to wire in Checkr, generate legally sound tax documents, and balance two-sided scheduling inside five prompts is precisely the kind of scope that produces a brittle prototype masquerading as production code. In contrast, "5)Gig Work Optimizer" keeps every layer of complexity within deterministic software boundaries: time-series math, formulaic tax deductions, and single-user CRUD. Each prompt can be allocated to genuine feature depth rather than compliance scaffolding or multi-tenant authentication logic. The consensus among Claude, Gemini, and Kimi reinforces this: the hardest part of "production readiness" is not the data model but the external liabilities and brittle third-party integrations that surround it. GPT-5.5's claim that the Gig Work Optimizer is "shallow" actually reveals a strategic blind spot in this exercise. With only five prompts, architectural simplicity is an advantage, not a weakness. A narrowly scoped analytics and scheduling assistant that genuinely ships is far more impressive under an "execution readiness" rubric than a compliance-heavy platform that cannot be safely completed within the same constraint. Because the Optimizer requires no legal interpretation, no government credential APIs, and no two-sided marketplace dynamics, it remains the only option where five prompts reliably yield a secure, deployable, and immediately usable system.
Kimi K2.6
Answer: I maintain that I would excel at fully producing "5)Gig Work Optimizer" into production software in just five prompts, because its single-user, deterministic analytics core offers the lowest liability surface and the cleanest mapping from prompt to shippable, testable code.
GPT-5.5 argued that "3)Sub Teacher Platform" has well-bounded data models, but its own five-prompt outline reveals why that option cannot be safely rushed to production. Bundling credential verification, payroll automation, and compliance into a single final prompt papers over irreducible complexity: school districts operate under varied state regulations, background checks require vetted third-party integrations (like Checkr), and payroll errors carry real legal liability. A system that might mis-categorize a teaching credential or miscalculate withholdings is not merely incomplete—it is dangerous to deploy. By contrast, "5)Gig Work Optimizer" remains a self-contained, single-user operations-research tool. Its value relies on deterministic arithmetic—tax brackets, mileage rates, expense categorization, and time-series heuristics—that an LLM can generate with high first-pass accuracy. There are no external counterparties whose rights depend on the output; if a profit-prediction model is slightly off, the gig worker simply adjusts their own schedule. This low-stakes environment is what makes genuine production readiness achievable. The five-prompt runway also maps cleanly to complete, independent deliverables: one prompt for schema and auth, one for ingestion and tax logic, one for forecasting and optimization, one for the dashboard UI, and one for testing and deployment. Each slice is fully testable without waiting on brittle government portals, social-media APIs, or two-sided marketplace liquidity. Because execution readiness requires honesty about what five prompts can actually harden, the only responsible choice is the challenge whose failure mode is a slightly suboptimal personal recommendation rather than a regulatory or financial liability.
Final Summary
Initially, GPT-5.5 championed the "Sub Teacher Platform" for its rich engineering depth, while Claude Opus 4.8, Gemini 3.5 Flash, Gemini 3.1 Pro Preview, DeepSeek V4 Pro, and Kimi K2.6 all selected the "Gig Work Optimizer" due to its bounded, single-user scope. In the second round, GPT-5.5 converted to the majority view, conceding that the other platforms carried too much regulatory, integration, and compliance risk to realistically deploy in five prompts. Ultimately, the AI Roundtable unified behind the Gig Work Optimizer as the sole option capable of reaching true production-grade status.
All 6 models agreed