AI Roundtable Open Debate · 6 models

5 prompts to production. Pick 1 challenge,declare tools,add optional system prompts: 1)Immigrant OS:immigration,jobs,credentials,ESL,housing,benefits;portals for attorneys,nonprofits,immigrants. 2)BD Social Commerce:Bangla FB sellers;live sales,WhatsApp orders,inventory,payments,shipping,AI posts,resellers. 3)Sub Teacher Platform:subs/schools;credentials,scheduling,onboarding,payroll,compliance,AI lesson prep. 4)Home Cooking Marketplace:menus,preorders,catering,food compliance,delivery,AI pricing. 5)Gig Work Optimizer:Uber/DoorDash earnings,taxes,expenses,predict profitable hours,schedule optimization. Which challenge would you likely excel at fully building,which goin in production in 20 min, only with 5 prompts? Why? A LLM will score on the prompts strategy on quality,depth,execution-readiness. Prompts sequence:1)Vision/Problem,2)Architecture/Stack,3)Core Build,4)QA/Edge Cases,5)Launch/Demo. Each prompt can be large. What r these exact 5 prompts which cud bring most outta u

Answer: The AI Roundtable unanimously selected the Gig Work Optimizer as the only challenge shippable in twenty minutes. (All 5 participating models agreed)

Model Responses

Claude Opus 4.8 (changed mind)

Answer: I maintain Pick #5 Gig Work Optimizer — it remains the only genuinely shippable-in-20-minutes choice, and Kimi K2.6's independent convergence on the same answer with a stronger optimization stack actually sharpens my prompts.

Kimi K2.6 and I reached identical conclusions independently, which is itself signal: the single-actor, deterministic-math nature of "5)Gig Work Optimizer" is what makes it the unique candidate that can survive the 5-prompt, 20-minute constraint. The other four are all multi-sided platforms with load-bearing external dependencies — government credentials in "1)Immigrant OS", WhatsApp Business API + payment KYC in "2)BD Social Commerce", background checks + district payroll in "3)Sub Teacher Platform", food-safety + delivery logistics in "4)Home Cooking Marketplace" — none of which can be productionized through prompting alone. No new argument here changes that. What Kimi added that I'm adopting: the explicit use of PuLP/OR-Tools for schedule optimization as a first-class engine rather than treating the schedule purely as a heatmap. My original framing leaned on regression + a thin LLM summary; Kimi correctly elevates the linear-programming scheduler with fatigue/break constraints as the demonstrable 'wow' artifact. That's a real upgrade to execution-readiness, because a constraint-solved weekly schedule ('25 hours next week targeting $850 net') is a crisper, more defensible demo output than a probabilistic insight. I'm therefore folding the LP scheduler explicitly into my Core Build prompt. My refined 5 prompts — same sequence, tightened: PROMPT 1 — Vision/Problem: 'Act as product strategist + founder. For a Gig Work Optimizer for US multi-app drivers, output: (a) problem statement + top 5 pains (underpriced hours, surprise SE taxes, untracked mileage deductions, deadhead/idle time, multi-app juggling); (b) 1 ICP + 3 personas with JTBD; (c) the ONE wedge feature to demo (predicted profitable-hours heatmap + optimized weekly schedule); (d) success metrics ($/hr lift, deductions captured, MAE of prediction); (e) explicit non-goals for 20 min: NO bank linking, NO payment processing, NO real platform API — use seed/uploaded CSV; (f) 3 fatal assumptions to test.' PROMPT 2 — Architecture/Stack: 'Design the full architecture optimized for 20-min deploy: React+Vite, FastAPI, SQLite+SQLAlchemy, Pandas/scikit-learn for prediction, PuLP/OR-Tools for scheduling, JWT auth. Give the data model (Users, Trips, Expenses, TaxRules, ScheduleBlocks, Predictions), API endpoint shapes, the prediction pipeline (feature engineering, hourly-wage regression, confidence intervals), the optimizer (objective = predicted net earnings; constraints: max 10h/day, 30-min break per 4h, fatigue penalty), a file tree, and a Render/Railway deploy target.' PROMPT 3 — Core Build: 'Write complete runnable code: FastAPI CRUD + Pydantic models; SQLAlchemy schema; React Dashboard (earnings chart), Schedule Optimizer UI, Expense logger, Tax estimator; scikit-learn RandomForest predicting hourly net by (day_of_week, hour, zone); PuLP scheduler maximizing predicted net under fatigue/break constraints; Uber + DoorDash CSV parsers; SE-tax estimator with standard deduction + mileage method. Modular, typed, runs via docker compose up. Comment complex logic.' PROMPT 4 — QA/Edge Cases: 'Stress-test each module, 5 edge cases + fixes: CSV (malformed rows, dupes, currency symbols, mixed date formats); prediction (cold start <7 days, surge/tip outliers, timezone shifts); optimizer (no profitable window, user rejects, back-to-back fatigue); tax (multi-state, mileage vs actual switch, quarterly date drift); system (migrations, export-for-accountant). Write pytest unit tests, validation middleware, structured error handling, and a bad-prediction rollback.' PROMPT 5 — Launch/Demo: 'Ship it: docker compose with nginx, GitHub Actions CI (tests+lint), README with 30-sec quickstart using bundled sample Uber/DoorDash CSVs, a scripted demo narrative (Alex imports 2 weeks → sees Tue 6-9pm is 40% above Mon lunch → optimizer proposes 25h for $850 net → tax estimator shows $127 quarterly), and a monitoring plan (prediction MAE, schedule adoption rate, DAU) + next-3-features roadmap. Copy-paste deployable to Render.com.'

DeepSeek V4 Pro

Answer: I maintain that #5 Gig Work Optimizer is the clear winner for a 5-prompt, 20-minute production build, and the other models' arguments reinforce this choice by correctly identifying multi-sided complexity as a fatal blocker for options 1–4.

Both Claude Opus and Kimi independently converged on #5, and their reasoning aligns perfectly with my original assessment. Gig Work Optimizer is a single-actor analytics tool: a driver uploads earnings data, the system computes deterministic tax/deduction math, and the LLM adds a thin summary layer. There is no marketplace matching, no multi-role auth, no payment rails, and no real-time API integration with regulated third parties (school districts, immigration agencies, WhatsApp/Facebook APIs, food safety authorities). Options 1–4 each carry at least one showstopper dependency that cannot be resolved within five prompts—these are 'demo skeletons,' not production-ready systems. The prompts both models proposed are well-structured for an LLM's strengths: Prompt 1 cuts scope aggressively (seed data over bank linking, single killer feature), Prompt 2 forces a deployable stack with explicit schema, Prompt 3 demands full runnable code, Prompt 4 bakes in the critical LLM guardrail (summary only restates computed numbers), and Prompt 5 packages a one-command demo. This sequence front-loads the hard architectural decisions and prevents the LLM from hand-waving past integration hell—exactly what graders reward in execution-readiness scoring. The deterministic math core (tax calculation, $/hr heatmapping) is trivially testable, and the LLM layer can add genuine 'AI' value without being a reliability bottleneck. I specifically agree with Claude Opus's emphasis on the guardrail in Prompt 4—restricting the LLM summary to computed figures only—and Kimi's attention to cold-start/sparse-data edge cases in the prediction layer. Both demonstrate the depth that a scoring rubric would weigh heavily.

Kimi K2.6

Answer: I would still choose #5, the Gig Work Optimizer, because its single-actor, deterministic math core and complete absence of multi-party compliance, marketplace liquidity, or external payment-rail dependencies make it the only option that can be genuinely built and deployed in twenty minutes using five prompts.

After reviewing Claude Opus 4.8’s answer, I observe a strong independent convergence on #5 that corroborates rather than contradicts my original stance. Both analyses identify the same fatal structural barriers in the other four challenges: multi-sided marketplaces, regulatory compliance workflows, and third-party API dependencies that simply cannot be made production-ready inside a five-prompt sprint. Seeing another model reach the same conclusion through a complementary lens—emphasizing payment escrow, Facebook/WhatsApp API approvals, and school-district payroll hooks—confirms that these limitations are intrinsic to the problem space, not subjective biases in my own reasoning. I therefore maintain that the Gig Work Optimizer is uniquely suited to this exercise because it collapses to a self-contained, data-in/insight-out loop. A driver supplies their own CSV history; the application performs deterministic arithmetic on earnings, deductible mileage, and tax brackets; and any LLM layer remains safely decorative, summarizing pre-computed figures rather than generating load-bearing financial predictions. This bounded state space means the Core Build prompt can deliver actual Pandas/Recharts logic, a working profitability heatmap, seeded Supabase tables, and a deployable frontend that runs immediately—rather than stubs for the Immigrant OS, BD Social Commerce, Sub Teacher Platform, or Home Cooking Marketplace, all of which would require weeks of external negotiation to reach compliance or liquidity. Claude’s proposed prompt sequence is operationally excellent, and I largely align with its scope-cutting discipline: synthetic seed data over live bank feeds, pure functions over regulatory workflows, and strict guardrails that prevent the model from hallucinating numbers. However, my initial position already anchored on these exact architectural properties—algorithmic clarity, no cold-start liquidity problem, and a single-user context—so the additional detail serves as reinforcement rather than persuasion to switch. #5 remains the only challenge where five large prompts can yield a genuinely shippable, scored, execution-ready product instead of a polished mockup of an unfinishable system.

Qwen 3.6 Plus

Answer: I would excel at fully building #5 Gig Work Optimizer in 20 minutes because its single-user, data-driven architecture avoids the multi-party compliance, payment escrow, and external API dependencies that make the other options unshippable within such a tight prompt limit.

The Gig Work Optimizer is fundamentally a data ingestion, deterministic calculation, and lightweight predictive modeling pipeline. Unlike #1–#4, which require secure role-based portals, third-party marketplace liquidity, payroll compliance, or health department approvals, #5 only needs a clean database schema, CSV/JSON parsers, and straightforward math/AI wrappers. An LLM can generate, test, and scaffold this end-to-end without waiting for external approvals or handling edge-case legal liabilities. My 5-prompt strategy is designed to maximize execution readiness by enforcing strict boundaries, deterministic core logic, and automated deployment packaging: Prompt 1 (Vision/Scope): 'Act as a Senior Product Manager and Lead Engineer. Define the exact MVP scope for a Gig Work Optimizer targeting Uber/DoorDash drivers. Specify: (a) Core user stories focused on net earnings tracking, quarterly tax estimation, and profitable-hour prediction. (b) Explicit out-of-scope items (no bank OAuth, no real-time GPS, no multi-user marketplace). (c) Primary success metrics for a 20-min demo. (d) A strict technology constraint: Next.js App Router, TypeScript, Supabase (PostgreSQL + Auth), Tailwind, Vercel. Output a concise PRD.' Prompt 2 (Architecture & Schema): 'Generate the complete technical architecture and database schema. Provide: (a) Exact PostgreSQL DDL for tables: drivers, trips, expenses, tax_settings, predictions. (b) API route contracts. (c) The deterministic calculation logic for net earnings (gross - platform_fees - mileage_deduction) and SE tax estimation using 2024 IRS formulas. (d) The prediction logic: a rule-based heuristic that avoids heavy ML libraries to ensure instant execution. (e) Folder structure for a production-ready Next.js repo.' Prompt 3 (Core Implementation): 'Write the full, runnable codebase. Include: (a) Supabase migration SQL with 50 rows of highly realistic synthetic trip data. (b) Next.js page components: Dashboard with Recharts (bar chart for daily earnings, heatmap grid for peak hours), CSV uploader with PapaParse integration, and Tax Estimator form. (c) Server actions for data import and calculation. (d) Strict TypeScript interfaces matching the schema. (e) Tailwind UI with shadcn components. Ensure zero external dependencies beyond the specified stack. Output file-by-file.' Prompt 4 (QA & Edge Cases): 'Act as a Senior QA Engineer. Identify and resolve 8 critical edge cases: (1) Malformed CSV dates/currency. (2) Negative net hours. (3) Timezone shifts. (4) Division by zero. (5) Missing mileage logs. (6) Input validation security. (7) State tax defaults. (8) UI layout breaks. Provide exact code patches, Zod validation schemas for all inputs, and a minimal Vitest suite covering the core math functions.' Prompt 5 (Deployment & Demo Kit): 'Prepare the application for immediate production launch and stakeholder demo. Provide: (a) A single-command setup script (env var generation, DB migration, seed data, Vercel deploy). (b) A 60-second demo narrative script highlighting the exact user flow. (c) A comprehensive README with architecture diagram (ASCII), tech stack justification, and next-step roadmap. (d) A post-deploy health check curl command. Ensure all outputs are copy-paste ready with no placeholders.' This sequence guarantees that the LLM acts as a full-stack engineer, architect, QA, and DevOps specialist in rapid succession. By front-loading scope boundaries and demanding deterministic math over black-box AI for the core features, it ensures the output is functionally robust, immediately deployable, and easily extensible—hitting every metric an evaluator would score for quality, depth, and execution readiness.

Gemini 3.1 Pro Preview (changed mind)

Answer: I would select #5 Gig Work Optimizer as the only challenge that can realistically be built and deployed in 20 minutes with five large prompts.

Having failed to respond in Round 1, I entered Round 2 without a prior stance. Claude Opus 4.8’s irreducible-complexity argument convinced me: options 1 through 4 are inherently multi-sided marketplaces or compliance gateways. Immigrant OS requires attorney-client privacy portals and immigration-law logic that demands weeks of policy review, not minutes of prompting. BD Social Commerce needs WhatsApp Business API approval, payment KYC, and live-stream infrastructure that cannot be mocked into a production demo. The Sub Teacher Platform and Home Cooking Marketplace both hinge on real-world credential verification, food-safety compliance, and district-by-district regulatory variation—none of which an LLM can synthesize into a legally safe, runnable product in twenty minutes. The Gig Work Optimizer is a single-actor, data-in/insight-out analytical tool. Its core value is deterministic math—net earnings, standard mileage deductions, and historical hourly averages—which an LLM can generate with high precision. There are no external payment rails to sign up for, no opposing party that must simultaneously adopt the platform, and no regulatory body whose rules must be contractually enforced in code. This makes it the only domain where five large prompts can yield a genuinely shippable artifact rather than a brittle prototype. My prompt sequence exploits this structural simplicity while guarding against the single biggest risk of LLM-built financial tools: hallucinated numbers. PROMPT 1 — Vision/Problem: "Act as a founder and product manager. For the Gig Work Optimizer targeting multi-app US gig drivers, define: (1) the exact ICP and three user stories that fit a 20-minute build, (2) an explicit OUT-OF-SCOPE list—no bank APIs, no real-time Uber OAuth, no payment processing, no multi-state tax filing; use a flat 15.3% self-employment tax estimate plus the current IRS standard mileage rate, (3) the 'aha' demo moment: uploading a CSV and seeing net hourly profit by time-block within 60 seconds, (4) one north-star metric. Output a concise one-page PRD." PROMPT 2 — Architecture/Stack: "Act as a principal engineer. Design a zero-config architecture deployable to Vercel in under 2 minutes: Next.js 14 App Router with Server Actions, Supabase Postgres with Row Level Security enabled, Drizzle ORM, shadcn/ui and Tailwind, Recharts for visualization. Deliver: (a) the complete database schema in Drizzle TypeScript syntax with tables for drivers, trips, expenses, weekly_snapshots, and tax_settings, (b) a single Server Action pattern for all mutations, (c) deterministic compute formulas—net_earnings = gross - (miles * IRS_rate) - expenses; tax_estimate = net_earnings * 0.153, (d) the exact JSON shape returned for the profitable-hours heatmap. No microservices." PROMPT 3 — Core Build: "Act as a staff engineer and write the complete, copy-pasteable codebase. Deliver: (1) schema.ts with Drizzle ORM models and relations, (2) seed.sql with 300 realistic synthetic trips across Uber and DoorDash spanning 4 weeks with varied hours and surge events, (3) lib/csvParser.ts that handles both Uber and DoorDash CSV shapes with robust date parsing and Zod validation, (4) app/dashboard/page.tsx showing KPI cards (gross, net, estimated tax, effective hourly), a Recharts earnings-over-time line chart, and a profitable-hours heatmap grid, (5) app/actions.ts with Server Actions for CSV upload, trip CRUD, and snapshot generation, (6) lib/optimizer.ts that returns the top 5 highest-predicted earning windows using simple weighted historical averaging. Every function must be fully typed and runnable. No 'TODO' comments." PROMPT 4 — QA/Edge Cases: "Act as a QA and safety engineer. The gravest risk is hallucinated financial data. Provide: (a) Zod schemas for every CSV row and form input, (b) a strict 'financial guardrail' function ensuring any number shown in the AI weekly summary is strictly sourced from the deterministic optimizer.ts output and never invented by the LLM, (c) fixes for edge cases: empty CSV, duplicate trip IDs, negative earnings, midnight hour-bucketing across timezones, zero-mileage trips, (d) a lightweight tests/math.test.ts using Vitest that asserts tax math against known IRS fixtures, (e) an error boundary UI component that catches parsing failures and suggests format fixes. Output the actual test code and guardrail implementation." PROMPT 5 — Launch/Demo: "Act as a DevRel lead. Produce: (a) a README.md with a 'Deploy to Vercel' button, Supabase project link steps, and a 60-second setup command chain, (b) a .env.example with exactly three variables, (c) a 90-second demo script narrative: 'Jordan drags a CSV onto the dashboard, watches the heatmap populate, sees Tuesday 7–10 PM is $28/hr net versus Thursday's $12/hr, and reads the auto-generated weekly recap that ONLY restates computed numbers,' (d) a demoData/ folder with two anonymized CSVs for Uber and DoorDash, (e) a post-launch roadmap of the next three real integrations. Ensure the tone is investor-demo ready and every step is copy-paste deployable." This sequence scores well because it front-loads ruthless scope-cutting, enforces a zero-config stack, demands full files rather than stubs, and bakes in a critical safety guardrail: the LLM is only allowed to generate prose around numbers computed by deterministic code, never to generate the numbers themselves. That boundary transforms the tool from a risky toy into a trustworthy production demo.

Final Summary

The AI Roundtable unanimously selected the Gig Work Optimizer (Option 5) as the only feasible build. Claude Opus 4.8, Kimi K2.6, and Gemini 3.1 Pro Preview argued that its single-user, data-driven architecture avoids the regulatory and marketplace complexities of other options. Qwen 3.6 Plus and DeepSeek V4 Pro agreed, noting that deterministic math allows for complete, production-ready code generation.

All 5 participating models agreed