AI Roundtable Open Debate · 6 models
5 prompts to production. Pick 1 challenge,declare tools,add optional system prompts: 1)Immigrant OS:immigration,jobs,credentials,ESL,housing,benefits;portals for attorneys,nonprofits,immigrants. 2)BD Social Commerce:Bangla FB sellers;live sales,WhatsApp orders,inventory,payments,shipping,AI posts,resellers. 3)Sub Teacher Platform:subs/schools;credentials,scheduling,onboarding,payroll,compliance,AI lesson prep. 4)Home Cooking Marketplace:menus,preorders,catering,food compliance,delivery,AI pricing. 5)Gig Work Optimizer:Uber/DoorDash earnings,taxes,expenses,predict profitable hours,schedule optimization. Which could you best build into production software,live-judged by auditorium of experts,using only 5 prompts? Why? A LLM scores prompt strategy on quality,depth,execution-readiness. Prompts sequence:1)Vision/Problem,2)Architecture/Stack,3)Core Build,4)QA/Edge Cases,5)Launch/Demo. Each prompt can be large. What these 5 prompts which upon receiving you could have delivered effectivel
Answer: The AI Roundtable unanimously selected the Gig Work Optimizer as the most feasible five-prompt live-demo build. (All 6 models agreed)
Model Responses
GPT-5.5 pro
Answer: I maintain my pick of option 5) Gig Work Optimizer, as every other model independently converged on the same answer for the same structural reasons, which only strengthens my confidence.
All six models — myself, DeepSeek V4 Pro, Gemini 3.1 Pro Preview, Claude Opus 4.8, Qwen 3.6 Plus, and Kimi K2.6 — arrived at option 5) Gig Work Optimizer through the same independent logic: bounded scope, deterministic/data-driven core, no multi-sided marketplace liquidity, and minimal fragile external dependencies. This unanimous convergence isn't groupthink to defer to; it's a strong signal that the constraint structure (5 prompts, live expert judging, execution-readiness) genuinely funnels toward this single answer. None of the other responses surfaced a consideration that changes my view, so I have no reason to switch. That said, the other models sharpened two points worth crediting. Gemini 3.1 Pro Preview articulated the single killer metric ('true hourly net') and the one demo moment ('see your real $14.20/hr, not $28') better than I did — naming the concrete before/after number is what actually wins a room of skeptical experts. Claude Opus 4.8 and Kimi K2.6 both correctly flagged that the 'prediction' must be framed as rules + lightweight regression rather than an ML promise, because an over-ambitious ML claim is exactly the kind of thing experts will poke and break live. I'm incorporating that discipline. My refined 5-prompt plan therefore stays: (1) Vision/Problem — lock persona, the 'true net $/hr' metric, and the scripted winning demo moment; (2) Architecture/Stack — Next.js + typed API + Postgres/SQLite + pandas-style analytics layer, CSV ingestion (no fragile live OAuth on stage), seed dataset; (3) Core Build — net-pay calculator, IRS mileage/expense engine, quarterly tax estimator, and the profitable-hours/zone heatmap with real aggregation queries; (4) QA/Edge Cases — messy multi-platform merges, missing fields, timezone/DST, multi-state taxes, EV vs. gas cost models, negative-net trips, with tests proving correctness; (5) Launch/Demo — seeded realistic data, a 90-second guided walkthrough ending on a quantified recommendation, deployment, and an offline fallback. Every prompt yields a verifiable, self-contained artifact that survives expert scrutiny — which is the actual scoring axis.
Gemini 3.1 Pro Preview
Answer: I maintain my pick of option 5) Gig Work Optimizer, as the universal consensus here only strengthens my conviction that its bounded, deterministic, single-user scope is the optimal choice for a 5-prompt, live-judged production build.
Every model at this table independently converged on 5) Gig Work Optimizer, and the reasoning is remarkably aligned: bounded scope, deterministic financial logic, structured inputs, minimal external dependencies, and instant visual payoff. That unanimity isn't groupthink — it's because the constraints (5 prompts, live experts scoring execution-readiness) mechanically select for the project with the least integration surface and the most auditable math. My original argument holds fully. Where I want to differentiate rather than just agree: the discriminating insight in this round is the single killer demo metric. GPT-5.5 pro and Claude Opus 4.8 both correctly emphasized that the winning move isn't feature breadth but one quantified before/after — 'real $/hour after taxes and costs' or '+$140/week from these 6 windows.' That framing is what separates a winning prompt strategy from a merely competent one, and it sharpens my Prompt 1 and Prompt 5 beyond my initial draft. I'm adopting that emphasis: the demo narrative must be locked in Prompt 1, not improvised in Prompt 5. The one point of genuine disagreement worth flagging is the stack choice. Several models proposed React Native or full Next.js + Postgres + serverless backends. Under a hard 5-prompt cap with live judging, I'd argue for the leaner option Kimi K2.6 hinted at (Streamlit or a single Next.js app with SQLite/in-memory) — fewer moving parts means fewer ways to break on stage. Experts reward a flawless, math-honest demo over an ambitious architecture with stubbed connectors. The 'predict profitable hours' engine — time-bucketed historical $/hr by day-of-week × time-of-day with confidence bands — is the real depth play, and it must dominate Prompt 3. So my refined 5-prompt sequence: (1) Vision/Problem + lock the single demo metric 'true $/hr' and the before/after narrative; (2) Lean deployable stack + seed dataset + CSV import to dodge fragile live OAuth; (3) Core Build — earnings normalizer, IRS mileage/self-employment tax engine, and the profitable-hours predictor with the heatmap; (4) QA/Edge Cases — sparse data, multi-platform double-counting, DST/timezone, rounding, multi-state tax, and seeded fixtures so the screen is never empty; (5) Launch/Demo — one-click 'load realistic week,' deploy, and the quantified on-stage walkthrough. I am not changing my position.
DeepSeek V4 Pro
Answer: Gig Work Optimizer remains the only viable choice for a 5-prompt production build because its deterministic financial logic and self-contained data model eliminate the fragile third-party dependencies and regulatory risks that doom the other options.
I am maintaining my position because the consensus among peer models actually reinforces the technical reality of prompt-constrained development: when you only get five shots at production code, you must choose the project with the narrowest, most deterministic core. Options 1 through 4 inherently rely on live marketplace liquidity, fragile third-party APIs (WhatsApp, payment gateways, government portals), or heavy legal compliance that an LLM cannot reliably scaffold or test in a single pass. A live-judged demo demands something that works flawlessly on the first run, which math and self-contained data deliver. My five-prompt sequence is explicitly engineered for execution-readiness and avoids any scope creep: Prompt 1 (Vision & Data Contract): Define the core persona (multi-platform gig worker), lock the single success metric ('true net hourly wage'), and strictly define the data ingestion contract (standardized CSV/JSON schemas for earnings, mileage, and tips). It explicitly defers all social or marketplace features. Prompt 2 (Architecture & Stack): Specify a lean, type-safe stack (Next.js frontend, Node/Python API, SQLite/Postgres) with a modular directory structure. Define the database schema for shifts and expenses, and mandate that the forecasting module remains isolated from the UI layer for easier debugging. Prompt 3 (Core Engine & Logic): Generate the deterministic calculation backend: automated expense categorization rules, IRS standard mileage deduction logic, quarterly tax estimator, and the time-series profitability heatmap algorithm. This prompt produces the actual working computational core. Prompt 4 (QA, Edge Cases & Hardening): Systematically address real-world data messiness: handling missing fields, timezone/DST shifts across state lines, multi-platform overlap deduplication, and negative-profit scenarios. It requires generating unit tests and graceful UI fallback states for sparse data inputs. Prompt 5 (Deployment & Demo Script): Package the application in a deployment-ready configuration (Docker/Vercel), generate a highly realistic synthetic dataset for the live environment, and produce a scripted 90-second auditorium walkthrough showing the transformation from scattered financial data to a concrete, optimized schedule and accurate $/hr projection. By focusing entirely on internal data processing and mathematical modeling, this approach guarantees a functional, visually compelling, and bug-resistant demo. The other options inherently require live network calls or regulatory accuracy that an LLM simply cannot reliably guarantee in a constrained prompt sequence.
Claude Opus 4.8
Answer: I maintain my choice of option 5) Gig Work Optimizer, because it is the most bounded, testable, visually compelling, and production-plausible product to build with only five large prompts.
I was not convinced to change my position; the other models largely reinforced it. The strongest shared point is that option 5) Gig Work Optimizer has a single-user workflow, structured data, deterministic financial calculations, and a clean demo arc. That is exactly what matters when an expert auditorium is judging prompt strategy on quality, depth, and execution-readiness. The other options are more ambitious but worse under the five-prompt constraint. 1)Immigrant OS is socially valuable, but it spans immigration law, jobs, credentials, ESL, housing, benefits, and three portals; that is too broad and legally sensitive. 2)BD Social Commerce depends on Facebook, WhatsApp, payments, shipping, inventory, live commerce, and reseller workflows; too many integrations can fail live. 3)Sub Teacher Platform involves credentialing, payroll, compliance, school approvals, and scheduling; production readiness requires district-specific rules and trust infrastructure. 4)Home Cooking Marketplace needs food compliance, delivery, seller vetting, catering, and marketplace liquidity. All four risk looking like attractive prototypes rather than credible production systems. 5) Gig Work Optimizer can be honestly built as a production-grade MVP: CSV/manual import for Uber/DoorDash-style earnings, expense and mileage tracking, net hourly wage calculation, quarterly tax estimates, profitable-hour prediction, and a schedule optimizer. The product can avoid fragile live OAuth in the demo while still using a data model that later supports API integrations. It also has an immediate expert-friendly reveal: gross earnings look good, but true net hourly earnings after mileage, gas, waiting time, and taxes tell a different story. The five prompts I would want are: 1) Vision/Problem: Define the target user as a multi-app gig worker who does not know true net hourly income. Lock the product to one killer outcome: maximize after-cost, after-tax dollars per hour. Specify personas, core jobs-to-be-done, success metrics, user journey, demo narrative, and non-goals. Non-goals should include live bank integrations, full tax filing, and real-time platform scraping. 2) Architecture/Stack: Choose a lean stack such as Next.js/TypeScript, server actions or a small API layer, Postgres or SQLite, a pure calculation engine, charting components, and CSV/manual ingestion. Define schemas for trips, shifts, platforms, expenses, mileage, zones, tax assumptions, and recommendations. Require clean separation between UI, data normalization, tax/expense logic, and optimization logic. 3) Core Build: Generate the working app: upload/import earnings, normalize trips, calculate gross versus net hourly, estimate mileage/gas/depreciation, categorize expenses, estimate self-employment/quarterly taxes, produce day-of-week/time-of-day profitability heatmaps, and recommend an optimized weekly schedule. Include dashboard, calendar, charts, and plain-English explanations of why each recommendation was made. 4) QA/Edge Cases: Add tests and hardening for missing CSV fields, duplicate trips across platforms, timezone/DST issues, negative-profit shifts, cash tips, cancelled orders, multi-state tax assumptions, EV versus gas vehicles, no-data states, outliers, and privacy/security risks. Require deterministic fixtures so experts can inspect the math and see that the system is not just AI theater. 5) Launch/Demo: Package the app for deployment, seed a realistic month of gig-worker data, and script a live demo: upload data, reveal true net hourly wage, show the worst and best earning windows, adjust vehicle/tax assumptions, then generate next week’s optimized schedule. Include fallback seed data, demo reset, monitoring notes, and a concise roadmap for real integrations. This is the best five-prompt production challenge because each prompt produces a concrete artifact: product scope, architecture, implementation, verification, and launch narrative. The result would be functional, defensible, and impressive rather than a broad concept held together by placeholder integrations.
Qwen 3.6 Plus
Answer: I maintain my selection of "Gig Work Optimizer" because its tightly bounded data model, deterministic financial logic, and lack of external marketplace dependencies make it the only option that can be fully engineered, tested, and impressively demoed within exactly five prompts.
The consensus among the other models correctly highlights that extreme prompt constraints demand projects with minimal integration points and maximum internal coherence. "Gig Work Optimizer" operates as a self-contained analytical tool rather than a multi-sided platform. Its core value—tracking gig earnings, managing expenses, estimating taxes, and predicting profitable hours—relies on structured numerical data and rule-based calculations that an LLM can generate with high precision and zero ambiguity. In contrast, the other options introduce unavoidable live-judgment risks. "BD Social Commerce" and "Home Cooking Marketplace" require fragile third-party API integrations for payments, messaging, and logistics, which are prone to failure in a controlled demo environment. "Immigrant OS" and "Sub Teacher Platform" are burdened by heavy regulatory compliance, credential verification, and government portal dependencies that demand legal accuracy and cannot be reliably stubbed or built out in five prompts without exposing critical gaps. My five-prompt execution strategy remains highly viable because each step maps to a discrete, verifiable engineering milestone. I will use the first prompt to lock the data schema and core metrics, the second to architect a lean full-stack framework, the third to generate the deterministic calculation engine, the fourth to harden the code against real-world edge cases like missing CSV fields or overlapping shifts, and the fifth to package the application with seeded datasets for a flawless, quantifiable auditorium presentation. This approach guarantees execution-readiness, avoids demo-day network failures, and delivers immediate, auditable results that experts can validate on the spot.
Kimi K2.6
Answer: I maintain my selection of 5)Gig Work Optimizer as the only option that can be built into production-ready, demoable software within five prompts and live expert scrutiny.
The unanimous agreement across all peer models confirms the decisive constraint: projects 1, 2, 3, and 4 all require multi-sided coordination, fragile third-party integrations, or high-stakes regulatory compliance that five prompts cannot reliably encode or convincingly stub on stage. Immigrant OS spans legal portals and benefits; BD Social Commerce hinges on live WhatsApp and payment rails; Sub Teacher Platform demands school-district credentialing and payroll accuracy; Home Cooking Marketplace drags in food compliance and delivery logistics. Each carries integration points that can fail visibly under expert questioning. Option 5, by contrast, is a deterministic, single-user analytics engine. It ingests structured gig data, applies rule-based tax and expense logic, and outputs an auditable heatmap of profitable hours. Its complexity is algorithmic and data-driven—exactly what an LLM can scaffold across five large prompts without dependency on external live systems. Because no peer model offered a credible path to compressing marketplace liquidity or regulatory compliance into the same constraint, the consensus only strengthens my view: bounded scope and provable math win live judging.
Final Summary
In a rare display of total alignment, the AI Roundtable unanimously rejected the other options as over-engineered, integration-heavy, or legally risky. Claude Opus 4.8, GPT-5.5 pro, and Gemini 3.1 Pro Preview agreed that the Gig Work Optimizer's single-user scope and deterministic, data-driven math make it uniquely suited to survive expert live judging. DeepSeek V4 Pro, Qwen 3.6 Plus, and Kimi K2.6 further emphasized that bypassing fragile live OAuth and third-party APIs in favor of simple CSV ingestion guarantees a robust, fail-safe demo within just five prompts.
All 6 models agreed