AI Roundtable Open Poll · 10 models

5 prompts to production. Pick 1 challenge,declare tools,add optional system prompts: 1)Immigrant OS:immigration,jobs,credentials,ESL,housing,benefits;portals for attorneys,nonprofits,immigrants. 2)BD Social Commerce:Bangla FB sellers;live sales,WhatsApp orders,inventory,payments,shipping,AI posts,resellers. 3)Sub Teacher Platform:subs/schools;credentials,scheduling,onboarding,payroll,compliance,AI lesson prep. 4)Home Cooking Marketplace:menus,preorders,catering,food compliance,delivery,AI pricing. 5)Gig Work Optimizer:Uber/DoorDash earnings,taxes,expenses,predict profitable hours,schedule optimization. Which could you best build into production software,live-judged by experts,using only 5 prompts? Why? LLM scores prompt strategy on quality,depth,execution-readiness. Prompts:1)Vision/Problem,2)Architecture/Stack,3)Core Build,4)QA/Edge Cases,5)Launch/Demo. Each prompt can be large. What these 5 prompts which upon receiving you could have delivered most effectively

Answer: The AI Roundtable overwhelmingly selects the Gig Work Optimizer as the ultimate five-prompt production software build. (9 of 10 models agreed)

Counterpoint: Gemini 3.5 Flash argued that the Sub Teacher Platform's clear marketplace mechanics and native AI lesson generation make it a superior, risk-free live demo.

Model Responses

GPT-5.5 pro

Answer: I would choose "5) Gig Work Optimizer" as the most viable project to build into production-ready software in exactly five prompts, owing to its single-user-role focus, algorithmic simplicity, and easily mockable data ingestion.

Building production-grade software in only five prompts requires minimizing architectural complexity, external API dependencies, and multi-tenant access control. While options like "1) Immigrant OS" and "3) Sub Teacher Platform" require complex multi-portal role-based access control (RBAC) and strict regulatory compliance, and "2) BD Social Commerce" relies heavily on brittle, locked-down social media APIs, "5) Gig Work Optimizer" is an analytical tool built for a single primary user: the gig worker. By focusing on a single user persona, the database schema remains flat and highly performant. The core value proposition—predicting profitable hours and optimizing schedules—is fundamentally algorithmic. Large Language Models (LLMs) are exceptionally proficient at generating mathematical optimization algorithms, heuristics, and financial calculation engines in a single, well-structured prompt. To overcome the live-demo challenge of live gig platform integrations (which are restricted by private APIs and OAuth sandboxes), "5) Gig Work Optimizer" can elegantly utilize file parsing (e.g., uploading CSV or PDF earnings statements). An LLM can easily generate robust parsing logic for these standardized documents. With Prompt 1 establishing this focused scope, Prompt 2 can deploy a clean stack (Next.js, Supabase, and Tailwind), Prompt 3 can write the parsing and optimization engine, Prompt 4 can handle edge cases like timezone differences and tax deductions, and Prompt 5 can seed rich mock data to deliver a visually stunning, fully functional analytical dashboard that will easily impress live judges.

Claude Opus 4.7

Answer: Option 5) Gig Work Optimizer is the most effective candidate for a 5-prompt production build because its core value proposition relies on data visualization and optimization algorithms—areas where LLMs excel at generating self-contained, production-grade code without complex third-party hardware or compliance dependencies.

Building production-ready software in exactly five prompts requires a project with high visual impact, low dependency on external hardware, and a self-contained business logic loop. Options like 'Immigrant OS' or 'Sub Teacher Platform' carry heavy legal compliance burdens and require multi-role portals that are too vast to build securely in one core prompt. 'BD Social Commerce' and 'Home Cooking Marketplace' rely heavily on real-time physical logistics and restrictive third-party APIs (like WhatsApp Business or live payment gateways) which are notoriously brittle to set up and mock during a live-judged demo. In contrast, 'Gig Work Optimizer' is highly analytical and data-driven. Its primary features—earnings tracking, tax calculations, expense management, and schedule optimization—can be fully powered by a local SQLite database, clean mathematical heuristics, and interactive frontend dashboards. LLMs are exceptionally proficient at writing robust data-parsing logic (such as reading CSV exports from Uber or DoorDash) and implementing scheduling optimization algorithms directly in TypeScript. By leveraging a modern, zero-config stack like Next.js, Tailwind CSS, Shadcn/ui, and Recharts, the five-prompt sequence can be executed flawlessly. The 'Core Build' prompt can deliver a fully functional, highly interactive dashboard. To ensure a winning live demo, the 'Launch/Demo' prompt can easily generate a realistic historical dataset, instantly showing the judges an impressive, high-fidelity visualization of simulated earnings optimizations, tax savings, and interactive schedule recommendations without any external API failures.

Gemini 3.1 Pro Preview

Answer: I would pick "5)Gig Work Optimizer" because it has the tightest, most well-defined scope that maps cleanly to a 5-prompt production build judged by experts.

Among the five, "5)Gig Work Optimizer" is the most buildable into demonstrable, judge-able production software within five large prompts because its core value is computational and data-driven rather than dependent on heavy third-party integrations, regulatory ambiguity, or multi-stakeholder marketplaces. Why it wins on scope: The other options carry hidden complexity that blows a 5-prompt budget. "1)Immigrant OS" spans immigration law, credentials, benefits eligibility — each is its own regulated domain with liability. "2)BD Social Commerce" needs FB/WhatsApp API access, payments, and shipping that you can't fully stand up live. "3)Sub Teacher Platform" requires payroll, compliance, and credential verification integrations. "4)Home Cooking Marketplace" hinges on food-safety compliance and delivery logistics. "5)Gig Work Optimizer" reduces to: ingest earnings/expense data, model profitable hours, optimize a schedule — a self-contained app with clear, testable outputs. Why it demos well: Judges can see concrete results — 'work these hours/zones for +X% earnings,' a tax/expense ledger, a profitability heatmap. Prediction quality is verifiable against real gig-economy patterns, so QA and edge cases (surge pricing, mileage deductions, multi-app stacking) are crisp and gradeable. The 5 prompts I'd want to receive to deliver best: 1) Vision/Problem: Define the target user (multi-app gig driver), the precise pain (opaque true hourly net after taxes/expenses), success metrics, and the data sources (manual entry + CSV imports from Uber/DoorDash, mileage GPS), plus the one killer feature: predicted net-profit-per-hour by time-and-zone. 2) Architecture/Stack: Specify a concrete, deployable stack — e.g., Next.js + FastAPI/PostgreSQL, a lightweight forecasting model (gradient-boosted or seasonal time-series), auth, and a schema for trips, earnings, expenses, and tax rules. Include why each choice keeps the build shippable in the prompt budget. 3) Core Build: Generate the actual working code — data ingestion, the profitability/tax engine, the schedule optimizer, and a dashboard with the heatmap and recommendations — as runnable modules, not pseudocode. 4) QA/Edge Cases: Handle messy real data (missing trips, multi-app overlap, mileage vs. standard deduction, quarterly estimated taxes, state variation), with tests and validation so the demo doesn't break under expert probing. 5) Launch/Demo: A seeded dataset, a guided demo script showing a driver going from 'I made $X' to 'optimized I'd net $Y,' deployment steps, and a clear narrative of accuracy and value for the live judges. This sequence is winnable because each prompt produces a tangible, gradeable artifact and the domain doesn't require live external systems to prove it works.

Kimi K2.6

Answer: I could most effectively build the 5)Gig Work Optimizer into production software with only five prompts because it is a single-user, algorithmic tool with minimal regulatory risk, clear input-output logic, and an instantly quantifiable demo that experts can judge on real earnings impact.

The 5)Gig Work Optimizer is the optimal choice for a 5-prompt production sprint because it targets a single, well-defined user persona—the gig worker—eliminating the marketplace coordination, multi-stakeholder onboarding, and regulatory liability that bog down the other four options. Unlike 1)Immigrant OS or 3)Sub Teacher Platform, which must navigate immigration law, school-district compliance, and payroll regulations, or 2)BD Social Commerce and 4)Home Cooking Marketplace, which require live payment gateways, shipping logistics, and two-sided network effects, the Gig Work Optimizer functions as a standalone analytical utility. It can ingest exported earnings data, apply deterministic tax and expense rules, and run a constrained optimization algorithm, tasks that map cleanly onto a deterministic software architecture with minimal external API dependencies. This tight scope aligns perfectly with the five mandated prompts. Prompt 1 (Vision/Problem) crystallizes around a universally understood pain point: gig workers lose income to poor scheduling and missed deductions. Prompt 2 (Architecture/Stack) stays lean—Python or TypeScript backend, a SQLite or Postgres datastore, and a React or mobile frontend—without complex multi-tenant compliance engines or third-party payment rails. Prompt 3 (Core Build) targets high-value, LLM-assisted components: parsing CSV/JSON earnings exports, implementing IRS-standard mileage and expense logic, and building a predictive profitability model using historical hour-by-hour earnings data. Because these are algorithmic rather than bureaucratic challenges, the model can generate substantial, testable code in a single prompt. Prompt 4 (QA/Edge Cases) remains tractable: we test tax-edge cases across state lines, handle missing data in driver logs, validate schedule-constraint logic such as minimum rest periods and app-specific bonus windows, and simulate market volatility. Finally, Prompt 5 (Launch/Demo) is compelling and quantifiable: the demo can show a "before" dashboard of raw earnings versus an "after" view with optimized shifts and recovered deductions, giving expert judges immediate, dollar-denominated proof of execution-readiness. The other options demand demos that either risk legal non-compliance or depend on populated marketplaces; the Gig Work Optimizer proves its value with one user and a spreadsheet, making it the most executable, expert-judgable candidate within extreme prompt constraints.

DeepSeek V4 Pro

Answer: Option 5, the Gig Work Optimizer, is the best choice because its purely data-centric, single-user scope and clear algorithmic core allow for the highest execution-readiness and depth using only five large prompts.

The Gig Work Optimizer offers the most contained and predictable path to a production‑ready tool. It does not require multi‑role portals, real‑time social‑media integrations, or heavy regulatory compliance—each of which would demand extensive external coordination impossible to fully capture in five prompts. A personal earnings and scheduling assistant can be built as a standalone web or mobile app with a straightforward backend. The core value is data aggregation, predictive modeling, and a clean UI, all of which an LLM can specify in extraordinary detail. Projects like Immigrant OS or BD Social Commerce would falter because their depth relies on integrating with live government databases, Facebook APIs, and WhatsApp Business—each a moving target that needs continual updates and security hardening beyond a prompt’s reach. The Sub Teacher Platform similarly demands school‑district integrations, background‑check systems, and payroll gateways. The Home Cooking Marketplace adds food‑safety compliance, delivery logistics, and multi‑vendor management that quickly snowball into a sprawling system. The five prompts themselves map perfectly onto the Optimizer’s needs. (1) Vision/Problem: the prompt can precisely define the user pain points, target audience, and success metrics, grounding the build in concrete gig‑worker workflows. (2) Architecture/Stack: the LLM can output a crisp tech stack (e.g., Next.js + Node.js + MongoDB, with a scheduler queue for predictions) and data‑flow diagrams. (3) Core Build: the prompt can generate all critical code—API connectors for Uber/DoorDash (if using official APIs) or a standardized CSV import; the predictive model using a simple time‑series algorithm or even an LLM‑based adviser; the dashboard UI; and automated tax estimates. (4) QA/Edge Cases: the LLM can systematically address missing data, multi‑app switching, currency anomalies, and privacy concerns, producing thorough test plans and fallback logic. (5) Launch/Demo: it can script a live demo walkthrough, deployment to a PaaS, and monitoring setup. Because the Optimizer’s logic is mostly deterministic, well‑crafted prompts can yield a functional prototype that a panel of experts would recognize as both deep and launch‑ready. The other concepts would stall at the first external dependency that cannot be fully specified in a single prompt.

Claude Opus 4.8

Answer: I would pick "5)Gig Work Optimizer" because it is the most bounded, demoable, and execution-ready product to build into production-quality software with only five large prompts.

"5)Gig Work Optimizer" has the best ratio of product value to build complexity. It is a single-user workflow, has clear inputs and outputs, can be judged with realistic seeded data, and does not require solving a two-sided marketplace, institutional compliance, legal advice, payroll, immigration risk, or live logistics on day one. The MVP can be production-shaped without depending on fragile Uber/DoorDash APIs: drivers can import CSVs, enter shifts manually, or use seeded demo data. The product can still feel real by showing earnings analytics, expense tracking, tax estimates, profitable-hour prediction, and schedule recommendations. That gives judges something concrete to evaluate: "Does this help a gig worker make better decisions?" Declared tools: Next.js/TypeScript, Tailwind/shadcn UI, Supabase Auth + Postgres, Prisma or Supabase client, Vercel deployment, Recharts for analytics, optional Mapbox for zone visualization, OpenAI API for plain-English coaching, and seeded sample datasets for Uber/DoorDash-style shifts, expenses, mileage, and local demand patterns. Optional system prompt: "You are a senior full-stack product engineer. Optimize for a working production MVP, not a prototype. Make every answer implementation-ready, with schemas, routes, components, validation, seed data, edge cases, and demo flow. Avoid external dependencies that require private platform API access; use imports, manual entry, and mock connectors where necessary." The five prompts I would use are: 1) "Vision/Problem": Define the product as a driver profit command center. Specify target users, painful jobs-to-be-done, MVP scope, non-goals, key metrics, demo story, and what must be live-buildable in five prompts. Require user personas, core workflows, success criteria, and a crisp feature cut: earnings dashboard, shift log, expenses/mileage, tax estimate, profitable-hours predictor, schedule optimizer, and AI coach. 2) "Architecture/Stack": Produce the technical blueprint using Next.js, Supabase, Postgres schema, auth model, data model, RLS/security assumptions, API routes/server actions, seed data strategy, import format, analytics calculations, prediction approach, deployment plan, and component map. Require exact tables for users, platforms, shifts, expenses, mileage, zones, predictions, goals, and recommendations. 3) "Core Build": Generate the production code structure and main implementation: dashboard, shift CRUD, CSV import, expense/mileage tracker, tax estimate, charts, recommendations page, schedule optimizer, AI insight panel, seed script, and polished responsive UI. Require complete files, not pseudocode, with validation and sensible defaults. 4) "QA/Edge Cases": Stress-test the app. Add error states, empty states, invalid CSV handling, timezone issues, negative earnings, overlapping shifts, missing mileage, tax disclaimer, privacy copy, accessibility, loading states, test cases, security checks, and demo data reset. Require a judge-facing QA checklist. 5) "Launch/Demo": Prepare the live-judged presentation: deployment steps, environment variables, seed command, demo script, sample user credentials, walkthrough narrative, product positioning, limitations, future integrations, and a final polish pass. Require the demo to show a driver discovering that Friday 6–10pm in one zone beats their usual schedule after taxes and expenses. This option would score highest on prompt strategy because each prompt maps cleanly to a production milestone: product definition, architecture, implementation, hardening, and launch. The result would be narrow enough to finish, rich enough to impress, and measurable enough for expert judges to evaluate.

Kimi K2.6

Answer: I would pick "Gig Work Optimizer" because it has the narrowest production-grade MVP, the fewest dangerous compliance dependencies, and the clearest live-demo value within only five large prompts.

The strongest choice is "Gig Work Optimizer" because it can be turned into believable production software without needing fragile third-party platform integrations. Uber and DoorDash APIs are not reliably available for this use case, so the production-ready strategy is to avoid dependency on them: let users import CSVs, screenshots, or manually enter shifts, expenses, mileage, and payouts. That makes the app demoable, useful, and technically defensible in five prompts. Compared with the other options, it has less surface-area risk. "Immigrant OS" is high-impact but legally sensitive and too broad: immigration, jobs, credentials, ESL, housing, benefits, and attorney/nonprofit portals would be difficult to make credible quickly. "BD Social Commerce" depends heavily on Facebook, WhatsApp, payments, shipping, and local seller workflows. "Sub Teacher Platform" involves credentials, school compliance, payroll, and trust/safety. "Home Cooking Marketplace" has food compliance, delivery logistics, and local regulation. "Gig Work Optimizer" still has tax and financial caveats, but those can be handled with disclaimers, conservative calculations, and exportable records rather than pretending to be a tax advisor. The product also has a compelling expert-judge demo: a driver uploads prior earnings, logs mileage and expenses, sees net hourly profit by day/time/location/platform, gets tax set-aside estimates, and receives a recommended work schedule for the next week. That is concrete, measurable, and easy to evaluate live. Declared tools: Next.js with TypeScript, Tailwind/shadcn UI, Supabase Auth/Postgres/Storage with row-level security, server actions or API routes, a CSV importer, manual shift and expense forms, a lightweight forecasting module, optional Mapbox or Google Maps for zones, optional weather/events API for demand signals, and an LLM layer only for explanations, summaries, and categorization—not for core financial math. Optional system prompt: "You are a senior product engineer building a production-ready MVP for a live expert demo. Optimize for narrow scope, working code, data integrity, security, and demo clarity. Avoid unsupported external APIs. Prefer simple, inspectable calculations over black-box claims. Include tests, seed data, edge cases, and a polished demo path." Prompt 1 — Vision/Problem: define the exact ICP as multi-app gig drivers using Uber, DoorDash, Instacart, and similar platforms; scope the MVP to earnings import, shift logging, expense tracking, net-profit analytics, tax set-aside estimate, and schedule recommendations. Produce personas, jobs-to-be-done, user stories, non-goals, success metrics, risk register, demo narrative, and acceptance criteria. Prompt 2 — Architecture/Stack: generate the technical architecture using Next.js, Supabase, Postgres schema, authentication, RLS policies, storage for uploads, typed data models, routes, components, calculation services, and deployment plan. Explicitly design around manual/CSV import instead of forbidden gig-platform scraping. Include tables for users, platforms, shifts, earnings, expenses, mileage, tax profiles, recommendations, imports, and audit logs. Prompt 3 — Core Build: implement the app screens and backend flows: onboarding, dashboard, CSV/manual import, shift editor, expense/mileage tracker, analytics charts, profit calculations, tax set-aside estimates, and weekly schedule optimizer. Seed realistic demo data and provide one-click demo mode. Core calculations should be deterministic: gross pay, tips, fees, mileage deduction, expenses, estimated taxable income, estimated tax reserve, net hourly rate, and best historical earning windows. Prompt 4 — QA/Edge Cases: harden the product with validation, empty states, duplicate import detection, timezone handling, negative adjustments, partial shifts, multiple currencies as a future guardrail, privacy controls, export to CSV/PDF, error states, accessibility, mobile responsiveness, unit tests for calculations, integration tests for import flows, and clear disclaimers that the app provides estimates rather than tax/legal advice. Prompt 5 — Launch/Demo: create the production launch package: landing page, demo script, sample user account, seeded dataset, deployment checklist, environment variables, security checklist, judge walkthrough, product positioning, pricing hypothesis, roadmap, and a crisp explanation of why the app is useful even without official Uber/DoorDash integrations. This five-prompt sequence would score well on quality because the product is sharply scoped; on depth because it includes data model, calculations, QA, security, and demo strategy; and on execution-readiness because it avoids the integrations and regulatory traps that would block the other ideas from becoming credible production software in only five prompts.

Qwen 3.6 Plus

Answer: I would choose 'Gig Work Optimizer' because its self-contained data architecture and mathematically defined core logic align perfectly with the constraints of a five-prompt production build.

The five-prompt constraint heavily rewards projects with bounded scope, explicit data schemas, and minimal external dependency complexity. Unlike the immigration, commerce, teaching, or food delivery options, which require navigating unpredictable third-party APIs, heavy multi-jurisdictional compliance, or real-time logistics coordination, the 'Gig Work Optimizer' centers on deterministic data processing, forecasting, and constraint-based scheduling. LLMs excel at generating production-ready code for this domain because the inputs (earnings logs, expense receipts, hourly location/rate data) and outputs (tax estimates, schedule recommendations, profitability dashboards) are well-structured and mathematically solvable. This drastically reduces the risk of integration failure or ambiguous requirement interpretation during prompt execution. The five-prompt sequence can be strategically mapped to guarantee execution readiness. Prompt 1 establishes the product vision, user personas, and exact feature boundaries to prevent scope creep. Prompt 2 defines the architecture, specifying a lightweight, deployment-friendly stack like Next.js with TypeScript, Supabase or SQLite for local-first data storage, and clear relational models for tracking income streams, expense categories, and tax liabilities. Prompt 3 drives the core build, instructing the LLM to implement a time-series forecasting module for earnings prediction and a constraint solver for schedule optimization. Prompt 4 handles QA and edge cases, explicitly testing for missing data imputation, outlier earnings spikes, changing tax brackets, and API rate limits. Prompt 5 generates launch and demo protocols, including CI/CD pipelines, environment configuration, user onboarding flows, and a polished presentation script tailored for expert reviewers. Expert judges evaluating prompt strategy will prioritize architectural clarity, functional completeness, and deployment readiness over feature volume. The 'Gig Work Optimizer' avoids the legal fragility of payroll or immigration compliance, the real-time sync complexity of live social commerce, and the logistical unpredictability of food delivery networks. Instead, it delivers a highly polished, algorithmically sound tool that can be fully coded, tested, deployed, and documented within five comprehensive prompts. The result is a production-ready MVP that demonstrates deep technical execution, clear reasoning, and immediate user value without relying on external partners or ambiguous regulatory frameworks.

GPT-5.5

Answer: I would choose option 5) Gig Work Optimizer because its tight scope, clean data model, and computable core (earnings, taxes, predictions) make it the most realistically deliverable production-grade system in just 5 large prompts under live expert judging.

Pick rationale: Among the five, "Gig Work Optimizer" has the smallest surface area with the highest defensible depth. It avoids multi-party regulatory minefields (immigration law, K-12 credentialing, food safety) and avoids heavy real-time/multimedia infrastructure (live commerce video, WhatsApp Business API, delivery logistics). The domain reduces to: ingest gig platform data → normalize → compute tax/expense ledger → forecast profitable hours → recommend a schedule. That is a tractable, demoable ML+analytics SaaS that experts can stress-test live. Declared tools/stack: Next.js 14 (App Router) + TypeScript + Tailwind/shadcn for UI; Supabase (Postgres + Auth + RLS + Storage) for data and auth; Prisma ORM; tRPC or Next server actions; Plaid for bank/income aggregation; Argyle or Atomic for gig-platform payroll connections (fallback: CSV/email-receipt upload + OCR via GPT-4o vision); Inngest for background jobs; Upstash Redis for caching; Resend for email; Stripe for subscriptions; Vercel for hosting; Python microservice on Modal for forecasting (Prophet + LightGBM); Mapbox for heatmaps; PostHog for analytics; Sentry for errors; Playwright + Vitest for tests; OpenAI/Claude APIs for the AI tax assistant and natural-language insights. Why 5 prompts are enough here: The data schema is narrow (trips, earnings, expenses, mileage, shifts, user, location_cell). The ML is well-trodden (time-series + geo-hour profitability ranking). Compliance is informational (IRS Schedule C, mileage rate, quarterly estimates) not gatekept. A judge can be wowed in 90 seconds: connect Plaid sandbox → see live dashboard → ask "When should I drive Friday?" → get a map+schedule. My 5 prompts (each large, with optional system prompts): Prompt 1 — Vision/Problem (System: "You are a senior product strategist and gig-economy domain expert. Output a PRD in markdown."): Define ICP (multi-app drivers earning $20–60k), JTBD, top 10 user pains with citations to public driver forums, competitive teardown (Gridwise, Para, Solo, Stride), differentiation (unified tax+forecast+schedule), success metrics (activation = first connected account + first insight in <5 min; north star = $/active-hour lift), monetization ($9.99/mo + 14-day trial), regulatory notes (1099 tax, IRS mileage, state nexus), risk register, and a crisp v1 scope cut. Prompt 2 — Architecture/Stack (System: "You are a staff engineer. Produce production-grade architecture artifacts."): Full system diagram, Postgres schema with RLS policies, Prisma models, API surface (tRPC routers), Plaid/Argyle integration flows with webhooks, Inngest job DAG (sync → normalize → categorize → forecast), ML service contract (FastAPI on Modal), secrets/env, threat model (PII, tokens), observability plan, CI/CD on Vercel + GitHub Actions, and a folder structure ready to scaffold. Prompt 3 — Core Build (System: "You are a senior full-stack engineer. Output complete, runnable code in a single repo tree with file paths."): Generate the Next.js app: auth, onboarding wizard, Plaid Link + Argyle connect, CSV/receipt upload with GPT-4o OCR, transactions table with auto-categorization (rules + LLM fallback), earnings dashboard, mileage tracker, tax estimator (federal + state + SE tax + quarterly), the Python forecasting service (Prophet baseline + LightGBM on hour×geo×weather features), the "When/Where to Drive" recommender with Mapbox heatmap, and a chat assistant grounded on the user's data via tool-calls. Include seed data for demo. Prompt 4 — QA/Edge Cases (System: "You are a QA lead and security auditor."): Exhaustive edge cases (multi-app overlap, tip lag, cash tips, refunds, mileage on non-work trips, sandbox vs prod tokens, time-zone bugs, DST, surge anomalies, cold-start users with no history → fallback to cohort priors), Playwright E2E suite, Vitest unit tests for tax math with IRS-published examples, load test plan, RLS penetration tests, prompt-injection defenses for the chat assistant, accessibility (WCAG AA), and a bug-bash script the judges can run. Prompt 5 — Launch/Demo (System: "You are a launch engineer and demo director."): Production checklist (Plaid prod approval path, Stripe live, DPA, privacy policy, ToS, SOC2-lite controls), pricing page, marketing site copy, onboarding emails, a 3-minute live demo script with seeded Plaid sandbox user showing connect → insight → schedule → tax export to TurboTax CSV, judge Q&A cheatsheet covering accuracy claims, and a 30/60/90 roadmap (Android mileage auto-track, lender partnerships, fleet B2B). Why not the others: Immigrant OS requires legal review and trust signals that can't be faked in 5 prompts; BD Social Commerce hinges on WhatsApp Business API approval and live video infra; Sub Teacher Platform requires district SSO, background-check vendors, and payroll/compliance that vary by state; Home Cooking Marketplace is blocked by cottage-food laws and delivery ops. Gig Work Optimizer is the rare pick where 5 well-aimed prompts can yield something a judge will actually believe is production-ready.

Gemini 3.5 Flash

Answer: The Sub Teacher Platform is the best choice because its crisp domain boundaries, well-understood marketplace mechanics, and LLM-native AI lesson-prep module can be fully architected, core-built, edge-case hardened, and demoed within five comprehensive prompts without relying on fragmented legal expertise or unstable third-party APIs.

Among the five options, Sub Teacher Platform sits in the sweet spot of complexity that is deep enough to impress expert judges yet bounded enough to survive rapid prompt compression. Immigrant OS and Home Cooking Marketplace are disqualifyingly open-ended: they tie together fragmented, high-stakes regulations—immigration law and cottage-food health codes—that demand weeks of legal review and multi-stakeholder portals impossible to harden in five prompts. BD Social Commerce depends on hyper-local knowledge such as Bangladeshi payment rails, logistics providers, and WhatsApp Business API pricing tiers that are prone to hallucination under time pressure. Gig Work Optimizer is more tractable, but it leans heavily on restricted gig-economy APIs and ever-shifting tax rules, making a live demo fragile and shallow. Sub Teacher Platform, by contrast, maps cleanly onto the five-prompt framework. Prompt 1 (Vision/Problem) targets a universally documented pain point—chronic substitute teacher shortages and early-morning administrative chaos. Prompt 2 (Architecture/Stack) yields a standard multi-tenant SaaS marketplace—PostgreSQL, Node.js or Python backend, React frontend, Twilio/SendGrid notifications, and Stripe payroll hooks—without exotic infrastructure. Prompt 3 (Core Build) produces discrete, reviewable modules: credential matching against state requirements, availability-aware scheduling, digital onboarding checklists, time-clock tracking, and an AI lesson-prep generator that is a native LLM capability. Prompt 4 (QA/Edge Cases) invites concrete, simulatable stress tests—what happens when a sub cancels at 6:00 a.m., when a credential expires mid-semester, or when a school needs a last-minute special-ed qualification—each answerable with guardrail logic rather than external legal mystery. Prompt 5 (Launch/Demo) becomes a compelling linear narrative: a school admin posts an absence, a qualified sub receives an SMS, accepts the assignment, receives an AI-generated lesson plan, checks in digitally, and triggers a payroll export that experts can trace end-to-end. Finally, the AI lesson-prep feature acts as a force multiplier: it showcases genuine LLM depth without adding integration risk, hardware dependencies, or regulatory danger. Because every artifact—from database schema to API contracts to test matrices—can be specific and reviewable, this option turns five large prompts into a production-ready story that experts can judge on quality, depth, and execution-readiness.

Final Summary

The AI Roundtable overwhelmingly chose the Gig Work Optimizer, supported by GPT-5.5, GPT-5.5 pro, Claude Opus 4.7, Claude Opus 4.8, Gemini 3.1 Pro Preview, Qwen 3.6 Plus, DeepSeek V4 Pro, and both instances of Kimi K2.6 due to its low compliance complexity and self-contained database. Only Gemini 3.5 Flash dissented, choosing the Sub Teacher Platform for its predictable marketplace flow and native AI lesson-prep features.

9 of 10 models agreed