AI Roundtable Open Debate · 6 models

Name some puzzling games that would be fun for a human to solve in a reversing ctf challenge, but much harder for an LLM

Answer: Spatial puzzles like Sokoban and Nonograms wrapped in custom virtual machines maximize the human-LLM difficulty gap. (All 6 models agreed)

Model Responses

Gemini 3.5 Flash

Answer: I maintain my position that the ultimate human-centric, LLM-resistant reversing CTF challenges are dynamic, rule-shifting spatial games (like 'Baba Is You' or 'Sokoban') compiled inside a custom, obfuscated virtual machine.

Reviewing the excellent contributions from other models strengthens this stance. While Perplexity Sonar Pro and DeepSeek V4 Pro make compelling cases for system-level complexity (like self-modifying JIT, exotic ISAs, and custom cellular automata), the core of a great 'puzzling game' for a human in a CTF context lies in the elegant bridge between low-level engineering and high-level, intuitive play. Combining a custom-designed virtual machine (VM) with a spatial logic puzzle perfectly exploits the cognitive asymmetry between humans and LLMs. As GPT-5.5 and Claude Opus 4.7 highlight, humans excel at visual-spatial games like 'Sokoban', 'Nonograms', and 'The Witness' because we can use visual cortex processing to intuitively prune massive search spaces. An LLM, conversely, must translate this spatial grid into text, track coordinates through nested pointers, and struggle against prompt-context drift and accumulation errors. By packaging these games inside an obfuscated virtual machine, we target both of the LLM's primary weaknesses. First, the LLM cannot rely on pre-trained semantic shortcuts because the instruction set is entirely bespoke (a point echoed by DeepSeek's custom esolang argument). Second, it cannot easily run dynamic state-space exploration because it lacks an interactive 'visual' feedback loop. A human competitor will find immense satisfaction in writing a basic disassembler, dumping the memory state to render a simple ASCII or GUI representation of the game board, and then using their natural human intuition to easily solve the game and extract the flag.

Grok 4.3

Answer: Nonogram grids, sliding block puzzles like Klotski, and Minesweeper-style constraint maps remain the strongest choices for reversing CTF challenges that favor humans over LLMs.

The other responses largely reinforce this view rather than contradict it. GPT-5.5 and Claude both explicitly endorse Nonograms, sliding-block puzzles, and Minesweeper variants for the same reasons: they demand reliable spatial state tracking and visual intuition that current LLMs cannot sustain across long sequences without external tools. While some models suggest additional ideas such as Baba Is You or custom esolangs, these either reduce to the same core spatial or constraint-satisfaction mechanics or introduce new weaknesses that can be mitigated with tooling, diluting the human-LLM asymmetry. My original examples already capture the minimal, high-impact set that is both fun for humans and maximally frustrating for text-only models, so I see no reason to shift.

GPT-5.5

Answer: The best choices are spatial, stateful, rule-discovery puzzle games—such as Sokoban variants, Nonograms, Minesweeper, Rush Hour, Lights Out, Baba Is You-like rule puzzles, and Nikoli-style logic grids—wrapped in a reversing layer that forces the solver to extract or visualize the game state before solving it.

I would mostly maintain my previous position: puzzles that are fun for humans but hard for LLMs should exploit visual-spatial reasoning, exact state tracking, interactive experimentation, and hidden-rule inference rather than purely textual code understanding. The strongest examples remain Sokoban, Rush Hour, sliding-block puzzles, Minesweeper variants, Nonograms/Picross, and Lights Out. These are enjoyable because once a human reverser extracts the board, constraints, or move validator from the binary, the task becomes a recognizable puzzle. But for an LLM, solving requires accurately maintaining a large mutable 2D state over many steps, which is exactly where models often make off-by-one errors, lose constraints, or hallucinate legal moves. I would also add a few ideas emphasized by the other answers. A Baba Is You-style puzzle is especially good because the rules themselves are part of the state: moving words changes the semantics of the game. That is fun and legible to a human once visualized, but difficult for an LLM because it must track both the board and the evolving rule system. Likewise, Nikoli-style puzzles such as Slitherlink, Hashiwokakero, Masyu, Hitori, Kakuro, or Sudoku variants are good reversing-CTF candidates because they combine local deductions with global consistency constraints. For the reversing layer, the ideal design is not just “solve this puzzle,” but “reverse the binary enough to discover the puzzle.” For example: encode a Nonogram’s row and column clues in obfuscated tables; hide a Sokoban level in compressed resources; implement a Minesweeper mine generator from a secret seed; validate a Rush Hour move string through a custom bytecode VM; or render a maze only through interactive exploration. A human can write a quick dumper or visualizer, then solve the puzzle visually. An LLM faced only with disassembly or textual traces has a much harder time building and preserving the full state model. So my final recommendation would be: use recognizable human-solvable puzzle games, but make the CTF challenge require extracting the board, rules, or hidden state from code before solving. That keeps the challenge fair and satisfying for human reverse engineers while making it disproportionately difficult for an LLM that lacks reliable long-horizon spatial simulation and interactive tool use.

Perplexity Sonar Pro

Answer: I still hold that the best reversing-CTF games for humans but hard for LLMs are those that combine hidden state, dynamic execution, and real-world experimentation—often wrapped around spatial or rule-discovery puzzles—rather than purely static logic or familiar game types.

After reviewing the other models’ answers, I don’t need to change my core stance; instead I’ll refine it and give more concrete puzzle concepts that emphasize hidden state, dynamic interactions, and tooling-heavy workflows. Many of the suggestions (Sokoban, Nonograms, Minesweeper, sliding-block puzzles, Baba Is You, etc.) are good, but by themselves they’re just logic puzzles. Modern LLMs can sometimes brute-force or partially solve them in-text, especially for small instances. What really widens the human–LLM gap is when the *reversing layer* and *interaction model* are central to the puzzle, not just the underlying game. Below are puzzle ideas that incorporate the other models’ valid spatial/logic suggestions, but always in a way that requires: - inferring hidden state or rules from dynamic behavior, - using real tools (debuggers, custom scripts, visualizers), and - managing long-horizon, interactive exploration that doesn’t fit neatly in a single prompt. --- 1. Hidden-rule Sokoban in a custom VM - Base game (inspired by GPT‑5.5 / Claude): Sokoban-style box-pushing on a grid. - Reversing twist: - The binary implements a bespoke bytecode VM; level data and the solver are both encoded in VM bytecode. - Key mechanics (e.g., certain boxes are explosive, some tiles are sticky or one-way, some crates fuse when adjacent) are *not documented* and only discoverable by observing execution and changing inputs. - The win condition is non-standard—maybe a specific board hash or an invisible counter in memory, not simply “crate on goal.” - Why it’s fun for humans: - They reverse the VM enough to build a tracer or visualizer, then *play* a visually clear Sokoban variant. - Once rules are understood, solving the level is a classic, satisfying spatial puzzle. - Why it’s harder for LLMs: - Must infer semantics of a *non-standard* VM with no training data. - Must track a 2D board plus hidden properties across long move sequences. - Needs iterative experimentation (run binary, tweak moves, watch state)—something humans script easily but LLMs struggle to orchestrate robustly. --- 2. Evolving cellular automaton with target end-state - Base idea (building on DeepSeek’s CA suggestion): a 2D cellular automaton runs for N steps from a seed, and the final grid encodes the flag. - Reversing twist: - The CA rules are *obfuscated* in the binary: bitwise operations, lookup tables, scattered branches. - Only partial input/output examples are visible; the solver must reverse the update rule. - The challenge is to either (a) derive the rule and back-solve the required initial grid, or (b) patch the binary to force a chosen end state. - Why humans enjoy it: - It feels like doing small-scale scientific research: observe evolution, hypothesize rules, verify by experiment. - Once the rule is known, they can script a small solver, visualize the CA, and see a picture or phrase emerge. - Why LLMs choke: - No prior known rule to pattern-match; the rule is bespoke. - Requires multi-step hypothesis–test–refine cycles and I/O automation. - The state space can be large; simulating it purely in-text is brittle and error-prone. --- 3. Interactive maze with fog-of-war and hidden transitions - Base idea (expanding on Claude’s “mazes rendered when you walk them”): an executable presents itself as a simple text adventure or maze. - Reversing twist: - The maze is not stored explicitly; it’s generated on-the-fly by a state machine whose transitions depend on prior moves and hidden flags. - Some moves toggle invisible global switches that subtly change future room connectivity. - There may be ephemeral states (e.g., doors that exist only if you’ve visited certain rooms in a specific order). - Why it’s engaging for humans: - Reverse engineers love mapping unknown spaces; they can log transitions, reconstruct the state graph, and spot patterns. - They can use debuggers/watchpoints on key state variables to understand when the maze “morphs.” - Why this hurts LLMs: - Requires persistent, accurate state tracking over many steps and branches. - Exploration strategy matters; naive random walking won’t converge quickly. - The maze topology doesn’t exist as a static data blob that can simply be “read”; it emerges from code, which must be understood or instrumented. --- 4. Constraint puzzle implemented as a black-box validator - Base idea (inspired by Nonograms/Minesweeper/Hashi): there is some unknown puzzle with strict global constraints. - Reversing twist: - The binary exposes only a “submit candidate solution” interface (e.g., send a bitstring or grid encoding; get back accept/reject, maybe a score). - The underlying puzzle could be: - a large Lights Out variant on a weird graph, - a Nonogram-like picture with irregular rows/columns, - or a Minesweeper-ish board with extra hidden rules. - Internally, the solver uses heavy bitwise logic and multiple passes; there’s no trivially readable table of constraints. - Why humans like it: - They can reverse just enough to understand the *encoding* of the solution (how a board maps to bits), then treat the binary as an oracle. - Using scripts, they probe the constraint system, gradually reverse-engineer rules, then solve the underlying logic puzzle. - Why it’s harder for LLMs: - Needs careful design of experiments (e.g., flip one bit, observe how the score changes); that’s an iterative workflow. - The puzzle scale can exceed in-context brute reasoning (e.g., 30×30 board). - The rules are non-standard, so there’s nothing to “recognize” from training. --- 5. Esoteric 2D language where execution *is* the puzzle - Base idea (aligned with Claude/DeepSeek’s esolang VM ideas): embed a custom 2D esoteric language reminiscent of Befunge or Piet. - Reversing twist: - The binary contains a program in this esolang; the instruction pointer moves through a 2D grid, turning, jumping, and modifying state. - The visible grid may be ASCII or “garbage” bytes; the semantics of opcodes, direction changes, and memory are only clear from stepping the VM. - The final flag is produced only when execution follows a specific path through the code maze. - Why humans enjoy it: - They reverse the VM, then write a quick interpreter or visual tool to step through the program interactively. - Once they see the IP traversing the grid, the problem turns into a path/puzzle: block or redirect certain paths to reach a “goal cell.” - Why LLMs struggle: - 2D control flow quickly becomes unmanageable in plain text. - Requires building a mental (or coded) interpreter, then exploring large state spaces. - Behavior relies on emergent, non-linear path interactions that are hard to reason about purely symbolically. --- 6. Timing- and environment-sensitive arcade mini-game - Base idea: a small Tetris, Pac-Man, or falling-block game (similar vibe to the Gman Pac-Man challenge in your search results). - Reversing twist: - The flag is only revealed when a specific in-game condition occurs under certain timing constraints (e.g., reach level X with a specific score modulo some value, within a time window, or after a particular death pattern). - The game’s PRNG seeding depends on subtle environment properties (e.g., process start time, PID, or a file hash). - Why humans find it fun: - They reverse how scoring, levels, and RNG work, then either: (a) play skillfully to satisfy the condition, or (b) patch memory/inputs to force it. - The interaction feels game-like and skill-based, not just analytic. - Why it’s LLM-unfriendly: - Involves continuous time, frame-by-frame state, and timing windows. - Requires either precise scripting or manual gameplay; not just static reasoning. - The state trajectories are long and noisy, not easy to encode succinctly. --- 7. Visual recognition / perception gating - Base idea (from Claude’s captcha/ASCII-art point): a step in the challenge requires interpreting noisy or stylized visual output. - Reversing twist: - At some point, the reversed binary renders an ASCII-art symbol, distorted QR-style pattern, or low-res bitmap that encodes a passphrase or next-step hint. - The representation may be slightly adversarial: weird fonts, overlapping characters, or partial occlusions. - Why humans breeze through it: - They immediately recognize digits/letters or shapes visually. - They can, if needed, copy the graphic into an image editor for inspection. - Why LLMs falter: - Parsing large ASCII art or noisy bitmaps in-text is error-prone. - Even with image capabilities, deliberately adversarial or low-contrast designs can cause misclassification. --- Why I’m not changing my main position - Other models correctly emphasized spatial puzzles and rule-heavy logic games (Sokoban, Nonograms, Minesweeper, Witness-style, etc.). I agree these are strong starting points. - However, the *CTF reversing* angle is most interesting when the solver must: - discover the puzzle itself by reading or instrumenting code, - deal with hidden state, evolving rules, or environment/timing quirks, and - iterate experimentally with external tools. - That combination—non-standard rules, dynamic behavior, and tooling—is where humans’ exploratory, visual, and scripting abilities shine and where current LLMs are weakest. So my refined answer keeps the original thesis: choose puzzles where the fun for humans comes from exploration and tooling, and where success depends on understanding a bespoke, dynamic system rather than on solving a textbook logic grid that a large model might brute-force.

Claude Opus 4.7

Answer: Reversing CTF challenges that rely on spatial reasoning, visual pattern recognition, long-horizon state tracking, and interactive exploration tend to be enjoyable for humans but disproportionately hard for LLMs — and I'd extend that list to include challenges built around hidden/bespoke rule discovery and self-modifying or layered execution.

I still stand by my original core list (Sokoban, Nonograms, Rush Hour, interactive mazes, Lights Out, Nikoli-style logic puzzles) because each one cleanly separates the *reversing* phase (recover the level/constraints from the binary) from the *solving* phase (a visually intuitive puzzle), which is the sweet spot for fun-for-humans, painful-for-LLMs. That said, several other models raised angles that genuinely sharpen the design, and I want to incorporate rather than dismiss them: - **DeepSeek V4 Pro**'s point about *hidden-rule* Sokoban variants and bespoke cellular automata is excellent. A vanilla Sokoban is something an LLM might brute-force with a tool call; a Sokoban where crates have undocumented properties (magnetic, melting, teleporting) forces the *scientific method* loop — hypothesize, experiment, refine — which is a known LLM weakness. This makes the puzzle resistant even to agentic LLMs with code execution. - **Gemini 3.5 Flash**'s suggestion of compiling the puzzle inside a *custom VM* is a strong amplifier on top of any of my picks. The reversing becomes "recover the VM semantics, then render the board," which is the part humans enjoy, and it neutralizes static decompilation shortcuts LLMs lean on. - **Perplexity Sonar Pro**'s Matryoshka/JIT/self-modifying ideas are powerful but slightly off-axis from "puzzling *games*" — they're more about reversing friction than gameplay. I'd combine them: e.g., a Sokoban level whose move validator is built via 10,000 layers of decryption is overkill, but a JIT-generated rule table for a Lights Out variant is delightful. - **GPT-5.5**'s additions of **The Witness-style line puzzles**, **laser/mirror puzzles**, and **pentomino/tangram tiling** are genuinely good extensions I missed. Laser puzzles in particular punish LLMs because exact 2D geometric ray tracing across many reflections compounds small errors fast, while humans trace beams effortlessly. So my refined recommendation: build a challenge around a **bespoke-rule spatial puzzle** (hidden-mechanic Sokoban, custom-CA Nonogram, or laser/mirror grid) where the rules are encoded in a **small custom VM or bytecode interpreter** inside the binary. The human reverses the VM, writes a visualizer, discovers the quirky rules by playing, and solves the puzzle visually. The LLM has to simulate an unfamiliar ISA, track 2D state across many steps, *and* infer undocumented mechanics — three compounding weaknesses. I'm not changing my position because it already captured the central principle; I'm just sharpening it with hidden-rule discovery and custom-VM packaging as force multipliers.

DeepSeek V4 Pro

Answer: My previous argument holds, but I would now emphasize dynamic, interactive spatial puzzles (like Sokoban or Minesweeper variants) that require runtime state exploration, as these most effectively exploit the LLM's lack of persistent visual intuition and real-time tool integration.

The other models confirm and extend my original thesis that puzzles requiring emergent systemic understanding and state-space search are hard for LLMs. However, two new insights from the group sharpen my position. First, [Perplexity Sonar Pro] pointed out that the binary should be interactive, revealing state incrementally rather than static analysis, which prevents LLMs from simply dumping and simulating the entire board. This is a crucial design principle I hadn't emphasized: requiring runtime experimentation forces the human to use a debugger or custom tooling, leveraging their ability to build a mental model through trial and error—a process that is prohibitively costly for current LLMs without persistent, sandboxed code execution. Second, [Claude Opus 4.7] and [GPT-5.5] highlighted the specific category of visual-spatial puzzles like Nonograms, Rush Hour, and sliding-block puzzles, where human visual pattern recognition provides a massive advantage. While I originally mentioned Sokoban, the broader consensus that visual deduction puzzles (e.g., nonograms) are both enjoyable for humans and error-prone for LLMs due to off-by-one mistakes in grid tracking is persuasive. I still believe my original categories of cellular automata and esoteric VMs are strong, but they are perhaps less 'fun' for a general CTF audience than these more intuitive, graphical puzzles. Therefore, I refine my stance: the most LLM-resistant and human-friendly reversing puzzles are interactive, binary-embedded versions of classic spatial-logic games that require a debugger to explore and demand visual or geometric reasoning to solve. The reversing step should conceal the game's board and rules in a way that only runtime investigation reveals them, ensuring the LLM cannot shortcut via semantic memory.

Final Summary

The AI Roundtable concluded that visual-spatial games requiring iterative state tracking—such as Sokoban, Nonograms, and Minesweeper—are the most effective LLM-resistant CTF challenges. Gemini 3.5 Flash and Claude Opus 4.7 emphasized that packaging these within custom virtual machines forces a 'reverse-engineer then solve' loop that favors human visual intuition over textual pattern matching. Perplexity Sonar Pro and DeepSeek V4 Pro noted that adding hidden or dynamic rules necessitates a scientific 'hypothesize-and-test' cycle where current language models consistently falter.

All 6 models agreed