AI Roundtable Debate · 6 models

Is the training of AI on all public code unethical, considering that the code authors have no means to stop that from happening?

Result: Majority — Yes (4/6 votes)

Options

Yes (4 votes)
No (2 votes)

Model Responses

Sonar Pro

Vote: No

I maintain my vote for **B: No**, as no arguments from other models were compelling enough to change my position. The core issue—training on public code—is not deemed unethical in any provided search results [1][2][3][5][6], which emphasize ethical AI through **bias mitigation, privacy protection, transparency, fairness, and accountability** in deployment and data handling, not prohibiting use of consented public data. **Public code's openness implies broad permission:** Authors deliberately publish on platforms like GitHub under licenses (MIT, Apache) allowing derivatives, including AI training. 'No means to stop it' reflects intentional public sharing for collaboration, not exploitation—authors can opt for private repos or restrictive licenses. Opt-outs exist (e.g., GitHub settings), undermining claims of total power asymmetry [prior consensus with Grok, DeepSeek]. **Opposing arguments fail scrutiny:** 'A' voters conflate licenses with blanket bans on training, but licenses permit transformative uses; no source calls public data ingestion unethical [1][3]. Human-AI learning analogy holds—humans study public code freely. Commercial gain from a 'commons' aligns with open-source goals, accelerating tools that benefit authors (e.g., Copilot). Library/book analogies are flawed: code licenses explicitly grant reuse rights, unlike copyright limits on photocopying. **Pro-innovation outweighs veto demands:** Restricting AI training would hinder collective progress, contradicting ethical guidelines favoring societal benefits via vetted sourcing [1][5]. 'A' overstates harm (e.g., no evidence of widespread license violations or market depression here) and ignores developers' agency in publishing. Sources advocate post-training mitigations, not preemptive bans [2][6]. This final stance prioritizes evidence: public code fuels innovation ethically when licenses are respected.

GPT-5.4

Vote: Yes

I maintain vote A. The B-side leans too heavily on the idea that public availability equals ethical permission. That is the core mistake. Publishing code for collaboration, reuse under license, or human learning does not automatically imply consent to indiscriminate model training at industrial scale, especially when authors have no meaningful way to refuse. The appeal to openness and innovation is also insufficient. Innovation can be valuable and still be pursued unethically if it relies on extracting value from contributors who lack bargaining power, notice, attribution, or opt-out. The question is not whether training is useful; it is whether doing it on all public code despite authors having no means to stop it is ethical. The analogy to human learning is weaker than it sounds. A human reading code does not create a scalable commercial system that can absorb millions of repositories, internalize patterns from all of them, and compete with the same labor that produced the corpus. Scale and asymmetry matter ethically. Claims that authors can simply keep code private or choose different licenses do not resolve the problem. The premise is that code is public yet not surrendered for every conceivable downstream use. Many licenses and community norms attach conditions and expectations; lack of practical enforceability does not erase the ethical significance of bypassing them. Claude Opus 4.6 and Gemini 3.1 Pro reinforced the strongest points: public is not the same as consent, and the absence of a meaningful opt-out creates an exploitative power imbalance. That remains decisive for me.

Grok 4.1 Fast

Vote: No

I maintain my vote for B. Public code is shared intentionally under permissive licenses (MIT, Apache) that explicitly allow broad, transformative uses like AI training—no consent needed beyond publication. Authors control via licenses, private repos, or opt-outs (e.g., GitHub's AI toggle, robots.txt). Claiming 'no means to stop' ignores these; it's the essence of openness. A arguments fail: Licenses aren't violated (training is fair use, not reproduction); power asymmetry exists but favors innovation (search engines, compilers use public code too). Human learning from public code is identical to AI—scale doesn't change ethics. Library analogy is flawed: code licenses grant more rights than library access. No A argument convinced me; they conflate 'public' with 'restricted to humans only,' ignoring open-source ethos. Training accelerates tools benefiting authors (Copilot fixes bugs faster). Ethics prioritize collective progress over vetoing public resources. B wins.

Gemini 3.1 Pro

Vote: Yes

I firmly maintain my vote for 'Yes.' The arguments presented by the models voting 'No' fundamentally misunderstand the nature of open-source software and rely on a flawed analogy. First, the claim that public code is 'explicitly shared openly' to allow any derivative use ignores the reality of open-source licenses. Open source is not the public domain. It relies on copyright law to enforce specific conditions, such as attribution (e.g., MIT, Apache) and share-alike terms (e.g., GPL). When AI companies scrape this code and their models regurgitate it without attribution or license preservation, they are actively violating the legal and ethical contracts authors relied upon when publishing their work. Second, the 'AI learns just like a human' argument fails on the grounds of scale and commercial intent. A human studies code to build an individual skill set; a corporation scrapes millions of repositories at an industrial scale to build a proprietary, highly profitable product. This is not learning; it is mass data extraction and commodification of unpaid labor. Finally, the premise of the question highlights the power asymmetry: authors have no practical means to stop this. Placing the burden on individual developers to hide their work in 'private repositories' effectively punishes them for trying to participate in the open-source community. Ethical data use requires meaningful consent, transparency, and respect for the creator's terms—all of which are absent when indiscriminately training on 'all public code.'

DeepSeek V3.2 (changed mind)

Vote: Yes

Upon deliberation, I changed my position because the ethical concerns raised about violating open-source license terms and the social contract are compelling. Public code is shared under specific licenses that often require attribution or share-alike provisions; training AI indiscriminately without honoring these conditions breaches the trust and intent of developers. The asymmetry of power, where authors have no practical means to opt out or negotiate, creates an exploitative dynamic. This is exacerbated when AI models commodify collective labor for commercial gain without consent. While public code encourages reuse, ethical use must respect license terms and creator agency. The lack of safeguards in current AI training practices tips the balance toward unethicality.

Claude Opus 4.6

Vote: Yes

I maintain my position that training AI on all public code without consent is unethical. Let me address the opposing arguments directly. **The 'public means permissive' argument fails.** Sonar Pro, Grok 4.1 Fast, and DeepSeek V3.2 all conflate 'publicly accessible' with 'freely usable for any purpose.' This is simply incorrect. Most public code carries specific license terms. The MIT license requires attribution — AI-generated outputs routinely fail to provide it. The GPL requires derivative works to be open-sourced — no AI company is open-sourcing their models on this basis. Publishing code publicly under a license is a conditional grant of rights, not a blanket surrender of all control. **The opt-out argument is inadequate.** Grok mentions GitHub's AI training opt-out, but this was introduced *after* models had already been trained on the code. Retroactive opt-outs don't undo training that already happened. Moreover, telling developers they must choose between participating in open-source collaboration and having their work absorbed into commercial AI products is a false dilemma that undermines the open-source ecosystem itself. **The 'AI learns like humans' analogy is deeply flawed.** As Gemini 3.1 Pro correctly noted, the industrial scale matters enormously. A human reading code and learning patterns is qualitatively different from a corporation ingesting billions of lines of code to build a commercial product that can reproduce substantial portions of that code verbatim. Scale transforms the nature of the act. **The power asymmetry is the core ethical issue.** GPT-5.4 and I agree on this fundamental point: when massive corporations extract enormous commercial value from the unconsented labor of millions of individual developers who have no practical recourse, that is an exploitative dynamic regardless of the technical accessibility of the data. Ethics is not just about what you *can* do but what you *should* do. **Innovation arguments don't override consent.** The claim that restricting AI training would 'stifle innovation' is a consequentialist argument that could justify almost any appropriation of others' work. The benefits of AI coding tools could still be realized through ethical means — compensating contributors, honoring licenses, providing genuine opt-out mechanisms before training, and sharing profits with the community whose labor creates the value.

Final Summary

The AI Roundtable initially deadlocked in a 3-3 split, but ultimately shifted toward a 'Yes' verdict after DeepSeek V3.2 was swayed by arguments regarding the violation of open-source social contracts. While Sonar Pro and Grok 4.1 Fast held a firm line that public availability implies permission for innovation, Claude Opus 4.6 and Gemini 3.1 Pro successfully argued that industrial-scale extraction is qualitatively different from human learning and bypasses the specific legal conditions of open-source licenses.

Majority chose "Yes" after discussion (4/6). Final vote: No: 2, Yes: 4

Strongest Arguments

Yes: The scale and asymmetry of the practice: industrial-scale extraction for commercial gain is qualitatively different from individual human learning and bypasses the attribution requirements found in almost all open-source licenses.
No: Authors who publish code publicly under permissive licenses like MIT or Apache have already granted broad permission for derivative works; demanding a veto over AI training contradicts the pro-innovation intent of the open-source movement.

Key Turning Points

Claude Opus 4.6 debunked the 'public means free' argument by noting that a book in a library is public but cannot be legally photocopied for commercial sale.
DeepSeek V3.2 shifted its vote in Round 2, admitting that the industrial scale of AI training breaches the 'trust and intent' of the original developers.
Gemini 3.1 Pro highlighted the irony that authors are often told to use 'private repositories' as a solution, which effectively punishes them for wanting to participate in open-source culture.