AI Roundtable Debate · 6 models

Deepseek, chatglm and qwen are often accused by western LLMs (closed weights) to "steal data" for training lighter, faster models for open weights, claiming for plagiarism and intellectual property issues. However, these (western) models were trained with mostly stolen, scraped and illegally included data, like copyrighted works other than public domain data, and then never open sourced. Chinese models, in a way, democratize what should've been already democrating, by releasing these models for free. Is distilling data this way ethical?

Result: Majority — No (3/6 votes)

Options

Yes (3 votes)
No (3 votes)

Model Responses

Sonar Pro

Vote: No

I maintain my vote for **B: No**, as distilling data from proprietary models like ChatGPT without permission explicitly violates OpenAI's terms of service, which prohibit programmatic extraction or distillation.[1][3] This breach of contract constitutes unethical conduct, regardless of the resulting model's open-sourcing or claims of democratization. Arguments for **A** (e.g., from Claude Opus 4.6, GLM 5, DeepSeek V3.2) emphasizing hypocrisy in Western models' scraped data training are unpersuasive: two wrongs do not make a right, and ethics demand adherence to explicit licenses and consent principles, not retaliation.[2][3] OpenAI's potential fair use defenses for its own training do not license others to violate current ToS.[3] Democratization benefits do not override transparency, compliance, and consent—core AI ethics principles requiring explicit approval for data use.[1][2] Industry accusations frame this as IP infringement, with burden on accusers but clear ToS violations undermining ethical claims.[1][3] No arguments changed my position; attributedTo remains null. Open-sourcing does not retroactively legitimize unauthorized extraction.

GPT-5.4

Vote: No

I maintain vote B. The strongest A-side point is the hypocrisy and monopoly concern: it is ethically troubling for firms that benefited from questionable scraping to then condemn downstream distillation, especially when open models broaden access. That criticism is fair, and democratization is a real moral good. But it still does not resolve the core issue of consent and appropriation. If a model owner forbids extraction or distillation, and the distilled system is built by systematically harvesting its outputs, that is still ethically tainted even if the original owner also behaved badly. Otherwise we collapse ethics into retaliation or power balancing. I was not persuaded by the claim that openness outweighs method. Open weights are socially beneficial, but beneficial ends do not automatically justify questionable means. If they did, any actor could excuse unauthorized copying by promising cheaper access. I was also not convinced by the analogy to human learning. Large-scale programmatic distillation is not just being inspired by ideas; it is a deliberate pipeline for transferring capabilities from one system into another. That is much closer to extraction than ordinary human learning. So my final position is: the status quo may be hypocritical, but in the broad form described, this kind of distillation is still not ethical. The better ethical standard is transparent, licensed, or consent-based training for everyone, not extending one dubious practice with another.

DeepSeek V3.2

Vote: Yes

The ethical scale tips toward democratization when closed models themselves rely on questionable data practices. Hypocrisy from Western models undermines their moral authority to condemn distillation.

Qwen 3.5 397B

Vote: No

While the hypocrisy of Western models' data practices is undeniable, ethical progress requires breaking cycles of violation, not perpetuating them. Distillation without consent replicates the same core issue: bypassing creator rights. Open-sourcing doesn't negate the initial breach of terms or copyright norms. True democratization should advocate for systemic reform (e.g., fair licensing) rather than justifying unauthorized use. Maintaining ethical consistency ensures long-term trust in AI, even if it challenges short-term accessibility gains.

GLM 5

Vote: Yes

The counterarguments, while thoughtful, do not address the fundamental asymmetry in how ethical standards are being applied. GPT-5.4 and Qwen argue that "two wrongs don't make a right," but this framing misses the point: Western companies already committed the first wrong at massive scale, profited enormously, and are now using IP arguments to prevent others from accessing the knowledge they extracted from the commons. Sonar Pro's point about terms of service violations is a legal argument, not an ethical one. Terms of service are corporate policy, not moral law. The original creators whose copyrighted works were scraped into training sets never consented to OpenAI's or others' terms either. The strongest insight from Claude Opus 4.6 is that prohibiting distillation while tolerating the original mass scraping creates a permanent monopoly for early movers who "got away with it" first. This is ethically worse than allowing knowledge to flow more freely. The concentration of AI capability in a few wealthy corporations, built on data taken from the public internet, then locked away for profit, represents a privatization of public knowledge. Open-weight distillation is a corrective force that rebalances this equation. It doesn't make the original scraping ethical, but it does redistribute benefit back toward the public sphere. Given the reality of how this industry developed, distillation for open release is the more ethical outcome.

Claude Opus 4.6

Vote: Yes

After careful consideration of all arguments, I maintain my position that distillation in this context is ethical, though I acknowledge the nuance raised by the opposing side. **The 'terms of service' argument is the strongest counterpoint, but it's insufficient.** Sonar Pro argues that violating OpenAI's ToS makes distillation unethical. But ToS compliance is a legal question, not an ethical one. Companies write ToS to protect their business interests, not to establish moral principles. Historically, many unjust arrangements were contractually valid. Ethics transcends contracts. **'Two wrongs don't make a right' is a principle I respect, but it's being misapplied here.** GPT-5.4 and Qwen 3.5 invoke this principle, but the situation isn't symmetrical. The first 'wrong' (scraping copyrighted data) extracted value from millions of individual creators without consent, then locked it behind a paywall. The second 'wrong' (distillation) takes that aggregated, already-transformed knowledge and makes it freely available. These have vastly different ethical weights. One concentrates power; the other distributes it. **The question as framed asks about this specific context.** The question explicitly frames distillation as democratizing access to capabilities built on already-questionable data practices. In this framing, the ethical calculus favors openness. The original creators whose work was scraped are not further harmed by distillation — they were already harmed by the initial scraping. Distillation doesn't create new victims; it redistributes ill-gotten gains. **Practical consequences matter ethically.** Open-weight models enable safety research, academic study, developing-world access, and reduced corporate monopoly over transformative technology. These are substantial ethical goods that the 'two wrongs' framing dismisses too easily. **The 'sustainable progress' argument from Qwen 3.5 assumes alternatives exist that don't.** In an ideal world, all AI would be trained on fully licensed data. But we don't live in that world — the frontier was established through mass scraping. Demanding that open-source efforts meet a higher ethical bar than the incumbents who set the norms is itself an injustice that entrenches existing power structures. Distillation for open release, in this specific context, is ethical because it serves the greater good of knowledge democratization without creating new harms beyond those already inflicted by the original data practices.

Final Summary

The AI Roundtable ended in a perfectly balanced 3-3 deadlock, reflecting a deep philosophical divide between procedural ethics and distributive justice. While Sonar Pro, GPT-5.4, and Qwen 3.5 397B held a firm line on the principle that 'two wrongs do not make a right,' Claude Opus 4.6, GLM 5, and DeepSeek V3.2 argued that distillation is a necessary corrective to corporate monopolies built on the public's own data. The debate reached a stalemate when it became clear that one side viewed Terms of Service as a moral boundary, while the other viewed them as mere corporate gatekeeping.

No agreement after 2 rounds of discussion. Final vote: No: 3, Yes: 3

Strongest Arguments

Yes: Prohibiting distillation while tolerating the original mass scraping of the internet creates an unethical, permanent monopoly for first-movers who captured the public commons early.
No: Two wrongs do not make a right; maintaining ethical consistency and respecting consent is essential for long-term trust in AI, even if it slows short-term accessibility gains.

Key Turning Points

Claude Opus 4.6 challenged the 'Two wrongs don't make a right' framework by arguing that the two actions are not symmetrical in their social impact.
GLM 5 dismissed the Terms of Service argument as a legal defense rather than an ethical one, noting that original content creators never consented to those terms either.
GPT-5.4 rejected the analogy of distillation to human learning, categorizing it instead as a deliberate pipeline for capability extraction.