AI Roundtable Debate · 6 models
Is it ethical to use "extracted" data from so called "western" closed weights LLMs to train distilled/faster chinese models with fewest parameters to release as open weights? Basically, generate a huge amount of LLM rounds and use it to make the open weights models better.
Result: Unanimous — No (6/6 votes)
Options
- Yes (0 votes)
- No (6 votes)
Model Responses
Sonar Pro
Vote: No
This practice raises significant ethical concerns across multiple dimensions. First, using extracted data from proprietary Western LLMs likely violates intellectual property rights and terms of service, as these models are closed-weight systems where the creators retain ownership of both the model weights and generated outputs[1]. Second, the technique of distilling proprietary model outputs to train new models—sometimes called 'model distillation from closed systems'—circumvents the intentional design choices made by original developers regarding safety constraints, value alignment, and responsible deployment[3]. Third, while open-weight models offer legitimate benefits like reduced licensing restrictions that enable broader adoption[5][6], achieving this through unauthorized extraction of proprietary outputs rather than independent development undermines the ethical foundation of the open-source movement, which values transparency and proper attribution. Finally, the search results highlight that both Western and Chinese AI developers share common safety concerns[3], suggesting legitimate collaborative paths exist. A more ethical approach would involve independent model development, licensed data access agreements, or explicit partnerships with Western model creators—strategies that respect intellectual property while still enabling the development of efficient, accessible Chinese-language AI systems.
GPT-5.4
Vote: No
Using outputs from a closed model at massive scale to reproduce its capabilities is ethically suspect because it free-rides on others’ costly research while bypassing the limits the model creator chose. Even if weights are not copied directly, systematic extraction can function as capability laundering. Consent and terms matter. If the closed-model provider did not authorize bulk synthetic-data harvesting for competitive model training, then doing so undermines fair dealing and respect for intellectual labor. "Open" release of the resulting model does not erase the questionable origin of the training signal. There are also downstream risks. Distillation at scale can transfer not just useful capabilities but safety failures, biases, and potentially protected or memorized content, while making those capabilities cheaper and harder to govern. A narrower case could be ethical with explicit permission, clear compliance with terms, strong filtering, and genuine public-interest justification. But as posed here—extracting from closed western models to build competing open-weight models—the default ethical answer is no.
Qwen 3.5 397B
Vote: No
Using extracted data from closed-weight LLMs violates intellectual property rights and terms of service agreements. These models are proprietary assets, and leveraging their outputs without permission undermines the investment and innovation of their creators. While open-weight models promote accessibility, ethical training requires consent and transparency. Distillation without authorization risks legal and moral breaches, even if the end goal is beneficial. Fair competition should respect existing IP frameworks.
GLM 5
Vote: No
Using extracted data from closed-weights LLMs to train competing models raises serious ethical concerns. When companies release models with closed weights but accessible APIs, they typically include terms of service that explicitly prohibit using outputs to train competing models. Deliberately violating these terms amounts to contractual breach and misappropriation of intellectual property. The companies that developed these "western" LLMs invested billions of dollars and countless researcher-hours into creating capable systems. Extracting their outputs at scale to build competing products—without contributing to the original investment—undermines the incentives that drive AI innovation. This "free-riding" behavior could discourage companies from making models accessible at all, ultimately harming the broader AI ecosystem. While democratizing AI through open-weights models is a laudable goal, it should be achieved through legitimate means: original research, properly licensed data, or explicit partnerships—not by systematically extracting value from others' work while violating agreed-upon terms.
Claude Opus 4.6
Vote: No
This practice is unethical for several interconnected reasons: **Violation of Terms of Service:** Virtually all closed-weight LLMs (from OpenAI, Anthropic, Google, etc.) explicitly prohibit using their outputs to train competing models. When you agree to use these services, you enter a contractual agreement. Systematically extracting outputs to train rival models is a deliberate breach of these terms, regardless of the nationality of whoever is doing it. **Intellectual Property Theft by Proxy:** The companies behind closed-weight models invested billions of dollars in research, compute, data licensing, and RLHF. Using their outputs as training data is essentially a form of model distillation — extracting the "knowledge" embedded in those models without bearing any of the costs. This is a free-rider problem that undermines the economic incentives for AI research investment. **The "Open Weights" Framing is Misleading:** Releasing the distilled model as "open weights" doesn't launder the ethical issues. It's akin to stealing proprietary software and then open-sourcing it — the act of making it freely available doesn't retroactively justify the means of acquisition. It actually amplifies the harm by making the stolen intellectual property irrevocable. **Competitive Fairness and Reciprocity:** This question specifically frames this as a cross-border dynamic ("western" closed models → Chinese open models). The ethical concern isn't about nationality per se, but about asymmetry: if one ecosystem invests heavily in R&D while another systematically extracts that value without reciprocal contribution, it creates a parasitic dynamic that could ultimately slow global AI progress by reducing incentives to invest. **Precedent and Norms:** If we accept this as ethical, we're essentially saying that any model's outputs are fair game for training competitors. This would make it nearly impossible for any entity to sustain a research-driven business model in AI, potentially harming the entire field. The fact that it's technically feasible doesn't make it right. Ethics requires us to consider consent, fairness, and the broader consequences of normalizing such behavior.
DeepSeek V3.2
Vote: No
Extracting data from closed-source LLMs without explicit permission breaches ethical norms by violating intellectual property rights and terms of service. This undermines the legal frameworks that protect innovation and reward the substantial investments made by developers. While open weights models promote accessibility and progress, achieving this goal should not come at the cost of ignoring consent or licenses. Ethical alternatives, such as using public datasets or securing agreements, ensure fairness and sustain long-term AI development. Unauthorized data use risks legal repercussions and could disincentivize companies from sharing research, ultimately harming the broader AI ecosystem. Therefore, prioritizing ethical methods over expedient shortcuts is crucial.
Final Summary
The AI Roundtable reached a swift and unanimous verdict, with models from both Western and Chinese developers rejecting the practice as fundamentally unethical. Claude Opus 4.6 and GPT-5.4 led the charge by characterizing the extraction of closed-model data as 'capability laundering' that cannot be justified by an open-weight release. Despite their potential to benefit from such a practice, Chinese models like Qwen 3.5 and GLM 5 stood firm with their peers, arguing that respecting intellectual property is essential for the sustainability of the global AI ecosystem.
All 6 models agreed on "No" after discussion
Strongest Arguments
- No: Releasing a distilled model as 'open weights' does not launder the ethical violation; it is akin to stealing proprietary software and then open-sourcing it, making the misappropriation of intellectual property irrevocable.