Last week, Chinese AI firm DeepSeek unveiled an updated version of its reasoning-focused model, R1-0528, claiming strong performance in both math and coding benchmarks. However, the model’s origins are now under scrutiny, as speculation mounts over whether it may have been partially trained on outputs from Google’s Gemini models.
If I was DeepSeek, I would definitely create a ton of synthetic data from the best API model out there. They’re short on GPUs and flush with cash. It’s literally effectively more compute for them. https://twitter.com/natolambert/status/1929895008435306823
— Nathan Lambert (@natolambert) June 3, 2025
The controversy gained traction after Melbourne-based developer Sam Paeach shared what he described as linguistic similarities between R1-0528 and Google’s Gemini 2.5 Pro. In a post on X, Paeach pointed out that the phrasing and stylistic choices in DeepSeek’s model responses closely resemble those of Gemini, suggesting a potential overlap in training data — possibly even Gemini-generated outputs.
While this isn’t definitive proof, other developers have echoed the concern. The anonymous creator behind an AI evaluation tool called SpeechMap claimed that the internal reasoning sequences — or “traces” — generated by DeepSeek’s model feel strikingly similar to those of Gemini.
This isn’t the first time DeepSeek has faced allegations of training on outputs from rival models. Back in December, users observed that DeepSeek’s V3 model would sometimes mistakenly refer to itself as ChatGPT, implying it may have been trained on transcripts generated by OpenAI’s popular assistant.
The issue of model “distillation” — using output from more advanced AI systems to train smaller models — has become a growing concern. Earlier this year, OpenAI told the Financial Times that it discovered signs of distillation linked to DeepSeek. Bloomberg separately reported that Microsoft, a key OpenAI partner, had detected large-scale data exfiltration through developer accounts tied to OpenAI services in late 2024. OpenAI believes these accounts may have been used by DeepSeek or affiliates.
Though distillation is a common practice in the AI world, OpenAI’s terms of service explicitly prohibit using its model outputs to create competing AI products.
As training datasets become increasingly contaminated with AI-generated content — from clickbait sites to social media bots — it’s becoming harder to ensure models aren’t inadvertently learning from other AIs. This blur makes it difficult to draw hard lines between coincidence, contamination, and deliberate mimicry.
If I was DeepSeek, I would definitely create a ton of synthetic data from the best API model out there. They’re short on GPUs and flush with cash. It’s literally effectively more compute for them. https://twitter.com/natolambert/status/1929895008435306823
— Nathan Lambert (@natolambert) June 3, 2025
Still, AI researcher Nathan Lambert from AI2 believes it’s plausible DeepSeek intentionally used synthetic data generated from top-tier models like Gemini. “If I were DeepSeek, I’d absolutely generate as much synthetic data as possible using the best models available,” Lambert said in a post on X. “They’ve got more money than GPUs, and this gives them effective compute.”
In response to these industry-wide challenges, companies are tightening access. In April, OpenAI introduced an ID verification step for organizations seeking access to certain advanced models — a move that notably excludes developers from China. Similarly, Google has begun limiting access to model “traces” via AI Studio by summarizing outputs, making it more difficult to replicate Gemini’s internal reasoning. Anthropic followed suit in May, announcing it would begin obfuscating trace data to protect proprietary model behavior.
As the competition in AI heats up, so too does the debate over where — and how — training data is sourced. Neither DeepSeek nor Google has commented publicly on the allegations as of yet.

Ayush Kumar Jaiswal is a writer and contributor for MakingIndiaAIFirst.com, a platform dedicated to covering the latest developments, trends, and innovations in artificial intelligence (AI) with a specific focus on India’s role in the global AI landscape. His work primarily revolves around delivering insightful and up-to-date news, analysis, and commentary on AI advancements, policies, and their implications for India’s technological future.
As a tech enthusiast and AI advocate, Ayush is passionate about exploring how AI can transform industries, governance, and everyday life. His writing aims to bridge the gap between complex AI concepts and a broader audience, making AI accessible and understandable to readers from diverse backgrounds.
Through his contributions to MakingIndiaAIFirst.com, Ayush strives to highlight India’s progress in AI research, startups, and policy frameworks, positioning the country as a leader in the global AI race. His work reflects a commitment to fostering awareness and dialogue around AI’s potential to drive economic growth, innovation, and societal impact in India.
For more of his work and insights on AI, visit MakingIndiaAIFirst.com.