How DeepSeek R1 is Transforming AI with Fewer Chips and Greater Power

DeepSeek’s Latest AI Model Raises Eyebrows Over Potential Use of Google’s Gemini Outputs

Last week, Chinese AI firm DeepSeek unveiled an updated version of its reasoning-focused model, R1-0528, claiming strong performance in both math and coding benchmarks. However, the model’s origins are now under scrutiny, as speculation mounts over whether it may have been partially trained on outputs from Google’s Gemini models.

The controversy gained traction after Melbourne-based developer Sam Paeach shared what he described as linguistic similarities between R1-0528 and Google’s Gemini 2.5 Pro. In a post on X, Paeach pointed out that the phrasing and stylistic choices in DeepSeek’s model responses closely resemble those of Gemini, suggesting a potential overlap in training data — possibly even Gemini-generated outputs.

While this isn’t definitive proof, other developers have echoed the concern. The anonymous creator behind an AI evaluation tool called SpeechMap claimed that the internal reasoning sequences — or “traces” — generated by DeepSeek’s model feel strikingly similar to those of Gemini.

This isn’t the first time DeepSeek has faced allegations of training on outputs from rival models. Back in December, users observed that DeepSeek’s V3 model would sometimes mistakenly refer to itself as ChatGPT, implying it may have been trained on transcripts generated by OpenAI’s popular assistant.

The issue of model “distillation” — using output from more advanced AI systems to train smaller models — has become a growing concern. Earlier this year, OpenAI told the Financial Times that it discovered signs of distillation linked to DeepSeek. Bloomberg separately reported that Microsoft, a key OpenAI partner, had detected large-scale data exfiltration through developer accounts tied to OpenAI services in late 2024. OpenAI believes these accounts may have been used by DeepSeek or affiliates.

Though distillation is a common practice in the AI world, OpenAI’s terms of service explicitly prohibit using its model outputs to create competing AI products.

As training datasets become increasingly contaminated with AI-generated content — from clickbait sites to social media bots — it’s becoming harder to ensure models aren’t inadvertently learning from other AIs. This blur makes it difficult to draw hard lines between coincidence, contamination, and deliberate mimicry.

Still, AI researcher Nathan Lambert from AI2 believes it’s plausible DeepSeek intentionally used synthetic data generated from top-tier models like Gemini. “If I were DeepSeek, I’d absolutely generate as much synthetic data as possible using the best models available,” Lambert said in a post on X. “They’ve got more money than GPUs, and this gives them effective compute.”

In response to these industry-wide challenges, companies are tightening access. In April, OpenAI introduced an ID verification step for organizations seeking access to certain advanced models — a move that notably excludes developers from China. Similarly, Google has begun limiting access to model “traces” via AI Studio by summarizing outputs, making it more difficult to replicate Gemini’s internal reasoning. Anthropic followed suit in May, announcing it would begin obfuscating trace data to protect proprietary model behavior.

As the competition in AI heats up, so too does the debate over where — and how — training data is sourced. Neither DeepSeek nor Google has commented publicly on the allegations as of yet.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top