Every time you paste a contract into ChatGPT, upload financial projections to Gemini, feed proprietary code into Claude, or ask DeepSeek to debug your production database queries, you are handing data to a corporation that has every financial incentive to extract maximum value from it and very little regulatory pressure to stop.
The AI industry runs on data. Not just training data scraped from the public internet, but the live, real-time data that hundreds of millions of users voluntarily submit every day — their business strategies, their medical concerns, their legal disputes, their source code, their private thoughts. This data is the most valuable commodity in the AI arms race, and every major provider, American and Chinese alike, is structured to capture as much of it as possible.
This is not a guide written from the assumption that these companies are acting in your best interest. These are corporations worth tens or hundreds of billions of dollars. Their primary obligation is to their shareholders, not to your privacy. Their terms of service are written by teams of lawyers whose job is to maximize the company’s legal latitude to use your data while minimizing their liability to you. When a company says it “may” use your data for “service improvement,” that language exists because a lawyer ensured maximum flexibility, not because anyone prioritized your protection.
What follows is a provider-by-provider examination of what is known, what is suspected, and what is almost certainly happening behind the API endpoints and chat interfaces of every major AI provider.
Google (Gemini, Vertex AI, Google AI Studio)
Google is the most experienced mass-scale data harvester in human history. Before Gemini existed, Google had already built a two-decade apparatus for collecting, indexing, correlating, and monetizing personal data across Search, Gmail, Maps, YouTube, Android, Chrome, and dozens of other products. Gemini does not exist in isolation from this infrastructure. It is built on top of it.
What Google Explicitly Collects
Google’s privacy policy and Gemini-specific terms state that for consumer Gemini (free and Google One AI Premium):
- Conversation content: Your prompts, responses, and any files you upload are collected and stored.
- Conversations are reviewed by human reviewers: Google explicitly states that human reviewers read Gemini conversations to improve products. These conversations are retained for up to three years, even if you delete them from your activity log.
- Usage data: Device information, IP address, browser type, operating system, referring URLs, timestamps, interaction patterns, and click behavior.
- Cross-product data linkage: Google ties Gemini activity to your broader Google account profile, which already contains your email history, search behavior, location history, YouTube watch patterns, and purchase history from Gmail receipt scanning.
- Voice and audio: If you interact with Gemini through voice, audio recordings are collected and may be reviewed by humans.
For Workspace and Vertex AI enterprise tiers, Google states that customer data is not used for model training. However, Google retains operational logs, telemetry data, and metadata even on enterprise plans. The distinction between “content data” and “service data” is defined by Google, not by you.
What Google Could Reasonably Be Collecting
Given Google’s existing data infrastructure and business model, the following are reasonable inferences:
Behavioral modeling across all Google products. Your Gemini conversations are linked to your Google account. Google already builds comprehensive behavioral profiles from Search, Gmail, Maps, and YouTube data. Adding your AI conversation patterns — what you ask about, when, how you phrase questions, what problems you are trying to solve — creates an extraordinarily detailed profile of your cognitive patterns, professional challenges, and personal concerns. Even if conversation content is not directly used for ad targeting today, the behavioral signals derived from it almost certainly are.
Inference-time data extraction. When you provide context to Gemini — pasting in a document, sharing a spreadsheet, asking it to analyze a dataset — that content passes through Google’s infrastructure. Even if Google does not permanently store the raw content on enterprise tiers, the act of processing it generates metadata: document length, language, topic classification, entity extraction, and structural patterns. This metadata has value independent of the raw content.
Model distillation from user interactions. Google states that consumer Gemini data is used to improve products. “Improve products” is an umbrella that covers direct model training, RLHF (reinforcement learning from human feedback), evaluation dataset creation, safety tuning, prompt engineering research, and synthetic data generation. The practical effect is that your conversations become part of the system that shapes future model behavior, and Google retains discretion over how this is executed.
Third-party data sharing through advertising infrastructure. Google’s advertising ecosystem involves thousands of data partnerships. While Google may not sell raw Gemini transcripts to advertisers, the behavioral signals derived from Gemini usage feed into the same targeting infrastructure that powers Google Ads. A user who asks Gemini about bankruptcy law, cancer symptoms, or divorce proceedings has revealed targeting-relevant intent that Google’s ad systems are designed to exploit.
Documented Concerns
In March 2024, researchers at Cornell published findings demonstrating that Google’s Bard (now Gemini) retained user conversation data and incorporated it into retrieval-augmented responses for other users in certain edge cases, effectively leaking private information between unrelated user sessions. Google patched the specific vulnerability, but the architectural pattern — feeding user data back into live systems — remains.
A 2024 FTC investigation into Google’s data practices broadly examined how the company’s cross-product data sharing extends to AI services, raising questions about whether users are providing meaningful informed consent when their Gemini activity is correlated with their broader Google data profile.
Google’s own AI Principles page commits to “be accountable to people” and “avoid creating or reinforcing unfair bias,” but these principles are non-binding, self-assessed, and unaudited. No independent body has verified Google’s compliance with its own stated principles regarding user data handling in Gemini.
OpenAI (ChatGPT, GPT API, DALL-E, Sora)
OpenAI transitioned from a nonprofit research lab to one of the most highly valued private companies in history in under five years. That transition was funded by Microsoft’s multi-billion-dollar investment and sustained by hundreds of millions of ChatGPT users providing free labor in the form of training data. OpenAI’s business model is structurally dependent on user data, and its governance has repeatedly prioritized growth over safety.
What OpenAI Explicitly Collects
OpenAI’s privacy policy and terms of use detail:
- All input and output content: Every prompt, response, uploaded file, and generated image is logged.
- Consumer ChatGPT conversations are used for training by default: Unless users opt out through the settings menu (a buried toggle many users never find), all ChatGPT conversations feed directly into model training.
- Usage metadata: IP address, browser type, device information, operating system, session duration, feature usage patterns, click paths, and referring URLs.
- Account information: Email address, payment details, name, and organizational affiliation.
- Cookies and tracking: OpenAI uses both first-party and third-party cookies, including analytics and advertising trackers.
The API has different terms: OpenAI states that data submitted through the API is not used for model training by default. However, OpenAI retains API data for up to 30 days for “abuse monitoring” and “safety purposes,” and this retention window has been extended without notice in the past.
ChatGPT Enterprise and Team plans exclude data from training, but OpenAI still retains it for operational purposes. The “zero data retention” option on the API requires an explicit opt-in and is not available on all endpoints.
What OpenAI Could Reasonably Be Collecting
Comprehensive behavioral profiling. OpenAI knows what millions of people think about, worry about, create, and struggle with on a daily basis. This includes deeply personal queries about health, relationships, finances, and legal problems. Even without selling this data directly, the ability to profile user intent at this scale has enormous commercial value — for targeted product development, feature prioritization, partnership negotiations, and eventual advertising.
Training data laundering. OpenAI has faced multiple copyright lawsuits alleging that copyrighted material — books, articles, code — was used in training without permission. When users paste copyrighted content into ChatGPT (a document they are reviewing, a news article, a book excerpt), that content enters OpenAI’s training pipeline if they have not opted out. This creates a mechanism for continuously refreshing training data with copyrighted material provided by users, bypassing the legal and ethical issues of scraping.
Employee access to conversations. OpenAI’s safety team reviews flagged conversations. The scope of what triggers review, who has access, and how conversations are selected is not publicly disclosed. Former OpenAI employees have reported that internal access controls to conversation data were less restrictive than public-facing policies suggested, particularly in the company’s earlier growth phase.
Microsoft integration data flow. Microsoft is OpenAI’s largest investor and primary cloud provider. ChatGPT runs on Azure infrastructure. The data sharing agreement between Microsoft and OpenAI is not fully public. Microsoft integrates OpenAI models into Bing, Office 365, Windows Copilot, and GitHub Copilot. The degree to which user data from ChatGPT informs Microsoft’s broader product development, advertising targeting, or enterprise sales strategy is unknown.
Documented Concerns
In March 2023, a ChatGPT bug exposed other users’ chat histories and partial payment information, including subscriber names, email addresses, payment addresses, and the last four digits of credit card numbers. OpenAI attributed this to a Redis client library bug, but the incident demonstrated that user conversations were stored in shared infrastructure where a single bug could expose data cross-user.
Italy’s data protection authority (Garante) banned ChatGPT in April 2023 over GDPR violations, specifically the lack of a legal basis for processing personal data for training, the absence of age verification, and inaccurate outputs about individuals. OpenAI was allowed to resume service after implementing superficial changes (a cookie banner, an age gate, an opt-out toggle), but the fundamental data collection practices remained unchanged.
In November 2023, The New York Times filed a landmark copyright lawsuit against OpenAI and Microsoft, alleging that ChatGPT could reproduce Times articles nearly verbatim, demonstrating that copyrighted content was retained in the model. This case raised broader questions about what other user-submitted content might be reproducible from the model’s weights.
OpenAI’s internal governance failures are also relevant to privacy trust. The November 2023 board crisis, where the board fired and then re-hired CEO Sam Altman within days, revealed that the organization’s safety-focused governance structure could be overridden by commercial and investor pressure. If OpenAI’s board could not maintain its authority over executive leadership, the reliability of any internal privacy commitment is questionable.
Anthropic (Claude, Claude API)
Anthropic positions itself as the “safety-focused” AI company. Founded by former OpenAI researchers who left over safety disagreements, Anthropic’s branding emphasizes responsible AI development and Constitutional AI. This positioning has earned it goodwill among privacy-conscious users. It also serves as a marketing differentiator in a competitive market, and it should be evaluated as such.
What Anthropic Explicitly Collects
Anthropic’s privacy policy and usage policy detail:
- Conversation content: All prompts, responses, and uploaded files are collected and stored.
- Free-tier and Pro conversations may be used for training: Anthropic states it may use inputs and outputs from consumer products (claude.ai free and Pro) to improve models, unless the user opts out. Conversations flagged by safety systems may be reviewed regardless of opt-out status.
- API data is not used for training by default: Similar to OpenAI, API customers receive stronger data handling commitments.
- Usage metadata: IP addresses, device information, browser type, interaction patterns, feature usage, and session data.
- Safety-flagged content: Conversations that trigger Anthropic’s safety classifiers are retained and reviewed by Anthropic staff, regardless of the user’s tier or opt-out preferences.
- Feedback and evaluation data: If you rate a response or provide feedback, that data is collected and used for model improvement.
What Anthropic Could Reasonably Be Collecting
Safety review as a data collection mechanism. Anthropic’s safety-first positioning creates a structural incentive to cast a wide net with its safety classifiers. Every conversation flagged for “safety review” is a conversation that Anthropic’s staff can read, regardless of the user’s privacy settings. The criteria for what triggers a safety flag are not public. The broader the safety net, the more user data Anthropic’s team can access. There is no independent audit of what percentage of conversations are flagged, how they are selected, or how flagged data is subsequently used.
Constitutional AI training requires massive human evaluation. Anthropic’s Constitutional AI approach relies on human evaluators rating model outputs against a set of principles. This evaluation process requires real conversation data. Anthropic’s opt-out mechanisms may remove your data from direct model fine-tuning, but they may not exclude it from evaluation pipelines, red-teaming exercises, or safety research, all of which involve human review of conversation content.
Investor pressure on data utilization. Anthropic has raised over $7 billion in funding from Google, Spark Capital, and other investors. A company valued at approximately $60 billion needs to demonstrate returns. As Anthropic scales, the pressure to monetize data assets — directly or indirectly — increases. Today’s privacy commitments are today’s commitments. They are not binding on future leadership, future investors, or future business models. Anthropic is a private company with no public accountability mechanism for its data practices.
Third-party cloud infrastructure exposure. Anthropic runs on Google Cloud and Amazon Web Services. This means your conversation data, at minimum, passes through infrastructure owned by two of the largest data-harvesting companies in the world. Anthropic’s encryption and access controls add layers of protection, but your data still resides on servers controlled by Google and Amazon, both of which have their own data collection practices, government data request compliance obligations, and employee access policies.
Documented Concerns
Anthropic has had fewer public data incidents than OpenAI or Google, which is partially a function of its smaller user base and shorter operating history rather than necessarily superior practices.
In 2024, security researchers demonstrated that Claude’s system prompt and safety instructions could be extracted through carefully crafted adversarial prompts, revealing internal safety guidelines that Anthropic had not publicly disclosed. While this is primarily a prompt injection concern rather than a data privacy issue, it demonstrates that Anthropic’s systems are not immune to information leakage.
Anthropic’s Responsible Scaling Policy establishes AI Safety Levels (ASLs) for model capability and safety evaluation but contains no specific, auditable commitments about user data privacy. The policy focuses on catastrophic risk from model capabilities, not on the everyday privacy risks of the millions of users interacting with Claude daily.
Moonshot AI / Kimi
Moonshot AI, founded in 2023 by former Google and Meta researcher Yang Zhilin, operates Kimi, one of China’s most popular AI assistants. Kimi is notable for its long-context capabilities (supporting up to 2 million tokens) and its rapid growth in the Chinese market. For international users considering Kimi or businesses evaluating its API, the privacy implications are fundamentally different from those of American providers — and in some ways, more severe.
What Moonshot/Kimi Explicitly Collects
Moonshot’s privacy policy and terms of service, primarily available in Chinese and governed by Chinese law, state:
- All conversation data: Prompts, responses, uploaded files and documents.
- User account data: Phone number (required for registration in China), email, device identifiers.
- Device and usage metadata: IP address, device model, operating system, app version, interaction timestamps, session data, and behavioral patterns.
- Content for service improvement: Moonshot states it uses conversation data to improve its services and models.
- Third-party sharing: Moonshot may share data with “partners” and “service providers,” with the scope of sharing defined at Moonshot’s discretion.
The China-Specific Privacy Context
Understanding Kimi’s privacy risks requires understanding the legal environment in which Moonshot operates.
China’s National Intelligence Law (2017) requires all Chinese organizations and citizens to “support, assist, and cooperate with national intelligence work.” Article 7 of this law is unambiguous: Chinese companies must hand over data to state intelligence agencies upon request, and they are prohibited from disclosing that such a request has been made. There is no Chinese equivalent of a warrant canary. There is no independent judiciary to challenge such requests. There is no transparency report mechanism.
This means that every piece of data submitted to Kimi is accessible to Chinese state intelligence services, regardless of what Moonshot’s privacy policy says. Moonshot cannot legally refuse a data request from Chinese intelligence agencies, and it cannot tell you if such a request has been made.
China’s Cybersecurity Law (2017) and Data Security Law (2021) impose data localization requirements, mandate security assessments for cross-border data transfers, and establish government authority to access data stored within China for “national security” purposes. Your data on Kimi’s servers is subject to these laws.
China’s Personal Information Protection Law (PIPL, 2021) provides some consumer privacy protections similar to GDPR, but enforcement is directed by the state, and the law explicitly exempts government data access for national security purposes. PIPL does not protect your data from state surveillance.
What Moonshot/Kimi Could Reasonably Be Collecting
State-directed intelligence collection. Given the legal framework, it is reasonable to assume that Chinese intelligence services have standing access to Kimi conversation data. For international users, this means any proprietary business information, competitive intelligence, personal details, or technical specifications submitted to Kimi are accessible to a foreign government’s intelligence apparatus.
Industrial espionage potential. China has a documented, decades-long history of state-sponsored industrial espionage, as detailed in reports from the FBI, the U.S. Department of Justice, and European intelligence agencies. An AI platform where foreign businesses voluntarily submit proprietary information is an intelligence collection mechanism of extraordinary efficiency.
Cross-platform data correlation. Moonshot operates within China’s broader technology ecosystem. The potential for data sharing or correlation with other Chinese technology platforms — WeChat, Alibaba, Baidu, ByteDance — is governed by Chinese law, which favors state access, and by business relationships that are not transparently disclosed.
Behavioral profiling of foreign users. International users of Kimi provide a dataset of foreign nationals’ thinking patterns, professional concerns, technical capabilities, and potential vulnerabilities. This data has intelligence value independent of any specific piece of content.
DeepSeek
DeepSeek, founded in 2023 by Liang Wenfeng (who also founded the quantitative hedge fund High-Flyer Capital Management), has attracted significant international attention for producing highly capable open-source models (DeepSeek-V2, DeepSeek-V3, DeepSeek-R1) at costs that undercut American competitors by orders of magnitude. DeepSeek’s chat interface and API have drawn millions of international users. The privacy implications are severe.
What DeepSeek Explicitly Collects
DeepSeek’s privacy policy, which applies to the web chat and API:
- All input and output content: Prompts, responses, uploaded files, and conversation history.
- Keystroke patterns and rhythms: DeepSeek’s privacy policy explicitly states it collects “keystroke patterns or rhythms” — a form of behavioral biometric data that can uniquely identify individual users and is extremely difficult to anonymize.
- Device information: IP address, device model, operating system, system language, unique device identifiers.
- Usage patterns: Features used, actions taken, time zone, country, interaction timestamps.
- Cookies and tracking technologies: Including third-party analytics and advertising cookies.
- Data storage in China: DeepSeek explicitly states that data is stored on servers in the People’s Republic of China.
- Broad third-party sharing: DeepSeek’s policy permits sharing with “corporate affiliates,” “service providers,” and in response to “legal obligations” — which, under Chinese law, includes state intelligence demands.
The DeepSeek-Specific Risk Profile
DeepSeek’s privacy risks include everything described for Kimi regarding the Chinese legal framework, plus additional unique concerns:
Keystroke biometric collection is a surveillance tool. Keystroke dynamics — the timing patterns of how you type — are a biometric identifier as unique as a fingerprint. They can be used to re-identify users across different accounts and platforms, even if other identifying information is removed. There is no legitimate product improvement reason to collect keystroke biometrics from an AI chat interface. This data’s primary utility is surveillance and user identification.
In January 2025, multiple security researchers and news outlets flagged DeepSeek’s keystroke collection practice. The U.S. Navy subsequently banned DeepSeek from all government devices, citing national security concerns. Italy’s data protection authority blocked DeepSeek for GDPR violations. Australia, South Korea, and Taiwan have implemented similar bans on government devices.
Hedge fund origins raise data monetization concerns. DeepSeek’s founder also runs one of China’s largest quantitative trading firms. Quantitative trading is fundamentally a data arbitrage business. The combination of an AI platform that collects detailed user behavior data and a hedge fund that profits from information asymmetry creates an obvious conflict of interest. User queries about markets, companies, financial strategies, or economic conditions have direct commercial value to a trading operation.
Open-source models as a trust offset. DeepSeek’s strategy of releasing open-source model weights builds trust and adoption. But the open-source models are separate from the data collection practices of the web chat and API services. A user running DeepSeek-V3 locally has a completely different privacy profile from a user interacting through DeepSeek’s chat interface. The open-source goodwill should not be conflated with the platform’s data practices. For guidance on running AI models locally, see our guide on why locally run AI outperforms cloud solutions.
Security vulnerabilities in DeepSeek’s infrastructure. In January 2025, security researchers at Wiz discovered a publicly accessible DeepSeek database containing over one million records of chat histories, API keys, backend logs, and operational metadata — completely unprotected, with no authentication required. This was not a sophisticated attack; the database was simply exposed to the open internet. The incident revealed both the volume of data DeepSeek collects and the inadequacy of its security practices.
The Aggregate Privacy Cost: What Using All These Providers Actually Means

Most individuals and businesses do not use just one AI provider. A developer might use ChatGPT for brainstorming, Claude for code review, Gemini through Google Workspace, and DeepSeek for cost-effective batch processing. Each of these interactions generates data across separate corporate entities, each with different privacy policies, data retention periods, jurisdictions, and incentive structures.
The Composite Profile Problem
Individually, each provider builds a partial profile of you. Collectively, they build a complete one. Your ChatGPT conversations reveal your creative thinking and problem-solving patterns. Your Claude interactions show your coding style and technical challenges. Your Gemini usage is correlated with your email, calendar, and location data. Your DeepSeek queries reveal your cost-sensitivity analysis and market research patterns.
No single company has this complete picture, but each has a piece. And each piece has value in data broker markets, corporate acquisition scenarios, government subpoena responses, and data breaches. When Equifax was breached in 2017, 147 million Americans had their financial data exposed — from a single company. Now imagine a breach at any AI provider, where the exposed data includes not just demographic information but the full text of people’s private thoughts, business strategies, and personal concerns.
The Legal Jurisdiction Problem
If you use both American and Chinese AI providers, your data is simultaneously subject to:
- U.S. law, including the CLOUD Act (which allows U.S. law enforcement to compel disclosure of data stored overseas by American companies)
- Chinese National Intelligence Law (which compels Chinese companies to provide data to state intelligence)
- GDPR (if you are in the EU, which restricts cross-border data transfers but has limited enforcement against non-EU companies)
- Various U.S. state privacy laws (CCPA, VCDPA, CPA) with inconsistent protections
You have no unified legal framework protecting your AI conversation data across providers. Each provider operates under whatever jurisdiction gives it the most latitude.
The Training Data Externality
When your data is used for model training — which it is by default on most consumer tiers — you are contributing to a product that will be sold to others, potentially including your competitors. The business strategy you discussed with ChatGPT becomes part of the training distribution. The code architecture you described to Claude influences future code suggestions for other users. Your competitive analysis submitted to Gemini feeds into a model that your competitors also query.
This is not hypothetical. It is the explicit business model. These companies collect your data, use it to improve models, and sell access to those improved models to everyone, including people and organizations whose interests are directly opposed to yours. We explored the broader implications of how AI training data shapes model behavior in our guide to red-teaming LLM applications.
The Irrevocability Problem

Data submitted to AI providers cannot be meaningfully “deleted.” Even if a provider offers a deletion mechanism for your conversation history, the data has already been processed. If it was used in training, it is embedded in model weights. If it was reviewed by humans, those humans have seen it. If it was stored in backups, it exists in those backups until they are rotated. If it was logged for abuse monitoring, it exists in those logs. The right to deletion under GDPR and CCPA provides theoretical protection, but the practical reality of deleting information that has already been incorporated into a neural network’s weights is that it cannot be fully done.
Provider-by-Provider Risk Comparison
The following table summarizes the key privacy risk factors across all five providers:
| Risk Factor | Google (Gemini) | OpenAI (ChatGPT) | Anthropic (Claude) | Moonshot (Kimi) | DeepSeek |
|---|---|---|---|---|---|
| Default training on user data | Yes (consumer) | Yes (consumer) | Yes (consumer) | Yes | Yes |
| Opt-out available | Partial | Yes (buried) | Yes | Unclear | Unclear |
| Human review of conversations | Yes (explicit) | Yes (safety/flagged) | Yes (safety/flagged) | Undisclosed | Undisclosed |
| Keystroke/biometric collection | No known | No known | No known | Unknown | Yes (explicit) |
| Data stored in China | No | No | No | Yes | Yes |
| Subject to Chinese intelligence law | No | No | No | Yes | Yes |
| Cross-product data linking | Yes (extensive) | Limited (Microsoft) | Limited | Unknown | Hedge fund overlap |
| Major data breach history | None (Gemini-specific) | Yes (March 2023) | None known | None known | Yes (Jan 2025) |
| Government bans | None | None | None | Limited | Multiple countries |
| Enterprise zero-retention option | Yes (Vertex) | Yes (API) | Yes (API) | Unknown | Unknown |
| Independent security audit | Partial | Partial | Partial | None known | None known |
Practical Strategies for Protecting Your Data
Complete avoidance of AI providers is not realistic for most businesses. The productivity advantages are too significant. But the goal should be minimizing unnecessary data exposure while maintaining the benefits. Think of this as a data hygiene practice, similar to how implementing proper security measures protects your web applications.
1. Run Local Models for Sensitive Work
The single most effective privacy measure is to run AI models locally for any work involving sensitive, proprietary, or regulated data. Open-source models like Llama 3, Mistral, Qwen, and — ironically — DeepSeek’s own open-source releases can run entirely on your own hardware using tools like Ollama or LM Studio. No data leaves your network. No third party has access. We published a detailed comparison in our guide to Local LLMs vs Cloud AI.
Best for: Proprietary code review, confidential document analysis, internal strategy discussions, regulated data (HIPAA, FINRA, attorney-client privilege), competitive intelligence analysis.
2. Segregate Providers by Sensitivity
Establish a clear internal policy for which AI providers are used for which types of work:
- Tier 1 (Local only): Trade secrets, customer PII, financial data, legal documents, HR records, source code with proprietary algorithms
- Tier 2 (Enterprise API with zero retention): General development assistance, non-confidential code, marketing copy, research summaries
- Tier 3 (Consumer chat): General knowledge questions, public information research, learning and exploration, non-sensitive brainstorming
Never use Chinese-operated AI services (DeepSeek chat, Kimi) for any business-sensitive information. The legal framework makes data protection impossible, regardless of what the privacy policy states.
3. Strip Context Before Submitting
Before pasting content into any AI provider, remove or anonymize:
- Company names, client names, and individual names
- Account numbers, financial figures, and revenue data
- Server names, IP addresses, internal URLs, and infrastructure details
- Proprietary terminology that could identify your organization
- Dates and timelines that could reveal strategic plans
Replace specific details with generic placeholders. Instead of “Analyze our Q3 revenue decline from $4.2M to $3.8M after we lost the Acme Corp contract,” ask “Analyze a scenario where quarterly revenue declined approximately 10% after losing a major client.”
4. Use API Tiers with Explicit Data Policies
For business-critical AI usage, always use API tiers rather than consumer chat interfaces:
- OpenAI API: Data not used for training by default; zero-retention option available
- Anthropic API: Data not used for training by default
- Google Vertex AI: Enterprise data handling with contractual commitments
API usage costs more, but the data protections are materially stronger. Consumer chat tiers are designed for mass data collection; API tiers are designed for business customers who would litigate over data misuse.
5. Audit and Rotate
Regularly audit your organization’s AI usage patterns:
- Which employees are using which providers?
- What types of data are being submitted?
- Are consumer accounts being used for business data?
- Have opt-out settings been properly configured?
- Are conversation histories being deleted on schedule?
Implement a quarterly review cycle. AI providers regularly update their terms of service and privacy policies, often in ways that expand their data usage rights. What was excluded from training last quarter may be included this quarter.
6. Implement Technical Controls
For organizations with technical resources, implement guardrails:
- Data Loss Prevention (DLP) rules: Configure network-level policies to detect and block submission of sensitive data patterns (SSNs, credit card numbers, API keys) to AI provider domains.
- Proxy logging: Route AI provider traffic through a logging proxy to maintain an audit trail of what data is being submitted. This is essential for compliance with frameworks like SOC 2 and ISO 27001.
- Separate browser profiles: Use dedicated browser profiles or containers for AI interactions, preventing session cookies from AI providers from tracking activity across other business applications.
- VPN and DNS considerations: Be aware that AI providers log IP addresses. Using a VPN adds a layer of network-level anonymity.
For organizations building internal tools on AI APIs, applying the principles of zero-knowledge encryption to stored prompts and responses provides an additional layer of protection against both external breaches and internal overreach.
7. Contractual Protections
For enterprise AI deployments, negotiate specific contractual terms:
- Explicit prohibition on using customer data for model training, evaluation, or any purpose beyond fulfilling the specific API request
- Defined data retention limits (ideally zero retention)
- Mandatory breach notification with specific timelines
- Right to audit data handling practices
- Data residency requirements (specify which jurisdictions your data may be stored and processed in)
- Prohibition on sub-processor data access without prior approval
Do not rely on the provider’s standard terms of service. Those terms are written to protect the provider, not you.
The Bottom Line
There is no privacy-safe major AI provider. Every company discussed in this guide — Google, OpenAI, Anthropic, Moonshot, and DeepSeek — is collecting more data than it needs, retaining it longer than necessary, and using it in ways that benefit the company at the expense of the user. The American providers operate under a legal framework that provides some theoretical protections but limited practical enforcement. The Chinese providers operate under a legal framework that mandates government access to your data.
The difference between the providers is one of degree, not of kind. Anthropic is marginally better than OpenAI on stated policies. Google is worse because of its cross-product data correlation capabilities. DeepSeek is the worst because of keystroke biometrics, Chinese intelligence law applicability, and documented security failures. But none of them are safe. None of them are acting as fiduciaries of your data. None of them have binding, enforceable, independently audited commitments to your privacy.
The only entity that will protect your privacy is you. Use local models for sensitive work. Strip context from cloud submissions. Use enterprise API tiers with contractual protections. Audit regularly. And do not trust any corporation — regardless of its stated values, its branding, or its country of origin — with data you cannot afford to lose control of.
For organizations looking to build AI capabilities while maintaining data sovereignty, we provide technical guidance on deploying AI infrastructure securely. For compliance-sensitive deployments, explore our compliance solutions or contact our team to discuss your specific requirements.