Curriculum
Course: AI and Data Governance
Login
Text lesson

Compliance: The Dark Side: Legal Liability

By the end of this lesson, you’ll understand the risks of feeding Personally Identifiable Information (PII) into AI systems, the consequences of doing so, and how to implement safeguards that preserve privacy and regulatory compliance.

What is PII? and Why It Matters in AI? 

As mentioned in previous lessons, PII refers to any data that can identify an individual — names, emails, addresses, phone numbers, IP addresses, biometric data, and more. In traditional IT systems, protecting PII is already a serious responsibility. But in AI systems, it becomes even more critical due to how data is absorbed, not stored. Once PII enters a model’s training pipeline, it can become deeply embedded and essentially irreversible.

How PII Enters AI Systems and Why That’s a Problem?

PII can find its way into AI through various channels:

  • Training Data: AI models often learn from massive datasets, some of which may contain PII if not properly filtered.

  • Data Collection Pipelines: When collecting customer feedback or scraping web data, identifiers may be unintentionally included.

  • Poor Anonymization: Flawed anonymization techniques can still leave data re-identifiable.

  • Third-Party Integrations: Sharing data across AI providers without proper governance multiplies the exposure risk.

Once this data is inside the model, it can surface in unpredictable ways — through autocomplete suggestions, chat responses, or image generation.

Consequences of Letting PII Slip Into AI

  • Privacy Breaches: AI may unintentionally reveal personal information in responses, violating user trust.

  • Regulatory Fines: Laws like GDPR, HIPAA, and CCPA mandate strict protections — breaches can result in massive penalties.

  • Reputational Harm: Data misuse in AI systems often makes headlines and quickly erodes public confidence.

Best Practices to Keep AI PII-Free

  • Data Minimization: Don’t collect what you don’t need. Only include data essential to the task.

  • Anonymization and Pseudonymization: Strip out identifiers before training. Use tokenization and masking when full anonymization isn’t possible.

  • Encryption and Access Control: Protect sensitive data at rest and in transit. Only allow access to approved systems or roles.

  • PII Detection Tools: Use automated scans to identify and remove PII from training and interaction datasets.

  • Federated Learning & Differential Privacy: Train AI on local data without centralizing sensitive content. Add statistical noise to obscure specifics while keeping aggregate patterns.

Real-World Examples:

Clearview AI’s Mass Surveillance & Privacy Violations

Case: Clearview AI Inc. vs. Multiple Countries (2021–Present)

Issue: Clearview AI scraped billions of facial images from social media and public websites without consent to train its facial recognition AI.

Consequences:

  • GDPR Violations: Fined €20M by Italy, banned in France, Sweden, and the UK.
  • U.S. Lawsuits: Settled with ACLU (2022) over Illinois’ Biometric Privacy Act (BIPA), requiring stricter consent controls.
  • Meta & Google Legal Actions: Sued for unauthorized data scraping (violating terms of service).
Country Year Authority Fine / Action Reason / Outcome
United States (Illinois) 2024–2025 Class Action Settlement (BIPA) $51.75M equity stake to plaintiffs Collected biometric data without consent. Settlement resolved by granting equity in lieu of cash.
United States (ACLU) 2022 ACLU Lawsuit Usage restrictions Clearview restricted selling to private U.S. entities; only allowed to offer services to law enforcement.
Canada 2021 Federal & Provincial Privacy Commissioners Ceased operations in Canada Found guilty of mass surveillance; ordered to delete all data related to Canadians (company partially complied).
UK 2022–2023 Information Commissioner’s Office (ICO) £7.5M (revoked on appeal) Initially fined for violating UK data laws; later won appeal due to jurisdictional limitations.
France 2022 CNIL (Data Protection Authority) €20M Fined for scraping images without consent; Clearview failed to delete data, resulting in continued non-compliance.
Italy 2022 Italian DPA €20M Fined for unlawful collection of biometric data and ordered to delete Italian citizens’ data.
Greece 2022 Hellenic DPA €20M Fined for GDPR violations and banned from processing data of Greek residents.
Netherlands 2024 Dutch Data Protection Authority €30.5M Fined for creating a biometric database without consent; warned of additional penalties if non-compliance continues.
Australia 2021–2024 Office of the Australian Information Commissioner (OAIC) Deletion order (not enforced) Found to have breached Australian privacy law. Regulator dropped further pursuit after non-cooperation from Clearview.

Lesson: AI models trained on non-consensual PII (especially biometrics) face global legal repercussions.

Google & University of Chicago Medical Center – Patient Data Leak in AI Research

Case: Dinerstein v. Google & U. of Chicago (2019)

Issue: Google partnered with the hospital to develop AI for predicting medical outcomes using patient records.

What Went Wrong?

  • Hospital allegedly shared full, unredacted medical records (including PII like names, dates, and notes) with Google.
  • Patients were unaware their data was used for AI training.

Consequences:

  • HIPAA & GDPR Violations: Class-action lawsuit alleged illegal sharing of identifiable health data.
  • Research Halted: Google paused the project amid scrutiny.

Lesson: Even “anonymized” medical data can be re-identified; strict governance is mandatory.

OpenAI lawsuits for multiple copyright and privacy infringement

Case: Multiple lawsuits (2018–2023) over OpenAI data usage.

Issue: OpenAI was found to (Or ongoing proceedings for) :

  • Copyright infringement for unauthorized use of articles in AI training
  • Privacy laws infringement and improper data collection.
  • Store and share recordings with third-party contractors for AI training without explicit consent.

Consequences:

  • GDPR & CCPA lawsuits: Faced lawsuits in Europe and California for unauthorized data collection.
  • Class-Action Lawsuits: Some have been settled and some have not yet, amounting to billions over claims of illegal data usage.

 

Date Plaintiff / Authority Allegation / Action Status / Outcome  
Dec 2023 The New York Times Copyright infringement for unauthorized use of articles in AI training Lawsuit ongoing; OpenAI’s motion to dismiss was denied in March 2025, allowing the case to proceed.  
Nov 2024 Canadian media outlets (e.g., CBC, The Globe and Mail) Unauthorized use of journalistic content for AI training Lawsuit filed; plaintiffs seek up to C$20,000 per article, potentially amounting to billions in damages.  
July–Sept 2023 Authors Guild, Sarah Silverman, George R.R. Martin, others Copyright infringement for using literary works in AI training Multiple lawsuits filed; some claims dismissed, but core infringement allegations remain active.  
Dec 2024 Italian Data Protection Authority (Garante) GDPR violations: improper data collection and lack of transparency in ChatGPT Fined €15 million; OpenAI plans to appeal, asserting the fine is disproportionate.  
Nov 2023 The InterceptRaw StoryAlternet Copyright infringement for unauthorized use of articles in AI training Lawsuits filed; some claims dismissed, but key allegations proceed.  
April 2024 Eight U.S. newspapers (e.g., Chicago TribuneDenver Post) Copyright infringement and dissemination of false information via AI outputs Lawsuit filed; plaintiffs allege AI-generated content falsely attributed to their publications.  
June 2023 Class action by anonymous plaintiffs Unauthorized data scraping of 300 billion words without consent Lawsuit filed in California; plaintiffs allege violation of data privacy rights.  
April 2024 Daily News LP Copyright infringement for use of news articles in AI training Lawsuit filed; ongoing proceedings.  
Nov 2024 Asian News International (ANI) Unauthorized use of news content in AI training Lawsuit filed; ongoing proceedings.  
Sept 2023 Authors Guild (including John Grisham, Jodi Picoult) Copyright infringement for use of literary works in AI training Class action lawsuit filed; ongoing proceedings.  
July 2023 Paul Tremblay, Mona Awad Copyright infringement for unauthorized use of novels in AI training Lawsuit filed; ongoing proceedings.  
Nov 2022 Doe 3 vs. OpenAI and GitHub Breach of contract related to AI-generated code Lawsuit filed; ongoing proceedings.

 
2025 The New York Times Company v. Microsoft Corporation et al., Case No. 1:23-cv-11195 Still ongoing with possible copyright infringement. Court order of retaining data for 30 days, including inputs and outputs, and any private data it may contain  

 

Lesson: Never use private (or copyrighted) data without consent. 

Zoom’s AI Training on User Calls Without Consent

Case: Zoom’s 2023 Terms of Service Scandal

Issue: Zoom quietly updated its terms to claim it could use customer video/audio data to train AI models—without clear opt-in consent.

What Went Wrong?

  • Schools, businesses, and therapists unknowingly risked patient/client PII being fed into AI.
  • Public outcry forced Zoom to backtrack, but data had already been ingested.

Consequences:

  • Legal Threats: Violated GDPR/CCPA by assuming implied consent.
  • Reputation Damage: Competitors (Microsoft Teams, Webex) capitalized on distrust.

Lesson: Never assume consent—AI training on PII requires explicit, granular permissions.

Stable Diffusion’s AI-Generated “Fake Nudes” of Real People

Case: Stability AI & Midjourney lawsuits (2023–2024)

Issue: AI image generators like Stable Diffusion were trained on scraped online photos, including:

  • Social media images (LinkedIn, Instagram) used without permission.
  • Non-consensual deepfake porn of celebrities and private individuals.

Consequences:

  • Class-Action Lawsuit: Artists & photographers sued for copyright and PII violations (biometric data in faces).
  • EU Investigation: Potential fines under the AI Act for unethical training data.

Lesson: Scraping public data ≠ legal consent. AI must filter PII/biometrics and respect opt-outs.

 

Prevention is Better Than Cure

PII must be treated as radioactive material in AI development — even a small leak can be harmful. By practicing strong governance, pre-processing data carefully, and selecting privacy-first architectures, we can ensure that the power of AI doesn’t come at the cost of individual rights.