Curriculum

AI and Data Governance

0/8

Text lesson

Compliance: The Dark Side: Legal Liability

By the end of this lesson, you’ll understand the risks of feeding Personally Identifiable Information (PII) into AI systems, the consequences of doing so, and how to implement safeguards that preserve privacy and regulatory compliance.

What is PII? and Why It Matters in AI?

As mentioned in previous lessons, PII refers to any data that can identify an individual — names, emails, addresses, phone numbers, IP addresses, biometric data, and more. In traditional IT systems, protecting PII is already a serious responsibility. But in AI systems, it becomes even more critical due to how data is absorbed, not stored. Once PII enters a model’s training pipeline, it can become deeply embedded and essentially irreversible.

How PII Enters AI Systems and Why That’s a Problem?

PII can find its way into AI through various channels:

Training Data: AI models often learn from massive datasets, some of which may contain PII if not properly filtered.
Data Collection Pipelines: When collecting customer feedback or scraping web data, identifiers may be unintentionally included.
Poor Anonymization: Flawed anonymization techniques can still leave data re-identifiable.
Third-Party Integrations: Sharing data across AI providers without proper governance multiplies the exposure risk.

Once this data is inside the model, it can surface in unpredictable ways — through autocomplete suggestions, chat responses, or image generation.

Consequences of Letting PII Slip Into AI

Privacy Breaches: AI may unintentionally reveal personal information in responses, violating user trust.
Regulatory Fines: Laws like GDPR, HIPAA, and CCPA mandate strict protections — breaches can result in massive penalties.
Reputational Harm: Data misuse in AI systems often makes headlines and quickly erodes public confidence.

Best Practices to Keep AI PII-Free

Data Minimization: Don’t collect what you don’t need. Only include data essential to the task.
Anonymization and Pseudonymization: Strip out identifiers before training. Use tokenization and masking when full anonymization isn’t possible.
Encryption and Access Control: Protect sensitive data at rest and in transit. Only allow access to approved systems or roles.
PII Detection Tools: Use automated scans to identify and remove PII from training and interaction datasets.
Federated Learning & Differential Privacy: Train AI on local data without centralizing sensitive content. Add statistical noise to obscure specifics while keeping aggregate patterns.

Real-World Examples:

Clearview AI’s Mass Surveillance & Privacy Violations

Case: Clearview AI Inc. vs. Multiple Countries (2021–Present)

Issue: Clearview AI scraped billions of facial images from social media and public websites without consent to train its facial recognition AI.

Consequences:

GDPR Violations: Fined €20M by Italy, banned in France, Sweden, and the UK.
U.S. Lawsuits: Settled with ACLU (2022) over Illinois’ Biometric Privacy Act (BIPA), requiring stricter consent controls.
Meta & Google Legal Actions: Sued for unauthorized data scraping (violating terms of service).

Country	Year	Authority	Fine / Action	Reason / Outcome
United States (Illinois)	2024–2025	Class Action Settlement (BIPA)	$51.75M equity stake to plaintiffs	Collected biometric data without consent. Settlement resolved by granting equity in lieu of cash.
United States (ACLU)	2022	ACLU Lawsuit	Usage restrictions	Clearview restricted selling to private U.S. entities; only allowed to offer services to law enforcement.
Canada	2021	Federal & Provincial Privacy Commissioners	Ceased operations in Canada	Found guilty of mass surveillance; ordered to delete all data related to Canadians (company partially complied).
UK	2022–2023	Information Commissioner’s Office (ICO)	£7.5M (revoked on appeal)	Initially fined for violating UK data laws; later won appeal due to jurisdictional limitations.
France	2022	CNIL (Data Protection Authority)	€20M	Fined for scraping images without consent; Clearview failed to delete data, resulting in continued non-compliance.
Italy	2022	Italian DPA	€20M	Fined for unlawful collection of biometric data and ordered to delete Italian citizens’ data.
Greece	2022	Hellenic DPA	€20M	Fined for GDPR violations and banned from processing data of Greek residents.
Netherlands	2024	Dutch Data Protection Authority	€30.5M	Fined for creating a biometric database without consent; warned of additional penalties if non-compliance continues.
Australia	2021–2024	Office of the Australian Information Commissioner (OAIC)	Deletion order (not enforced)	Found to have breached Australian privacy law. Regulator dropped further pursuit after non-cooperation from Clearview.

Lesson: AI models trained on non-consensual PII (especially biometrics) face global legal repercussions.

Google & University of Chicago Medical Center – Patient Data Leak in AI Research

Case: Dinerstein v. Google & U. of Chicago (2019)

Issue: Google partnered with the hospital to develop AI for predicting medical outcomes using patient records.

What Went Wrong?

Hospital allegedly shared full, unredacted medical records (including PII like names, dates, and notes) with Google.
Patients were unaware their data was used for AI training.

Consequences:

HIPAA & GDPR Violations: Class-action lawsuit alleged illegal sharing of identifiable health data.
Research Halted: Google paused the project amid scrutiny.

Lesson: Even “anonymized” medical data can be re-identified; strict governance is mandatory.

OpenAI lawsuits for multiple copyright and privacy infringement

Case: Multiple lawsuits (2018–2023) over OpenAI data usage.

Issue: OpenAI was found to (Or ongoing proceedings for) :

Copyright infringement for unauthorized use of articles in AI training
Privacy laws infringement and improper data collection.
Store and share recordings with third-party contractors for AI training without explicit consent.

Consequences:

GDPR & CCPA lawsuits: Faced lawsuits in Europe and California for unauthorized data collection.
Class-Action Lawsuits: Some have been settled and some have not yet, amounting to billions over claims of illegal data usage.

Date	Plaintiff / Authority	Allegation / Action	Status / Outcome
Dec 2023	The New York Times	Copyright infringement for unauthorized use of articles in AI training	Lawsuit ongoing; OpenAI’s motion to dismiss was denied in March 2025, allowing the case to proceed.
Nov 2024	Canadian media outlets (e.g., CBC, The Globe and Mail)	Unauthorized use of journalistic content for AI training	Lawsuit filed; plaintiffs seek up to C$20,000 per article, potentially amounting to billions in damages.
July–Sept 2023	Authors Guild, Sarah Silverman, George R.R. Martin, others	Copyright infringement for using literary works in AI training	Multiple lawsuits filed; some claims dismissed, but core infringement allegations remain active.
Dec 2024	Italian Data Protection Authority (Garante)	GDPR violations: improper data collection and lack of transparency in ChatGPT	Fined €15 million; OpenAI plans to appeal, asserting the fine is disproportionate.
Nov 2023	The Intercept, Raw Story, Alternet	Copyright infringement for unauthorized use of articles in AI training	Lawsuits filed; some claims dismissed, but key allegations proceed.
April 2024	Eight U.S. newspapers (e.g., Chicago Tribune, Denver Post)	Copyright infringement and dissemination of false information via AI outputs	Lawsuit filed; plaintiffs allege AI-generated content falsely attributed to their publications.
June 2023	Class action by anonymous plaintiffs	Unauthorized data scraping of 300 billion words without consent	Lawsuit filed in California; plaintiffs allege violation of data privacy rights.
April 2024	Daily News LP	Copyright infringement for use of news articles in AI training	Lawsuit filed; ongoing proceedings.
Nov 2024	Asian News International (ANI)	Unauthorized use of news content in AI training	Lawsuit filed; ongoing proceedings.
Sept 2023	Authors Guild (including John Grisham, Jodi Picoult)	Copyright infringement for use of literary works in AI training	Class action lawsuit filed; ongoing proceedings.
July 2023	Paul Tremblay, Mona Awad	Copyright infringement for unauthorized use of novels in AI training	Lawsuit filed; ongoing proceedings.
Nov 2022	Doe 3 vs. OpenAI and GitHub	Breach of contract related to AI-generated code	Lawsuit filed; ongoing proceedings.
2025	The New York Times Company v. Microsoft Corporation et al., Case No. 1:23-cv-11195	Still ongoing with possible copyright infringement.	Court order of retaining data for 30 days, including inputs and outputs, and any private data it may contain

Lesson: Never use private (or copyrighted) data without consent.

Zoom’s AI Training on User Calls Without Consent

Case: Zoom’s 2023 Terms of Service Scandal

Issue: Zoom quietly updated its terms to claim it could use customer video/audio data to train AI models—without clear opt-in consent.

What Went Wrong?

Schools, businesses, and therapists unknowingly risked patient/client PII being fed into AI.
Public outcry forced Zoom to backtrack, but data had already been ingested.

Consequences:

Legal Threats: Violated GDPR/CCPA by assuming implied consent.
Reputation Damage: Competitors (Microsoft Teams, Webex) capitalized on distrust.

Lesson: Never assume consent—AI training on PII requires explicit, granular permissions.

Stable Diffusion’s AI-Generated “Fake Nudes” of Real People

Case: Stability AI & Midjourney lawsuits (2023–2024)

Issue: AI image generators like Stable Diffusion were trained on scraped online photos, including:

Social media images (LinkedIn, Instagram) used without permission.
Non-consensual deepfake porn of celebrities and private individuals.

Consequences:

Class-Action Lawsuit: Artists & photographers sued for copyright and PII violations (biometric data in faces).
EU Investigation: Potential fines under the AI Act for unethical training data.

Lesson: Scraping public data ≠ legal consent. AI must filter PII/biometrics and respect opt-outs.

Prevention is Better Than Cure

PII must be treated as radioactive material in AI development — even a small leak can be harmful. By practicing strong governance, pre-processing data carefully, and selecting privacy-first architectures, we can ensure that the power of AI doesn’t come at the cost of individual rights.