If we can agree that ChatGPT and similar AI systems could be useful and should not be shutdown, they also need to comply with existing applicable laws such as GDPR in Europe. The issue is that it may be impossible for such opaque and complex systems to respect the right to access, rectification or to be forgotten.
I achieved to get my name outputted by ChatGPT by asking it some specific questions. For instance, when you ask « Who created the Hacked font inspired by the Watch Dogs game’s logo? », ChatGPT will reply that « The Hacked font was created by David Libeau, a graphic designer based in France. », and that is correct. That’s me. What is incorrect comes after, when you ask « What else did he do? ». ChatGPT continue with a totally fake biography.
Banning ChatGPT
When we talk about risks surrounding AI, it is always difficult to have a discussion that does not become dystopian or at the opposite utopian. I believe that it is possible to be enthusiast about new technologies without forgetting their potential risks and impact on humans and earth. On ChatGPT, the tool is funny on the first sight, but we must be aware that this toy is not perfect and that misuse could let to dangerous situations. I believe that scientific research is key to identify the ethical challenges of AI.
The Italian data protection authority wanted to stop the treatments of personal data of Italian citizens by ChatGPT a couple of days ago. Here, the reason was about GDPR and privacy risks. Banning a technology is not a good solution, but if ChatGPT is not complying with European rules, as a last resort, stopping the treatments is still an option.
How ChatGPT generate text?
ChatGPT is using a large language model (LLM) trained on a vast amount of scrapped data publicly available on the web. Then, when ChatGPT need to reply to some prompt, it is using its generative pre-trained transformer (GPT) which act like an auto-complete algorithm. It generate sentences by putting together the most likely syllables in a given context.
I saw with my eyes some glitches in the algorithm when ChatGPT was trying to guess my name. As I am not famous, the algorithm have some data thanks to some press outputs, but not much. So when it was guessing my name, it sometimes gives me « David Libeskind » instead of « David Libeau ». When I clicked on the button « Regenerate response », it gave me my right surname.
The difference between regular algorithms and AI systems is that the first one is only a set of instructions when the second one needs data to work. That is an important shift for data protection. Even if the dataset used to train the ChatGPT model is public data from the web, it contains personal data. I demonstrate that with prompts generating a response with my name. In regards to GDPR, that could cause some issues.
Implementing GDPR rights
GDPR created a couple of rights for data subjects: the right to access, to rectification or to be forgotten, for instance. ChatGPT or similar AI systems have to facilitate the exercise of these rights. The issue is that it may be impossible to rectify a model or even to simply extract a copy of personal data.
Controllers needs to have in place a couple of tools and features in order to operate this kind of AI systems in the light of GDPR. Here is a list:
- Sources output: when a response is generated, the AI system must include the list of data sources.
- Dataset & model browser: it should be possible for administrators to retrieve and extract all occurrence of a name in the training dataset and the model’s inferences.
- Model editor: it should be possible for administrators to edit the model (or train it again) without selected personal data (after anonymisation).
- Safeguards on AI hallucinations: limitations should prevent the generation of the name or personal data of data subjects (in all generated responses or on specific contexts).
Feature 1 is about lawfulness and accurate information as described in article 5 of GDPR. The dataset browser will ensure the exercise of article 15 of GDPR (right to access). The model editor is for the article 16 (right to rectification) and the article 17 (right to erasure/right to be forgotten). And finally, the feature 4 is for the right to object of the article 21 of GDPR.
These features should be implemented in all AI systems and if they are not, or if it is not possible to implement them because AI is too opaque, we are in trouble… It would also mean that developers did not build AI with privacy-by-design principles.
3 réponses à “ChatGPT will probably never comply with GDPR”
GDPR was not written with this technology in mind, and there are technical reasons why GDPR does not model the reality sufficiently well. But not the technology has to change, we need to fix GDPR, urgently.
ad 1: the source is the entirety of its training data. All of it. You want that listed? You will get a long list.
ad 2: this is a fundamental misunderstanding both of the nature of data (names are labels that are not unique), and of the nature of a trained network (training data is distributed over the entire set of parameters). The name « David Libeau » does not even exist as such in GPT models, it is tokenized as « David/ Lib/e/au ».
ad 3: this is not only technically impossible, as it would require adjudication of legitimate requests, constant re-training, and a technological way to purge older parameter sets from the commons, it again makes the assumption that the data is somehow « stored », not learned.
ad 4: Such safeguards are technically impossible, and, frankly, this statement reveals a disturbing desire to treat the model as factual authority. It is a probabilistic model, it does not generate facts but highly probable expressions, and every user is warned to check the facticity.
All of these kinds of arguments rely on the fallacy that the data is stored in some atomic way. It is not. We have machines that learn. Learning is not copying. Generating is not retrieving.
Where so many people see a _win_ for GDPR, I see a _loss_ for all those Italians that are now denied access to the most important technological development of the century; access, that is restricted by a piece of legislation that was installed to protect them, and now prevents them to enter into a completely voluntary arrangement about their own data, in a way they need. European citizen’s liberties have been severely damaged.
Why do you want to change the GDPR when you can change the technology? GDPR only created basic rights & principles derived from human rights. Many software are not complying with GDPR everyday, I don’t see any urgency with ChatGPT.
1) I want the sources.
2) It’s not because the data is tokenized that it’s not personal data. Words are also split by characters, and it is still personal data.
3) You can call it « stored » or « learned », it should be possible to not treat personal data if the data subject ask for it.
4) Statistics and probabilities are data and associated with identifiable information, it’s personal data.
The Italian DPA ask to stopped the treatment. If OpenAI would have apply the decision, it would mean that the GPT models would be offline for all worldwide, because when GPT models are used, they treat Italian citizen’s personal data. OpenAI only blocked ChatGPT on Italian IPs and that’s not stopping the treatment.
We are focusing of GDPR here, but please keep in mind that even if we create an exception for AI, ChatGPT will still be problematic with the laws on defamation (refer to the case with an Australian mayor who is about to sue OpenAI for defamation).
Bonjour.
C’est brillant ! Je suis en phase avec ce que tu écris. J’ajoute que ce que tu demandes/prévois c’est ce que le règlement européen sur l’IA va exiger demain des IA « haut risques » comme chat gpt