Ioana Stoica – The Dutch DPA & Data Scraping
- Introduction
The recent Guidelines from the Dutch Data Protection Authority (DPA), have sparked significant debate in the data protection community. The Guidelines take a firm stance on the legality of data scraping, particularly emphasizing the constraints around ‘legitimate interests’ and providing guidance on instances in which scraping is to be considered always illegal. While the Guidelines focus mainly on the interpretation of the GDPR and the processing of personal data (implying that the GDPR will almost always apply), the implications for AI development are significant.
[please take into account that this article and the quotes provided are based on an automated translation of the Guidelines]
- Defining the Terms:
The Guidelines define scrapping as the activity by which, information obtained by sending a collection request to a pre-determined list of servers, is automatically recorded in a database, with a specific purpose for further processing such data. If the collection request is sent to a list of servers that is not pre-determined but adjusts dynamically, the activity will be web-crawling. Both scraping and web-crawling, are the first layer of many AI models, for e.g. large language models (LLM’s).
In any case, the Guidelines apply to both types of activities, scrapping and web-crawling.
- The Legal Basis & the Crux of Legitimate Interest
Here’s where it gets tricky. While the general opinion about legitimate interest revolves around: “not prohibited by law & which satisfies the requirements of necessity & balancing test” – the Dutch DPA’s opinion is that:
- Scraping cannot be a compatible further processing (excluding cases of scraping proprietary data) but is a new processing activity that needs a separate legal basis and
- That legitimate interest may be the most likely legal basis – however, “If you only have a purely commercial interest in processing personal data, then you cannot successfully rely on the legitimate interest basis (…).“
Furthermore, according to the DPA’s view, you might have a legitimate interest only if you have „another” interest (not purely commercial) that is „protected by law”.
This opinion has been heavily disputed and if you would like to read more about views contrary to the DPA’s, please refer to Joost Gerritsen’s article here or to Peter Craddock’s post here. In any case, for now, business should continue as normal, as the matter has been referred to the ECJ – here. [Request for preliminary ruling concerning the question whether: „the concept of ‘legitimate interest’ within the meaning of Article 6(1)(f) of the GDPR covers only interests enshrined in law (positive test) or any interest in so far as that interest is not contrary to the law (negative test) and, more specifically, whether a purely commercial interest – such as the interest in providing personal data in return for payment without the consent of the data subject concerned – can be regarded as legitimate under certain circumstances and, if so, what circumstances are determinative in this regard.”]
- Scrapping that is Probably Always Illegal
The Dutch DPA warns that in certain cases scrapping isn’t just a privacy faux pas; but should be flagged as “almost always illegal”. Such is the case when scrapping:
- the internet to create profiles of people and then resell them;
- private social media accounts or closed / private forums;
- social media profiles – even if they are public – in order to use the information collected for determining whether or not a person will receive a requested insurance policy.
However, certain activities are considered by the DPA as being more likely to be in line with the GDPR, like, for example, scrapping:
- public news websites, including those relevant to your own organization or field to portray current events;
- own web pages by online stores, for example with customer reviews, for communication with our own (potential) customers;
- public online forums on information security, to identify security risks for your own organization.
[if you employ scraping, please consider that the activity must be in line with other regulations, such as copyright laws; in this sense, please read more about the fine that the French authority imposed on Google for, amongst others, breaching IP rights here.]
- Implications for AI and Algorithm Training
The Guidelines generally warn about the privacy risks related to the volume and nature of the data collected, which would obviously increase if sensitive or special categories of personal data would be involved. Also, companies are advised to conduct a DPIA before undertaking any scraping activities.
The interesting part – that will probably hit hard and will (legitimately) cause push back from the tech industry – refers to the use of scraped data to train algorithms, which is highlighted as a potential risk area, affecting fundamental rights.
The DPA especially draws attention to the lack of accuracy of the scrapped information that may cause bias and discriminatory effects.
- Why This Matters – not necessarily in this order
Firstly, because after only 6-8 years of life, a piece of legislation that was considered innovative and fresh – the GDPR, which pre-dates the current wave of AI innovation – seems to be unprepared for addressing the nuances of technological practices that, even if not ‘new’, have sparked a recent wave of attention due to the advancements in AI.
Secondly, if the interpretation suggested by the Dutch DPA is to be implemented by all DPA’s, these Guidelines will be the setting stone that will reshape how data for AI systems will be sourced globally. While this will, for sure, make gathering data really challenging for companies, it may also cause a first mover advantage for those companies that have already scrapped and secured their datasets, potentially securing a competitive edge which may be impossible to overcome.
Thirdly, the narrow interpretation of ‘legitimate interest’ would imply significant shifts in business models for many companies – far beyond those whose operations are heavily reliant on data scraping for commercial purposes.
No Comments