A group of authors are suing various vendors of Large Language Model AIs. The authors claim that the AIs are trained on material which infringes their copyright.
Is that likely? Well, let's take a quick look at the evidence presented.
First up, Meta's LLaMA Paper. It describes how the LLM was trained:
We include two book corpora in our training dataset: the Gutenberg Project, which contains books that are in the public domain, and the Books3 section of ThePile (Gao et al., 2020)
OK, Gutenberg is out-of-copyright books. That seems like fair game. But what about Books3?
Let's look at The Pile Paper. Here's how that describes "Books3":
Books3 is a dataset of books derived from a copy of the contents of the Bibliotik private tracker made available by Shawn Presser (Presser, 2020).
Following that reference takes us to a Tweet by Shawn Presser and he describes Book3 as
Suppose you wanted to train a world-class GPT model, just like OpenAI. How? You have no data.
Now you do. Now everyone does.
Presenting "books3", aka "all of bibliotik"
- 196,640 books
- in plain .txt
- reliable, direct download, for years
And then he links to a 37GB file.
So, what is "Bibliotik"? The site itself isn't particularly instructive. But various Torrent forums describe it as:
Latest Free Indie Books – As a member of Bibliotik.me, you will not have to wait for a long time to get your hands on the latest free indie book or edition of a particular free indie series. The members of the community upload new free indie books every day to the website’s content library.
There is a file listing available (20MB .txt file) - which appears to list the books written by the complainants.
I don't have the time and space to download the 37GB file. Nor do I want the legal liability if it is full of illicit material. But if those authors' books are in there... isn't this a slam dunk case?
Meta literally published a paper where they said "We trained this AI on Intellectual Property which we knew had been obtained without the owners' consent."
Now, you can argue all day about whether an AI being able to summarise a book is fair use. Or if reading a borrowed book is a crime. I'm even happy to hear arguments about whether it is legally binding to say "No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means without the prior written permission of the publisher".
But... come on! If a regular person published a confession about their piracy and how they're storing thousands of pirated works, the copyright goon squad would be knocking down their doors!
I suspect we're about to hear some arguments from AI-maximalists that LLaMA is sentient and that deleting it would be akin to murder - and wiping out AIs trained on stolen property is literally genocide. I don't believe that for a second.
I want to live in a future where Artificial Intelligences can relieve humans of the drudgery of labour. But I don't want to live in a future which is built by ripping-off people against their will.