nicbou 24 days ago

They already started with the assumption of consent, crawled the web with disregard for resource use, and still provide no mechanism to revoke permission. This is the culture around AI. A quiet little tag that says "please don't do that" won't do much.

These companies are already behaving like jerks. Do you think they will become more polite once they control how we avcess information? with investors breathing down their neck?

Ukv 24 days ago

Of the signals used to indicate crawling is prohibited, robots.txt is probably the most effective; OpenAI, Google, Anthropic, Meta, and CommonCrawl all claim to respect it. That often provokes a response of "well they're lying", but I've yet to actually find any cases of the IPs they use for crawling accessing content prohibited by robots.txt.

Newly proposed standards will probably take a while to catch on, if they ever do.

Not a lawyer, but I believe such measures could in theory become legally enforceable in the US without any new legislation if the fair use defense fails but an implied license defense (the reason you can cache/rehost copies of webpages that don't have a <noarchive> meta tag, as in Field v. Google Inc) succeeds.

zzo38computer 24 days ago

I do not want others to scrape my files from my server for the purpose of training LLMs, but if they acquire a copy of them by other means or already have a copy of them for other reasons, then they will already have a copy and then they can do what they want with it.

I do not care about attribution; but I care more that they do not claim additional restrictions in their terms of use when they copy my stuff and use it.

abhisek 24 days ago

I am not sure how this is any different from open source code being embedded in commercial applications. It’s really like a self-accelerating loop.

At least for OSS, usage defines value. When an OSS project is popular, enterprises notices it and begins to use it in their commercial applications.

[-]

alissa_v 24 days ago

I agree with your point about usage defining value in OSS - popular projects gain recognition, contributions, and opportunities through their adoption in commercial applications.

The critical difference, though, is consent. OSS creators explicitly choose licenses permitting commercial use - they opt in to sharing their work. Many content creators never made such a choice for AI training.

The current AI training paradigm doesn't even have a true opt-out model - it simply assumes everything is available. The noAI tags are attempting to create an opt-out mechanism where none previously existed. Without enforcement or standards adoption, though, these signals don't seem to have the same weight as established open source licenses.

There's also a significant difference in attribution. OSS creators receive clear attribution even when their work is used commercially. For creators whose work trains AI models, their contribution is blended and anonymized with no recognition pathway.

The core question is whether creating this opt-out approach is sufficient, or if AI training should move toward an opt-in model more similar to how open source licensing works.

BobbyTables2 24 days ago

No

[-]

alissa_v 24 days ago

Haha fair enough! Any particular reason why you think they won't be respected?

Do Not Train" Meta Tags: The Robots.txt of AI – Will Anyone Respect Them?

Do Not Train" Meta Tags: The Robots.txt of AI – Will Anyone Respect Them?