The text file that runs the internet

March 22, 2024

Click the link below the picture

For three decades, a tiny text file has kept the internet from chaos. This text file has no particular legal or technical authority, and it’s not even particularly complicated. It represents a handshake deal between some of the earliest pioneers of the internet to respect each other’s wishes and build the internet in a way that benefitted everybody. It’s a mini constitution for the internet, written in code.

It’s called robots.txt and is usually located at yourwebsite.com/robots.txt. That file allows anyone who runs a website — big or small, cooking blog or multinational corporation — to tell the web who’s allowed in and who isn’t. Which search engines can index your site? What archival projects can grab a version of your page and save it? Can competitors keep tabs on your pages for their own files? You get to decide and declare that to the web.

It’s not a perfect system, but it works. Used to, anyway. For decades, the main focus of robots.txt was on search engines; you’d let them scrape your site and in exchange, they’d promise to send people back to you. Now AI has changed the equation: companies around the web are using your site and its data to build massive sets of training data, in order to build models and products that may not acknowledge your existence at all.

The robots.txt file governs a give and take; AI feels to many like all take and no give. But there’s now so much money in AI, and the technological state of the art is changing so fast that many site owners can’t keep up. And the fundamental agreement behind robots.txt, and the web as a whole — which for so long amounted to “everybody just be cool” — may not be able to keep up either.

In the early days of the internet, robots went by many names: spiders, crawlers, worms, WebAnts, web crawlers. Most of the time, they were built with good intentions. Usually, it was a developer trying to build a directory of cool new websites, make sure their own site was working properly, or build a research database — this was 1993 or so, long before search engines were everywhere and in the days when you could fit most of the internet on your computer’s hard drive.

The only real problem then was the traffic: accessing the internet was slow and expensive, both for the person seeing a website and the one hosting it. If you hosted your website on your computer, as many people did, or on hastily constructed server software run through your home internet connection, all it took was a few robots overzealously downloading your pages for things to break and the phone bill to spike.

Over the course of a few months in 1994, a software engineer and developer named Martijn Koster, along with a group of other web administrators and developers, came up with a solution they called the Robots Exclusion Protocol. The proposal was straightforward enough: it asked web developers to add a plain-text file to their domain specifying which robots were not allowed to scour their site, or listing pages that are off limits to all robots. (Again, this was a time when you could maintain a list of every single robot in existence — Koster and a few others helpfully did just that.) For robot makers, the deal was even simpler: respect the wishes of the text file.

Illustration by Erik Carter

Click the link below for the article:

https://www.theverge.com/24067997/robots-txt-ai-text-file-web-crawlers-spiders?utm_source=pocket_collection_story

__________________________________________

James' World 2

The text file that runs the internet

Leave a comment Cancel reply

Search:

Recent Posts

Categories

Archives

Follow James’ World 2 Posts – Plus: on LORAC Facebook Page

Blogs of interest

Loving it!

James' World 2

The text file that runs the internet

%Post_title%

Leave a comment Cancel reply

Search:

Recent Posts

Categories

Archives

Follow James’ World 2 Posts – Plus: on LORAC Facebook Page

Blogs of interest

Loving it!