A new web crawler launched by Meta last month secretly searches the internet for AI training data

Meta has quietly launched a new web crawler that crawls the Internet and collects data in masses to feed its AI model.

The crawler, called Meta External Agent, was introduced last month, according to three companies that track web scrapers and bots across the internet. The automated bot essentially copies, or “scrapes,” all of the data publicly displayed on websites, such as the text in news articles or the conversations in online discussion groups.

A representative of Dark Visitors, which offers website owners a tool to automatically block all known scraper bots, said Meta External Agent is analogous to OpenAI’s GPTBot, which crawls the web for AI training data. Two other companies involved in tracking web scrapers confirmed the bot’s existence and its use to collect AI training data.

Meta, the parent company of Facebook, Instagram and WhatsApp, updated a corporate website for developers in late July with a tab announcing the existence of the new scraper, according to a version history found on the Internet Archive. Aside from updating the page, Meta has not publicly announced the new crawler.

A Meta spokesperson said the company has had a crawler under a different name “for years,” although that crawler – called Facebook External Hit – “has been used for various purposes over time, such as sharing link previews.”

“Like other companies, we train our generative AI models on content that is publicly available online,” the spokesperson said. “We recently updated our guidelines that show publishers how to best exclude their domains from being crawled by Meta’s AI-related crawlers.”

Scraping web data to train AI models is a controversial practice that has led to numerous lawsuits from artists, writers, and others who claim AI companies have used their content and intellectual property without their consent. Some AI companies, such as OpenAI and Perplexity, have entered into deals in recent months where they pay content providers money for access to their data (Assets was one of several news providers that announced a revenue-sharing agreement with Perplexity in July).

Flying under the radar

While nearly 25% of the world’s most popular websites now block GPTBot, only 2% block Meta’s new bot, according to data from Dark Visitors.

In order for a website to attempt to block a web scraper, it must implement robots.txt, a line of code added to a code base to signal a scraper bot to ignore that site’s information. However, the specific name of a scraper bot must also typically be added for robots.txt to be respected. This is difficult to achieve if the name has not been publicly disclosed. A scraper bot operator can also simply choose to ignore robots.txt – it is not enforceable or legally binding in any way.

Such scrapers are used to pull large amounts of data and written text from the internet to be used as training data for generative AI models (also known as large language models or LLMs) and related tools. Meta’s Llama is one of the largest LLMs available and powers things like Meta AI, an AI chatbot that now appears on various Meta platforms. While the company did not disclose the training data for the latest version of the model, Llama 3, the first version of the model used large datasets compiled from other sources like Common Crawl.

Earlier this year, Mark Zuckerberg, co-founder and longtime CEO of Meta, boasted on a conference call that his company’s social platforms had amassed a dataset for training artificial intelligence that was even “larger than that of Common Crawl,” an entity that has been crawling around 3 billion web pages every month since 2011.

However, the existence of the new crawler suggests that Meta’s vast trove of data may no longer be enough as the company continues to work on updating Llama and expanding Meta AI. LLMs typically require new and high-quality training data to further improve their functionality. Meta is expected to spend up to $40 billion this year, mostly on AI infrastructure and related costs.

Are you a Meta contributor or someone who wants to share insights or tips? Contact Kali Hays securely via signal at +1-949-280-0267 or [email protected].

Recommended newsletter: High-level insights for senior executives. Subscribe to the CEO Daily newsletter for free today. Subscribe now.

Breaking News

Grandtkitchenfilipinocuisine

A new web crawler launched by Meta last month secretly searches the internet for AI training data

Flying under the radar

Leave a Reply Cancel reply