Changelog

Blocking AI Crawlers

AI crawlers have become a common presence across the web, scanning websites and collecting large volumes of content. While some serve legitimate purposes, others operate in ways that are overly aggressive, consuming resources and capturing information that site owners may want to keep private. To maintain control and protect valuable content, it is important to have effective measures in place for managing and blocking unwanted AI crawler access.

Prerequisites

Contentstack account
Access to Launch for your organization

Launch provides two ways to help you control access by AI crawlers:

Using a robots.txt file.
Using Contentstack Launch Edge Functions to block crawlers at runtime.

Note:

If you want to disallow all web crawlers, including non-AI crawlers, for a specific domain, you can add the X-Robots-Tag header (for example, "noindex, nofollow") to your responses. See this example on how to implement it in your Launch project using Launch Edge Functions.
Some bots may not follow the rules and continue crawling your site.

Using robots.txt to Disallow AI Crawlers

The robots.txt file provides crawl instructions for compliant bots. You can use it to disallow specific User-Agent strings from accessing certain parts of your site.

Here’s a sample robots.txt to block common AI crawlers:

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: ai-crawler
Disallow: /

User-agent: *
Disallow: /private-directory/

Note: Some bots may ignore the robots.txt file and continue crawling your site. To strictly block AI crawlers, you can enforce the restriction at the edge using Launch Edge Functions.

How to Add robots.txt in Next.js

Blocking AI Crawlers Using Launch Edge Functions

As bots can ignore the robots.txt file, you can use Launch Edge Functions to detect and block suspicious User-Agent strings in real time.

Example Launch Edge Function

const KNOWN_BOTS = [
  'claudebot',
  'gptbot',
  'googlebot',
  'bingbot',
  'ahrefsbot',
  'yandexbot',
  'semrushbot',
  'mj12bot',
  'facebookexternalhit',
  'twitterbot',
  //more bots can be added here
];

export default function handler(request) {
  const userAgent = (request.headers.get('user-agent') || '').toLowerCase();

  const isBot = KNOWN_BOTS.some(bot => userAgent.includes(bot));

  if (isBot) {
    return new Response('Forbidden: AI crawlers are not allowed.', { status: 403 });
  }

  return fetch(request);
}

Note:

Some bots may spoof their identity by faking their User-Agent string. You can refer to the ContentStack Example to try that out. You can customize the KNOWN_BOTS list by referring to the verified bots directory or the AI Crawler Bot Metrics GitHub Repository, which help you identify and update relevant AI crawler User-Agents for your use case.
Whenever you add a User-Agent in the KNOWN_BOTS list, ensure you add it in lowercase.

Deployment Instructions on Contentstack Launch

Refer to the Launch Edge Functions documentation for setup.
Add the edge function to your project’s edge runtime entry point.
Deploy using your Launch pipeline. The edge function will begin filtering requests before they reach your backend or frontend.

Best Practices & Recommendations

Regularly update the list of known AI crawler User-Agents to ensure ongoing effectiveness.
Use a robots.txt file to guide compliant bots and edge functions to block non-compliant or deceptive bots.
Review server logs and User-Agent headers periodically to refine detection and blocking rules.

Conclusion

While the robots.txt file helps communicate your site's crawling preferences, runtime protections like Launch Edge Functions offer more reliable control, especially in an AI-driven environment where bot behavior is increasingly unpredictable.

Was this article helpful?

More articles in "Launch How to Guides"

Add API Endpoints to a Website in Launch

Deploy a Project from a Monorepo on Launch

Disable Automatic Redeployment

Host a Static Site on Launch

Redeploy Automatically when Content is Published on CMS

Setup Production and Non-Production Environment

Auto-populate Environment Variables from a Linked Stack

Repair GitHub Connection for Projects

Repair Git Provider Connection for Projects

Host any SSR Framework on Launch

Trigger Deployments on Launch Platform based on Tags/Releases

Caching Guide for Contentstack Launch

Handling Next.js RSC Issues on Launch

IP-based Access Control Using Edge Functions in Launch

Can I Run a Docker Container on Launch?