LS Staging

Tag: robots txt

  • Complete Guide to Robots.txt File and SEO Protocols

    Complete Guide to Robots.txt File and SEO Protocols

    A robots.txt file is basically a text file that webmasters create to instruct the search engine robots which URLs/pages to crawl on a website. This is used to improve the crawl budget for the website.

    What is a robots.txt file used for?

    A robots.txt file is used to manage crawler traffic to the site and usually to keep a file off Google, depending on the various file type:

    File Type Robots.txt effect
    Web Page Manages your website crawl budget

    Avoids crawling unimportant or similar pages on your site

    Pages will still get indexed in search results with no description

    Note – If other pages link to your page with descriptive anchor text, Google could still index the page without visiting.

    Digital Assets Prevents image, video, and audio files from appearing in Google search results

    This will not prevent other pages or users from linking to your image, video, or audio file.

    Robots.txt file limitations:

    Depending on your goals, consider other mechanisms to ensure your URLs are not findable on the web.

    • txt directives may not be supported by all search engines
    • Search engine crawlers may interpret syntax differently.
    • A page that is disallowed in robots.txt can still be indexed if linked to other sites.

    Steps to create a perfect Robots.txt file:

    A robots.txt file should be added at the end of any root domain. For instance, if the site URL is www.abc.com, the robots.txt file lives at www.abc.com/robots.txt. robots.txt is a plain text file that follows the Robots Exclusion Standard consisting of one or more rules. This rule allows or blocks access for a mentioned search engine crawler to a specified file path in that website. Unless the rules are specified in the robots.txt file, all files on the site are allowed for crawling.

    Simple example for robots.txt with few rules:

    User-agent: Googlebot
    Disallow: /nogooglebot/
    User-agent: *
    Allow: /

    Sitemap: http://www.abc.com/sitemap.xml

    This file means:

    1. Googlebot, i.e., the user agent, cannot crawl any URL that starts with abc.com/nogooglebot/.
    2. All other user agents can crawl the entire site. By default, other user agents are allowed to crawl the entire website even if this rule isn’t added.
    3. The site’s sitemap file is located at http://www.abc.com/sitemap.xml

    Formats and Location Rules:

    •  The file name must be robots.txt
    •  A site can have only one robots.txt file
    • txt file needs to be placed at the root of the website host to which it applies. For instance, considering the URL is http://www.abc.com/robots.txt. It cannot be placed in a subdirectory, i.e. http://www.abc.com/page/robots.txt
    • This file can be applied to
    • a subdomain (http://web.abc.com/robots.txt) or on
    • non-standard ports (http://www.abc.com:8181/robots.txt)
    • Google may ignore characters that are not part of the UTF-8 range, potentially rendering robots.txt rules invalid.
    • Adding rules to the robots.txt file

    Instructions should be added for search engine crawlers on which parts of your site they can crawl. Check on these guidelines when adding rules to your robots.txt file:

    • A robots.txt file consists of one or more groups
    • Each group consists of multiple guidelines or instructions, one directive per line. Each group begins with a User-agent line specifying the target of the groups. For example:

    User-agent: *
    Allow: /
    Disallow: /search/

    • A group gives the following information:
    • Who the group applies to (the user agent)
    • Which directories or files that user agents can access

    Which directories or files that user agent cannot access

    • Search engine crawlers process groups from top to bottom
    • By default, the user agent can crawl any page or directory not blocked by a disallow rule
    • Rules are case-sensitive. For example, disallow: /login.asp applies to https://www.abc.com/login.asp, but not https://www.abc.com/LOGIN.asp

    The # character marks the starting of a comment
    Google’s crawlers support the subsequent directives in robots.txt files:
    User-agent:

    Each search engine must identify itself with a User-agent. This defines the start of a group of directives. Usage of an asterisk (*) matches all search engine crawlers except for the different AdsBot crawlers, which must be named explicitly.

    For example:
    # Example 1: Block all but AdsBot crawlers
    User-agent: *
    Disallow: /

    # Example 2: Block only Googlebot
    User-agent: Googlebot
    Disallow: /

    # Example 3: Block Googlebot and Adsbot
    User-agent: Googlebot
    User-agent: AdsBot-Google
    Disallow: /

    disallow:
    At least one or more disallow entries per rule. A directory or page to the root domain that you do not want the user agent to crawl. If a robot’s rule refers to a URL, it must be the entire URL path as visible in the browser. If it refers to a directory, it must start with a / character and end with the / mark. For example:

    #Disallow: Page and directory

    User-agent: *
    Disallow: /login.html
    Disallow: /search/
    allow:

    At least one or more allow entries per rule. A directory or page to the root domain may be crawled by the user agent just mentioned. Allow rule is used to override a disallow directive to allow crawling of a subdirectory or page in a disallowed directory.

    For a single page, specify the entire URL path. To add a directory, end the rule with a forward slash ‘/’ mark.

    For example:

    #Allow: Page and directory, and override
    User-agent: *
    Allow: /login.html
    Allow: /loan/
    Disallow: /search/
    Allow: /search/

    sitemap:

    One or more xml sitemaps can be declared in robots.txt. Sitemaps are an effective way to indicate which content Google should crawl, as opposed to which content it can or cannot crawl. For example:

    Sitemap: https://www.abc.com/sitemap-images.xml

    Sitemap: https://www.abc.com/sitemap-index.xml

    Upload the robots.txt file

    Once the robots.txt file is created, you can make it available for web crawlers. Contact your hosting company to upload the file as it depends on your site and server architecture.

    Test robots.txt markup

    Test whether your uploaded robots.txt file is publicly accessible, open the browsing window and navigate to the location https://www.abc.com/robots.txt

    Google offers testing robots.txt in Search Console. You can only use this tool for robots.txt files that are already accessible on your site.

    Submit robots.txt file to Google

    Google’s crawlers certainly find and start using your robots.txt file once uploaded to the domain root folder. No action is required here.

    In case you update the robots.txt file, it is essential that you refresh Google’s cached copy.

    Useful robots.txt rules

    Here are a few common useful robots.txt rules:

    Useful Rules Code Comment
    Disallow crawling of the entire website User-agent: *

    Disallow: /

    URLs from the website may still be indexed, even if they have not been crawled.
    Disallow crawling of an entire site, but allow Mediapartners-Google User-agent: *

    Disallow: /

     

    User-agent: Mediapartners-Google

    Allow: /

    Hides your pages from search results.

    Mediapartners-Google web crawlers can still examine them to decide which ads to show visitors on your site.

    Disallow crawling of a directory and its contents User-agent: *

    Disallow: /search/

    Disallow: /about-us/

    Disallow: /about-us/archive/

    Add a forward slash to the directory name to disallow the crawling of a whole directory.

    The disallowed string may appear anywhere in the URL path,

    so, Disallow: /about-us/ matches both

    https://abc.com/about-us/

    https://abc.com/why-us/other/about-us/

    Allow access to a single crawler User-agent: Googlebot-news

    Allow: /

     

    User-agent: *

    Disallow: /

    Only googlebot-news can crawl the entire site
    Allow access to all except for a single crawler User-agent: Unnecessarybot

    Disallow: /

     

    User-agent: *

    Allow: /

    All bots can crawl the site except for Unnecessarybot
    Disallow crawling of a single web page User-agent: *

    Disallow: /login_file.html

    Bots can crawl all web pages on your site, except for the login_file.html page
    Disallow crawling of files from a specific file type User-agent: Googlebot

    Disallow: /*.gif$

     

    User-agent: *

    Disallow: /*loan/

    Adding * prior to the string/file extension will disallow files/paths containing a specific string or file type

     

    Disallow for crawling all .gif files

    Block all images on your site from Google Images User-agent: Googlebot-Image

    Disallow: /

    Google cannot index images and videos without crawling them
    Block a specific image from Google Images User-agent: Googlebot-Image

    Disallow: /images/logo.jpg

    This will disallow the logo.jpg file to be crawled
    Use $ to match URLs that end with a specific string User-agent: Googlebot

    Disallow: /*.pdf$

    This will disallow all pdf files from crawling

     

    Learn about how Google interprets robots.txt specifications in the coming chapters for robots.