Following on from Ben’s post on Technical SEO, I thought in this month’s blog post I’d look at search engine robots and how to control them. Now I’m sure that even any SEO newbies reading this will have some idea of how sites are crawled and indexed, but for the purposes of this post I’ll begin with a brief description of how that all goes down.

Robots and Spiders and Bots, Oh My!

Search engines find content on websites by crawling them. This is done by sending in robots/spiders/bots etc. that crawl the site and follow links within the pages looking for new content to index. The spiders read the content on each page and attach it to a URL. The information is indexed and, hey presto, you are now able to search for this content via the search engines.

Now, this is all great and a major part of SEO is optimizing websites so that search engines robots can easily crawl pages and index content. But what if we have something on our website that, for some reason, we don’t want the search engines to be able to read?

There are a number of reasons that this may be a possibility and many different ways that you can hide content from search engines. Rand Fishkin has a post on SEOmoz with a rundown of these. But today I’m going to talk a bit about controlling robots, specifically with the robots.txt file and meta robots.

Robots.txt

Robots.txt is a file placed in the root directory of a website. It holds a list of the pages that you don’t want search engines to access.

Robots.txt is great for keeping pages out of the index as it will prevent crawling and indexing. It can’t, however, prevent URLs that are found on other pages on the web from displaying in the index. If there is no robots.txt file then it is assumed that robots can access any area of the site.

Creating a robots.txt file is fairly straight forward and requires you to define the user-agent (or the robot you are targeting. Googlebot, for example) and which folders the search engine bots are not allowed to crawl. Below is an example of a robots.txt file.

User-Agent: Googlebot
Disallow: /assets/
Disallow: /logs/

Sitemap: https://www.yoursitehere.com/sitemap.xml

The “Disallow: /assets/” command will prevent search engines from accessing everything contained within the /assets/ folder on your computer.

Note the sitemap addition. This should be added to the same file so that search engines can locate it easily. If your website has multiple sitemaps you can enter these in succession.

Tips for Using Robots.txt

The user-agent can be defined as ‘*’ which indicates all robots as opposed to a specific one.

If you want to block search engine spiders from your whole website you enter ‘Disallow: /’. Entering this will include your whole website as content that you want to be blocked from search engine crawlers.

RobotsTxt.org has some great resources such as a robots database with a list of robots and a robots.txt checker to check your file and meta tags.

Meta Robots

The meta robots tag is a line of code which is entered into the <head> of a webpage. It’s relatively simple to implement and also allows you to remove content from the index. It works well for blocking content on specific pages but can be harder to put into place on a larger scale.

Adding the meta tag into your HTML is straight forward and comprises of the tag, name (“ROBOTS”) and the content (the commands). Google suggests that meta commands are all placed within one meta tag. This helps to make them ‘easy to read and reduces chance for conflicts’. Below is an example.

The following are the commands and their definitions.

noindex: This page shouldn’t be put in the index or should be removed from the index.

nofollow: The links on this page shouldn’t be followed.

nosnippet: A snippet of the page or a cached version shouldn’t be shown in the search results.

noarchive: A cached version of the page shouldn’t be shown in the search results.

noodp: The title and description from the Open Directory Project shouldn’t be used in the search results.

none: This command is the equal to “nofollow, noindex”.

Robots.txt, along with meta robots, is one of the most common ways of controlling what content on a site is crawled and hopefully this post on robot basics is a helpful introduction into this aspect of Technical SEO.

Cookie	Duration	Description
gdpr_status	6 months 2 days	This cookie is set by the provider Media.net. This cookie is used to check the status whether the user has accepted the cookie consent box. It also helps in not showing the cookie consent box upon re-entry to the website.
JSESSIONID	session	Used by sites written in JSP. General purpose platform session cookies that are used to maintain users' state across page requests.
PHPSESSID		This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	1 year	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__adiCookieCheck	session	No description
_anon_id	20 years	No description
_ga_JH5Q7M3QH5	2 years	No description
_gat_UA-5413109-2	1 minute	No description
_lfa	2 years	This cookie is set by the provider Leadfeeder. This cookie is used for identifying the IP address of devices visiting the website. The cookie collects information such as IP addresses, time spent on website and page requests for the visits.This collected information is used for retargeting of multiple users routing from the same IP address.
adiErr	5 minutes	No description
adiLP	30 minutes	This cookie is used by the provider ResponseTap. This cookie is used for ensuring that no tracking errors occur when the visitor have multiple tabs open in the same browser.
adiS	30 minutes	This cookie is set by the provider ResponseTap. This cookie contains an identifier which helps to track the visitors session.
adiV	1 year	This cookie is used by the provider ResponseTap. This cookie is used for tracking the multiple visits made by the visitor from the same browser.
adiVi	30 minutes	This cookie is used by the provider ResponseTap. This cookie is used for tracking the visitor's path while they are on the website.
AnalyticsSyncHistory	1 month	No description
browser_session_id		No description
CONSENT	16 years 7 months 20 days 13 hours 13 minutes	No description
expiring_session_token	20 minutes	No description
ig_putma		No description
UID	2 years	No description
UserMatchHistory	1 month	Linkedin - Used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor's preferences.
wfvt_1408356384	30 minutes	No description
wmc	10 years	No description

Cookie	Duration	Description
_fbp	3 months	This cookie is set by Facebook to deliver advertisement when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
bscookie	2 years	This cookie is a browser ID cookie set by Linked share Buttons and ad tags.
fr	3 months	The cookie is set by Facebook to show relevant advertisments to the users and measure and improve the advertisements. The cookie also tracks the behavior of the user across the web on sites that have Facebook pixel or Facebook social plugin.
IDE	1 year 24 days	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
NID	6 months	This cookie is used to a profile based on user's interest and display personalized ads to the users.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.
iutk	5 months 27 days	This cookie is used by Issuu analytic system. The cookies is used to gather information regarding visitor activity on Issuu products.

Cookie	Duration	Description
bcookie	2 years	This cookie is set by linkedIn. The purpose of the cookie is to enable LinkedIn functionalities on the page.
lang	session	This cookie is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.
language		This cookie is used to store the language preference of the user.
lidc	1 day	This cookie is set by LinkedIn and used for routing.

Let’s Get Technical: Robot Basics

Robots and Spiders and Bots, Oh My!

Robots.txt

Meta Robots

Boom Online

Leave a Reply Cancel reply

The Power of Niching Down for Your eCommerce Business

Structured Data for eCommerce – Benefits and Best Practices for a Beautifully Marked-up Website

How Small eCommerce Sites Can Compete with Larger Businesses in Digital Marketing

Call Us: 0115 857 7755

Call Us: 0115 857 7755

Let’s Get Technical: Robot Basics

Robots and Spiders and Bots, Oh My!

Robots.txt

Meta Robots

Boom Online

Related Posts

Structured Data for eCommerce – Benefits and Best Practices for a Beautifully Marked-up Website

How Small eCommerce Sites Can Compete with Larger Businesses in Digital Marketing

What Makes a Good eCommerce Category Page?

Leave a Reply Cancel reply

Call Us: 0115 857 7755

Call Us: 0115 857 7755