查看“Coss”的源代码

Most of the frequent users or guests use distinct accessible search engines to search out the piece of info they required. But how this details is supplied by search engines? Where from they have collected these data? Fundamentally most of these search engines keep their own database of information. These database consists of the sites obtainable in the  webworld which in the end sustain the detail web pages information for each and every readily available websites. Essentially search engine do some background perform by using robots to gather details and sustain the database. They make catalog of gathered information and then present it publicly or at-times for private use.

In this write-up we will go over about these entities which loiter in the global net atmosphere or we will  about net crawlers which move close to in netspace. We will understand

What its all about and what objective they serve ?

Pros and cons of using these entities.

How we can preserve our pages away from crawlers ?

Differences in between the common crawlers and robots.

In the following portion we will divide the entire analysis work under the following two sections :

I. Search Engine Spider : Robots.txt.

II. Search Engine Robots : Meta-tags Explained.

I. Search Engine Spider : Robots.txt

What is robots.txt file ?

A web robot is a plan or search engine computer software that visits websites regularly and automatically and crawl by way of the webs hypertext structure by fetching a document, and recursively retrieving all the documents which are referenced. At times internet site owners do not want all their internet site pages to be crawled by the net robots. For this cause they can exclude couple of of their pages getting crawled by the robots by utilizing some regular agents. So most of the robots abide by the Robots Exclusion Common, a set of constraints to restricts robots behavior.

Robot Exclusion Standard is a protocol utilized by the website administrator to control the movement of the robots. When search engine robots come to a internet site it will search for a file named robots.txt in the root domain of the website ( This is a plain text file which implements Robots Exclusion Protocols by permitting or disallowing certain files inside the directories of files. Web site administrator can disallow access to cgi, temporary or private directories by specifying robot user agent names.

The format of the robot.txt file is really basic. It consists of two area : user-agent and 1 or much more disallow area.

What is User-agent ?

This is the technical name for an programming concepts in the planet wide networking environment and utilised to mention the specific search engine robot inside the robots.txt file.

For instance :

User-agent: googlebot

We can also use the wildcard character * to specify all robots :

User-agent: *

Signifies all the robots are allowed to come to check out.

What is Disallow ?

In the robot.txt file second field is identified as the disallow: These lines guide the robots, to which file really should be crawled or which must not be. For example to prevent downloading email.htm the syntax will be:

Disallow: e-mail.htm

Stop crawling by way of directories the syntax will be:

Disallow: /cgi-bin/

White Space and Comments :

Using # at the starting of any line in the robots.txt file will be viewed as as comments only and making use of # at the starting of the robots.txt like the following instance entail us which url to be crawled.

# robots.txt for www.anydomain.com

Entry Facts for robots.txt :

1) User-agent: *

Disallow:

The asterisk (*) in the User-agent field is denoting all robots are invited. As absolutely nothing is disallowed so all robots are no cost to crawl by way of.

2) User-agent: *

Disallow: /cgi-bin/

Disallow: /temp/

Disallow: /private/

All robots are permitted to crawl by means of the all files except the cgi-bin, temp and private file.

three) User-agent: dangerbot

Disallow: /

Dangerbot is not allowed to crawl by means of any of the directories. / stands for all directories.

four) User-agent: dangerbot

Disallow: /

User-agent: *

Disallow: /temp/

The blank line indicates beginning of new User-agent records. Except dangerbot all the other bots are allowed to crawl through  all the directories except temp directories.

five) User-agent: dangerbot

Disallow: /hyperlinks/listing.html

User-agent: *

Disallow: /e mail.html/

Dangerbot is not allowed for the listing web page of links directory otherwise all the robots are allowed for all directories except downloading e-mail.html page.

6) User-agent: abcbot

Disallow: /*.gif$

To remove all files from a particular file sort (e.g. .gif ) we will use the above robots.txt entry.

7) User-agent: abcbot
 [http://www.fgwa.com/content/menus/fgw/end_handling.aspx automatic baggers]
Disallow: /*?

To restrict internet crawler from crawling dynamic pages we will use the above robots.txt entry.

Note : Disallow field may include * to stick to any series of characters and may possibly end with $ to indicate the finish of the name.

Eg : Within the image files to exclude all gif files but allowing other folks from google crawling

User-agent: Googlebot-Image

Disallow: /*.gif$

Disadvantages of robots.txt :

Difficulty with Disallow field:

Disallow: /css/  /cgi-bin/  /photos/

Distinct spider will read the above field in diverse way. Some will ignore the spaces and will read /css//cgi-bin//images/ and could only think about either /pictures/ or /css/ ignoring the others.

The proper syntax should be :

Disallow: /css/

Disallow: /cgi-bin/

Disallow: /images/

All Files listing:

Specifying each and each and every file name within a directory is most typically utilised mistake

Disallow: /ab/cdef.html

Disallow: /ab/ghij.html

Disallow: /ab/klmn.html

Disallow: /op/qrst.html

Disallow: /op/uvwx.html

Above portion can be written as:

Disallow: /ab/

Disallow: /op/

A trailing slash implies a lot that is a directory is offlimits.

Capitalization:

USER-AGENT: REDBOT

DISALLOW:

Though fields are not case sensitive but the datas like directories, filenames are case sensitive.

Conflicting syntax:

User-agent: *

Disallow: /

#

User-agent: Redbot

Disallow:

What will happen ? Redbot is permitted to crawl everything but will this permission override the disallow field or disallow will override the let permission.

II. Search Engine Robots: Meta-tag Explained:

What is robot meta tag ?

In addition to robots.txt search engine is also getting one more tools to crawl by way of net pages. This is the META tag which tells internet spider to index a page and comply with hyperlinks on it, which could be much more valuable in some situations, as it can be employed on web page-by-web page basis. It is also beneficial incase you dont have the requisite permission to access the servers root directory to manage robots.txt file.

We employed to spot this tag inside the header portion of html.

Format of the Robots Meta tag :

In the HTML document it is placed in the HEAD section.

html

head

META NAME=robots Content=index,stick to

META NAME=description Content material=Welcome to.

titletitle

head

physique

Robots Meta Tag options :

There are 4 options that can be utilized in the Content portion of the Meta Robots. These are index, noindex, stick to, nofollow.

This tag allowing search engine robots to index a distinct web page and can stick to all the hyperlink residing on it. If internet site admin doesnt want any pages to be indexed or any link to be followed then they can replace  index,follow with  noindex,nofollow.

According to the specifications, web site admin can use the robots in the following distinct options :

META NAME=robots Content material=index,follow> Index this page, follow hyperlinks from this web page.

META NAME=robots Content =noindex,follow> Dont index this page but stick to link from this web page.

META NAME=robots Content material =index,nofollow> Index this web page but dont comply with hyperlinks from this web page

META NAME=robots Content =noindex,nofollow> Dont index this page, dont stick to links from this page.