首頁
查看“Coss”的源代码
←
Coss
跳转至:
导航
、
搜索
因为以下原因,你没有权限编辑本页:
您所请求的操作仅限于该用户组的用户使用:
用户
您可以查看与复制此页面的源代码。
Most of the frequent users or guests use distinct accessible search engines to search out the piece of info they required. But how this details is supplied by search engines? Where from they have collected these data? Fundamentally most of these search engines keep their own database of information. These database consists of the sites obtainable in the webworld which in the end sustain the detail web pages information for each and every readily available websites. Essentially search engine do some background perform by using robots to gather details and sustain the database. They make catalog of gathered information and then present it publicly or at-times for private use. In this write-up we will go over about these entities which loiter in the global net atmosphere or we will about net crawlers which move close to in netspace. We will understand What its all about and what objective they serve ? Pros and cons of using these entities. How we can preserve our pages away from crawlers ? Differences in between the common crawlers and robots. In the following portion we will divide the entire analysis work under the following two sections : I. Search Engine Spider : Robots.txt. II. Search Engine Robots : Meta-tags Explained. I. Search Engine Spider : Robots.txt What is robots.txt file ? A web robot is a plan or search engine computer software that visits websites regularly and automatically and crawl by way of the webs hypertext structure by fetching a document, and recursively retrieving all the documents which are referenced. At times internet site owners do not want all their internet site pages to be crawled by the net robots. For this cause they can exclude couple of of their pages getting crawled by the robots by utilizing some regular agents. So most of the robots abide by the Robots Exclusion Common, a set of constraints to restricts robots behavior. Robot Exclusion Standard is a protocol utilized by the website administrator to control the movement of the robots. When search engine robots come to a internet site it will search for a file named robots.txt in the root domain of the website ( This is a plain text file which implements Robots Exclusion Protocols by permitting or disallowing certain files inside the directories of files. Web site administrator can disallow access to cgi, temporary or private directories by specifying robot user agent names. The format of the robot.txt file is really basic. It consists of two area : user-agent and 1 or much more disallow area. What is User-agent ? This is the technical name for an programming concepts in the planet wide networking environment and utilised to mention the specific search engine robot inside the robots.txt file. For instance : User-agent: googlebot We can also use the wildcard character * to specify all robots : User-agent: * Signifies all the robots are allowed to come to check out. What is Disallow ? In the robot.txt file second field is identified as the disallow: These lines guide the robots, to which file really should be crawled or which must not be. For example to prevent downloading email.htm the syntax will be: Disallow: e-mail.htm Stop crawling by way of directories the syntax will be: Disallow: /cgi-bin/ White Space and Comments : Using # at the starting of any line in the robots.txt file will be viewed as as comments only and making use of # at the starting of the robots.txt like the following instance entail us which url to be crawled. # robots.txt for www.anydomain.com Entry Facts for robots.txt : 1) User-agent: * Disallow: The asterisk (*) in the User-agent field is denoting all robots are invited. As absolutely nothing is disallowed so all robots are no cost to crawl by way of. 2) User-agent: * Disallow: /cgi-bin/ Disallow: /temp/ Disallow: /private/ All robots are permitted to crawl by means of the all files except the cgi-bin, temp and private file. three) User-agent: dangerbot Disallow: / Dangerbot is not allowed to crawl by means of any of the directories. / stands for all directories. four) User-agent: dangerbot Disallow: / User-agent: * Disallow: /temp/ The blank line indicates beginning of new User-agent records. Except dangerbot all the other bots are allowed to crawl through all the directories except temp directories. five) User-agent: dangerbot Disallow: /hyperlinks/listing.html User-agent: * Disallow: /e mail.html/ Dangerbot is not allowed for the listing web page of links directory otherwise all the robots are allowed for all directories except downloading e-mail.html page. 6) User-agent: abcbot Disallow: /*.gif$ To remove all files from a particular file sort (e.g. .gif ) we will use the above robots.txt entry. 7) User-agent: abcbot [http://www.fgwa.com/content/menus/fgw/end_handling.aspx automatic baggers] Disallow: /*? To restrict internet crawler from crawling dynamic pages we will use the above robots.txt entry. Note : Disallow field may include * to stick to any series of characters and may possibly end with $ to indicate the finish of the name. Eg : Within the image files to exclude all gif files but allowing other folks from google crawling User-agent: Googlebot-Image Disallow: /*.gif$ Disadvantages of robots.txt : Difficulty with Disallow field: Disallow: /css/ /cgi-bin/ /photos/ Distinct spider will read the above field in diverse way. Some will ignore the spaces and will read /css//cgi-bin//images/ and could only think about either /pictures/ or /css/ ignoring the others. The proper syntax should be : Disallow: /css/ Disallow: /cgi-bin/ Disallow: /images/ All Files listing: Specifying each and each and every file name within a directory is most typically utilised mistake Disallow: /ab/cdef.html Disallow: /ab/ghij.html Disallow: /ab/klmn.html Disallow: /op/qrst.html Disallow: /op/uvwx.html Above portion can be written as: Disallow: /ab/ Disallow: /op/ A trailing slash implies a lot that is a directory is offlimits. Capitalization: USER-AGENT: REDBOT DISALLOW: Though fields are not case sensitive but the datas like directories, filenames are case sensitive. Conflicting syntax: User-agent: * Disallow: / # User-agent: Redbot Disallow: What will happen ? Redbot is permitted to crawl everything but will this permission override the disallow field or disallow will override the let permission. II. Search Engine Robots: Meta-tag Explained: What is robot meta tag ? In addition to robots.txt search engine is also getting one more tools to crawl by way of net pages. This is the META tag which tells internet spider to index a page and comply with hyperlinks on it, which could be much more valuable in some situations, as it can be employed on web page-by-web page basis. It is also beneficial incase you dont have the requisite permission to access the servers root directory to manage robots.txt file. We employed to spot this tag inside the header portion of html. Format of the Robots Meta tag : In the HTML document it is placed in the HEAD section. html head META NAME=robots Content=index,stick to META NAME=description Content material=Welcome to. titletitle head physique Robots Meta Tag options : There are 4 options that can be utilized in the Content portion of the Meta Robots. These are index, noindex, stick to, nofollow. This tag allowing search engine robots to index a distinct web page and can stick to all the hyperlink residing on it. If internet site admin doesnt want any pages to be indexed or any link to be followed then they can replace index,follow with noindex,nofollow. According to the specifications, web site admin can use the robots in the following distinct options : META NAME=robots Content material=index,follow> Index this page, follow hyperlinks from this web page. META NAME=robots Content =noindex,follow> Dont index this page but stick to link from this web page. META NAME=robots Content material =index,nofollow> Index this web page but dont comply with hyperlinks from this web page META NAME=robots Content =noindex,nofollow> Dont index this page, dont stick to links from this page.
返回至
Coss
。
导航菜单
个人工具
登录
命名空间
页面
讨论
不转换
不转换
简体
繁體
大陆简体
香港繁體
澳門繁體
大马简体
新加坡简体
台灣正體
视图
阅读
查看源代码
查看历史
更多
搜索
导航
首页
最近更改
随机页面
帮助
工具
链入页面
相关更改
特殊页面
页面信息