Error » Search Engines » Search Engine Optimization » You, Some SEO and a Spider

Search Engine Optimization search engine optimization discussion.

Post New Thread Reply
  You, Some SEO and a Spider
LinkBack Thread Tools Display Modes
Old 02-Jan-2007, 02:41 AM   #1 (permalink)
Administrator
 
Anilrgowda's Avatar

Posts: 18,715
Join Date: Jan 2006
Rep Power: 10 Anilrgowda is on a distinguished road

IM:
Default You, Some SEO and a Spider

Seducing the robots and spiders
What do you imagine when you think of successful seduction?
Right now I’m thinking of thousands of tiny spiders crawling over my computer screen. No, I’m not mentally ill – I’m talking about making your website seductive; or rather, attractive to webspiders and net-bots.
What are webspiders, what are net-bots?
Web-spiders, ants and crawlers are just some of the names for the automatic scripts that browse the Internet in a methodological fashion. They harvest data for different kinds of processing. They can be used internally - a website may employ a net-bot to check for broken links, or they can be used by search engines to index new and updated websites.
For some examples of these webcrawlers please have a browse through Wikipedia’s selection;
http://en.wikipedia.org/wiki/Web_cra...f_web_crawlers
Why would I seduce a spider?
Never thought I’d write that. Crawlers are good for your website because they let the search engines find you. Without them your website would be very difficult to find.
The benefits of webcrawlers:
  • Your website will be indexed by the major search engines.
  • The crawlers will notice updates and the search engines will update accordingly.
  • The search engine will display your website correctly.
How do I seduce a Spider?
Spiders like Googlebot (please see How Google Crawls my site for more details) want to index your website and they will find you if you have:
  • Links to your website from external (and *legitimate) URLs
  • Links to other websites (like directories you may feature in, for example).
  • Internal page links (the Bots use them to navigate)
(*By ‘legitimate’ I meant bone fide websites, which are not connected to your own website. It would not benefit you to create single-page website to link back from, for example.)
However, you do not want a crawler to index all the information in your website. It would be a waste of time having your /image directory listed on Google, for example, so you must disallow the crawlers from accessing this content. You may also want to protect your e-mail addresses from malignant crawlers (Please see ‘Are all crawlers safe?’ below).
To do this you should create a Robot.txt file.
A robot.txt file is a simple, but potent, document that every website should keep in its root directory. This file is your ‘fart in the lift’; it is small, but very powerful in effect. With it you may stop a crawler harA Mini robot.txt tutorial:

1. Start a notepad document and name it robot.txt


Learn How VoIP is Dramatically Cutting Telecom Costs for Small Businesses
With VoIP...
Businesses Beware - New Battlefront in Email and Web
FierceIPTV
Research Report: Magic Quadrant for E-Mail Security Boundary, 2006
Digital Transactions
Hundreds more titles...

2. Address the webcrawlers like this:

User-agent: *

The ‘user-agent’ denotes that you are addressing a webcrawler. If you place an asterisk in the way that I have done here you will address every webcrawler that happens upon your website. If you wish to address individual crawlers you should list them by name like this:

User-agent: Googlebot

But you must list the disallowed pages/directories for each crawler individually.

For example:

User-agent: *

Disallow: /user-list/email/
Disallow: /products/images/
Disallow: /articles/contributors/

All files and folders listed in these directories will be blocked and will not be indexed. Bear in mind that you should list the directories as relative to the position of the robot.txt file, or the robot.txt will not be referring to the correct information. The robot.txt cannot refer to material in directories above it, for example;

http://www.yoururl.co.uk/index/robot.txt

The robot.txt cannot refer to anything that is higher than ‘index/’ directory, in other words –it will not refer to material above itself.

3. You may also want to disallow certain files, you can do so like this:

Disallow: /articles/jubjub.html
Disallow: /index/error_page.html

Are all crawlers safe?

No, some can and will bite you. There are many webcrawlers and they may visit your website for reasons other than indexing. You should attempt to protect certain information by disallowing the crawlers as I have shown you in the tutorial above.

Malignant Crawlers

They can be (much to my upset) used for Spamming. Malignant crawlers look through your website with a view to capture all the e-mail addresses and other useful data displayed there.

If they do this you can expect an inbox full of Spam. I discovered 20 e-mails from a Japanese Adult dating website in my Herds of Words inbox today. I was not a happy bunny.

However, you can avoid this (I was just that little bit too late) if you encode the addresses differently making it harder for these evil Bots to trap you.

If you are using Cascading Style Sheets (.css):

1. Create an html-tag to fit around the text you want to use as an e-mail address.
2. In the css file you must define that tag, so:

postmaster:after{ content: "postmaster40herdsofwords.co.uk";}

If that doesn’t help you, or you don’t use cascading style sheets, please have a look through this useful article by Daniel Cody, http://evolt.org/article/Using_Apach...8/15126vesting certain pages or even entire directories by using the command -
Disallow:
Anilrgowda is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit!
Reply With Quote
   


   
Post New Thread Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT -8. The time now is 05:04 AM.

Powered by vBulletin® Version 3.7.2
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.2.0

DMCA Policy

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228