Error » Search Engines » Search Engine Optimization » Indexing Process of SEO

Search Engine Optimization search engine optimization discussion.

Post New Thread Reply
  Indexing Process of SEO
LinkBack Thread Tools Display Modes
Old 13-Jan-2007, 10:45 PM   #1 (permalink)
Administrator
 
Anilrgowda's Avatar

Posts: 18,715
Join Date: Jan 2006
Rep Power: 10 Anilrgowda is on a distinguished road

IM:
Default Indexing Process of SEO

There is a lot of speculation about how search engines index websites. The topic is shrouded in mystery about exact working of search engine indexing process since most search engines offer limited information about how they architect the indexing process. Webmasters get some clues by checking their log reports about the crawler visits but are unaware of how the indexing happens or which pages of their website were really crawled.
While the speculation about search engine indexing process may continue, here is a theory, based on experience, research and clues, about how they may be going about indexing 8 to 10 billion web pages even so often or the reason why there is a delay in showing up newly added pages in their index. This discussion is centered around Google, but we believe that most popular search engines like Yahoo and MSN follow a similar pattern.
Google runs from about 10 Internet Data Centers (IDCs), each having 1000 to 2000 Pentium-3 or Pentium-4 servers running Linux OS.
Google has over 200 (some think ‘over 1000′) crawlers / bots scanning the web each day. These do not necessarily follow an exclusive pattern, which means different crawlers may visit the same site on the same day, not knowing other crawlers have been there before. This is what probably gives a ‘daily visit’ record in your traffic log reports, keeping web masters very happy about their frequent visits.
Some crawlers’ jobs are only to grab new URLs (lets call them ‘URL Grabbers’ for convenience) - The URL grabbers grab links & URLs they detects on various websites (including links pointing to your site) and old/new URL’s it detects on your site. They also capture the ‘date stamp’ of files when they visit your website, so that they can identify ‘new content’ or ‘updated content’ pages. The URL grabbers respect your robots.txt file & Robots Meta Tags so that they can include / exclude URLs you want / do not want indexed. (Note: same URL with different session IDs is recorded as different ‘unique’ URLs. For this reason, session ID’s are best avoided, otherwise they can be misled as duplicate content. The URL grabbers spend very little time & bandwidth on your website, since their job is rather simple. However, just so you know, they need to scan 8 to 10 Billion URLs on the web each month. Not a petty job in itself, even for 1000 crawlers.
The URL grabbers write the captured URL’s with their date stamps and other status in a ‘Master URL List’ so that these can be deep-indexed by other special crawlers.
The master list is then processed and classified somewhat like -
a) New URLs detected
b) Old URLs with new date stamp
c) 301 & 302 redirected URLs
d) Old URLs with old date stamp
e) 404 error URLs
f) Other URLs
The real indexing is done by (what we’re calling) ‘Deep Crawlers’. A deep crawler’s job is to pick up URLs from the master list and deep crawl each URL and capture all the content - text, HTML, images, flash etc.
Priority is given to ‘Old URLs with new date stamp’ as they relate to already index but updated content. ‘301 & 302 redirected URLs’ come next in priority followed by ‘New URLs detected’. High priority is given to URLs whose links appear on several other sites. These are classified as ‘important’ URLs. Sites and URL’s whose date stamp and content changes on a daily or hourly basis are ’stamped’ as ‘News’ sites which are indexed hourly or even on minute-by-minute basis.
Indexing of ‘Old URLs with old date stamp’ and ‘404 error URLs’ are altogether ignored. There is no point wasting resources indexing ‘Old URLs with old date stamp’, since the search engine already has the content indexed, which is not yet updated. ‘404 error URLs’ are URLs collected from various sites but are broken links or error pages. These URLs do not show any content on them.
The ‘Other URLs’ may contain URLs which are dynamic URLs, have session IDs, PDF documents, Word documents, PowerPoint presentations, Multimedia files etc. Google needs to further process these and assess which ones are worth indexing and to what depth. It perhaps allocates indexing task of these to ‘Special Crawlers’.
When Google ’schedules’ the ‘Deep Crawlers’ to index ‘New URLs’ and ‘301 & 302 redirected URLs’, just the URLs (not the descriptions) start appearing in search engines result pages when you run the search “site:www.domain.com” in Google. These are called ’supplemental results’, which mean that Deep Crawlers shall index the content ’soon’ when the crawlers get the time to do so.
Since Deep Crawlers need to crawl ‘Billions’ of web pages each month, they take as many as 4 to 8 weeks to index even updated content. New URL’s may take longer to index.
Once the Deep Crawlers index the content, it goes into their originating IDCs. Content is then processed, sorted and replicated (synchronized) to the rest of the IDCs. A few years back, when the data size was manageable, this data synchronization used to happen once a month, lasting for 5 days, called ‘Google Dance’. Nowadays, the data synchronization happens constantly, which some people call ‘Everflux’.
When you hit www.google.com from your browser, you can land at any of their 10 IDCs depending upon their speed and availability. Since the data at any given time is slightly different at each IDC, you may get different results at different times or on repeated searches of the same term (Google Dance).
Bottom line is that one needs to wait for as long as 8 to 12 weeks, to see full indexing in Google. One should consider this as ‘cooking time’ in ‘Google’s kitchen’. Unless you can increase the ‘importance’ of your web pages by getting several incoming links from good sites, there is no way to speed up the indexing process, unless you personally know Sergey Brin & Larry Page, and have a significant influence over them.
Dynamic URLs may take longer to index (sometimes they do not get indexed at all) since even a small data can create unlimited URLs, which can clutter Google index with duplicate content.
What to do:
Ensure that you have cleared all roadblocks for crawlers and they can freely visit your site and capture all URLs. Help crawlers by creating good interlinking and sitemaps on your website.
Get lots of good incoming links to your pages from other websites to improve the ‘importance’ of your web pages. There is no special need to submit your website to search engines. Links to your website on other websites are sufficient.
Patiently wait for 4 to 12 weeks for the indexing to happen.
Anilrgowda is offline  
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!Spurl this Post!Reddit!
Reply With Quote
   


   
Post New Thread Reply


Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT -8. The time now is 01:12 PM.

Powered by vBulletin® Version 3.7.2
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.2.0

DMCA Policy

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228