Monday, March 21, 2011

How Googlebot webcrawler works - an example

I've been wondering how does the Google spider work.  The main webcrawler for Google is called Googlebot.  So I created a website and checked my logs.

For this test I created a nine page website.  It consists of the home page and three others, all linked to each other with symmetrical linking.  The effect is that all four pages are linked equally.  Then I added a fifth web page that was the top page in a four page section.  I had one of the first four pages link to the top page of the new second, then had the top page link to the four new pages.  Each of the four new pages links only back to the home page.

The first impression when Googlebot does visit will be to see the home page with only four internal links.  Since my website is new and not linked from anything, I used a blog post to embed a link to the site.  This pinged Google and got the process started.

I don't know how long it took for Googlebot to show up and request the frontpage and robots.  It might have been a few minutes, it might have been a few hours.  I forgot to clock it.  But is was less than 6 hours I think.

Google got the links to the three other main pages and went silent.  Then in about three and a half hours Googlebot came back and got the first of the three other pages.  It got the remaining pages about one every nine minutes.

Then about two and a half hours later Googlebot came got the section header top page it found on one of the first three pages.  Then about three hours later Googlebot came and got each of the four section pages, about one every nine minutes.

What I did next was to make some changes to all of the web pages, and I added several new web pages and links for the spider to pickup.

The webcrawler has not come back and re-indexed any of the pages it has already seen.  I don't know if that will be days or weeks.  I'll post here and tell you.

No comments: