|
[Bottom] |
|
|
The Google CrawlSkip to Bob's Google crawl problem! The Google crawl is simply the time Google is sending out requests to all the websites in the world and the sites respond with their text content. I hope to slowly document the Google™ crawl of my websites over time and publish the information here. This Google crawl analysis will start with a one month sample of the Google crawl of my website www.bobshowto.com. I've created another page with the actual sample Google crawl log entries. Google does not crawl an entire website, it only crawls the text of the website. There are browsers like the "Lynx" browser that will show a website in this context. Google does not look at image files per say. Also the crawl from Google does not look at Cascading Style Sheets. Knowing this, one could also realize that you could "spam" Google by using unusual "style sheets", this might be a hint to SEO's ( Search Engine Optimizers ) out there. Below is a one line example extract from the www.bobshowto.com log files. It contains the Internet Protocol Address ( IP ) of the requestor in this case Google. 64.68.82.144 is one of many IP addresses Google uses to "crawl" websites. Also in this Google log entry are the date and time, and I suppose time zone ( -0700 ) GMT - 7 hours. You see the command "GET" and the argument is "/robots.txt". Robots.txt text is the first file Google will attempt to fetch when beginning a crawl of a website. This is an optional file. Google's request is in HTTP/1.0 format, which I won't get into. My web server produced a return code of 200, this is a successful return meaning the BobsHowTo website does have a robots.txt file. Want to see mine? Copy and go to this link: "http://www.bobshowto.com/robots.txt" . Did you know you could do that. Yes my robots.txt file is "empty", but I think its a good idea to have one these days, if nothing else in this way there is not an error produced ( a 404 error code ) indicating the file is not present. Following the return code is the size of the data returned. "Googlebot/2.1" is kind enough to clearly identify itself, and they even provide a link for additional information http://www.google.com/bot.html, about the "spider" or "robot". You know a Spider crawls, the Googlebot is known as a spider so it crawls the web. The Google crawl. 64.68.82.144 - - [09/Aug/2004:19:10:24 -0700] "GET /robots.txt HTTP/1.0" 200 123 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)" So that's one line of the log file and the first file to be crawled. If you'll note the second line to be crawled is not my home page! 64.68.82.144 - - [09/Aug/2004:19:10:24 -0700] "GET /browse_health.htm HTTP/1.0" 200 27872 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)" During a crawl from Google the Googlebot seems to take notes about your website. The very first time your website is crawled Google will read the robots.txt file. Then it will try to read your home page, typically like this: 64.68.82.178 - - [09/Aug/2004:19:58:56 -0700] "GET / HTTP/1.0" 200 44207 "-" "Googlebot/2.1 (+http://www.google.com/bot.html)" Just "GET /" tells your web server to serve up the default home page. Once Google is familiar with your site it may get the actual name of your home page, typically "index.html". This is probably the most common default name. The other possibility is Google is crawling a page on your site due to a "hyperlink" from another website. But in my sample I'm sure this is not the case. Google clearly has generated a list of pages it already knows about and dispatched this list to the many servers that in aggregate do the Google crawl of a given site. Google crawl broken up into groups.If you've noted in the sample file in the page below Googlebot uses many IP addresses to crawl a given site. For my site the crawl appears to be broken up into 4 "groups". I conclude this because there are 4 fetches the robots.txt file prior to reading or "GETting" many of my website's pages. My site had about 74 pages at the time of this crawl. One page is not linked to the site but Google has remembered this page for a long time, probably a year, and calls it a "Supplemental result". There seems to be several reasons for a supplemental result, one is a page that used to have links pointing to it on a website, but there are no longer any links to it at the time of the Google crawl. The first group is crawled starting 08/09/2004, at 19:10:24.
The second group appears to be a single file canada-pharmacy-insert.htm at 08/12/2004 09:47:23. I'm not sure whether this file was
linked to at the time of the crawl. The third group
is crawled starting 08/12/2004, at 12:10:38. The fourth group is crawled starting 08/14/2004 at 12:27:25, note this fourth group is just two lines: The next Google crawlNote the next Google crawl begins at 08/22/2004 at 21:59:46. It has a 3 group pattern. But you'll note although the groups are similar the order of the pages read in each of the groups is not the same! And again oddly enough Google after reading the robots.txt file does not read my home page first! Characteristics of the Google crawlTypically two to three days after the Googlebot crawls, or spiders, a website, your revised page content will show up in the Google cache. I have a simple Google search which I use to check the Google status of all my pages it is: The link above no longer works well (9/21/2007). I've put my bottom border into an "IFrame" for several reasons. My new search which shows all indexed pages is: You'll see the entire link in your address bar if you click on it. This link will bring up all my web pages. The trick here is designing your site so you have some very well known and unique stuff in every page of your website. Another neat thing about this is I get to see the date of the page Google indexed and this will be the one in the Google cache. My Google crawl problemAs of 08/29/2004 I have about 13 pages or so from my website stuck at a Google Pagerank™ of 0. The pages in question are encompassed in the second group that Google crawls on my site. I had the misfortune of having my server go down twice in a row in synchronism with two concurrent Google crawls. This apparently is a real "NO-NO". These pages have had a Pagerank of 0 for more than 3 months, and this shows in the Google search results, the SERP's ( Search Engine Results Pages ). These pages had been indexed and all had a Pagerank of 3 or 4, before this event. My site had a Pagerank of 6 before this event, now it's stuck at 5. It could be the Google has not updated Pagerank itself for 3 months or more; likely I've encountered some kind of per page penalty. The pages I mention above now have page rank. I'm still not sure as of 2/15/2005 whether some kind of SERP's penalty is being applied, but it finally appears to be fading away. The "new" page SERPs boost, and spamming Googleit's possible I accidentally hit a spam penalty. If Google gives boost to a new page's position in the SERP's, which it appears to, it's logical for a spammer to try and make his web pages look new even though they're not. This could be achieved in numerous ways and I probably accidentally stumbled on one by accident and coincidence! The "Limbo" stateI've dubbed a new term, I call it "Limbo". A web page is in Limbo when Google does not show a title or description in the results but simply shows a URL (Uniform Resource Locater) to the web page. A page can definitely go into Limbo if Google already knows about it, then
when Google crawls an error is returned for the page or worse the Googlebot receives no response at all from the website it is trying to crawl. Several of my web pages went into this state and recovered at the same time I had my Google crawl problem. Here is a link I use to find all my
pages that exist including the ones in "Limbo": Google crawl conclusionI hope this gives you a feel for what's going on during an actual Google crawl of a website. What happens after this is not nearly as well known. Bob Topic: The Google™ crawl is simply the time Google is sending out requests to all the websites in the world and the sites respond with their text content. |
T o p i c s T o p i c s
|
|
BHTEndMarker
| |||||||||