Category Archives: google

Goooooogle

Anil Dash:

Google’s announcement of Knol shows that they understand some of their key business drivers very well; With as much as 5% of the search result links for popular terms going to Wikipedia pages, a solution to capturing some of that traffic in an environment that Google can control and display ads on makes good business sense. The idea of sharing the earnings from that content with authors is also good business sense. But as with Google Pages (Page Creator), Blogger, Google Notebook, JotSpot, Google Docs/Writely and other tools, Google has not proven that it understands content creation and publishing as well as it understands its core businesses of search and advertising, or even its ancillary tools for communication and collaboration.

Worse, Knol shares with Google Book Search the problem of being both indexed by Google and hosted by Google. This presents inherent conflicts in the ranking of content, as well as disincentives for content creators to control the environment in which their content is published. This necessarily disadvantages competing search engines, but more importantly eliminates the ability for content creators to innovate in the area of content presentation or enhancement. Anything that is written in Knol cannot be presented any better than the best thing in Knol.

Danah Boyd:

…given that page rank algorithms are proprietary, I can’t wait to see what happens when Knol articles are “magically” higher in rank than the About and Wikipedia equivalents.

Advertisements

The anatomy of Google

Only nine years late, via Speaking Freely, I am reading the paper ‘The Anatomy of a Large-Scale Hypertextual Web Search Engine‘ (a.k.a Google) by Sergey Brin and Larry Page.

I liked this bit about the Google crawler interrupting an online game:

It turns out that running a crawler which connects to more than half a million servers, and generates tens of millions of log entries generates a fair amount of email and phone calls. Because of the vast number of people coming on line, there are always those who do not know what a crawler is, because this is the first one they have seen. Almost daily, we receive an email something like, “Wow, you looked at a lot of pages from my web site. How did you like it?” There are also some people who do not know about the robots exclusion protocol, and think their page should be protected from indexing by a statement like, “This page is copyrighted and should not be indexed”, which needless to say is difficult for web crawlers to understand. Also, because of the huge amount of data involved, unexpected things will happen. For example, our system tried to crawl an online game. This resulted in lots of garbage messages in the middle of their game! It turns out this was an easy problem to fix. But this problem had not come up until we had downloaded tens of millions of pages. Because of the immense variation in web pages and servers, it is virtually impossible to test a crawler without running it on large part of the Internet. Invariably, there are hundreds of obscure problems which may only occur on one page out of the whole web and cause the crawler to crash, or worse, cause unpredictable or incorrect behavior. Systems which access large parts of the Internet need to be designed to be very robust and carefully tested. Since large complex systems such as crawlers will invariably cause problems, there needs to be significant resources devoted to reading the email and solving these problems as they come up.

Source: ‘The Anatomy of a Large-Scale Hypertextual Web Search Engine‘ Brin/Page, p. 10

It is also interesting to note the beginnings of Google Book Search in the acknowledgements:

The research described here was conducted as part of the Stanford Integrated Digital Library Project, supported by the National Science Foundation under Cooperative Agreement IRI-9411306. Funding for this cooperative agreement is also provided by DARPA and NASA, and by Interval Research, and the industrial partners of the Stanford Digital Libraries Project.

Source: ‘The Anatomy of a Large-Scale Hypertextual Web Search Engine‘ Brin/Page, p. 16

Note also their thoughts on the relationship of search engines and advertising:

Currently, the predominant business model for commercial search engines is advertising. The goals of the advertising business model do not always correspond to providing quality search to users. For example, in our prototype search engine one of the top results for cellular phone is “The Effect of Cellular Phone Use Upon Driver Attention”, a study which explains in great detail the distractions and risk associated with conversing on a cell phone while driving. This search result came up first because of its high importance as judged by the PageRank algorithm, an approximation of citation importance on the web [Page, 98]. It is clear that a search engine which was taking money for showing cellular phone ads would have difficulty justifying the page that our system returned to its paying advertisers. For this type of reason and historical experience with other media [Bagdikian 83], we expect that advertising funded search engines will be inherently biased towards the advertisers and away from the needs of the consumers.

Source: ‘The Anatomy of a Large-Scale Hypertextual Web Search Engine‘ Brin/Page, p. 18

Web stat tools and pinging

I’ve been looking for ping services and free web statistics tools for DaveUnderwood.com, below is what I’ve come across…

Free web statistics tools via contentious.com (updated):

http://www.google.com/analytics/
http://www.sitemeter.com
http://awstats.sourceforge.net/
http://www.webtrends.com
http://mach5.com/products/analyzer/index.php
http://www.tracewatch.com/
http://www.addfreestats.com/
http://bbclone.de/
http://www.trafficfile.com/
http://www.haveamint.com/
http://www.summary.net/
http://www.reinvigorate.net/
http://statcounter.com new

Unfortunately Google Analytics is invite only at the moment, sign up here.

XML-RPC Ping Services [codex.wordpress.org]