The anatomy of Google

Only nine years late, via Speaking Freely, I am reading the paper ‘The Anatomy of a Large-Scale Hypertextual Web Search Engine‘ (a.k.a Google) by Sergey Brin and Larry Page.

I liked this bit about the Google crawler interrupting an online game:

It turns out that running a crawler which connects to more than half a million servers, and generates tens of millions of log entries generates a fair amount of email and phone calls. Because of the vast number of people coming on line, there are always those who do not know what a crawler is, because this is the first one they have seen. Almost daily, we receive an email something like, “Wow, you looked at a lot of pages from my web site. How did you like it?” There are also some people who do not know about the robots exclusion protocol, and think their page should be protected from indexing by a statement like, “This page is copyrighted and should not be indexed”, which needless to say is difficult for web crawlers to understand. Also, because of the huge amount of data involved, unexpected things will happen. For example, our system tried to crawl an online game. This resulted in lots of garbage messages in the middle of their game! It turns out this was an easy problem to fix. But this problem had not come up until we had downloaded tens of millions of pages. Because of the immense variation in web pages and servers, it is virtually impossible to test a crawler without running it on large part of the Internet. Invariably, there are hundreds of obscure problems which may only occur on one page out of the whole web and cause the crawler to crash, or worse, cause unpredictable or incorrect behavior. Systems which access large parts of the Internet need to be designed to be very robust and carefully tested. Since large complex systems such as crawlers will invariably cause problems, there needs to be significant resources devoted to reading the email and solving these problems as they come up.

Source: ‘The Anatomy of a Large-Scale Hypertextual Web Search Engine‘ Brin/Page, p. 10

It is also interesting to note the beginnings of Google Book Search in the acknowledgements:

The research described here was conducted as part of the Stanford Integrated Digital Library Project, supported by the National Science Foundation under Cooperative Agreement IRI-9411306. Funding for this cooperative agreement is also provided by DARPA and NASA, and by Interval Research, and the industrial partners of the Stanford Digital Libraries Project.

Source: ‘The Anatomy of a Large-Scale Hypertextual Web Search Engine‘ Brin/Page, p. 16

Note also their thoughts on the relationship of search engines and advertising:

Currently, the predominant business model for commercial search engines is advertising. The goals of the advertising business model do not always correspond to providing quality search to users. For example, in our prototype search engine one of the top results for cellular phone is “The Effect of Cellular Phone Use Upon Driver Attention”, a study which explains in great detail the distractions and risk associated with conversing on a cell phone while driving. This search result came up first because of its high importance as judged by the PageRank algorithm, an approximation of citation importance on the web [Page, 98]. It is clear that a search engine which was taking money for showing cellular phone ads would have difficulty justifying the page that our system returned to its paying advertisers. For this type of reason and historical experience with other media [Bagdikian 83], we expect that advertising funded search engines will be inherently biased towards the advertisers and away from the needs of the consumers.

Source: ‘The Anatomy of a Large-Scale Hypertextual Web Search Engine‘ Brin/Page, p. 18

Advertisements

On Blogging

I used to be a blogger. I had a LiveJournal to put the details of my day, my thoughts and my life in hypertext somewhere on the web. It coincided like it does for a lot of the emo set with my late teenage and university years. Blogging was cathartic, it was good, it felt vital and necessary and right. It was also new and when you are Raging Against The Machine it is important to have that sense of “they don’t understand it but we do”. So I blogged about the things I hated, the things I loved, I blogged about my day and my dog. I have fond memories of it for a lot of reasons: it felt good to say those things whether I was right to be saying them all not, it got the words down where I could make sense of them as well. And sometimes when it connected you with like-minded people it felt good to be understood.

But like going to Scouts or keeping a hand-written journal, you started to wonder why you did it. Were you too old for this? You were too busy, you couldn’t be bothered. There was no moment where I thought “why is it I blog again?” but over time that’s what it was about. Why was I telling strangers these things? What was I getting out of it? I now find myself probably a couple of years down the track from really having blogged anywhere in any great substance. I don’t feel it anymore although occaisionally I’ll have a strong desire to release some vitirol and write invective but I’ll write a sentence and feel silly.

Blogging is a paradox. In it’s purest form it can often be your inner-most thoughts and deepest secrets you would only talk to a few people about but you will publish it online where 600 million strangers (and 5 people you thought didn’t know about your blog) can read it. But it also makes complete sense – a way to engage people about the issues we really want to talk about. I think I remember coming to a point where I blogged about something and the result wasn’t what I wanted – no passionate debate, no fists in the air… Instead, criticism, stodgy talking heads and cynical one-liners. “Damn”, I thought “Where have all the cool people gone?”

But they hadn’t gone anywhere. I was changing, they were changing, the crisp black and white lines were being replaced by adult shades of gray and I wasn’t studying anymore. I was working for the Man, which gave me some ammunition for a while but it wasn’t the same. Different attempts to find that blogging magic failed to spark, I was answering my own questions as soon as I posed them on the screen or the questions I wanted to ask suffered in the medium of the internet. I needed real people with real thoughts instead of NietscheApprentice666 firing missives from his keyboard somewhere in Long Island. Whenever I read the comments on a YouTube video these days or visit an online gaming forum questing for tech answers, when I get what I was after I hastily beat a return up the stairs and back to the light.

I look at Bebo or myspace and don’t beget their fun; I remember logging onto ICQ and needing to connect. You can look at the current crop of Windows Live Spaces and Blogspots and recognise how important it is to get it out, to enter in those communities. It can often be messy, poorly punctuated and a sad reflection on the brain food the kids of the parcus congregatio get fed these days. But I think I understand.

There are those who still blog because that is their gift, they are still relevant and they are still getting something out of it. My insights are now my own and the answers I want only found some of the time at the end of a Google query. But what interests me is those of us who grew up with a blog or two, how is that going to play out? Will we return to blogging one day when we’re old and waiting for our next pill? What about all our hopes and dreams sitting out there in databases on the Wayback Machine and in Google’s cache? It’s bigger now than it was, what have we unleashed?

Problems downloading large compressed files on Xtra Broadband

Where I live we get our broadband via Xtra who are probably New Zealand’s largest ISP. In the past three or four weeks we have noticed problems downloading large compressed files. This started with trying to download the EVE client and later just trying to download and run any large compressed files: .zip, .exe, .rar. The downloaded executables would display “missing files” or “corrupt file” errors or, for archive files, would get CRC errors.

We’ve progressed from testing the local network and the router for connection problems to today when I came across a couple of forum posts that suggest the problem may lie further upstream with Xtra: Corrupted compressed files, Completely different problem with Go Large Plan.

The Xtra Help Desk guy I spoke to hadn’t heard of any similar complaints so the problem may yet lie with our hardware here or our phone line. I’ll update this post if I learn more.

UPDATE:
This morning we confirmed the issue can affect compressed files as small as 10MB or 15MB. Large files are more guaranteed of having missing parts of files because of the longer download time.

UPDATE#2:
We’ve tested the connection with a different router and the problem is stilling occurring which suggests the issue is either with our phone line or Xtra. I spoke with the same Xtra Help Desk guy and although he didn’t know what is causing the problem, he says the problem is with Xtra and they are working to identify the cause and fix it.

And via Hard News, it looks like there are a bunch of problems over at Xtra at the moment:

Meanwhile, PA reader Janet Digby reports that Telecom is now trying to switch people back from the new accounts it is marketing:

“You are probably aware of this, but there some Xtra customers are experiencing major delays with mail sent through Xtra.

Some mails were delayed up to 8 hours yesterday and some are yet to arrive.

While all this was happening I had an odd call from Xtra asking whether I would like to change plans – from Go Large to one with a data cap. The person calling seemed unable to answer even basic questions including why I would change to a plan with a data cap – except to say that there was less ‘interference’ with the plan she was suggesting. When I commented that all plans were supposed to be max speed she seemed confused regarding her mission.

Perhaps they are trying to encourage customers to pull back as their system can’t handle the additional traffic resulting from their new plans.

Customer service acknowledge they are having problems (I understand some people can’t get on to their broadband connection too) I asked why these issues weren’t listed on their website and he didn’t know.

To top it all off – their phone system is on the blink and the opening recording tells you that you might get cut off – which I did, twice!

(Last updated: 2nd December 2006)

Recommended Free Software List

My machine is overdue for a format and with that in mind I have put together my list of free software that I use daily. As a Windows user of many years I am still a creature of habit, but some of this software is Mac OS X/Linux compatible.

Ad-Aware SE PersonalAd-Aware SE Personal (Windows)
Still the most effective anti-spyware available for free. Lavasoft regularly issue update definitions which are a free download from within the program.

AxCryptAxCrypt (Windows)
AES-128 and SHA-1 file encryption. An easy way to encrypt files for restricted access. For users wanting access to encrypted files but not wishing to install any software, an install-free “viewer” can be downloaded, only 70kb in size.

K-Lite Codec Pack/Media Player Classic (Windows)
Easy way to solve all your codec worries and get a free media player in one hit. Media Player Classic has a built-in DVD player, support for AVI subtitles, QuickTime and RealVideo support. It can also be configured to play video files in a dual monitor setup.

Mozilla FirefoxMozilla Firefox (Windows/Linux/Mac OS X)
Browser of choice. Fast, secure and has decent web standards support. I am particularly fond of Firefox for the ease with which extensions can be installed and managed. Some of my favourite extensions include Adblock which will let you block images, embedded objects, etc from entire domains/IP addresses and Web Developer, a powerful toolset which lets you quickly access technical aspects of a website. See DB’s Best Firefox Extensions for more.

Mozilla ThunderbirdMozilla Thunderbird (Windows/Linux/Mac OS X)
A long-time user of Microsoft Outlook, I switched to Thunderbird once I ran out reasons not to. A sleek email client with the same inbuilt support for themes and extensions that Firefox has. Thunderbird’s adaptive junk email filter can be quickly trained to keep your inbox free of spam. I still haven’t found an easy way to export email and archive it to files but your email can be backed up under Windows XP by manually copying the mailbox store in Thunderbird’s ‘Application Data’ folder.

OpenOffice.orgOpenOffice.org (Windows/Linux/Mac OS X)
Faced with forking out for Microsoft Office 2003 or lumping it with Notepad/Wordpad/Microsoft Office Word Viewer 2003, I turned to OpenOffice.org somewhat reluctantly. So far it has been stellar as I’ve transitioned from version 1.1 to 2.0. OpenOffice supports Office formats like .xls and .doc and will let you save to those formats as well. It is a rock-solid program and now includes Base and Impress, alternatives to PowerPoint and Access.

Also:
Audacity – record, edit and export audio (Windows/Linux/Mac OS X)
Azureus – excellent BitTorrent client (Windows/Linux/Mac OS X)
Crimson Editor – a text editor that just works (Windows)
CutePDF – free PDF maker (Windows)
dBpowerAmp – right-click ‘Convert To’ music file conversion (Windows)
FileZilla – solid FTP client (Windows)
Skype – top notch VoiP and instant messaging (Windows/Linux/Mac OS X)

Update (2007-07-21):
100 Open Source Downloads – (Windows/Linux/Unix/Mac OS X/Classic Mac)

Web stat tools and pinging

I’ve been looking for ping services and free web statistics tools for DaveUnderwood.com, below is what I’ve come across…

Free web statistics tools via contentious.com (updated):

http://www.google.com/analytics/
http://www.sitemeter.com
http://awstats.sourceforge.net/
http://www.webtrends.com
http://mach5.com/products/analyzer/index.php
http://www.tracewatch.com/
http://www.addfreestats.com/
http://bbclone.de/
http://www.trafficfile.com/
http://www.haveamint.com/
http://www.summary.net/
http://www.reinvigorate.net/
http://statcounter.com new

Unfortunately Google Analytics is invite only at the moment, sign up here.

XML-RPC Ping Services [codex.wordpress.org]

Blake Ross on open source marketing of Firefox

Sounds like an interesting Firefox session at Gnomedex.

Wired.com:

The conference espouses a bottom-up, audience-driven approach, making it an unpredictable if not outright chaotic affair… Discussion leaders included Blake Ross, of Firefox fame, whose presentation was upstaged by audience member…

ZDNet.com:

Blake then referenced the Firefox flicks project – and played a video called “Wheee!” from it that poked fun at Microsoft IE, which got a great response from the geeks in the Gnomedex crowd. However Dave Winer found it in poor taste, because it doesn’t address users. Dave asked: “what are you going to do for us?”. Dave said that he thinks Firefox will become just like Microsoft. Blake didn’t accept that – at which point a bit of a ‘Dave vs the crowd’ ruckus ensued. Chris Pirillo, Gnomedex organizer and host, had to step in and ask that the “conversation” be carried on later.

A final question asked about how Firefox will scale. Blake said that “I’m not looking to scale up to the size of Microsoft”. Overall a very interesting session, spiced up by Dave Winer and also Steve Gillmor’s interventions. The crowd was very much in support of Blake and Firefox, but even so Dave’s point that Firefox has to appeal to normal users instead of focusing on fighting Microsoft was a good one.

The curse of too much good web

I finally got round to listening to the podcast of Jason Kottke (kottke.org) and Heather Armstrong (dooce.com) at SXSW [sxsw.com]. And I’m glad I did, their discussion is an interesting insight into their positions as high-profile bloggers.

But it took me ages to get round to it, the curse of too much good web. It depends how you handle it, I haven’t found a perfect method yet. I have my infamous ‘To Read’ bookmark folder which is meant to capture the overflow, it perennially grows and never gets read. Often links that go in there are dated by the time I get to reading them anyway.

It was recommended to me once that I learn to skim-read really fast and to a certain extent I do that. But the web is brilliant for distractions and deep links that before you realise it you’ve been reading an article three links away from the one you started out reading. But it wasn’t conscious, one link lead to a better link which lead to a better article.

I’m at the point now where if I load up my favourite websites to read, the memory useage of Firefox 1.5 under Windows XP is over 200,000 K. While thats happening Firefox tends to be unstable so online transactions are out of the question and normally running something like streaming will cause a crash. And I find it hard to prune my favourite links, they are all so good. If only I got paid to read them.