by Jeremy Cohen
When I was in college in the late 1980’s there was no such thing as a search engine. There was no World Wide Web, at least not the web that you and I know today. The Internet at that time was limited to use by our government (who feared nuclear catastrophe and desperately sought a method to preserve itself in the tragic event of a first strike by the Soviet Union) and by academic geeks who wanted a better way to store, share and access data. Those days, and the Soviet Union, are gone forever.
Since that time the World Wide Web has grown into a behemoth collection of billions of web pages featuring information about almost any subject. Martial arts, fine art, music, chemistry, pickles, planes, plastics, you name it, it’s out there. The amount of data is staggering.
And to me, what’s even more staggering is how people like us are able to plop in front of our computers, type a quick description of what we seek into something called a search engine and “poof!,” we are presented with a myriad of links to and descriptions of the information we desire. It may seem like magic, but it’s not. What’s going on here? What are these mystical creatures called search engines and how do they work?
How may I help you?
Search engines are typically companies that have decided to make their business reading, memorizing and making sense of the World Wide Web for the rest of us. For whatever reason many have chosen light-hearted names like Yahoo!, Google, and even Dogpile. I guess the folks at Yahoo! wanted us to have dude-ranch-like fun as we search. Google’s name on the other hand was generated by an honest but happy mistake. In their earliest days, before there was even a company, an enlightened investor made a $100,000 check out to Google, Inc. The rest is history. I won’t go into how Dogpile developed its name. Despite their silly names their ultimate and ongoing goal is serious – to make money and to provide their users with the most relevant and helpful search results.
To attain their lofty goals the most popular search engines, Google, Yahoo! and MSN search, have adapted similar models. They provide us, the searching public, with a text box and a button on a web page. Type something into that text box, push the button and bingo! You’ve got results. Keeping in line with their fiscal imperative search engines provide a mix of “sponsored” and “natural” results. Sponsored results are how search engines generate revenue. Natural results, on the other hand, are at the crux of a search engine’s reputation. If a search engine, like Google, is broadly known to render great results web surfers will flock to it. And to Google we do flock, millions every day. While it is true that less than half of us bother to click on sponsored ads the rest of us click on them enough to make very wealthy people out of those who succeed at digesting and regurgitating the web for us lay people. It’s an enormous task. Here’s how they do it.
While the algorithms each search engine uses to gather the contents of the web and index what they find are proprietary, each does basically the same thing. To begin, any search engine must first gather as much of the information that is the web as possible. They do so by ‘spidering’ or ‘crawling’ the web. Each of these terms means pretty much the same thing. A search engine spider or a web crawler is a computer program written to methodically visit and copy as many web pages as possible in as little time as possible. The pages that are visited are copied to a database which is ultimately used to generate search results. With today’s speedy computers and the availability of massive amounts of inexpensive memory, spidering the web has become the easiest of tasks a search engine must perform. And as computers continue to get faster and memory cheaper, search engines should be able to keep up with the continuously expanding web.
You Want Me to Make Sense of What?
Now comes the hard part. Having amassed a seemingly insurmountable jumble of data, a search engine must now figure out just what it has. For every page a search engine spiders it must ask itself questions like: What is this page about? Is this a new page? If this is a page I already know, has it been updated? Is there quality information on this page?
To help answer these questions search engines look at the text that make up the contents of each page. Every time a page is downloaded to a web browser what goes into the browser is not what comes out. What we see when we visit any website, like Yahoo! for instance, is an interpretation of the text sent by the web server to our browser. Simply put, when a web page is downloaded, the web server gives instructions to the browser about what to display and how to display it. Search engines know to look at special instructions in a web page, like those that describe the title and headings, to identify important words and phrases that may describe the contents of the page.
Phrase That Again
Some search engines use the number of times a phrase appears on a page to determine the importance of that phrase. In other instances a search engine may count the number of links from other web pages to the page it is examining to determine if that page is important. A link to a page is often considered a vote for that page in the eyes of the search engines, particularly Google. And like an election, the more votes the better. Search engines also examine the relative size of text on a page and any special formatting, like making words appear bold or in italics, to determine what a page is about. The larger or bolder the text, the more important those words are. After a search engine is done examining a web page it compares the page to all the other pages it knows about and assigns a rank to the page based on its keyword phrase’s strengths and weaknesses.
The techniques described above are just a glimpse at how a search engine may go about ranking a page. They help search engines do a good job of delivering the results their searchers seek, but not a perfect job. In fact, by some measures, search engine results are relevant only about fifty percent of the time. Improvements are needed and expected.
As long as search engines continue to improve the relevancy of their results and generate those results in a speedy fashion the average web surfer need not be concerned with how those results are generated. Web surfers should rest assured that search engines are our forever hard working friends.