michigan informatics

 
 

How Search Engines Work: The Webcrawler

The Webcrawler is the part of the search engine which combs through the pages on the Internet and gathers the information for the search engine. Variable features which can affect your search results include:

  • Included pages
    Most search engines will find information by beginning at one page and then following all of the links on that page. It will then follow all of the links on these new pages, and so on. Therefore, if a page is not linked to from another page, it may never be found by a search engine. Authors can include unlinked pages in a search engine by submitting them to each specific search engine.
  • Excluded pages
    Some web administrators may choose to exclude their pages from search engines because they are internal pages or Intranets. Many web pages are also excluded because their content is dynamically generated from a database and a search engine cannot find it.
  • Documents types
    Different search engines will search different document types. All will search HTML documents, but some will also search PDF, PowerPoint, Word, Excel, images and more.
  • Frequency of crawling
    An important part of a web crawler is how frequently it retrieves information from pages. Some sites will be visited more often than others.