Wednesday, June 25, 2008

Deep Web (Invisible Web)

The deep Web (also called Deepnet, the invisible Web, or the hidden Web) refers to World Wide Web content that is not part of the surface Web, which is indexed by search engines. It is estimated that the deep Web is several orders of magnitude larger than the surface Web.

These types of pages used to be invisible but can now be found in most search engine results
  • Pages in non-HTML formats (pdf, Word, Excel, PowerPoint), now converted into HTML.
  • Script-based pages, whose URLs contain a ? or other script coding.
  • Pages generated dynamically by other types of database software (e.g., Active Server Pages, Cold Fusion). These can be indexed if there is a stable URL somewhere that search engine crawlers can find.

Crawling Deep Web
Researchers have been exploring how the deep Web can be crawled in an automatic fashion. Raghavan and Garcia-Molina (2001) presented an architectural model for a hidden-Web crawler that used key terms provided by users or collected from the query interfaces to query a Web form and crawl the deep Web resources. Ntoulas et al. (2005) created a hidden-Web crawler that automatically generated meaningful queries to issue against search forms. Their crawler generated promising results, but the problem is far from being solved.

Since a large amount of useful data and information resides in the deep Web, search engines have begun exploring alternative methods to crawl the deep Web. Google’s Sitemap Protocol and mod oai are mechanisms that allow search engines and other interested parties to discover deep Web resources on particular Web servers. Both mechanisms allow Web servers to advertise the URLs that are accessible on them, thereby allowing automatic discovery of resources that are not directly linked to the surface Web.

Federated search by subject category or vertical is an alternative mechanism to crawling the deep Web. Traditional engines have difficulty crawling and indexing deep Web pages and their content, but deep Web search engines like CloserLookSearch, Science.gov and Northern Light create specialty engines by topic to search the deep Web. Because these engines are narrow in their data focus, they are built to access specified deep Web content by topic. These engines can search dynamic or password protected databases that are otherwise closed to search engines.

Resources
  1. Wikipedia's Deep Web http://en.wikipedia.org/wiki/Deep_web
  2. http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/InvisibleWeb.html