The Deep Web

Yvan Cloutier
(Terminology Update, Volume 34, Number 2, 2001, page 18)

The surface Web or shallow Web is the part of the Web that common search engines such as AltaVista and Google can index, in other words, the part most Internet surfers consult regularly. The deep Web or invisible Web is the part of the Internet that search engines do not index. It includes pages written in PDF (Portable Document Format), dynamically generated pages, ASPs (Active Server Pages), databases with restricted access or user fees, and firewall- or password-protected pages.

Studies on the content of the Web show that the most powerful search engines give users access to only 1/500 of all Web pages, or 1 billion of the 500 billion pages on the Internet. These figures are both surprising and alarming because they show that common search methods give us access to only a tiny portion of the Web and that, in concrete terms, we cannot access about 499 billion Web pages—the hidden part of the iceberg!

Web specialists talk more and more about the imperfections of traditional indexers because of the irrelevance of the links they provide in the search results. These tools provide only a rough map of cyberspace in the form of hyperlinked tree structures based strictly on machine logic. They can record the link to the homepage of a database, but they cannot index the content—an astronomical amount of information.

According to deep-Web experts, the resources on this part of the Web are of higher quality and are updated more often than those on the surface Web. The deep Web adds information faster than the surface Web; its search results are more relevant, and 95% of the information it contains is free of charge. The resources on the deep Web generally take the form of directories that give users access to information compiled and classified by people, often archiving professionals, making the results more relevant and providing some degree of quality assurance on the content.

The pages of the deep Web often come from sites in domains such as edu and org, reserved for educational institutions and international organizations, and therefore offer a higher level of language and greater expertise. The com (commercial) domain, however, contains sites with less-refined terminology. These sites are part of the surface Web and can be accessed using general search engines.

The Internet is evolving at an incredible rate and is becoming increasingly organized and structured. The original Web was a huge, disorganized mass of hyperlinks that users had to explore using robots to find the needle in the haystack. But, informed users’ confidence in automatic indexers seems to be diminishing with the rise in directories indexed by humans. As language professionals, we may have to rethink how we use search tools because, as searchers, we have to keep up with the rapid pace of development in cyberspace.

An important distinction must be made between tools that provide access to language resources that may be available only temporarily (i.e. the common search engines) and those that provide access to sustainable resources (i.e. engines and directories of the deep Web). Sustainable resources are specialized sites containing reference databases that are likely to provide answers to language questions on an ongoing rather than sporadic basis.

Thanks to the deep Web, we can now access bodies of knowledge that are structured according to proven archiving models and can be consulted on-line. Let’s visit one of these "libraries" and do some browsing.

The site is a directory of resources in all subject fields. The information is retrieved automatically and is catalogued alphabetically and thematically according to Dewey decimal classification. For each site in the directory, there is a short description written by specialists. The links, about 11,000 in number, are evaluated, catalogued, described and updated monthly. This figure is very small compared with the links that can be found using general search engines, but the site gives users direct access to specific sustainable resources in a number of fields. Here is an example of a search that can be done on this site. It shows that this type of site is an excellent tool for creating or updating bookmarks.

Click on Life Science. This field is broken down into very specific subfields:

applied psychology
Australian natural resources
biological data
biology education
biology links
biology news
biology research
, etc.

Click on biodiversity to access no fewer than 22 sites (directories, portals, specialized search engines, glossaries, etc.).

The following is an example of a site description that shows the quality of the resources:

Bird Biodiversity
Searchable database which contains photographs, information on bird specimen handling, bird dissection and a glossary of avian external anatomy.
Author: Slater Museum of Natural History, Puget Sound University
Subjects: biodiversity, birds, zoology
Dewey Class: 598
Resource Type: museum
Location: usa
Last checked: 20001202

The deep Web is an excellent tool for building up a collection of bookmarks in the fields you often work in. A list of some deep-Web sites appears at the end of this article.

English and French Français et anglais
Active Server Page; ASP page de serveur actif; page ASP
Annuaire; annuaire de sites;
répertoire; répertoire de sites
directory; search directory
Active Server Page; ASP page de serveur actif; page ASP
deep Web; invisible Web Web caché; Web invisible
directory; search directory annuaire; annuaire de sites;
répertoire; répertoire de sites
dynamically generated page page générée dynamiquement
firewall pare-feu
format PDF Portable Document Format; PDF
invisible Web; deep Web Web invisible; Web caché
page ASP; page de serveur actif ASP; Active Server Page
page de serveur actif; page ASP Active Server Page; ASP
page générée dynamiquement dynamically generated page
pare-feu firewall
Portable Document Format; PDF format PDF
search directory; directory annuaire; annuaire de sites;
répertoire; répertoire de sites
shallow Web; surface Web Web accessible
Web accessible surface Web; shallow Web
Web caché; Web invisible deep Web; invisible Web




Copyright notice for Favourite Articles

© His Majesty the King in Right of Canada, represented by the Minister of Public Services and Procurement
A tool created and made available online by the Translation Bureau, Public Services and Procurement Canada

Search by related themes

Want to learn more about a theme discussed on this page? Click on a link below to see all the pages on the Language Portal of Canada that relate to the theme you selected. The search results will be displayed in Language Navigator.