The Surface and Deep Webs

There are two webs. There is the surface web. That is where we are now. All the blogs in Blogger are part  of the surface web. There is also a deep web. It is much larger and contains information which is of much better quality than that found in the surface web.

The information in the surface web is contained in static web pages which can be found and indexed by search engines. These are the pages that appear when you search on Google. They are free to view and of variable quality. Some searches might produce what you want; others might produce rubbish or disinformation. The surface web has no payment mechanism. There is no way for an information provider to charge for sight of a page. That has made providers reluctant to put valuable information in the surface web.

The deep web is much larger. Some writers have suggested it contains five hundred times more information than the surface web. The surface web search engines cannot index its content. If you conduct a Google search the results will contain very little from the deep web. Many of the deep web databases have been established for a long time, and contain vast quantities of high quality information. The Dialog database started in 1966, and it has been argued that it contains more information than the entire surface web. Access to its content requires payment.

There are two main reasons why the information in the deep web is not indexed by the surface web search engines.

1. The information may be held in web pages that require payment to view. If you do not have a subscription to the database which holds a web page Google cannot index its contents and you cannot read it.

2. The information may be held in databases rather than a persistent web page. If that is the case there are no  web pages for the search engines to find and index. Databases create dynamic web pages in response to queries. Each page is unique and does not persist. If you use your browser to search a database it will return a page containing the response to your query. That page has been created just for you and no longer exists after the database server has sent it to you. For example, when you search on Amazon or eBay their databases create dynamic web pages to answer your query. You are the only customer for that page and it ceases to exist after it has been sent to you.


The search engines cannot index most of the deep web for the reasons given above, but there are a few exceptions.

1.  Most deep web databases require payment to view but a few provide free access to some part of the information they hold .

2.  Google Scholar is able to index a small number of  databases that contain articles which have been published in academic journals. However, you will probably have to pay to read the articles

3.  Probably the majority of the information in the deep web is held in databases and delivered via dynamic web pages. You cannot use Google, or the other search engines, to search the content of these databases, but you can use Google to find some of the databases. This can be done by including the word “database” in your query. For example, you can carry out a search on Google using the query “toxic chemicals database”. This will produce a list of chemical databases. When you have found a database you then have to find a way of accessing its contents.

Other databases may not be easy to find using a Google search but they may appear on a list such as this and this.

The diagram attempts to illustrate the situation. The red area represents the total information held in the surface web. Only some of this has been indexed by the major search engines. It has been estimated that Google has only indexed about 40% of the information in the surface web. The various search engines have all indexed different parts of the surface web. Some of the web pages that have been indexed by Google have not been indexed by Bing and vice versa.

The yellow area represents the information that is held in the deep web, mainly in databases. Some of this is free, but most requires payment.

The blue area represents the free content on the World Wide Web. This includes that part of the surface web that has been indexed by the search engines, and the information in the free deep web databases.

It is important to realise that a search of the surface web using a single search engine will usually produce links to only a very small part of the information that might be available on both the surface and deep webs, and what is found will not be the best of what is available.

No comments: