Processing data, information and knowledge
© Copyright 1994-2002, Rishab Aiyer Ghosh. All rights reserved.
Electric Dreams #40

Computers are good at processing data. Juggling numbers, indexing names and addresses, these are the rudimentary tasks upon which much of the world's infrastructure depends. Computers are mediocre at processing information, the layering of data with complex inter-relationships. But they are simply pathetic at handling knowledge, the models based on piles of information used to understand and predict an aspect of the world around us, expressed by humans not in tables and charts, but in articles and books.

Computers are organized. They can understand streams of homogeneous inputs, they can follow links between data that are made clear and detailed. This preference for structure makes it somewhat difficult to get computers to process more naturally expressed concepts and knowledge embodied in human-language text.

Passing over the entirely academic debate about the ability or otherwise of machines to ever understand human ideas, the fact is that most attempts at getting computers to process or aid in processing such ideas has concentrated on making computers 'artificially intelligent' - making them form their own structured model of relatively unstructured text.

Computer systems for natural language processing try to find meaning in a text by translating it into some internal representation, with the aid of a detailed grammar-book far more explicit than most humans could bear. Most natural language processing is either too slow, too inaccurate, or too limited to a particular human language or set of concepts to be practically useful on a large scale. While it may be pretty good for simple voice- based interfaces, NLP is unlikely in the near future to be able to, for instance, quickly go through 2 years of Time magazine and identify the US government's changing policy on the war in Bosnia.

While NLP begins with the assumption that machines need some sort of understanding to process text, other methods concentrate more on practical applications. These usually abandon any attempt to search for a structure in textual inputs, and rely instead on identifying a vague pattern. Neural networks, which try to simulate the working of the brain, are frequently used to identify patterns in images, sounds and financial data. Though they are often quite successful at their limited tasks, they are not normally used to process text. One reason for this is perhaps that text either needs to be interpreted in the small chunks of conversation, which requires a knowledge of grammar that conventional NLP provides; the other use for text processing is in organizing huge volumes of it, for which neural networks are too slow.

The alternative comes strangely enough from the US National Security Agency. It has always been suspected that the NSA searches through e-mail traffic for 'sensitive' material, which for the large volumes involved would require considerable help from computers. Earlier this year, the agency began soliciting collaborations from business to develop commercial applications of their technique. It claimed to be able to quickly search through large quantities of text, in any language, for similarities to sample documents, and even automatically sort documents according to topics that it identifies. A similar though independently developed system is available from California-based Architext.

Though statistical techniques for text processing are not entirely new, the continuing development in the area is a sign of the growing use of computers as knowledge- processing aids. By identifying patterns more-or-less blindly, without any attempt at understanding the concepts they represent, they can help us make some sense of the ocean of information that otherwise threatens to swamp us.

  • Electric Dreams Index
  • Homepage