Douglas Campbell Professor of Computer Science Brigham Young University 3318 TMCB BYU, Provo, Utah 84602 campbell@cs.byu.edu (801) 378-4977 (801) 368-4977 (FAX) A Hundred-fold Increase in the Internet: Implications for National Security Needs This white paper addresses extracting information for National Security needs from plain text in email, web sites, newsgroups, and news-wire services after the Internet experiences a hundred-fold increase in speed and capacity. Three lettered National Security agencies have two different reasons to extract information from plain text on the Internet: (1) To investigate an event like the Oklahoma bombing, they must scour Internet archives after the event for evidence, traces, links. (2) To prevent an event like the Oklahoma bombing, they must constantly filter, gather, analyze, and archive references such as: bombing courthouses, distributing narcotics, penetrating Federal arsenals, assassinating federal officials, purchasing surplused Russian submarines, terrorist atomic weapons. It is safe to say that 99.99% of Internet plain text is of no interest to three lettered National Security agencies; the problem is to extract that .01% from a hundred-fold increased Internet given that present technologies for filtering, indexing, retrieving, and displaying textual information do not scale. The remainder of this white paper discusses six specific problems (problems which our research group has actively investigated). I. Proper Noun Extraction: The Internet application Netfind takes an electronic name and returns information on the person with that electronic name. The eclectic nature of electronic names allows Netfind to extract information from such Internet sources as domain naming service, finger service, simple mail transfer protocols, and Internet newsgroups. National Security needs a similar Internet application to recover from plaintext proper nouns: (1) people, (2) geographic places, (3) streets, (4) institutions, (5) governmental agencies, (6) businesses. However, proper nouns on the Internet are drawn from a weirder, more variegated, more internationalized set than electronic names. Extracting, storing, and retrieving proper nouns on the Internet is complicated by (1) spelling problems (see Problem III), (2) the fact that phonetic techniques, such as Soundex, were designed for systems with at most 100,000 different European-based personal names; these older techniques increasingly fail on the massive, variegated, rich internationalized set of Internet proper nouns. II. Number Extraction and Classification Most indexing engines omit numbers because: (1) numbers play havoc with compression since they skew the number of unique tokens, (2) numbers are extremely difficult to aggregate effectively for a query. Numbers should not be omitted since they carry an enormous amount of information. This information can be extracted using Localized Natural Language Processing (LNLP). Unlike natural language processing (which attempts global parsing), LNLP restricts itself to a syntax and semantics in a small window of the token. Using these LNLP tools, numbers can be indexed into categories such as: (1) Social Security Numbers, (2) ISBN Numbers, (3) Street Address Numbers, (4) Phone Numbers, (5) Price Numbers, (6) Zipcodes, (7) Mathematical Formulas. In addition, using LNLP tools, numbers can be indexed into measurement categories such as: (1) Time, (2) Weight, (3) Distance, (4) Area, (5) Volume, (6) Latitude, (7) Temperature, (8) Money, (9) Dates, (10) Frequency, (11) Power. These categories allow National Security users to (1) aggregate number information overload, (2) query by categories of numerical information. III. Spelling The difficulty in designing filters to detect argot for "bombs, hash, kill" is compounded by the fact that much of the traffic is: (1) by people who spell creatively and make frequent typographical errors, (2) in informal conversational form. Three ways to increase the robustness of filtering algorithms are: (1) modest LNLP tools, (2) improved databases of proper names, (3) improved databases of acronyms. IV. Scaling Of Indexes There are three ways that show promise in avoiding the creation of a super-index for a hundred-fold increased Internet. (1) Introducing a hierarchy into the set of modest-sized, existing, distributed Internet indexes. (2) Introducing topic based indexes such as proper nouns and numbers. (3) Using data mining. (One form of data mining mines information by cross correlating "N" pieces of data; known algorithms that find all pairwise correlations are quadratic in N. Consequently, a hundred-fold increase in N produces a ten-thousand-fold increase in time. Our research group has examined replacing such "exact" quadratic algorithms by linearized random algorithms and by linearized approximation algorithms.) V. Cluster Algorithms One way to deal with information overload involves automatically clustering documents by similarity measures. There is a wonderful benefit to National Security users to such clustering. For example, when an Information Retrieval System clusters documents on the word "stars," the user discovers at least three different clusters: (1) astronomy (stars in the sky), (2) animals and plants (star shaped), (3) Hollywood (film stars). But it is the user that posits meaning to these computer generated clusters. Such automatic clustering tools permit National Security users to "discover" an order despite information overload. VI. Recall/Precision Evaluations of Information Retrieval Systems are usually based on recall/precision measures. Although recall/precision has the advantage of being theoretical, of being measured quantitatively, and of being mathematically formulated, recall/precision measures ignore the human needs of the National Security user. An increase in effectiveness in retrieval and resource discovery tools must be accompanied by easier to use Information Retrieval Systems. The National Security user needs tools to: 1. permit multiple ways to browse information; 2. augment (not to replace) the National Security user's distinctly human capacity to "see" relevance and to aggregate information; 3. separate, merge, and view the burgeoning diversity of information contained in heterogeneous Internet data repositories; 4. deal with incomplete, inconsistent, and time dependent repositories of email, web sites, and newsgroups. 4. extract a snippet from a document; 5. process, store, display, archive, manage, annotate, and automatically rank text snippets. ------------- Douglas Campbell Computer Science Department, Brigham Young University Provo, UT 84602 phone: 801-378-4977 fax: 801-378-7775