Web-based data mining

来源:百度文库 编辑:神马文学网 时间:2024/05/20 12:38:00
Automatically extract information with HTML, XML, and Java

Document options

Print this page

E-mail this page
Rate this page

Help us improve this content
Level: Advanced
Jussi Myllymaki (mailto:jussi@almaden.ibm.com?subject=Web-based data mining), Researcher, IBM
Jared Jackson (mailto:jjared@almaden.ibm.com?subject=Web-based data mining), Researcher, IBM
01 Jun 2001
The World Wide Web is now undeniably the richest and most dense source of information the world has ever seen, yet its structure makes it difficult to make use of that information in a systematic way. The methods and tools described in this article will enable developers familiar with the most common technologies of the Web to quickly and easily extract the Web-delivered information they need.
The rapid growth of the World Wide Web in this age of information has led to a prolific distribution of a wide variety of public information. Unfortunately, while HTML, the major carrier of this information, provides a convenient way to present information to human readers, it can be a challenging structure from which to automatically extract information relevant to a data-driven service or application.
A variety of approaches have been taken to solve this problem. Most take the form of some proprietary query language that maps sections of an HTML page into code that populates a database with information from the Web page. While these approaches may offer some advantages, most are impractical for two reasons: one, they require a developer to take the time to learn a query language that can not be used in any other setting, and two, they are not robust enough to work in the face of the simple changes to the Web pages they target that are inevitable.
In this article, a method for Web-based data mining is developed using the standard technologies of the Web -- HTML, XML, and Java. This method is equal in power, if not more powerful, than other proprietary solutions and requires little effort to produce robust results for those already familiar with the technologies of the Web. As an added bonus, much of the code needed to begin data extraction is included with this article.