Web-based data mining
来源:百度文库 编辑:神马文学网 时间:2024/05/20 12:11:23
Web-based data mining
Automatically extract information with HTML, XML, and Java
Document optionsPrint this page
E-mail this page
Rate this page
Help us improve this content
Level: Advanced
Jussi Myllymaki (mailto:jussi@almaden.ibm.com?subject=Web-based data mining), Researcher, IBM
Jared Jackson (mailto:jjared@almaden.ibm.com?subject=Web-based data mining), Researcher, IBM
01 Jun 2001
The World Wide Web is now undeniably the richest and most dense source of information the world has ever seen, yet its structure makes it difficult to make use of that information in a systematic way. The methods and tools described in this article will enable developers familiar with the most common technologies of the Web to quickly and easily extract the Web-delivered information they need.
The rapid growth of the World Wide Web in this age of information has led to a prolific distribution of a wide variety of public information. Unfortunately, while HTML, the major carrier of this information, provides a convenient way to present information to human readers, it can be a challenging structure from which to automatically extract information relevant to a data-driven service or application.
A variety of approaches have been taken to solve this problem. Most take the form of some proprietary query language that maps sections of an HTML page into code that populates a database with information from the Web page. While these approaches may offer some advantages, most are impractical for two reasons: one, they require a developer to take the time to learn a query language that can not be used in any other setting, and two, they are not robust enough to work in the face of the simple changes to the Web pages they target that are inevitable.
In this article, a method for Web-based data mining is developed using the standard technologies of the Web -- HTML, XML, and Java. This method is equal in power, if not more powerful, than other proprietary solutions and requires little effort to produce robust results for those already familiar with the technologies of the Web. As an added bonus, much of the code needed to begin data extraction is included with this article.
HTML: A blessing and a curse
HTML is often a difficult medium to work with programmatically. The majority of the content of Web pages describes formatting irrelevant to a data-driven system, and document structure can change as often as every connection to the page, due to dynamic banner adds and other server-side-scripting. The problem is further compounded by the fact that a major portion of all Web pages are not well-formed, a result of the leniency in HTML parsing by modern Web browsers.
Despite these problems, there are advantageous aspects of HTML for data miners. Interesting data can often be isolated to single Background technologies The key to the data mining technology described herein is to convert existing Web pages into XML, or perhaps more appropriately XHTML, and use a few of the many tools for working with data structured as XML to retrieve the relevant data. Fortunately, a solution exists for correcting much of the uneven design of HTML pages. Tidy, a library available in several programming languages, is a freely available product for correcting common mistakes in HTML documents and producing equivalent documents that are well-formed. Tidy may also be used to render these documents in XHTML, a subset of XML. (See Resources). The code examples in this article are written in Java and will require the Tidy jar file to be placed in the Overview of the approach and introduction of an example We introduce the method of data extraction by means of an example. Suppose we are interested in tracking the temperature and humidity levels of Seattle, Washington, at various times of the day over the course of a few months. Supposing no off-the-shelf software for this kind of reporting fits our needs, we are still left with the opportunity to glean this information off of one of many public Web sites. Figure 1 illustrates an overview of the extraction process. Web pages are retrieved and processed until a data set is created that can be incorporated into an existing data set. In a few short steps, we will have a reliable system in place that gathers just the information we need. The steps are listed here to give a brief overview of the process, and the process is shown at a high level in Figure 1. Each of these steps will be explained in detail and the code necessary to execute them will be provided. Obtaining the source information as XHTML In order to extract data, we of course need to know where to find it. In most cases the source will be obvious. If we wanted to keep a collection of the titles and URLs of articles from developerWorks, we would use http://www.ibm.com/developerworks/ as our target. In the case of the weather, we have several sources to choose from. We will use Yahoo! Weather in the example, though others would have worked equally as well. In particular, we will be tracking the data on the URL, http://weather.yahoo.com/forecast/Seattle_WA_US_f.html. A screen shot of this page is shown in Figure 2. In considering a source, it is important to keep these factors in mind: While we are looking for robust solutions that will work in dynamic environments, our work will be easiest when extracting the most reliable and stable sources available. Once the source is determined, our first step in the extraction process is to convert the data from HTML to XML. We will accomplish this and other XML related tasks by constructing a Java class called We use the functionality provided by the Tidy library to do our conversion in the method Finding a reference point for the data Notice that the vast majority of information in either the Web page or source XHTML view is of absolutely no concern to us. Our next task then is to locate a specific region in the XML tree from which we can extract our data without concerning ourselves with the extraneous information. For more complex extractions we may need to find several instances of these regions on a single page. Accomplishing this is usually easiest by first examining the Web page and then working with the XML. Simply looking at the page shows us that the information we are looking for is in a section in the upper-middle part of the page. With but a limited familiarity with HTML, it is easy to infer that the data we are looking for is probably all contained under the same Making note of our observations, we now consider the XHTML that the page produced. A text search for "Appar Temp" reveals, as shown in Figure 4 that the text is indeed enclosed in a table containing all of the data we need. We will make this table our reference point, or anchor. Now we need a way to locate this anchor. Since we are going to be using XSL to transform the XML we have, we can use XPath expressions for this task. The trivial choice would be to use: This expression specifies a path from the root ...or even better, we can take advantage of the way XSL converts XML trees to strings: Mapping the data to XML With this anchor in hand, we can create the code that will actually extract our data. This code will be in the form of an XSL file. The goal of the XSL file is to identify the anchor, specify how to get from that anchor to the data we are looking for (in short hops), and construct an XML output file in the format we want. This process is really much simpler than it sounds. The code for the XSL that will do this is given in Listing 2 and is also available as an XSL text file. The Of course, just writing the XSL will not get the job done. We also need a tool that performs the conversion. For this, we take advantage of Merging and processing the results If we were only performing the data extraction once, we would now be done. However, we don‘t just want to know the temperature at one time, but at several different times. All we need to do now is to repeat our extraction process over and over again, merging the results into a single XML data file. We could again use XSL to do this, but instead we will create one last method for merging XML files in our The code for running this whole process is given in the Conclusion In this article, we have described and demonstrated the fundamentals of a robust approach for extracting information from the largest source of information in existence, the World Wide Web. We have also included the coding tools necessary for enabling any Java developer to begin his or her own extraction work with a minimum amount of effort and extraction experience. While the example in the article focused on merely extracting weather information about Seattle, Washington, nearly all of the code presented here is reusable for any data extraction. In fact, aside from minor changes to the The method is as simple as it is sound. By wisely choosing data sources that are reliable and picking anchors within those sources that are tied to content and not format, you can have a low-maintenance, reliable data extraction system, and, depending on your level of experience and the amount of data to extract, you could have it up and running in less than an hour. Resources About the authors Jussi Myllymaki joined IBM Almaden Research Center as a Research Staff Member in 1999 and has a PhD in Computer Science from University of Wisconsin at Madison. You can contact Jussi at mailto:jussi@almaden.ibm.com?cc=. Jared Jackson has been with IBM Almaden Research Center since graduating from Harvey Mudd College in May of 2000. Jared is also pursuing graduate studies in Computer Science at Stanford University. You can contact Jared at mailto:jjared@almaden.ibm.com?cc=. or
Back to top
classpath
of your system when compiling and running them. They will also require the XML libraries made available through the Apache project, Xerces and Xalan. These two libraries are based on code donated by IBM and govern XML parsing and XSL transforms, respectively. Each of these three libraries is available freely on the Web, and can be found by following the links above or the references following the article below. An understanding of the Java programming language, XML, and XSL transformations will be helpful in following the examples. References for these technologies also can be found following the article.
Back to top
Figure 1. An overview of the extraction process
Back to top
Figure 2. The Yahoo! Weather Web page for Seattle, Washington
XMLHelper
composed of static helper functions. The full source of this class can be found by following these links to XMLHelper.java
and XMLHelperException.java
. We will be building up the methods of this class as the article proceeds. XMLHelper.tidyHTML()
. This method takes in a URL as a parameter and returns an XML Document as a result. Be careful to check for exceptions when calling this or any other XML-related method. The code for doing so is shown in Listing 1. The fruits of this code are shown in Figure 3, a shot of Microsoft‘s Internet Explorer XML viewer working with the XML from the weather page.
Figure 3. The Yahoo! Weather Web page converted to XHTML
Back to top
element, and that this table probably always contains words such as "Appar Temp" and "Dewpoint," no matter what the data might be for the day.
Figure 4: The anchor is found by looking for a table containing the text "Appar Temp"
/html/body/center/table[6]/tr[2]/td[2]/table[2]/tr/td/table[6]
element to our anchor. This trivial approach leaves us very vulnerable to modifications of the layout of this page. A better approach is to specify the anchor based on the content around it. Using this approach we reconstruct the XPath expression to:
//table[starts-with(tr/td/font/b,‘Appar Temp‘)]
//table[starts-with(normalize-space(.), ‘Appar Temp‘)]
Back to top
element simply tells the processor that we want XML as a result of our transformation. The first
establishes our root element and calls
to search for our anchors. The second
keeps us from matching more than we want. The last
defines our anchor in the match
attribute and then tells the processor how to hop to the temperature and humidity data we are trying to mine.XMLHelper
class methods for parsing the XSL and performing the conversion. The methods that perform these tasks are called parseXMLFromURL()
, and transformXML()
respectively. The code for using these methods is given in Listing 3.
Listing 3
/** * Retrieve the XHTML file written to disk in the Listing 1 * and apply our XSL Transformation to it. Write the result * to disk as XML. */ public static void main(String args[]) { try { Document xhtml = XMLHelper.parseXMLFromURLString("file://weather.xml"); Document xsl = XMLHelper.parseXMLFromURLString("file://XSL/weather.xsl"); Document xml = XMLHelper.transformXML(xhtml, xsl); XMLHelper.outputXMLToFile("XML" + File.separator + "result.xml"); } catch (XMLHelperException xmle) { // ... Do Something ... } }
Back to top
XMLHelper
class. The mergeXML()
method allows us to merge the data we obtained in a current extraction with an archive file of past extraction data.WeatherExtractor.java
file. I leave the task of scheduling the execution of the program to the reader, as system-dependent methods for doing so are often superior to simple programmatic ones. The result of running WeatherExtractor
once a day for four days can be seen in Figure 5.
Figure 5. The results of our Web extraction
Back to top
WeatherExtractor
class, the only code that needs to be changed for other data mining projects is the XSL transformation code (which, by the way, never needs to be compiled).