Web-based data mining

来源:百度文库 编辑:神马文学网 时间:2024/05/20 12:11:23

Web-based data mining

Automatically extract information with HTML, XML, and Java

Document options

Print this page

E-mail this page


Rate this page

Help us improve this content


Level: Advanced

Jussi Myllymaki (mailto:jussi@almaden.ibm.com?subject=Web-based data mining), Researcher, IBM
Jared Jackson (mailto:jjared@almaden.ibm.com?subject=Web-based data mining), Researcher, IBM

01 Jun 2001

The World Wide Web is now undeniably the richest and most dense source of information the world has ever seen, yet its structure makes it difficult to make use of that information in a systematic way. The methods and tools described in this article will enable developers familiar with the most common technologies of the Web to quickly and easily extract the Web-delivered information they need.

The rapid growth of the World Wide Web in this age of information has led to a prolific distribution of a wide variety of public information. Unfortunately, while HTML, the major carrier of this information, provides a convenient way to present information to human readers, it can be a challenging structure from which to automatically extract information relevant to a data-driven service or application.

A variety of approaches have been taken to solve this problem. Most take the form of some proprietary query language that maps sections of an HTML page into code that populates a database with information from the Web page. While these approaches may offer some advantages, most are impractical for two reasons: one, they require a developer to take the time to learn a query language that can not be used in any other setting, and two, they are not robust enough to work in the face of the simple changes to the Web pages they target that are inevitable.

In this article, a method for Web-based data mining is developed using the standard technologies of the Web -- HTML, XML, and Java. This method is equal in power, if not more powerful, than other proprietary solutions and requires little effort to produce robust results for those already familiar with the technologies of the Web. As an added bonus, much of the code needed to begin data extraction is included with this article.

HTML: A blessing and a curse

HTML is often a difficult medium to work with programmatically. The majority of the content of Web pages describes formatting irrelevant to a data-driven system, and document structure can change as often as every connection to the page, due to dynamic banner adds and other server-side-scripting. The problem is further compounded by the fact that a major portion of all Web pages are not well-formed, a result of the leniency in HTML parsing by modern Web browsers.

Despite these problems, there are advantageous aspects of HTML for data miners. Interesting data can often be isolated to single

or
tags deeply nested in the HTML tree, allowing the extraction process to work exclusively within a small portion of the document. In the absence of client-side-scripting, there is only one way to define a drop-down menu and other data lists. These aspects of HTML allow us to focus our efforts in data extraction once we have the data in a format we can work with.



Back to top

Background technologies

The key to the data mining technology described herein is to convert existing Web pages into XML, or perhaps more appropriately XHTML, and use a few of the many tools for working with data structured as XML to retrieve the relevant data.

Fortunately, a solution exists for correcting much of the uneven design of HTML pages. Tidy, a library available in several programming languages, is a freely available product for correcting common mistakes in HTML documents and producing equivalent documents that are well-formed. Tidy may also be used to render these documents in XHTML, a subset of XML. (See Resources).

The code examples in this article are written in Java and will require the Tidy jar file to be placed in the classpath of your system when compiling and running them. They will also require the XML libraries made available through the Apache project, Xerces and Xalan. These two libraries are based on code donated by IBM and govern XML parsing and XSL transforms, respectively. Each of these three libraries is available freely on the Web, and can be found by following the links above or the references following the article below. An understanding of the Java programming language, XML, and XSL transformations will be helpful in following the examples. References for these technologies also can be found following the article.



Back to top

Overview of the approach and introduction of an example

We introduce the method of data extraction by means of an example. Suppose we are interested in tracking the temperature and humidity levels of Seattle, Washington, at various times of the day over the course of a few months. Supposing no off-the-shelf software for this kind of reporting fits our needs, we are still left with the opportunity to glean this information off of one of many public Web sites.

Figure 1 illustrates an overview of the extraction process. Web pages are retrieved and processed until a data set is created that can be incorporated into an existing data set.



Figure 1. An overview of the extraction process

In a few short steps, we will have a reliable system in place that gathers just the information we need. The steps are listed here to give a brief overview of the process, and the process is shown at a high level in Figure 1.

  1. Identify the data source and map it to XHTML.
  2. Find reference points within the data.
  3. Map the data to XML.
  4. Merge the results and process the data.

Each of these steps will be explained in detail and the code necessary to execute them will be provided.



Back to top

Obtaining the source information as XHTML

In order to extract data, we of course need to know where to find it. In most cases the source will be obvious. If we wanted to keep a collection of the titles and URLs of articles from developerWorks, we would use http://www.ibm.com/developerworks/ as our target. In the case of the weather, we have several sources to choose from. We will use Yahoo! Weather in the example, though others would have worked equally as well. In particular, we will be tracking the data on the URL, http://weather.yahoo.com/forecast/Seattle_WA_US_f.html. A screen shot of this page is shown in Figure 2.



Figure 2. The Yahoo! Weather Web page for Seattle, Washington

In considering a source, it is important to keep these factors in mind:

  • Will the source produce reliable data over a reliable network connection?
  • Will the source still be here a week, a month, or even a year from now?
  • How stable is the layout structure of the source?

While we are looking for robust solutions that will work in dynamic environments, our work will be easiest when extracting the most reliable and stable sources available.

Once the source is determined, our first step in the extraction process is to convert the data from HTML to XML. We will accomplish this and other XML related tasks by constructing a Java class called XMLHelper composed of static helper functions. The full source of this class can be found by following these links to XMLHelper.java and XMLHelperException.java . We will be building up the methods of this class as the article proceeds.

We use the functionality provided by the Tidy library to do our conversion in the method XMLHelper.tidyHTML(). This method takes in a URL as a parameter and returns an XML Document as a result. Be careful to check for exceptions when calling this or any other XML-related method. The code for doing so is shown in Listing 1. The fruits of this code are shown in Figure 3, a shot of Microsoft‘s Internet Explorer XML viewer working with the XML from the weather page.



Figure 3. The Yahoo! Weather Web page converted to XHTML



Back to top

Finding a reference point for the data

Notice that the vast majority of information in either the Web page or source XHTML view is of absolutely no concern to us. Our next task then is to locate a specific region in the XML tree from which we can extract our data without concerning ourselves with the extraneous information. For more complex extractions we may need to find several instances of these regions on a single page.

Accomplishing this is usually easiest by first examining the Web page and then working with the XML. Simply looking at the page shows us that the information we are looking for is in a section in the upper-middle part of the page. With but a limited familiarity with HTML, it is easy to infer that the data we are looking for is probably all contained under the same

element, and that this table probably always contains words such as "Appar Temp" and "Dewpoint," no matter what the data might be for the day.

Making note of our observations, we now consider the XHTML that the page produced. A text search for "Appar Temp" reveals, as shown in Figure 4 that the text is indeed enclosed in a table containing all of the data we need. We will make this table our reference point, or anchor.



Figure 4: The anchor is found by looking for a table containing the text "Appar Temp"

Now we need a way to locate this anchor. Since we are going to be using XSL to transform the XML we have, we can use XPath expressions for this task. The trivial choice would be to use:

/html/body/center/table[6]/tr[2]/td[2]/table[2]/tr/td/table[6]                            

This expression specifies a path from the root element to our anchor. This trivial approach leaves us very vulnerable to modifications of the layout of this page. A better approach is to specify the anchor based on the content around it. Using this approach we reconstruct the XPath expression to:

//table[starts-with(tr/td/font/b,‘Appar Temp‘)]                            

...or even better, we can take advantage of the way XSL converts XML trees to strings:

//table[starts-with(normalize-space(.), ‘Appar Temp‘)]                            



Back to top

Mapping the data to XML

With this anchor in hand, we can create the code that will actually extract our data. This code will be in the form of an XSL file. The goal of the XSL file is to identify the anchor, specify how to get from that anchor to the data we are looking for (in short hops), and construct an XML output file in the format we want. This process is really much simpler than it sounds. The code for the XSL that will do this is given in Listing 2 and is also available as an XSL text file.

The element simply tells the processor that we want XML as a result of our transformation. The first establishes our root element and calls to search for our anchors. The second keeps us from matching more than we want. The last defines our anchor in the match attribute and then tells the processor how to hop to the temperature and humidity data we are trying to mine.

Of course, just writing the XSL will not get the job done. We also need a tool that performs the conversion. For this, we take advantage of XMLHelper class methods for parsing the XSL and performing the conversion. The methods that perform these tasks are called parseXMLFromURL(), and transformXML() respectively. The code for using these methods is given in Listing 3.



Listing 3
/**                            * Retrieve the XHTML file written to disk in the Listing 1                            * and apply our XSL Transformation to it. Write the result                            * to disk as XML.                            */                            public static void main(String args[]) {                            try {                            Document xhtml = XMLHelper.parseXMLFromURLString("file://weather.xml");                            Document xsl   = XMLHelper.parseXMLFromURLString("file://XSL/weather.xsl");                            Document xml   = XMLHelper.transformXML(xhtml, xsl);                            XMLHelper.outputXMLToFile("XML" + File.separator + "result.xml");                            } catch (XMLHelperException xmle) {                            // ... Do Something ...                            }                            }                            



Back to top

Merging and processing the results

If we were only performing the data extraction once, we would now be done. However, we don‘t just want to know the temperature at one time, but at several different times. All we need to do now is to repeat our extraction process over and over again, merging the results into a single XML data file. We could again use XSL to do this, but instead we will create one last method for merging XML files in our XMLHelper class. The mergeXML() method allows us to merge the data we obtained in a current extraction with an archive file of past extraction data.

The code for running this whole process is given in the WeatherExtractor.java file. I leave the task of scheduling the execution of the program to the reader, as system-dependent methods for doing so are often superior to simple programmatic ones. The result of running WeatherExtractor once a day for four days can be seen in Figure 5.



Figure 5. The results of our Web extraction



Back to top

Conclusion

In this article, we have described and demonstrated the fundamentals of a robust approach for extracting information from the largest source of information in existence, the World Wide Web. We have also included the coding tools necessary for enabling any Java developer to begin his or her own extraction work with a minimum amount of effort and extraction experience. While the example in the article focused on merely extracting weather information about Seattle, Washington, nearly all of the code presented here is reusable for any data extraction. In fact, aside from minor changes to the WeatherExtractor class, the only code that needs to be changed for other data mining projects is the XSL transformation code (which, by the way, never needs to be compiled).

The method is as simple as it is sound. By wisely choosing data sources that are reliable and picking anchors within those sources that are tied to content and not format, you can have a low-maintenance, reliable data extraction system, and, depending on your level of experience and the amount of data to extract, you could have it up and running in less than an hour.



Resources

  • Tidy for Java is maintained by Sami Lempinen and can be downloaded from SourceForge.

  • The XML libraries, Xerces, and Xalan, are available at the Apache Project Web site.

  • For more information on XML, developerWorks has a zone related to the technology.

  • A tutorial on XSL and XPath. There are many more, just use your favorite Web search engine.

  • Jussi Myllymaki has a related paper on the relation of Web crawling and data extraction in the ANDES system, presented at WWW10 in Hong Kong.

  • Here are some techniques for personalizing your Web site, as well as tips for maximizing site performance.


About the authors

 

Jussi Myllymaki joined IBM Almaden Research Center as a Research Staff Member in 1999 and has a PhD in Computer Science from University of Wisconsin at Madison. You can contact Jussi at mailto:jussi@almaden.ibm.com?cc=.


Jared Jackson has been with IBM Almaden Research Center since graduating from Harvey Mudd College in May of 2000. Jared is also pursuing graduate studies in Computer Science at Stanford University. You can contact Jared at mailto:jjared@almaden.ibm.com?cc=.